-
Notifications
You must be signed in to change notification settings - Fork 216
269 fea validator - add validator code. #652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@aucahuasi - it is failing for test-full-ai, unsure how to fix that. Known issue? |
return True | ||
|
||
|
||
def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
add signature to
Plottable
interface: https://github.com/graphistry/pygraphistry/blob/master/graphistry/Plottable.py -
expose for chained use in
PlotterBase
, e.g., :pygraphistry/graphistry/PlotterBase.py
Line 1428 in 320326c
def upload(
def check_node_dataframe_exists(g, verbose=True): | ||
if g._nodes is None: | ||
if verbose: | ||
print("Warning: graph was created with only edges. Skipping Node ID check if Node IDs match edge IDs. Use g2 = g.materialize_nodes() to force node df creation. Exiting.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch from print
statements to warn
+ logger.info
depending on use
removes need for verbose
param via Pythonic conventions
|
||
|
||
def validate_graph(g, verbose=True): | ||
if not check_node_dataframe_exists(g, verbose): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node existence should be more like:
- either _node and _nodes are both defined or neither defined
return True | ||
|
||
|
||
def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for validators used for debugging, instead of doing an immediate error, good to collect errors and report them all, with failure only at the end
|
||
|
||
def check_nan_node_ids(g, verbose=True): | ||
if g._nodes[g._node].isnull().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute & report .sum()
vs just .any()
a more verbose mode might return the df to help save time
|
||
|
||
def check_duplicate_node_ids(g, verbose=True): | ||
if g._nodes[g._node].duplicated().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def check_edge_sources_exist_in_nodes(g, verbose=True): | ||
if not g._edges[g._source].isin(g._nodes[g._node]).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def check_edge_destinations_exist_in_nodes(g, verbose=True): | ||
if not g._edges[g._destination].isin(g._nodes[g._node]).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not check_duplicate_node_ids(g, verbose): | ||
return False | ||
check_edge_sources_exist_in_nodes(g, verbose) # Warnings only | ||
check_edge_destinations_exist_in_nodes(g, verbose) # Warnings only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not returning False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I guess I was considering only graph with edge IDs that could be valid.
However, to your point, we should error if g._node and g._nodes exist and there are missing node IDs from g._edges src/dst, that should throw an error.
return True | ||
|
||
|
||
def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a public function, should have a docstr matching other docstrs format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments
bigger picture:
-
always-on for validated uploads: factor out fast metadata-only checks (e.g.,
validate_schema_bindings
binding name <> col name match up) from data-level checks (e.g.,validate_indexes
checking col for nan, value mismatches, etc): updateArrowUploader::create_dataset
->if validate: ...
to include the fast metadata-only checks. Arguably we should do data-level checks here too, but that changes perf characteristics more fundamentally so let's not for now.. -
expose on
Plotter
/PlotterBase
, ideally calling both this +validate_encodings
-
switch to
logging
/warn
vsverbse => print()
, and report error count (# hits per violated condition) and ideally all errors vs early-exist on first -
add an ipynb demo
-
some potential items to warn or error on
return True | ||
|
||
|
||
def validate_graph(g, verbose=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend also checking that dtypes of src/dst match, as does node/src if exist
type mismatch is more of a warning vs a fail, as we can support bipartite graphs, like imagine user:str <> day:int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, as a warn vs fail, if there is no overlap of src/dst edges, they may have a mismatch needing cleanup, imagine 123
<> id:123
Added a graph validator.
Checks: