Skip to content

Conversation

SinsBre
Copy link
Contributor

@SinsBre SinsBre commented Feb 12, 2025

Added a graph validator.

Checks:

  • Missing nodes
  • Nan nodes
  • Duplicate node ids

@SinsBre SinsBre linked an issue Feb 12, 2025 that may be closed by this pull request
@SinsBre
Copy link
Contributor Author

SinsBre commented Mar 10, 2025

@aucahuasi - it is failing for test-full-ai, unsure how to fix that. Known issue?

return True


def validate_graph(g, verbose=True):
Copy link
Contributor

@lmeyerov lmeyerov Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def check_node_dataframe_exists(g, verbose=True):
if g._nodes is None:
if verbose:
print("Warning: graph was created with only edges. Skipping Node ID check if Node IDs match edge IDs. Use g2 = g.materialize_nodes() to force node df creation. Exiting.")
Copy link
Contributor

@lmeyerov lmeyerov Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch from print statements to warn + logger.info depending on use

removes need for verbose param via Pythonic conventions



def validate_graph(g, verbose=True):
if not check_node_dataframe_exists(g, verbose):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node existence should be more like:

  • either _node and _nodes are both defined or neither defined

return True


def validate_graph(g, verbose=True):
Copy link
Contributor

@lmeyerov lmeyerov Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for validators used for debugging, instead of doing an immediate error, good to collect errors and report them all, with failure only at the end



def check_nan_node_ids(g, verbose=True):
if g._nodes[g._node].isnull().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute & report .sum() vs just .any()

a more verbose mode might return the df to help save time



def check_duplicate_node_ids(g, verbose=True):
if g._nodes[g._node].duplicated().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def check_edge_sources_exist_in_nodes(g, verbose=True):
if not g._edges[g._source].isin(g._nodes[g._node]).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



def check_edge_destinations_exist_in_nodes(g, verbose=True):
if not g._edges[g._destination].isin(g._nodes[g._node]).all():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not check_duplicate_node_ids(g, verbose):
return False
check_edge_sources_exist_in_nodes(g, verbose) # Warnings only
check_edge_destinations_exist_in_nodes(g, verbose) # Warnings only
Copy link
Contributor

@lmeyerov lmeyerov Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not returning False?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I guess I was considering only graph with edge IDs that could be valid.

However, to your point, we should error if g._node and g._nodes exist and there are missing node IDs from g._edges src/dst, that should throw an error.

return True


def validate_graph(g, verbose=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a public function, should have a docstr matching other docstrs format

Copy link
Contributor

@lmeyerov lmeyerov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

bigger picture:

  • always-on for validated uploads: factor out fast metadata-only checks (e.g., validate_schema_bindings binding name <> col name match up) from data-level checks (e.g., validate_indexes checking col for nan, value mismatches, etc): update ArrowUploader::create_dataset -> if validate: ... to include the fast metadata-only checks. Arguably we should do data-level checks here too, but that changes perf characteristics more fundamentally so let's not for now..

  • expose on Plotter / PlotterBase, ideally calling both this + validate_encodings

  • switch to logging / warn vs verbse => print(), and report error count (# hits per violated condition) and ideally all errors vs early-exist on first

  • add an ipynb demo

  • some potential items to warn or error on

return True


def validate_graph(g, verbose=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend also checking that dtypes of src/dst match, as does node/src if exist

type mismatch is more of a warning vs a fail, as we can support bipartite graphs, like imagine user:str <> day:int

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, as a warn vs fail, if there is no overlap of src/dst edges, they may have a mismatch needing cleanup, imagine 123 <> id:123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] validator
5 participants