Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
d360250
add data generation to benchmarking
a-s-g93 Jan 3, 2025
c9ad65f
Update create_data.py
a-s-g93 Jan 3, 2025
c936051
Merge branch 'main' into benchmarking
a-s-g93 Jan 6, 2025
a997db3
write main benchmarking notebook
a-s-g93 Jan 6, 2025
e8eec17
Change to powerlaw value distribution
smithna Jan 6, 2025
a0771a5
formatting and update df save function to track
a-s-g93 Jan 6, 2025
e641755
first benchmarking run
a-s-g93 Jan 6, 2025
598c6cb
refactor benchmarking code into main script, write makefile command
a-s-g93 Jan 7, 2025
7aab2c8
allow benchmarking of multiple groups per method
a-s-g93 Jan 7, 2025
52c56f6
begin viz analytics functions
a-s-g93 Jan 7, 2025
da74dd3
Merge branch 'main' into benchmarking
a-s-g93 Jan 7, 2025
dad58ec
test benchmarking, got deadlock error for monopartite
a-s-g93 Jan 7, 2025
9c6d362
Merge branch 'main' into benchmarking
a-s-g93 Jan 7, 2025
9d4c6a9
Merge branch 'main' into benchmarking
a-s-g93 Jan 7, 2025
99efbbf
Update main.py
a-s-g93 Jan 7, 2025
9fcd972
update run group params
a-s-g93 Jan 7, 2025
dff36b7
Merge branch 'main' into benchmarking
a-s-g93 Jan 9, 2025
80a0f22
run benchmarks for v0.2.4 and create more visuals
a-s-g93 Jan 9, 2025
8e02804
formatting
a-s-g93 Jan 9, 2025
817a6a2
Update .gitignore
a-s-g93 Jan 9, 2025
1540582
git cache cleared
a-s-g93 Jan 9, 2025
789cb4c
update sampling method, add visuals
a-s-g93 Jan 10, 2025
eec4cf3
Delete benchmark.ipynb
a-s-g93 Jan 10, 2025
d734535
Delete _.ipynb
a-s-g93 Jan 10, 2025
0f0798b
Update CHANGELOG.md
a-s-g93 Jan 10, 2025
ec9b2d7
update analysis code
a-s-g93 Jan 10, 2025
6b37900
Create larger datasets
smithna Jan 10, 2025
599ae1d
add spark details to ouput csv
a-s-g93 Jan 10, 2025
3b4808f
Merge branch 'benchmarking' of https://github.com/neo4j-field/neo4j-p…
a-s-g93 Jan 10, 2025
d0d598d
add numpy import
a-s-g93 Jan 10, 2025
8b9a402
benchmark with 1mil rows
a-s-g93 Jan 10, 2025
0c425cb
Merge branch 'main' into benchmarking
a-s-g93 Jan 13, 2025
c04085c
Update poetry.lock
a-s-g93 Jan 13, 2025
7578ba6
0.3.0 benchmarking
a-s-g93 Jan 13, 2025
4aabfbc
update num_groups for v0.3.1
a-s-g93 Jan 13, 2025
b57b214
Merge branch 'main' into benchmarking
a-s-g93 Jan 13, 2025
9938941
0.3.1 benchmarking
a-s-g93 Jan 13, 2025
4132cbb
add more dataset options
a-s-g93 Jan 14, 2025
d12ed37
update benchmarking with real data
a-s-g93 Jan 14, 2025
7cc806b
Update CHANGELOG.md
a-s-g93 Jan 14, 2025
e5404a7
update delete to use driver
a-s-g93 Jan 15, 2025
e7acc31
update neo4j spark conf
a-s-g93 Jan 15, 2025
1e38203
update for aura benchmarking in databricks
a-s-g93 Jan 15, 2025
f3e86b7
Benchmark with Databricks
Jan 21, 2025
36001b5
Update visualizations
Jan 22, 2025
76a15c2
Ignore header in Twitch file
Jan 23, 2025
b696cbb
Add full samples
Jan 25, 2025
bccc063
Change monopartite coloring for even numbered groups
Jan 26, 2025
5f389af
Update README
Jan 26, 2025
e641d48
Update for local testing
smithna Jan 26, 2025
77edf8e
Fix predefined components load bug
Jan 27, 2025
9cbba43
Update running instructions
Jan 27, 2025
20f8b24
Update CHANGELOG.md
smithna Jan 28, 2025
3a51db5
Update monopartite coloring test for new batching strategy for self l…
smithna Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
Neo4j-*.txt
.DS_Store

benchmarking/data/*.csv
data/datasets/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,21 @@

### Added

## 0.4.0

### Changed

* For monopartite batching assign self loop relationships for two node groups to the same relationship group. Allows for improved efficiency for most real-world monopartite graphs.

### Added

* Benchmarking module including:
* Methods to generate synthetic data that maps to each ingest method in the package
* Generate_benchmarks method to iterate through parameters and collect benchmarking information for each ingest method in the package
* Visualizations to easily analyze results
* Method to partition results by package version automatically
* Methods to retrieve and format real data for benchmarking scenarios

## 0.3.1

### Changed
Expand Down
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,14 @@
# Default target executed when no arguments are given to make.
all: help

benchmark:
docker compose -f benchmarking/docker-compose.yml up -d
poetry run python3 benchmarking/main.py --env=$(env) --datatype=$(datatype)
docker compose -f benchmarking/docker-compose.yml stop

benchmark_databricks:
poetry run python3 benchmarking/main.py --env=$(env) --datatype=$(datatype)

coverage:
poetry run coverage run -m pytest tests/unit
poetry run coverage report --fail-under=85
Expand Down
19 changes: 10 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Each grouping and batching scenario has its own module. The `group_and_batch_spa

In some relationship data, the relationships can be broken into distinct components based on a field in the relationship data. For example, you might have a DataFrame of HR data with columns for `employeeId`, `managerId`, and `department`. If we are wanting to create a `MANAGES` relationship between employees and managers, and we know in advance that all managers are in the same department as the employees they manage, we can separate the rows of the dataframe into components based on the `department` key.

Often the number of predefined components is greater than the number of workers in the Spark cluster, and the number of rows within each component is unequal. When running `parallel_spark_loader.predefined_components.group_and_batch_spark_dataframe()`, you specify the number of groups that you want to collect the partitioned data into. This value should be equal to the number of workers in your Spark cluster. Neo4j Parallel Spark Loader uses a greedy algorithm to assign partitions into groups in a way that attempts to balance the number of relationships within each group. When loading this ensures that each Spark worker stays equally instead of some workers waiting while other workers finish loading larger groups.
Often the number of predefined components is greater than the number of workers in the Spark cluster, and the number of rows within each component is unequal. When running `parallel_spark_loader.predefined_components.group_and_batch_spark_dataframe()`, you specify the number of groups that you want to collect the partitioned data into. The optimal number of groups depends on the capacity of your Spark cluster and the Neo4j instance you are loading. As a rule of thumb, the number of groups should be less than or equal to the total number of executor CPUs on your Spark cluster. Neo4j Parallel Spark Loader uses a greedy algorithm to assign partitions into groups in a way that attempts to balance the number of relationships within each group. When loading this ensures that each Spark worker stays equally instead of some workers waiting while other workers finish loading larger groups.

![Diagram showing nodes and relationships assigned to groups](./docs/assets/images/predefined-components.png)

Expand All @@ -79,7 +79,7 @@ We can visualize the nodes within the same group as a single aggregated node and

In many relationship datasets, there is not a paritioning key in the Spark DataFrame that can be used to divide the relationships into predefined components. However, we know that no nodes in the dataset will be *both a source and a target* for this relationship type. Often this is because the source nodes and the target nodes have different node labels and they represent different classes of things in the real world. For example, you might have a DataFrame of order data with columns for `orderId`, `productId`, and `quantity`, and you want to create `INCLUDES_PRODUCT` relationships between `Order` and `Product` nodes. You know that all source nodes of `INCLUDES_PRODUCT` relationships will be `Order` nodes, and all target nodes will be `Product` nodes. No nodes will be *both source and target* of that relationship.

When running `parallel_spark_loader.bipartite.group_and_batch_spark_dataframe()`, you specify the number of groups that you want to collect the source and target nodes into. This value should be equal to the number of workers in your Spark cluster. Neo4j Parallel Spark Loader uses a greedy alogrithm to assign source node values to source-node groups so that each group represents roughly the same number of rows in the relationship DataFrame. Similarly, the library groups the target node values into target-node groups with roughly balanced size.
When running `parallel_spark_loader.bipartite.group_and_batch_spark_dataframe()`, you specify the number of groups that you want to collect the source and target nodes into. The optimal number of node groups depends on the capacity of your Spark cluster and the Neo4j instance you are loading. As a rule of thumb, the number of node groups should be less than or equal to the total number of executor CPUs on your Spark cluster. Neo4j Parallel Spark Loader uses a greedy alogrithm to assign source node values to source-node groups so that each group represents roughly the same number of rows in the relationship DataFrame. Similarly, the library groups the target node values into target-node groups with roughly balanced size.

We can visualize the nodes within the same group as a single aggregated node and the relationships that connect nodes within the same group as a single aggregated relationship.

Expand All @@ -93,24 +93,25 @@ In some relationship datasets, the same node is the source node of some relation

When running `parallel_spark_loader.monopartite.group_and_batch_spark_dataframe()`, the library uses the union of the source and target nodes as the basis for assigning nodes to groups. As with other scenarios, you select the number of groups that should be created, and a greedy algorithm assigns node IDs to groups so that the combined number of source and target rows for the IDs in a group is roughly equal.

As with the other scenarios, you set the number of groups that will be assigned by the algorithm. However, unlike the predefined components and bipartite scenarios, in the monopartite scenario, *it is not recommended that the number of groups equals the number of workers in the Spark cluster*. This is because a group can represent the source of a relationship and the target of a relationship. In the monopartite scenario, it is recommended to set `num_groups = (2 * num_workers) - 1`
As with the other scenarios, you set the number of groups that will be assigned by the algorithm. The optimal number of groups depends on the resources on the Spark cluster and the Neo4j instance. However, unlike the predefined components and bipartite scenarios, in the monopartite scenario, *the number of node groups should be 2 times the number of parallel transactions that you want to execute*. This is because a group can represent the source of a relationship and the target of a relationship.

We can visualize the nodes within the same group as a single aggregated node and the relationships that connect nodes within the same group as a single aggregated relationship.

![Diagram showing aggregated bipartite relationships colored by group](./docs/assets/images/monopartite-coloring-diagram.png)

In the aggregated biparite diagram, multiple relationships (each representing a group of individual relationships) connect to each node (representing a group of nodes). Because nodes could be either source or target, there are no arrow heads in the diagram representing relationship direction. However, the nodes are always stored with a direction in Neo4j. Using the rotational symmetry of the complete graph, the relationships are colored so that no relationships of the same color connect to the same node. The relationship colors represent the batches applied to the data. In the picture above, the relationship groups represented by red arrows can be processed in parallel because no node groups are connected to more than one red relationship group. After the red batch has completed, each additional color batch can be processed in turn until all relationships have been loaded. Notice that with five node groups, each color batch contains three relationship groups. This demonstrates why the number of groups should be larger than the number of Spark workers that you want to keep occupied.
In the aggregated monopartite diagram, multiple relationships (each representing a group of individual relationships) connect to each node (representing a group of nodes). Because nodes could be either source or target, there are no arrow heads in the diagram representing relationship direction. However, the nodes are always stored with a direction in Neo4j. Using the rotational symmetry of the complete graph, the relationships are colored so that no relationships of the same color connect to the same node. The relationship colors represent the batches applied to the data. In the picture above, the relationship groups represented by red arrows can be processed in parallel because no node groups are connected to more than one red relationship group. After the red batch has completed, each additional color batch can be processed in turn until all relationships have been loaded. Notice that with five node groups, each color batch contains three relationship groups. This demonstrates why the number of groups should be larger than the number of parallel transactions that you want to execute.

## Workflow Visualization

The visualization module may be used to create a heatmap of the workflow.

* Groups are identified as rows and batches are identified as columns.
* Batches are displayed sequentially in the order they are processed in.
* The value in each group & batch pair indicates how many rows are processed in that step.
* Batches are identified as rows.
* Groups that will be processed in a batch are shown as cells in the row.
* The number of relationships in the relationship group and the relationship group name are shown in each cell.
* For optimal processing, the number of relationships in each row should be similar.

This function may be imported with `from neo4j_parallel_spark_loader.visualize import create_ingest_heatmap` and takes a Spark DataFrame with columns including `group` and `batch` as input.

Here is an example of a generated heatmap with dummy data:
Here is an example of a generated heatmap with monopartite data with ten node groups:

![Example heatmap generated with the visualization module](./docs/assets/images/example-heatmap.png)
![Example heatmap generated with the visualization module](./docs/assets/images/monopartite_heatmap.png)
23 changes: 23 additions & 0 deletions benchmarking/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Benchmarking

This directory contains code to perform comparisons between serial and parallel relationship loading using Spark. Parallel loading utilizes the neo4j-parallel-spark-loader Python package.

## Datasets
You can choose to run the benchmarking code with synthetic data or with example graph datasets downloaded from public websites by setting a data_source parameter. If you choose data_source="real", you will read from these three data sources.

* Predefined components: [Reddit threads](https://snap.stanford.edu/data/reddit_threads.html)
* Bipartite graph: [Amazon ratings](https://networkrepository.com/rec-amazon-ratings.php)
* Monopartite graph: [Twitch gamers](https://snap.stanford.edu/data/twitch_gamers.html)

## Running in Databricks

To run the benchmarking code in Databricks, run the databricks_benchmarking.ipynb notebook in this directory. Update the notebook widgets with connection information for your Neo4j database. Run the notebook. It will generate a new output file in the `./output/{version}` directory with statistics for all of the benchmarking runs.

## Running locally

To run the benchmarking code locally, set environment variables for NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_URI, and NEO4J_DATABASE or include them in a .env file. Run main.py. Pass the -d argument with a value of "generated" or "real" and pass the -e argument with "local".

## Analyzing results
The analyze_results.ipynb notebook will generate charts to show the load times for each graph loading scenario in parallel and at two different group sizes. The results are read from the `./output` directory for the most recent run at the package version that you specify.

The visualize_groups_and_batches.ipynb will create heat maps to show the distribution of rows across groups and batches for the public graph datasets at the group number that you select.
Loading