Skip to content

Commit 28522a2

Browse files
author
Benjamin Hilprecht
committed
Added tpc-ds example for single table for UWarwick
1 parent 071ae9e commit 28522a2

36 files changed

+2671
-475
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,7 @@ benchmarks/maqp_scripts/rsync_dm.sh
129129
# profiling
130130
profiling_results
131131
profiling.py
132+
bar.pdf
132133
*.lprof
133134

134135
optimized_inference.cpp

README.md

Lines changed: 64 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian
99
![DeepDB Overview](baselines/plots/overview.png "DeepDB Overview")
1010

1111
# Setup
12+
Tested with python3.7 and python3.8
1213
```
1314
git clone https://github.com/DataManagementLab/deepdb-public.git
1415
cd deepdb-public
@@ -18,21 +19,12 @@ source venv/bin/activate
1819
pip3 install -r requirements.txt
1920
```
2021

21-
# How to experiment with DeepDB on a new Dataset
22-
- Specify a new schema in the schemas folder
23-
- Due to the current implementation, make sure to declare
24-
- the primary key,
25-
- the filename of the csv sample file,
26-
- the correct table size and sample rate,
27-
- the relationships among tables if you do not just run queries over a single table,
28-
- any non-key functional dependencies (this is rather an implementation detail),
29-
- and include all columns in the no-compression list by default (as done for the IMDB benchmark),
30-
- To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
31-
- Generate the HDF/sampled HDF files and learn the RSPN ensemble
32-
- Use the RSPN ensemble to answer queries
33-
- For reference, please check the commands to reproduce the results of the paper
22+
For python3.8: Sometimes spflow fails, in this case remove spflow from requirements.txt, install them and run
23+
```
24+
pip3 install spflow --no-deps
25+
```
3426

35-
# How to Reproduce Experiments in the Paper
27+
# Reproduce Experiments
3628

3729
## Cardinality Estimation
3830
Download the [Job dataset](http://homepages.cwi.nl/~boncz/job/imdb.tgz).
@@ -288,3 +280,61 @@ python3 maqp.py --evaluate_confidence_intervals
288280
--confidence_upsampling_factor 100
289281
--confidence_sample_size 10000000
290282
```
283+
284+
### TPC-DS (Single Table) pipeline
285+
As an additional example on how to work with DeepDB, we provide an example on just a single table of the TPC-DS schema for the queries in `./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql`. As a prerequisite, you need a 10 million tuple sample of the store_sales table in the directory `../mqp-data/tpc-ds-benchmark/store_sales_sampled.csv`. Afterwards,
286+
you can run the following commands. To compute the ground truth, you need a postgres instance with a 1T TPC-DS dataset.
287+
288+
Generate hdf files from csvs
289+
```
290+
python3 maqp.py --generate_hdf
291+
--dataset tpc-ds-1t
292+
--csv_seperator |
293+
--csv_path ../mqp-data/tpc-ds-benchmark
294+
--hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
295+
```
296+
297+
Learn the ensemble
298+
```
299+
python3 maqp.py --generate_ensemble
300+
--dataset tpc-ds-1t
301+
--samples_per_spn 10000000
302+
--ensemble_strategy single
303+
--hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
304+
--ensemble_path ../mqp-data/tpc-ds-benchmark/spn_ensembles
305+
--rdc_threshold 0.3
306+
--post_sampling_factor 10
307+
```
308+
309+
Compute ground truth
310+
```
311+
python3 maqp.py --aqp_ground_truth
312+
--dataset tpc-ds-1t
313+
--query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
314+
--target_path ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
315+
--database_name tcpds
316+
```
317+
318+
Evaluate the AQP queries
319+
```
320+
python3 maqp.py --evaluate_aqp_queries
321+
--dataset tpc-ds-1t
322+
--target_path ./baselines/aqp/results/deepDB/tpcds1t_model_based.csv
323+
--ensemble_location ../mqp-data/tpc-ds-benchmark/spn_ensembles/ensemble_single_tpc-ds-1t_10000000.pkl
324+
--query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
325+
--ground_truth_file_location ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
326+
```
327+
328+
# How to experiment with DeepDB on a new Dataset
329+
- Specify a new schema in the schemas folder
330+
- Due to the current implementation, make sure to declare
331+
- the primary key,
332+
- the filename of the csv sample file,
333+
- the correct table size and sample rate,
334+
- the relationships among tables if you do not just run queries over a single table,
335+
- any non-key functional dependencies (this is rather an implementation detail),
336+
- and include all columns in the no-compression list by default (as done for the IMDB benchmark),
337+
- To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
338+
- Generate the HDF/sampled HDF files and learn the RSPN ensemble
339+
- Use the RSPN ensemble to answer queries
340+
- For reference, please check the commands to reproduce the results of the paper

0 commit comments

Comments
 (0)