@@ -9,6 +9,7 @@ Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian
9
9
![ DeepDB Overview] ( baselines/plots/overview.png " DeepDB Overview ")
10
10
11
11
# Setup
12
+ Tested with python3.7 and python3.8
12
13
```
13
14
git clone https://github.com/DataManagementLab/deepdb-public.git
14
15
cd deepdb-public
@@ -18,21 +19,12 @@ source venv/bin/activate
18
19
pip3 install -r requirements.txt
19
20
```
20
21
21
- # How to experiment with DeepDB on a new Dataset
22
- - Specify a new schema in the schemas folder
23
- - Due to the current implementation, make sure to declare
24
- - the primary key,
25
- - the filename of the csv sample file,
26
- - the correct table size and sample rate,
27
- - the relationships among tables if you do not just run queries over a single table,
28
- - any non-key functional dependencies (this is rather an implementation detail),
29
- - and include all columns in the no-compression list by default (as done for the IMDB benchmark),
30
- - To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
31
- - Generate the HDF/sampled HDF files and learn the RSPN ensemble
32
- - Use the RSPN ensemble to answer queries
33
- - For reference, please check the commands to reproduce the results of the paper
22
+ For python3.8: Sometimes spflow fails, in this case remove spflow from requirements.txt, install them and run
23
+ ```
24
+ pip3 install spflow --no-deps
25
+ ```
34
26
35
- # How to Reproduce Experiments in the Paper
27
+ # Reproduce Experiments
36
28
37
29
## Cardinality Estimation
38
30
Download the [ Job dataset] ( http://homepages.cwi.nl/~boncz/job/imdb.tgz ) .
@@ -288,3 +280,61 @@ python3 maqp.py --evaluate_confidence_intervals
288
280
--confidence_upsampling_factor 100
289
281
--confidence_sample_size 10000000
290
282
```
283
+
284
+ ### TPC-DS (Single Table) pipeline
285
+ As an additional example on how to work with DeepDB, we provide an example on just a single table of the TPC-DS schema for the queries in ` ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql ` . As a prerequisite, you need a 10 million tuple sample of the store_sales table in the directory ` ../mqp-data/tpc-ds-benchmark/store_sales_sampled.csv ` . Afterwards,
286
+ you can run the following commands. To compute the ground truth, you need a postgres instance with a 1T TPC-DS dataset.
287
+
288
+ Generate hdf files from csvs
289
+ ```
290
+ python3 maqp.py --generate_hdf
291
+ --dataset tpc-ds-1t
292
+ --csv_seperator |
293
+ --csv_path ../mqp-data/tpc-ds-benchmark
294
+ --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
295
+ ```
296
+
297
+ Learn the ensemble
298
+ ```
299
+ python3 maqp.py --generate_ensemble
300
+ --dataset tpc-ds-1t
301
+ --samples_per_spn 10000000
302
+ --ensemble_strategy single
303
+ --hdf_path ../mqp-data/tpc-ds-benchmark/gen_hdf
304
+ --ensemble_path ../mqp-data/tpc-ds-benchmark/spn_ensembles
305
+ --rdc_threshold 0.3
306
+ --post_sampling_factor 10
307
+ ```
308
+
309
+ Compute ground truth
310
+ ```
311
+ python3 maqp.py --aqp_ground_truth
312
+ --dataset tpc-ds-1t
313
+ --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
314
+ --target_path ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
315
+ --database_name tcpds
316
+ ```
317
+
318
+ Evaluate the AQP queries
319
+ ```
320
+ python3 maqp.py --evaluate_aqp_queries
321
+ --dataset tpc-ds-1t
322
+ --target_path ./baselines/aqp/results/deepDB/tpcds1t_model_based.csv
323
+ --ensemble_location ../mqp-data/tpc-ds-benchmark/spn_ensembles/ensemble_single_tpc-ds-1t_10000000.pkl
324
+ --query_file_location ./benchmarks/tpc_ds_single_table/sql/aqp_queries.sql
325
+ --ground_truth_file_location ./benchmarks/tpc_ds_single_table/ground_truth_1t.pkl
326
+ ```
327
+
328
+ # How to experiment with DeepDB on a new Dataset
329
+ - Specify a new schema in the schemas folder
330
+ - Due to the current implementation, make sure to declare
331
+ - the primary key,
332
+ - the filename of the csv sample file,
333
+ - the correct table size and sample rate,
334
+ - the relationships among tables if you do not just run queries over a single table,
335
+ - any non-key functional dependencies (this is rather an implementation detail),
336
+ - and include all columns in the no-compression list by default (as done for the IMDB benchmark),
337
+ - To further reduce the training time, you can exclude columns you do not need in your experiments (also done in the IMDB benchmark)
338
+ - Generate the HDF/sampled HDF files and learn the RSPN ensemble
339
+ - Use the RSPN ensemble to answer queries
340
+ - For reference, please check the commands to reproduce the results of the paper
0 commit comments