From 414bfe365a8eb51311fdfdda204700299322291b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= Date: Fri, 31 Jan 2025 15:18:23 +0100 Subject: [PATCH 01/12] start work on vignette --- vignettes/duckplyr.Rmd | 89 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 vignettes/duckplyr.Rmd diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd new file mode 100644 index 000000000..7cb31041d --- /dev/null +++ b/vignettes/duckplyr.Rmd @@ -0,0 +1,89 @@ +--- +title: "duckplyr" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{duckplyr} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +```{r setup} +library(duckplyr) +``` + +## Design principles + +The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**. +These two facts create a tension: + +- When using dplyr, we are not used to explicitly collect results: the data.frames are eager by default. + Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration. + Therefore, _duckplyr needs eagerness_! + +- The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table. + _Therefore, duckplyr needs laziness_! + +As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**. + +> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." + +If the duckplyr data.frame is accessed by... + +- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame. +- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance). + +Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). + +Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? +Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package). +A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function. +By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_. + +## How to use duckplyr + +### For normal sized data (instead of dplyr) + +To replace dplyr with duckplyr, you can either + +- load duckplyr and then keep your pipeline as is. + +```r +library(conflicted) +library(duckplyr) +conflict_prefer("filter", "dplyr", quiet = TRUE) +``` + +- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`. + +In both cases, if an operation cannot be performed +by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. +You can choose to be informed about fallbacks to dplyr, see `?fallback_config`. +You can disable fallbacks by turning off automatic materialization. +In that case, if an operation cannot be performed by duckplyr, your code will error. + +### For large data (instead of dbplyr) + +With large datasets, you want: + +- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`. +- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr. +- the output to not clutter all the memory. Therefore you can make use of these features: + - funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size. + - computation to files using `compute_parquet()` or `compute_csv()`. + + + +A drawback of analyzing large data with duckplyr is that the limitations of duckplyr +(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory. + +## How to improve duckplyr + +- telemetry +- report issues, contribute From 460c4be7aeb083b781964e2234f285fe293f9536 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Fri, 31 Jan 2025 17:19:05 +0100 Subject: [PATCH 02/12] Tweaks --- vignettes/duckplyr.Rmd | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index 7cb31041d..89008df74 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -42,7 +42,7 @@ If the duckplyr data.frame is accessed by... Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? -Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package). +Therefore, the duckplyr package has a **safeguard called funneling**. A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function. By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_. @@ -62,8 +62,7 @@ conflict_prefer("filter", "dplyr", quiet = TRUE) - convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`. -In both cases, if an operation cannot be performed -by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. +In both cases, if an operation cannot be performed by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. You can choose to be informed about fallbacks to dplyr, see `?fallback_config`. You can disable fallbacks by turning off automatic materialization. In that case, if an operation cannot be performed by duckplyr, your code will error. @@ -75,13 +74,12 @@ With large datasets, you want: - input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`. - efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr. - the output to not clutter all the memory. Therefore you can make use of these features: - - funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size. + - funneling (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size. - computation to files using `compute_parquet()` or `compute_csv()`. - -A drawback of analyzing large data with duckplyr is that the limitations of duckplyr -(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory. + +A drawback of analyzing large data with duckplyr is that the limitations of duckplyr (unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory. ## How to improve duckplyr From 1ff8ffc921ec6204332d0c72a6841da9584f190f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Fri, 31 Jan 2025 17:19:13 +0100 Subject: [PATCH 03/12] prudence --- vignettes/duckplyr.Rmd | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index 89008df74..cc1f78ff2 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -42,9 +42,9 @@ If the duckplyr data.frame is accessed by... Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? -Therefore, the duckplyr package has a **safeguard called funneling**. -A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function. -By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_. +Therefore, the duckplyr package has a **safeguard called prudence**. +A prudent data.frame cannot be materialized by default, it needs a call to a `compute()` function. +By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _frugal_. ## How to use duckplyr @@ -74,7 +74,7 @@ With large datasets, you want: - input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`. - efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr. - the output to not clutter all the memory. Therefore you can make use of these features: - - funneling (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size. + - prudence (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size. - computation to files using `compute_parquet()` or `compute_csv()`. From 070292e8c330e36e397d3923056aef85916a1eb7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Sat, 1 Feb 2025 07:01:17 +0100 Subject: [PATCH 04/12] Authorship, index --- vignettes/duckplyr.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index cc1f78ff2..f04550487 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -1,8 +1,9 @@ --- title: "duckplyr" output: rmarkdown::html_vignette +author: Maëlle Salmon vignette: > - %\VignetteIndexEntry{duckplyr} + %\VignetteIndexEntry{00 Get started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- From d622cce5c454f1fb0fabe7de674a12887d89c407 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Sat, 1 Feb 2025 07:01:33 +0100 Subject: [PATCH 05/12] Silence conflict output --- vignettes/duckplyr.Rmd | 2 ++ 1 file changed, 2 insertions(+) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index f04550487..0f049ba0c 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -13,6 +13,8 @@ knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) + +options(conflicts.policy = list(warn = FALSE)) ``` ```{r setup} From dbb6f33d6212a9c4a9fae92f4c146e5bd04144ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Sat, 1 Feb 2025 07:01:43 +0100 Subject: [PATCH 06/12] Logic --- vignettes/duckplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index 0f049ba0c..97acb44d8 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -39,8 +39,8 @@ As a consequence, duckplyr is lazy on the inside for all DuckDB operations but e If the duckplyr data.frame is accessed by... -- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame. - duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance). +- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame. Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). From d353a06decfabdee6ee47a72fb25c87928e1b296 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kirill=20M=C3=BCller?= Date: Sat, 1 Feb 2025 07:03:10 +0100 Subject: [PATCH 07/12] Jargon --- vignettes/duckplyr.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index 97acb44d8..48387f6e4 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -46,8 +46,8 @@ Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for t Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? Therefore, the duckplyr package has a **safeguard called prudence**. -A prudent data.frame cannot be materialized by default, it needs a call to a `compute()` function. -By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _frugal_. +A _frugal_ data.frame cannot be materialized by default, it needs a call to a `collect.duckplyr_df()` function. +By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_. ## How to use duckplyr From 0815adbf7bce780fe967e5ba90e4d478514c3681 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= Date: Thu, 6 Feb 2025 13:36:14 +0100 Subject: [PATCH 08/12] prudence --- vignettes/duckplyr.Rmd | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index 48387f6e4..0d31b93a2 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -44,9 +44,13 @@ If the duckplyr data.frame is accessed by... Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). -Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM? -Therefore, the duckplyr package has a **safeguard called prudence**. -A _frugal_ data.frame cannot be materialized by default, it needs a call to a `collect.duckplyr_df()` function. +Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory? +Therefore, the duckplyr package has a **safeguard called prudence** with three levels. + +- `"lavish"`: automatically materialize _regardless of size_, +- `"frugal"`: _never_ automatically materialize, +- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_. + By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_. ## How to use duckplyr From 79d59697a674c1f27aaefb204b638cfefed578af Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= Date: Thu, 6 Feb 2025 14:19:23 +0100 Subject: [PATCH 09/12] simpler phrasing --- vignettes/duckplyr.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index 0d31b93a2..d2e0c400f 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -67,7 +67,7 @@ library(duckplyr) conflict_prefer("filter", "dplyr", quiet = TRUE) ``` -- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`. +- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use conversion functions like `duckdb_tibble()` or `as_duckdb_tibble()`, or ingestion functions like `read_csv_duckdb()`. In both cases, if an operation cannot be performed by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. You can choose to be informed about fallbacks to dplyr, see `?fallback_config`. From df80e1cdcf1b230dfb26bf8389b95cd96849db6e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= Date: Thu, 6 Feb 2025 14:38:59 +0100 Subject: [PATCH 10/12] diagram placeholder --- vignettes/duckplyr.Rmd | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index d2e0c400f..e125a8306 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -21,7 +21,17 @@ options(conflicts.policy = list(warn = FALSE)) library(duckplyr) ``` -## Design principles +## What is duckplyr + +DIAGRAM, described with words. + +The duckplyr package is a drop-in replacement for dplyr that uses DuckDB for speed. +Data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions. +The data manipulation pipeline uses the exact same syntax as a dplyr pipeline. +The duckplyr package performs the computation using DuckDB, or, if a specific operation is not supported, fallbacks to dplyr. +The result can be materialized to memory, or computed temporarily, or computed to a file. + +### Design principles: lazy and eager The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**. These two facts create a tension: @@ -35,7 +45,7 @@ These two facts create a tension: As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**. -> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." +> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen. If the duckplyr data.frame is accessed by... @@ -44,6 +54,8 @@ If the duckplyr data.frame is accessed by... Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world). +### Memory protection + Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory? Therefore, the duckplyr package has a **safeguard called prudence** with three levels. @@ -84,11 +96,22 @@ With large datasets, you want: - prudence (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size. - computation to files using `compute_parquet()` or `compute_csv()`. - - A drawback of analyzing large data with duckplyr is that the limitations of duckplyr (unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory. ## How to improve duckplyr -- telemetry +You can help us make duckplyr better! + +### Automatically report fallbacks to inform development + +If you allow duckplyr to log and upload fallback reports, the duckplyr development team will have better data to decide on what feature to work next. +See `vignette("telemetry")`. + +### Contribute + +Please report any issue especially regarding: + + + + - report issues, contribute From a97d20e6a9e1af9e8ed7adac23f1969a70e7d01d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= Date: Thu, 6 Feb 2025 14:39:49 +0100 Subject: [PATCH 11/12] crossrefs --- vignettes/duckplyr.Rmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index e125a8306..e15028b9e 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -81,10 +81,12 @@ conflict_prefer("filter", "dplyr", quiet = TRUE) - convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use conversion functions like `duckdb_tibble()` or `as_duckdb_tibble()`, or ingestion functions like `read_csv_duckdb()`. -In both cases, if an operation cannot be performed by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. +In both cases, if an operation cannot be performed by duckplyr (see `vignette("limits")`), it will be outsourced to dplyr. + You can choose to be informed about fallbacks to dplyr, see `?fallback_config`. You can disable fallbacks by turning off automatic materialization. In that case, if an operation cannot be performed by duckplyr, your code will error. +See `vignette("fallback")`. ### For large data (instead of dbplyr) From e57aa63c094b7e68c884f0b97b629718166dc6db Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= Date: Thu, 6 Feb 2025 14:42:22 +0100 Subject: [PATCH 12/12] contribute --- vignettes/duckplyr.Rmd | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd index e15028b9e..50d132f96 100644 --- a/vignettes/duckplyr.Rmd +++ b/vignettes/duckplyr.Rmd @@ -111,9 +111,6 @@ See `vignette("telemetry")`. ### Contribute -Please report any issue especially regarding: +Please report any issue especially regarding unknown incompabilities. See `vignette("limits")`. - - - -- report issues, contribute +You can also contribute further functionality to duckplyr, refer to our [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html) for details.