From 414bfe365a8eb51311fdfdda204700299322291b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= <maelle.salmon@yahoo.se>
Date: Fri, 31 Jan 2025 15:18:23 +0100
Subject: [PATCH 01/12] start work on vignette

---
 vignettes/duckplyr.Rmd | 89 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)
 create mode 100644 vignettes/duckplyr.Rmd

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
new file mode 100644
index 000000000..7cb31041d
--- /dev/null
+++ b/vignettes/duckplyr.Rmd
@@ -0,0 +1,89 @@
+---
+title: "duckplyr"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{duckplyr}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+```{r setup}
+library(duckplyr)
+```
+
+## Design principles
+
+The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**.
+These two facts create a tension:
+
+-   When using dplyr, we are not used to explicitly collect results: the data.frames are eager by default.
+    Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
+    Therefore, _duckplyr needs eagerness_!
+
+-   The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
+    _Therefore, duckplyr needs laziness_!
+
+As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**.
+
+> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed."
+
+If the duckplyr data.frame is accessed by...
+
+-   not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
+-   duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
+
+Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
+
+Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM?
+Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package).
+A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function.
+By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_.
+
+## How to use duckplyr
+
+### For normal sized data (instead of dplyr)
+
+To replace dplyr with duckplyr, you can either
+
+- load duckplyr and then keep your pipeline as is.
+
+```r
+library(conflicted)
+library(duckplyr)
+conflict_prefer("filter", "dplyr", quiet = TRUE)
+```
+
+- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`.
+
+In both cases, if an operation cannot be performed 
+by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. 
+You can choose to be informed about fallbacks to dplyr, see `?fallback_config`.
+You can disable fallbacks by turning off automatic materialization.
+In that case, if an operation cannot be performed by duckplyr, your code will error.
+
+### For large data (instead of dbplyr)
+
+With large datasets, you want:
+
+- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
+- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
+- the output to not clutter all the memory. Therefore you can make use of these features:
+    - funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
+    - computation to files using  `compute_parquet()` or `compute_csv()`.
+    
+
+
+A drawback of analyzing large data with duckplyr is that the limitations of duckplyr 
+(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
+
+## How to improve duckplyr
+
+- telemetry
+- report issues, contribute

From 460c4be7aeb083b781964e2234f285fe293f9536 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kirill=20M=C3=BCller?= <kirill@cynkra.com>
Date: Fri, 31 Jan 2025 17:19:05 +0100
Subject: [PATCH 02/12] Tweaks

---
 vignettes/duckplyr.Rmd | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index 7cb31041d..89008df74 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -42,7 +42,7 @@ If the duckplyr data.frame is accessed by...
 Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
 
 Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM?
-Therefore, the duckplyr package has a **safeguard called funneling** (in the current development version of the package).
+Therefore, the duckplyr package has a **safeguard called funneling**.
 A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function.
 By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_.
 
@@ -62,8 +62,7 @@ conflict_prefer("filter", "dplyr", quiet = TRUE)
 
 - convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`.
 
-In both cases, if an operation cannot be performed 
-by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr. 
+In both cases, if an operation cannot be performed by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr.
 You can choose to be informed about fallbacks to dplyr, see `?fallback_config`.
 You can disable fallbacks by turning off automatic materialization.
 In that case, if an operation cannot be performed by duckplyr, your code will error.
@@ -75,13 +74,12 @@ With large datasets, you want:
 - input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
 - efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
 - the output to not clutter all the memory. Therefore you can make use of these features:
-    - funneling see vignette TODO ADD CURRENT NAME to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
+    - funneling (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
     - computation to files using  `compute_parquet()` or `compute_csv()`.
-    
 
 
-A drawback of analyzing large data with duckplyr is that the limitations of duckplyr 
-(unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
+
+A drawback of analyzing large data with duckplyr is that the limitations of duckplyr (unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
 
 ## How to improve duckplyr
 

From 1ff8ffc921ec6204332d0c72a6841da9584f190f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kirill=20M=C3=BCller?= <kirill@cynkra.com>
Date: Fri, 31 Jan 2025 17:19:13 +0100
Subject: [PATCH 03/12] prudence

---
 vignettes/duckplyr.Rmd | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index 89008df74..cc1f78ff2 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -42,9 +42,9 @@ If the duckplyr data.frame is accessed by...
 Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
 
 Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM?
-Therefore, the duckplyr package has a **safeguard called funneling**.
-A funneled data.frame cannot be materialized by default, it needs a call to a `compute()` function.
-By default, duckplyr frames are _unfunneled_, but duckplyr frames created from Parquet data (presumedly large) are _funneled_.
+Therefore, the duckplyr package has a **safeguard called prudence**.
+A prudent data.frame cannot be materialized by default, it needs a call to a `compute()` function.
+By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _frugal_.
 
 ## How to use duckplyr
 
@@ -74,7 +74,7 @@ With large datasets, you want:
 - input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
 - efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
 - the output to not clutter all the memory. Therefore you can make use of these features:
-    - funneling (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
+    - prudence (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
     - computation to files using  `compute_parquet()` or `compute_csv()`.
 
 

From 070292e8c330e36e397d3923056aef85916a1eb7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kirill=20M=C3=BCller?= <kirill@cynkra.com>
Date: Sat, 1 Feb 2025 07:01:17 +0100
Subject: [PATCH 04/12] Authorship, index

---
 vignettes/duckplyr.Rmd | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index cc1f78ff2..f04550487 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -1,8 +1,9 @@
 ---
 title: "duckplyr"
 output: rmarkdown::html_vignette
+author: Maëlle Salmon
 vignette: >
-  %\VignetteIndexEntry{duckplyr}
+  %\VignetteIndexEntry{00 Get started}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---

From d622cce5c454f1fb0fabe7de674a12887d89c407 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kirill=20M=C3=BCller?= <kirill@cynkra.com>
Date: Sat, 1 Feb 2025 07:01:33 +0100
Subject: [PATCH 05/12] Silence conflict output

---
 vignettes/duckplyr.Rmd | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index f04550487..0f049ba0c 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -13,6 +13,8 @@ knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>"
 )
+
+options(conflicts.policy = list(warn = FALSE))
 ```
 
 ```{r setup}

From dbb6f33d6212a9c4a9fae92f4c146e5bd04144ee Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kirill=20M=C3=BCller?= <kirill@cynkra.com>
Date: Sat, 1 Feb 2025 07:01:43 +0100
Subject: [PATCH 06/12] Logic

---
 vignettes/duckplyr.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index 0f049ba0c..97acb44d8 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -39,8 +39,8 @@ As a consequence, duckplyr is lazy on the inside for all DuckDB operations but e
 
 If the duckplyr data.frame is accessed by...
 
--   not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
 -   duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
+-   not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
 
 Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
 

From d353a06decfabdee6ee47a72fb25c87928e1b296 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kirill=20M=C3=BCller?= <kirill@cynkra.com>
Date: Sat, 1 Feb 2025 07:03:10 +0100
Subject: [PATCH 07/12] Jargon

---
 vignettes/duckplyr.Rmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index 97acb44d8..48387f6e4 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -46,8 +46,8 @@ Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for t
 
 Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM?
 Therefore, the duckplyr package has a **safeguard called prudence**.
-A prudent data.frame cannot be materialized by default, it needs a call to a `compute()` function.
-By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _frugal_.
+A _frugal_ data.frame cannot be materialized by default, it needs a call to a `collect.duckplyr_df()` function.
+By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
 
 ## How to use duckplyr
 

From 0815adbf7bce780fe967e5ba90e4d478514c3681 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= <maelle.salmon@yahoo.se>
Date: Thu, 6 Feb 2025 13:36:14 +0100
Subject: [PATCH 08/12] prudence

---
 vignettes/duckplyr.Rmd | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index 48387f6e4..0d31b93a2 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -44,9 +44,13 @@ If the duckplyr data.frame is accessed by...
 
 Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
 
-Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all RAM?
-Therefore, the duckplyr package has a **safeguard called prudence**.
-A _frugal_ data.frame cannot be materialized by default, it needs a call to a `collect.duckplyr_df()` function.
+Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
+Therefore, the duckplyr package has a **safeguard called prudence** with three levels.
+
+- `"lavish"`: automatically materialize _regardless of size_,
+- `"frugal"`: _never_ automatically materialize,
+- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
+
 By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
 
 ## How to use duckplyr

From 79d59697a674c1f27aaefb204b638cfefed578af Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= <maelle.salmon@yahoo.se>
Date: Thu, 6 Feb 2025 14:19:23 +0100
Subject: [PATCH 09/12] simpler phrasing

---
 vignettes/duckplyr.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index 0d31b93a2..d2e0c400f 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -67,7 +67,7 @@ library(duckplyr)
 conflict_prefer("filter", "dplyr", quiet = TRUE)
 ```
 
-- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use `duckdb_tibble()`, `as_duckdb_tibble()` or read data using `read_*()` functions like `read_csv_duckdb()`.
+- convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use conversion functions like `duckdb_tibble()` or `as_duckdb_tibble()`, or ingestion functions like `read_csv_duckdb()`.
 
 In both cases, if an operation cannot be performed by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr.
 You can choose to be informed about fallbacks to dplyr, see `?fallback_config`.

From df80e1cdcf1b230dfb26bf8389b95cd96849db6e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= <maelle.salmon@yahoo.se>
Date: Thu, 6 Feb 2025 14:38:59 +0100
Subject: [PATCH 10/12] diagram placeholder

---
 vignettes/duckplyr.Rmd | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index d2e0c400f..e125a8306 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -21,7 +21,17 @@ options(conflicts.policy = list(warn = FALSE))
 library(duckplyr)
 ```
 
-## Design principles
+## What is duckplyr
+
+DIAGRAM, described with words.
+
+The duckplyr package is a drop-in replacement for dplyr that uses DuckDB for speed.
+Data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions.
+The data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
+The duckplyr package performs the computation using DuckDB, or, if a specific operation is not supported, fallbacks to dplyr.
+The result can be materialized to memory, or computed temporarily, or computed to a file.
+
+### Design principles: lazy and eager
 
 The duckplyr package uses **DuckDB under the hood** but is also a **drop-in replacement for dplyr**.
 These two facts create a tension:
@@ -35,7 +45,7 @@ These two facts create a tension:
 
 As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports **deferred evaluation**.
 
-> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed."
+> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen.
 
 If the duckplyr data.frame is accessed by...
 
@@ -44,6 +54,8 @@ If the duckplyr data.frame is accessed by...
 
 Therefore, duckplyr can be **both lazy** (within itself) **and not lazy** (for the outside world).
 
+### Memory protection
+
 Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
 Therefore, the duckplyr package has a **safeguard called prudence** with three levels.
 
@@ -84,11 +96,22 @@ With large datasets, you want:
     - prudence (see `vignette("funnel")`) to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
     - computation to files using  `compute_parquet()` or `compute_csv()`.
 
-
-
 A drawback of analyzing large data with duckplyr is that the limitations of duckplyr (unsupported verbs or data types, see `vignette("limits")`) won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
 
 ## How to improve duckplyr
 
-- telemetry
+You can help us make duckplyr better!
+
+### Automatically report fallbacks to inform development
+
+If you allow duckplyr to log and upload fallback reports, the duckplyr development team will have better data to decide on what feature to work next.
+See `vignette("telemetry")`.
+
+### Contribute
+
+Please report any issue especially regarding:
+
+
+
+
 - report issues, contribute

From a97d20e6a9e1af9e8ed7adac23f1969a70e7d01d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= <maelle.salmon@yahoo.se>
Date: Thu, 6 Feb 2025 14:39:49 +0100
Subject: [PATCH 11/12] crossrefs

---
 vignettes/duckplyr.Rmd | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index e125a8306..e15028b9e 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -81,10 +81,12 @@ conflict_prefer("filter", "dplyr", quiet = TRUE)
 
 - convert individual data.frames to duck frames which allows you to control their automatic materialization parameters. To do that, you use conversion functions like `duckdb_tibble()` or `as_duckdb_tibble()`, or ingestion functions like `read_csv_duckdb()`.
 
-In both cases, if an operation cannot be performed by duckplyr (see `vignettes("limits")`), it will be outsourced to dplyr.
+In both cases, if an operation cannot be performed by duckplyr (see `vignette("limits")`), it will be outsourced to dplyr.
+
 You can choose to be informed about fallbacks to dplyr, see `?fallback_config`.
 You can disable fallbacks by turning off automatic materialization.
 In that case, if an operation cannot be performed by duckplyr, your code will error.
+See `vignette("fallback")`.
 
 ### For large data (instead of dbplyr)
 

From e57aa63c094b7e68c884f0b97b629718166dc6db Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ma=C3=ABlle=20Salmon?= <maelle.salmon@yahoo.se>
Date: Thu, 6 Feb 2025 14:42:22 +0100
Subject: [PATCH 12/12] contribute

---
 vignettes/duckplyr.Rmd | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/vignettes/duckplyr.Rmd b/vignettes/duckplyr.Rmd
index e15028b9e..50d132f96 100644
--- a/vignettes/duckplyr.Rmd
+++ b/vignettes/duckplyr.Rmd
@@ -111,9 +111,6 @@ See `vignette("telemetry")`.
 
 ### Contribute
 
-Please report any issue especially regarding:
+Please report any issue especially regarding unknown incompabilities. See `vignette("limits")`.
 
-
-
-
-- report issues, contribute
+You can also contribute further functionality to duckplyr, refer to our [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html) for details.