Skip to content

Commit 00b4546

Browse files
committed
docs: some vignette tweaks
1 parent 1179bf5 commit 00b4546

File tree

1 file changed

+69
-57
lines changed

1 file changed

+69
-57
lines changed

vignettes/prudence.Rmd

Lines changed: 69 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
2-
title: "Memory protection: Prudence"
2+
title: "Memory protection: controlling automatic materialization"
33
output: rmarkdown::html_vignette
44
vignette: >
5-
%\VignetteIndexEntry{10 Memory protection: Prudence}
5+
%\VignetteIndexEntry{10 Memory protection: controlling automatic materialization}
66
%\VignetteEngine{knitr::rmarkdown}
77
%\VignetteEncoding{UTF-8}
88
---
@@ -30,22 +30,19 @@ knitr::opts_chunk$set(
3030
Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0)
3131
```
3232

33-
Unlike traditional data frames, duckplyr defers computation until absolutely necessary, allowing DuckDB to optimize execution.
34-
This article explains how to control the materialization of data while maintaining a seamless dplyr-like experience.
35-
36-
33+
This article explains how to control the materialization of data to maintain a seamless dplyr-like experience as well as to protect memory.
3734

3835
```{r attach}
3936
library(conflicted)
4037
library(dplyr)
4138
conflict_prefer("filter", "dplyr")
4239
```
4340

44-
## Introduction
41+
## dplyr drop-in replacement: eager data frames
4542

46-
Data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects.
43+
Data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects from a user's perspective.
4744
In particular, direct column access like `df$x`, or retrieving the number of rows with `nrow()`, works identically.
48-
Conceptually, duckplyr frames are "eager": from a user's perspective, they behave like regular data frames.
45+
Therefore, conceptually, duckplyr frames are "eager".
4946

5047
```{r}
5148
df <-
@@ -60,11 +57,14 @@ df$y
6057
nrow(df)
6158
```
6259

63-
Under the hood, two key differences provide improved performance and usability:
64-
lazy materialization and prudence.
60+
Under the hood though, two key differences provide improved performance and usability:
61+
62+
- **lazy materialization**: Unlike traditional data frames, duckplyr defers computation until absolutely necessary, i.e. lazily, allowing DuckDB to optimize execution.
63+
- **prudence**: Automatic materialization is controllable, as automatic materialization of large data might otherwise inadvertently lead to memory problems.
64+
6565
The term "prudence" is introduced here to set a clear distinction from the concept of "laziness", and because "control of automatic materialization" is a mouthful.
6666

67-
## Eager and lazy computation
67+
## DuckDB optimization: lazy evaluation
6868

6969
For a duckplyr frame that is the result of a dplyr operation, accessing column data or retrieving the number of rows will trigger a computation that is carried out by DuckDB, not dplyr.
7070
In this sense, duckplyr frames are also "lazy": the computation is deferred until the last possible moment, allowing DuckDB to optimize the whole pipeline.
@@ -109,10 +109,11 @@ The result becomes available when accessed:
109109
system.time(mean_arr_delay_ewr$mean_arr_delay[[1]])
110110
```
111111

112-
### Comparison
112+
### Comparison with similar tools
113113

114-
The functionality is similar to lazy tables in dbplyr and lazy frames in dtplyr.
114+
The functionality is similar to lazy tables in [dbplyr](https://dbplyr.tidyverse.org/) and lazy frames in [dtplyr](https://dtplyr.tidyverse.org/).
115115
However, the behavior is different: at the time of writing, the internal structure of a lazy table or frame is different from a data frame, and columns cannot be accessed directly.
116+
Users need to explicitly `collect()` the data, the data frame is not "eager" at all.
116117

117118
| | **Eager** 😃 | **Lazy** 😴 |
118119
|-------------|:------------:|:-----------:|
@@ -121,7 +122,7 @@ However, the behavior is different: at the time of writing, the internal structu
121122
| **dtplyr** | ||
122123
| **duckplyr**|||
123124

124-
In contrast, with dplyr, each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization:
125+
In contrast, with [dplyr](https://dplyr.tidyverse.org/), each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization:
125126

126127
```{r}
127128
system.time(
@@ -139,31 +140,65 @@ system.time(
139140

140141
See also the [duckplyr: dplyr Powered by DuckDB](https://duckdb.org/2024/04/02/duckplyr.html) blog post for more information.
141142

142-
## Prudence
143+
## Memory protection: control of automatic materialization with `prudence`
143144

144145
Being both "eager" and "lazy" at the same time introduces a challenge:
145-
it is too easy to accidentally trigger computation,
146-
which is prohibitive if an intermediate result is too large.
147-
Prudence is a setting for duckplyr frames that limits the size of the data that is materialized automatically.
146+
**it is too easy to accidentally trigger computation**,
147+
which is prohibitive if an intermediate result is too large to fit into memory.
148148

149-
### Concept
149+
Fortunately, duckplyr frames have a setting called `prudence` that limits the size of the data that is materialized automatically,
150+
and that the user can choose based on the data size.
151+
152+
### When to automatically materialize?
150153

151154
Three levels of prudence are available:
152155

153-
- _Lavish_: materialize the data right away, as in the first example.
154-
- _Frugal_: throw an error when attempting to access the data.
155-
- _Thrifty_: materialize the data if it is small, otherwise throw an error.
156+
- __lavish__: _always_ automatically materialize, as in the first example.
157+
- __frugal__: _never_ automatically materialize, throw an error when attempting to access the data.
158+
- __thrifty__: automatically materialize the data _if it is small_, otherwise throw an error.
156159

157160
For lavish duckplyr frames, as in the two previous examples, the underlying DuckDB computation is carried out upon the first request.
158161
Once the results are computed, they are cached and subsequent requests are fast.
159162
This is a good choice for small to medium-sized data, where DuckDB can provide a nice speedup but materializing the data is affordable at any stage.
160163
This is the default for `duckdb_tibble()` and `as_duckdb_tibble()`.
161164

162165
For frugal duckplyr frames, accessing a column or requesting the number of rows triggers an error.
163-
This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out.
166+
This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out and how (to memory, or to a file).
164167
Results can be materialized explicitly with `collect()` and other functions.
165168

166-
Thrifty duckplyr frames are a compromise between lavish and frugal, discussed further below.
169+
Thrifty duckplyr frames are a compromise between lavish and frugal, discussed below.
170+
171+
### Thrift
172+
173+
Thrifty is a compromise between frugal and lavish.
174+
Materialization is allowed for data up to a certain size, measured in cells (values) and rows in the resulting data frame.
175+
176+
```{r}
177+
nrow(flights)
178+
flights_partial <-
179+
flights |>
180+
duckplyr::as_duckdb_tibble(prudence = "thrifty")
181+
```
182+
183+
With this setting, the data is materialized only if the result has fewer than 1,000,000 cells (rows multiplied by columns).
184+
185+
```{r error = TRUE}
186+
flights_partial |>
187+
select(origin, dest, dep_delay, arr_delay) |>
188+
nrow()
189+
```
190+
191+
The original input is too large to be materialized, so the operation fails.
192+
On the other hand, the result after aggregation is small enough to be materialized:
193+
194+
```{r}
195+
flights_partial |>
196+
count(origin) |>
197+
nrow()
198+
```
199+
200+
Thrifty is a good choice for data sets where the cost of materializing the data is prohibitive only for large results.
201+
This is the default for the ingestion functions like `read_parquet_duckdb()`.
167202

168203

169204
### Example
@@ -198,7 +233,7 @@ flights_frugal[[1]]
198233
```
199234

200235

201-
### Enforcing DuckDB operation
236+
### Side effect: Enforcing DuckDB operation
202237

203238
For operations not supported by duckplyr, the original dplyr implementation is used as a fallback.
204239
As the original dplyr implementation accesses columns directly, the data must be materialized before a fallback can be executed.
@@ -224,7 +259,7 @@ flights_frugal |>
224259
By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way.
225260

226261

227-
### From frugal to lavish
262+
### Conversion between prudence levels
228263

229264
A frugal duckplyr frame can be converted to a lavish one with `as_duckdb_tibble(prudence = "lavish")`.
230265
The `collect.duckplyr_df()` method triggers computation and converts to a plain tibble.
@@ -252,43 +287,20 @@ flights_frugal |>
252287
class()
253288
```
254289

255-
### Comparison
290+
### Comparison with similar tools
256291

257292
Frugal duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you *explicitly* request it with `collect.duckplyr_df()` or through other means.
258293
However, frugal duckplyr frames can be converted to lavish ones at any time, and vice versa.
259294
In dtplyr and dbplyr, there are no lavish frames: collection always needs to be explicit.
260295

296+
## Conclusion
261297

262-
## Thrift
263-
264-
Thrifty is a compromise between frugal and lavish.
265-
Materialization is allowed for data up to a certain size, measured in cells (values) and rows in the resulting data frame.
266-
267-
```{r}
268-
nrow(flights)
269-
flights_partial <-
270-
flights |>
271-
duckplyr::as_duckdb_tibble(prudence = "thrifty")
272-
```
273-
274-
With this setting, the data is materialized only if the result has fewer than 1,000,000 cells (rows multiplied by columns).
275-
276-
```{r error = TRUE}
277-
flights_partial |>
278-
select(origin, dest, dep_delay, arr_delay) |>
279-
nrow()
280-
```
281-
282-
The original input is too large to be materialized, so the operation fails.
283-
On the other hand, the result after aggregation is small enough to be materialized:
298+
The duckplyr package provides
284299

285-
```{r}
286-
flights_partial |>
287-
count(origin) |>
288-
nrow()
289-
```
300+
- a drop-in replacement for duckplyr, which necessitates "eager" data frames that automatically materialize like in dplyr,
301+
- optimization by DuckDB, which means lazy evaluation where the data is materialized at the latest possible stage.
290302

291-
Thrifty is a good choice for data sets where the cost of materializing the data is prohibitive only for large results.
292-
This is the default for the ingestion functions like `read_parquet_duckdb()`.
303+
Automatic materialization can be dangerous for memory with large data, so duckplyr provides a setting called `prudence` that controls automatic materialization:
304+
is the data automatically materialized _always_ ("lavish" frames), _never_ ("frugal" frames) or _up to a certain size_ ("thrifty" frames).
293305

294306
See `vignette("large")` for more details on working with large data sets, `vignette("fallback")` for fallbacks to dplyr, and `vignette("limits")` for the operations supported by duckplyr.

0 commit comments

Comments
 (0)