Skip to content

Commit 88cb100

Browse files
maellekrlmlr
authored andcommitted
docs: some vignette tweaks
1 parent dc2967f commit 88cb100

File tree

1 file changed

+28
-14
lines changed

1 file changed

+28
-14
lines changed

vignettes/prudence.Rmd

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
2-
title: "Memory protection: Prudence"
2+
title: "Memory protection: controlling automatic materialization"
33
output: rmarkdown::html_vignette
44
vignette: >
5-
%\VignetteIndexEntry{10 Memory protection: Prudence}
5+
%\VignetteIndexEntry{10 Memory protection: controlling automatic materialization}
66
%\VignetteEngine{knitr::rmarkdown}
77
%\VignetteEncoding{UTF-8}
88
---
@@ -31,7 +31,7 @@ Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0)
3131
```
3232

3333
Unlike traditional data frames, duckplyr defers computation until absolutely necessary, allowing DuckDB to optimize execution.
34-
This article explains how to control the materialization of data while maintaining a seamless dplyr-like experience.
34+
This article explains how to control the materialization of data to maintain a seamless dplyr-like experience while remaining cautious of memory usage.
3535

3636

3737

@@ -43,9 +43,9 @@ conflict_prefer("filter", "dplyr")
4343

4444
## Introduction
4545

46-
Data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects.
46+
From a user's perspective, data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects.
4747
In particular, direct column access like `df$x`, or retrieving the number of rows with `nrow()`, works identically.
48-
Conceptually, duckplyr frames are "eager": from a user's perspective, they behave like regular data frames.
48+
Conceptually, duckplyr frames are "eager":
4949

5050
```{r}
5151
df <-
@@ -61,7 +61,10 @@ nrow(df)
6161
```
6262

6363
Under the hood, two key differences provide improved performance and usability:
64-
lazy materialization and prudence.
64+
65+
- **lazy materialization**: Unlike traditional data frames, duckplyr defers computation until absolutely necessary, i.e. lazily, allowing DuckDB to optimize execution.
66+
- **prudence**: Automatic materialization is controllable, as automatic materialization of large data might otherwise inadvertently lead to memory problems.
67+
6568
The term "prudence" is introduced here to set a clear distinction from the concept of "laziness", and because "control of automatic materialization" is a mouthful.
6669

6770
## Eager and lazy computation
@@ -111,7 +114,7 @@ system.time(mean_arr_delay_ewr$mean_arr_delay[[1]])
111114

112115
### Comparison
113116

114-
The functionality is similar to lazy tables in dbplyr and lazy frames in dtplyr.
117+
The functionality is similar to lazy tables in [dbplyr](https://dbplyr.tidyverse.org/) and lazy frames in [dtplyr](https://dtplyr.tidyverse.org/).
115118
However, the behavior is different: at the time of writing, the internal structure of a lazy table or frame is different from a data frame, and columns cannot be accessed directly.
116119

117120
| | **Eager** 😃 | **Lazy** 😴 |
@@ -121,7 +124,7 @@ However, the behavior is different: at the time of writing, the internal structu
121124
| **dtplyr** | ||
122125
| **duckplyr**|||
123126

124-
In contrast, with dplyr, each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization:
127+
In contrast, with [dplyr](https://dplyr.tidyverse.org/), each intermediate step and also the final result is a proper data frame, and computed right away, forfeiting the opportunity for optimization:
125128

126129
```{r}
127130
system.time(
@@ -143,24 +146,24 @@ See also the [duckplyr: dplyr Powered by DuckDB](https://duckdb.org/2024/04/02/d
143146

144147
Being both "eager" and "lazy" at the same time introduces a challenge:
145148
it is too easy to accidentally trigger computation,
146-
which is prohibitive if an intermediate result is too large.
149+
which is prohibitive if an intermediate result is too large to fit into memory.
147150
Prudence is a setting for duckplyr frames that limits the size of the data that is materialized automatically.
148151

149152
### Concept
150153

151154
Three levels of prudence are available:
152155

153-
- _Lavish_: materialize the data right away, as in the first example.
154-
- _Frugal_: throw an error when attempting to access the data.
155-
- _Thrifty_: materialize the data if it is small, otherwise throw an error.
156+
- _lavish_: always automatically materialize, as in the first example.
157+
- _frugal_: never automatically materialize, throw an error when attempting to access the data.
158+
- _thrifty_: only automaticaly materialize the data if it is small, otherwise throw an error.
156159

157160
For lavish duckplyr frames, as in the two previous examples, the underlying DuckDB computation is carried out upon the first request.
158161
Once the results are computed, they are cached and subsequent requests are fast.
159162
This is a good choice for small to medium-sized data, where DuckDB can provide a nice speedup but materializing the data is affordable at any stage.
160163
This is the default for `duckdb_tibble()` and `as_duckdb_tibble()`.
161164

162165
For frugal duckplyr frames, accessing a column or requesting the number of rows triggers an error.
163-
This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out.
166+
This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out and where the results are stored.
164167
Results can be materialized explicitly with `collect()` and other functions.
165168

166169
Thrifty duckplyr frames are a compromise between lavish and frugal, discussed further below.
@@ -254,7 +257,7 @@ flights_frugal |>
254257

255258
### Comparison
256259

257-
Frugal duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you *explicitly* request it with `collect.duckplyr_df()` or through other means.
260+
Frugal duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you _explicitly_ request it with `collect.duckplyr_df()` or through other means.
258261
However, frugal duckplyr frames can be converted to lavish ones at any time, and vice versa.
259262
In dtplyr and dbplyr, there are no lavish frames: collection always needs to be explicit.
260263

@@ -291,4 +294,15 @@ flights_partial |>
291294
Thrifty is a good choice for data sets where the cost of materializing the data is prohibitive only for large results.
292295
This is the default for the ingestion functions like `read_parquet_duckdb()`.
293296

297+
298+
## Conclusion
299+
300+
The duckplyr package provides
301+
302+
- a drop-in replacement for duckplyr, which necessitates "eager" data frames that automatically materialize like in dplyr,
303+
- optimization by DuckDB, which means "lazy" evaluation where the data is materialized at the latest possible stage.
304+
305+
Automatic materialization can be dangerous for memory with large data, so duckplyr provides a setting called `prudence` that controls automatic materialization:
306+
is the data automatically materialized _always_ ("lavish" frames), _never_ ("frugal" frames) or _up to a certain size_ ("thrifty" frames).
307+
294308
See `vignette("large")` for more details on working with large data sets, `vignette("fallback")` for fallbacks to dplyr, and `vignette("limits")` for the operations supported by duckplyr.

0 commit comments

Comments
 (0)