Skip to content

Conversation

@gadenbuie
Copy link
Contributor

Closes #51

How does this work?

pkgload::load_all("~/work/posit-dev/querychat/pkg-r")
#> ℹ Loading querychat
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

con <- DBI::dbConnect(duckdb::duckdb())
duckdb::dbWriteTable(con, "mtcars", mtcars)

mtcars_db <- tbl(con, "mtcars")

Simple tbl source

First, we can create a new data source from the tbl() object.

src <- TblLazySource$new(mtcars_db)
(res <- src$execute_query("SELECT * FROM mtcars WHERE cyl > 4"))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # ℹ more rows

Which returns a tbl() that can be chained into further dplyr operations.

res |> count(cyl, gear)
#> # Source:   SQL [?? x 3]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>     cyl  gear     n
#>   <dbl> <dbl> <dbl>
#> 1     6     5     1
#> 2     6     3     2
#> 3     8     3    12
#> 4     6     4     4
#> 5     8     5     2

Complicated tbl source

This same process even works for more complicated tibbles, like the result of
of dplyr pipeline on SQL tibbles.

mtcars_6_8_cyl <- mtcars_db |> inner_join(mtcars_db |> dplyr::filter(cyl > 4))
#> Joining with `by = join_by(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear,
#> carb)`
src <- TblLazySource$new(mtcars_6_8_cyl)

And again, the result is a tbl() that can be folded into further dplyr
operations.

(res2 <- src$execute_query("SELECT * FROM mtcars_6_8_cyl WHERE gear < 6"))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  4  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  5  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#>  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
#> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
#> # ℹ more rows
res2 |> count(cyl, gear)
#> # Source:   SQL [?? x 3]
#> # Database: DuckDB 1.4.1 [root@Darwin 25.0.0:R 4.5.2/:memory:]
#>     cyl  gear     n
#>   <dbl> <dbl> <dbl>
#> 1     6     3     2
#> 2     8     5     2
#> 3     6     4     4
#> 4     6     5     1
#> 5     8     3    12

The way we make this work is by extracting the SQL for the dplyr pipeline up
until we create a data source, and then, for complicated queries at least, we
use a local CTE, letting the LLM write queries against that CTE as if it were
a fixed table.

src$complete_query("SELECT * FROM mtcars_6_8_cyl WHERE gear < 6") |> cat()
#> Error in cat(src$complete_query("SELECT * FROM mtcars_6_8_cyl WHERE gear < 6")): attempt to apply non-function

Amazingly, we can even apply this strategy to get the schema of the CTE. This
took a small amount of updating to get_schema_impl() to make it work, but
the core logic is exactly the same.

src$get_schema() |> cat()
#> Table: mtcars_6_8_cyl
#> Columns:
#> - mpg (FLOAT)
#>   Range: 10.4 to 21.4
#> - cyl (FLOAT)
#>   Range: 6 to 8
#> - disp (FLOAT)
#>   Range: 145 to 472
#> - hp (FLOAT)
#>   Range: 105 to 335
#> - drat (FLOAT)
#>   Range: 2.76 to 4.22
#> - wt (FLOAT)
#>   Range: 2.62 to 5.424
#> - qsec (FLOAT)
#>   Range: 14.5 to 20.22
#> - vs (FLOAT)
#>   Range: 0 to 1
#> - am (FLOAT)
#>   Range: 0 to 1
#> - gear (FLOAT)
#>   Range: 3 to 5
#> - carb (FLOAT)
#>   Range: 1 to 8

if (!is_missing(table_name)) {
check_sql_table_name(table_name)
self$table_name <- table_name
use_cte <- identical(table_name, remote_name %||% remote_table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote_table isn't defined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the logic here inverted? That is, should it be use_cte <- !identical(table_name, remote_name)?

Comment on lines +439 to +440
# Collect various signals to infer the table name
obj_name <- deparse1(substitute(tbl))
Copy link
Contributor

@cpsievert cpsievert Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be making table_name required and keep any substitute() magic further up in the call stack.

Somewhat relatedly, do you see any utility to exporting the DataSource implementations (i.e., DataFrameSource, etc)?

Comment on lines +525 to +530
sprintf(
"WITH %s AS (\n%s\n)\n%s",
DBI::dbQuoteIdentifier(private$conn, self$table_name),
private$tbl_cte,
query
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clever! 💯


output$dt <- DT::renderDT({
df <- qc_vals$df()
if (inherits(df, "tbl_sql")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that data_source is a TblLazySource instead of sniffing the df result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this isn't really necessary -- I've included it because dplyr is in suggests and just want to make sure we've gone past a check_installed("dplyr") (which happens if you've created a TblLazySource).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(R) Return a dbplyr::tbl object from querychat_server

2 participants