diff --git a/docs/collect/configure.md b/docs/collect/configure.md index ea81c83..fa8aaf4 100644 --- a/docs/collect/configure.md +++ b/docs/collect/configure.md @@ -15,11 +15,11 @@ Tailpipe [plugins](/docs/collect/plugins) define tables for common log sources a If your logs are not in a standard format or are not currently supported by a plugin, you can create [custom tables](/docs/collect/custom-tables) to collect data from arbitrary log files and other sources. -Tables are implemented as DuckDB views over the Parquet files. Tailpipe creates tables (that is, creates views in the `tailpipe.db` database) based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules. +Tailpipe creates DuckLake tables based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules. -When you run `tailpipe query` or `tailpipe connect`, Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and adds a view for the table. The view definitions will include qualifiers that implement any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`). +When you run `tailpipe query` or `tailpipe connect` with any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`), Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and filters the view of the table. -You can see what tables are available with the `tailpipe plugin list` command. +You can see what tables are available with the `tailpipe table list` command. ## Partitions A partition represents data gathered from a [source](/docs/collect/configure#sources). Partitions are defined [in HCL](/docs/reference/config-files/partition) and are required for [collection](/docs/collect/collect). @@ -61,20 +61,22 @@ The standard partitioning/hive structure enables efficient queries that only nee tp_table=aws_cloudtrail_log └── tp_partition=prod └── tp_index=default - ├── tp_date=2024-12-31 - │   └── data_20250106140713_740378_0.parquet - ├── tp_date=2025-01-01 - │   └── data_20250106140713_740378_0.parquet - ├── tp_date=2025-01-02 - │   └── snap_20250106140823_952067.parquet - ├── tp_date=2025-01-03 - │   └── snap_20250106140824_011599.parquet - ├── tp_date=2025-01-04 - │   └── data_20250106140752_829722_0.parquet - ├── tp_date=2025-01-05 - │   └── snap_20250106140824_073116.parquet - └── tp_date=2025-01-06 - └── snap_20250106140824_131637.parquet + └── year=2024 + ├── month=7 + │ └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet + │ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet + │ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet + │ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet + ├── month=8 + │   └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet + │ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet + │ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet + │ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet + ├── month=9 + │   └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet + │ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet + │ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet + │ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet ``` diff --git a/docs/collect/manage-data.md b/docs/collect/manage-data.md index 296dd63..0f448b1 100644 --- a/docs/collect/manage-data.md +++ b/docs/collect/manage-data.md @@ -292,16 +292,16 @@ Plugin: hub.tailpipe.io/plugins/turbot/aws@latest ## Connecting from Other Tools -You can connect to your Tailpipe database with the native DuckDB client or other tools and libraries that can connect to DuckDB. To do so, you can generate a new db file for the connection using `tailpipe connect`: +You can connect to your Tailpipe database with the native DuckDB client or other tools and libraries that can connect to DuckDB. To do so, you can generate a new SQL script to initialise DuckDB to use the tailpipe database using `tailpipe connect`: ```bash tailpipe connect ``` -A new DB file will be generated and returned: +The path to a new SQL script will be returned: ```bash $ tailpipe connect -/Users/jsmyth/.tailpipe/data/default/tailpipe_20250409151453.db +/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918210704.sql ``` If you've collected a lot of data and want to optimize your queries for a subset of it, you can pre-filter the database. You can restrict to the most recent 45 days: diff --git a/docs/query/index.md b/docs/query/index.md index 864f844..27bbef2 100644 --- a/docs/query/index.md +++ b/docs/query/index.md @@ -2,9 +2,16 @@ title: Query Tailpipe --- -# Powered by DuckDB! +# Powered by DuckDB + DuckLake! -Tailpipe [collects](/docs/collect/collect) logs into a [DuckDB](https://duckdb.org/) database that uses [standard SQL syntax](https://duckdb.org/docs/sql/introduction.html) to query. It's easy to [get started writing queries](/docs/sql), and the [Tailpipe Hub](https://hub.tailpipe.io) provides ***hundreds of example queries*** that you can use or modify for your purposes. There are [example queries for each table](https://hub.tailpipe.io/plugins/turbot/aws/tables/aws_cloudtrail_log) in every plugin, and you can also [browse, search, and view the queries](https://hub.tailpipe.io/mods/turbot/tailpipe-mod-aws-dections/queries) in every published mod! +Tailpipe [collects](/docs/collect/collect) logs into open parquet files and catalogs them with [DuckLake](https://ducklake.select/), so you query everything with [standard SQL syntax](https://duckdb.org/docs/sql/introduction.html). This brings a simple "lakehouse" model: open data files, a lightweight metadata catalog, and fast local analytics. + +- Open formats: data is stored as Parquet on disk. +- Cataloged: DuckLake tracks tables/columns/partitions for efficient queries. +- Fast by design: partition pruning and vectorized execution via DuckDB. +- SQL-first: use familiar DuckDB syntax, functions, and tooling. + +It's easy to [get started writing queries](/docs/sql), and the [Tailpipe Hub](https://hub.tailpipe.io) provides ***hundreds of example queries*** that you can use or modify for your purposes. There are [example queries for each table](https://hub.tailpipe.io/plugins/turbot/aws/tables/aws_cloudtrail_log) in every plugin, and you can also [browse, search, and view the queries](https://hub.tailpipe.io/mods/turbot/tailpipe-mod-aws-dections/queries) in every published mod! ## Interactive Query Shell diff --git a/docs/query/snapshots.md b/docs/query/snapshots.md index b553d24..5b65239 100644 --- a/docs/query/snapshots.md +++ b/docs/query/snapshots.md @@ -16,7 +16,7 @@ To upload snapshots to Turbot Pipes, you must either [log in via the `powerpipe To take a snapshot and save it to [Turbot Pipes](https://turbot.com/pipes/docs), simply add the `--snapshot` flag to your command. ```bash -powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --snapshot +powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --snapshot ``` ```bash @@ -34,13 +34,13 @@ powerpipe benchmark run cloudtrail_log_detections --share You can set a snapshot title in Turbot Pipes with the `--snapshot-title` argument. ```bash -powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --share --snapshot-title "Recent Cloudtrail log lines" +powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --share --snapshot-title "Recent Cloudtrail log lines" ``` If you wish to save the snapshot to a different workspace, such as an org workspace, you can use the `--snapshot-location` argument with `--share` or `--snapshot`: ```bash -powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --share --snapshot-location my-org/my-workspace +powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --share --snapshot-location my-org/my-workspace ``` diff --git a/docs/reference/cli/connect.md b/docs/reference/cli/connect.md index b0d241c..f89a792 100644 --- a/docs/reference/cli/connect.md +++ b/docs/reference/cli/connect.md @@ -4,7 +4,12 @@ title: tailpipe connect # tailpipe connect -Return a connection string for a database with a schema determined by the provided parameters. +Return the path of SQL script to initialise DuckDB to use the tailpipe database. + +The generated SQL script contains: +- DuckDB extension installations (sqlite, ducklake) +- Database attachment configuration +- View definitions with optional filters ## Usage ```bash @@ -32,15 +37,15 @@ tailpipe connect --from 2025-01-01 ``` ```bash -/home/jon/.tailpipe/data/default/tailpipe_20250115140447.db +/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql ``` > [!NOTE] -> You can use this connection string with DuckDB to directly query the Tailpipe database. -To ensure compatibility with tables that include JSON columns, make sure you’re using DuckDB version 1.1.3 or later. +> You can use this sql script with DuckDB to directly query the Tailpipe database. +To ensure compatibility with DuckLake features, make sure you’re using DuckDB version 1.4.0 or later. > > ```bash -> duckdb /home/jon/.tailpipe/data/default/tailpipe_20241212134120.db +> duckdb -init /Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql > ``` Connect with no filter, show output as json: @@ -50,6 +55,6 @@ tailpipe connect --output json ``` ```bash -{"database_filepath":"/Users/jonudell/.tailpipe/data/default/tailpipe_20250129204416.db"} +{"init_script_path":"/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204828.sql"} ``` diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md index 9dfd48b..bb95781 100644 --- a/docs/reference/glossary.md +++ b/docs/reference/glossary.md @@ -30,7 +30,7 @@ A detection is a Tailpipe query, optionally bundled into a benchmark, that runs ## DuckDB -Tailpipe uses DuckDB, an embeddable column-oriented database. DuckDB reads the Parquet files created by `tailpipe collect` and enables queries against that data. +Tailpipe uses DuckDB for fast local analytics over Parquet data. DuckLake maintains a lightweight metadata catalog (`metadata.sqlite`) that references the Parquet files collected by Tailpipe, so you query with standard DuckDB SQL while benefiting from partition pruning and a lakehouse-style layout. ## Format A [format](/docs/reference/config-files/format) describe the layout of the source data so that it can be collected into a table. @@ -40,7 +40,7 @@ A [format type](/docs/reference/config-files/format#format-types) defines the pa ## Hive -A tree of Parquet files in the Tailpipe workspace (by default,`~/.tailpipe/data/default`). The `tailpipe.db` in `~/.tailpipe/data/default` (and derivatives created by `tailpipe connect`, e.g. `tailpipe_20241212152506.db`) are thin wrappers that materialize views over the Parquet data. +A tree of Parquet files in the Tailpipe workspace (by default, `~/.tailpipe/data/default`), organized with hive-style partition keys (for example, `tp_table=.../tp_partition=.../tp_index=.../year=YYYY/month=mm`). DuckLake’s catalog (`metadata.sqlite`) points to these files to enable efficient SQL queries. ## Index diff --git a/docs/sql/index.md b/docs/sql/index.md index d64e2df..5c3e365 100644 --- a/docs/sql/index.md +++ b/docs/sql/index.md @@ -27,7 +27,7 @@ You can **filter** rows where columns only have a specific value: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type from @@ -41,7 +41,7 @@ or a **range** of values: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type from @@ -55,7 +55,7 @@ or match a **pattern**: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -70,7 +70,7 @@ You can **filter on multiple columns**, joined by `and` or `or`: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -78,7 +78,7 @@ from aws_cloudtrail_log where event_name = 'UpdateTrail' - and tp_date > date '2024-11-06'; + and tp_timestamp > date '2024-11-06'; ``` You can **sort** your results: @@ -86,7 +86,7 @@ You can **sort** your results: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -101,7 +101,7 @@ You can **sort on multiple columns, ascending or descending**: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -109,7 +109,7 @@ from aws_cloudtrail_log order by aws_region asc, - tp_date desc; + tp_timestamp desc; ``` You can group and use standard aggregate functions. You can **count** results: @@ -147,7 +147,7 @@ or exclude **all but one matching row**: ```sql select distinct on (event_type) tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name diff --git a/docs/sql/querying-ips.md b/docs/sql/querying-ips.md index 0350088..d22e486 100644 --- a/docs/sql/querying-ips.md +++ b/docs/sql/querying-ips.md @@ -12,7 +12,7 @@ You can find requests **from a specific IP address**: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type from diff --git a/docs/sql/tips.md b/docs/sql/tips.md index 55a44b9..614f0e8 100644 --- a/docs/sql/tips.md +++ b/docs/sql/tips.md @@ -20,10 +20,10 @@ select count(*) from aws_cloudtrail_log where partition = 'prod' select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 ``` -*Date*. Each file contains log data for one day. You can filter to include only files for that day. +*Timestamp*. Filter by timestamp, to efficiently get all matching files. ```sql -select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 and tp_date = '2024-12-01' +select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 and tp_timestamp > date '2024-12-01' ``` The [hive directory structure](/docs/collect/configure#hive-partitioning) enables you to exclude large numbers of Parquet files.