Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 19 additions & 17 deletions docs/collect/configure.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ Tailpipe [plugins](/docs/collect/plugins) define tables for common log sources a

If your logs are not in a standard format or are not currently supported by a plugin, you can create [custom tables](/docs/collect/custom-tables) to collect data from arbitrary log files and other sources.

Tables are implemented as DuckDB views over the Parquet files. Tailpipe creates tables (that is, creates views in the `tailpipe.db` database) based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules.
Tailpipe creates DuckLake tables based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules.

When you run `tailpipe query` or `tailpipe connect`, Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and adds a view for the table. The view definitions will include qualifiers that implement any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`).
When you run `tailpipe query` or `tailpipe connect` with any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`), Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and filters the view of the table.

You can see what tables are available with the `tailpipe plugin list` command.
You can see what tables are available with the `tailpipe table list` command.

## Partitions
A partition represents data gathered from a [source](/docs/collect/configure#sources). Partitions are defined [in HCL](/docs/reference/config-files/partition) and are required for [collection](/docs/collect/collect).
Expand Down Expand Up @@ -61,20 +61,22 @@ The standard partitioning/hive structure enables efficient queries that only nee
tp_table=aws_cloudtrail_log
└── tp_partition=prod
└── tp_index=default
├── tp_date=2024-12-31
│   └── data_20250106140713_740378_0.parquet
├── tp_date=2025-01-01
│   └── data_20250106140713_740378_0.parquet
├── tp_date=2025-01-02
│   └── snap_20250106140823_952067.parquet
├── tp_date=2025-01-03
│   └── snap_20250106140824_011599.parquet
├── tp_date=2025-01-04
│   └── data_20250106140752_829722_0.parquet
├── tp_date=2025-01-05
│   └── snap_20250106140824_073116.parquet
└── tp_date=2025-01-06
└── snap_20250106140824_131637.parquet
└── year=2024
├── month=7
│ └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet
│ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet
│ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet
│ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet
├── month=8
│   └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet
│ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet
│ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet
│ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet
├── month=9
│   └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet
│ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet
│ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet
│ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet
```


Expand Down
6 changes: 3 additions & 3 deletions docs/collect/manage-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -292,16 +292,16 @@ Plugin: hub.tailpipe.io/plugins/turbot/aws@latest


## Connecting from Other Tools
You can connect to your Tailpipe database with the native DuckDB client or other tools and libraries that can connect to DuckDB. To do so, you can generate a new db file for the connection using `tailpipe connect`:
You can connect to your Tailpipe database with the native DuckDB client or other tools and libraries that can connect to DuckDB. To do so, you can generate a new SQL script to initialise DuckDB to use the tailpipe database using `tailpipe connect`:

```bash
tailpipe connect
```

A new DB file will be generated and returned:
The path to a new SQL script will be returned:
```bash
$ tailpipe connect
/Users/jsmyth/.tailpipe/data/default/tailpipe_20250409151453.db
/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918210704.sql
```

If you've collected a lot of data and want to optimize your queries for a subset of it, you can pre-filter the database. You can restrict to the most recent 45 days:
Expand Down
11 changes: 9 additions & 2 deletions docs/query/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,16 @@
title: Query Tailpipe
---

# Powered by DuckDB!
# Powered by DuckDB + DuckLake!

Tailpipe [collects](/docs/collect/collect) logs into a [DuckDB](https://duckdb.org/) database that uses [standard SQL syntax](https://duckdb.org/docs/sql/introduction.html) to query. It's easy to [get started writing queries](/docs/sql), and the [Tailpipe Hub](https://hub.tailpipe.io) provides ***hundreds of example queries*** that you can use or modify for your purposes. There are [example queries for each table](https://hub.tailpipe.io/plugins/turbot/aws/tables/aws_cloudtrail_log) in every plugin, and you can also [browse, search, and view the queries](https://hub.tailpipe.io/mods/turbot/tailpipe-mod-aws-dections/queries) in every published mod!
Tailpipe [collects](/docs/collect/collect) logs into open parquet files and catalogs them with [DuckLake](https://ducklake.select/), so you query everything with [standard SQL syntax](https://duckdb.org/docs/sql/introduction.html). This brings a simple "lakehouse" model: open data files, a lightweight metadata catalog, and fast local analytics.

- Open formats: data is stored as Parquet on disk.
- Cataloged: DuckLake tracks tables/columns/partitions for efficient queries.
- Fast by design: partition pruning and vectorized execution via DuckDB.
- SQL-first: use familiar DuckDB syntax, functions, and tooling.

It's easy to [get started writing queries](/docs/sql), and the [Tailpipe Hub](https://hub.tailpipe.io) provides ***hundreds of example queries*** that you can use or modify for your purposes. There are [example queries for each table](https://hub.tailpipe.io/plugins/turbot/aws/tables/aws_cloudtrail_log) in every plugin, and you can also [browse, search, and view the queries](https://hub.tailpipe.io/mods/turbot/tailpipe-mod-aws-dections/queries) in every published mod!


## Interactive Query Shell
Expand Down
6 changes: 3 additions & 3 deletions docs/query/snapshots.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ To upload snapshots to Turbot Pipes, you must either [log in via the `powerpipe
To take a snapshot and save it to [Turbot Pipes](https://turbot.com/pipes/docs), simply add the `--snapshot` flag to your command.

```bash
powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --snapshot
powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --snapshot
```

```bash
Expand All @@ -34,13 +34,13 @@ powerpipe benchmark run cloudtrail_log_detections --share
You can set a snapshot title in Turbot Pipes with the `--snapshot-title` argument.

```bash
powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --share --snapshot-title "Recent Cloudtrail log lines"
powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --share --snapshot-title "Recent Cloudtrail log lines"
```

If you wish to save the snapshot to a different workspace, such as an org workspace, you can use the `--snapshot-location` argument with `--share` or `--snapshot`:

```bash
powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --share --snapshot-location my-org/my-workspace
powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --share --snapshot-location my-org/my-workspace

```

Expand Down
17 changes: 11 additions & 6 deletions docs/reference/cli/connect.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,12 @@ title: tailpipe connect

# tailpipe connect

Return a connection string for a database with a schema determined by the provided parameters.
Return the path of SQL script to initialise DuckDB to use the tailpipe database.

The generated SQL script contains:
- DuckDB extension installations (sqlite, ducklake)
- Database attachment configuration
- View definitions with optional filters

## Usage
```bash
Expand Down Expand Up @@ -32,15 +37,15 @@ tailpipe connect --from 2025-01-01
```

```bash
/home/jon/.tailpipe/data/default/tailpipe_20250115140447.db
/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql
```

> [!NOTE]
> You can use this connection string with DuckDB to directly query the Tailpipe database.
To ensure compatibility with tables that include JSON columns, make sure you’re using DuckDB version 1.1.3 or later.
> You can use this sql script with DuckDB to directly query the Tailpipe database.
To ensure compatibility with DuckLake features, make sure you’re using DuckDB version 1.4.0 or later.
>
> ```bash
> duckdb /home/jon/.tailpipe/data/default/tailpipe_20241212134120.db
> duckdb -init /Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql
> ```

Connect with no filter, show output as json:
Expand All @@ -50,6 +55,6 @@ tailpipe connect --output json
```

```bash
{"database_filepath":"/Users/jonudell/.tailpipe/data/default/tailpipe_20250129204416.db"}
{"init_script_path":"/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204828.sql"}
```

4 changes: 2 additions & 2 deletions docs/reference/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ A detection is a Tailpipe query, optionally bundled into a benchmark, that runs

## DuckDB

Tailpipe uses DuckDB, an embeddable column-oriented database. DuckDB reads the Parquet files created by `tailpipe collect` and enables queries against that data.
Tailpipe uses DuckDB for fast local analytics over Parquet data. DuckLake maintains a lightweight metadata catalog (`metadata.sqlite`) that references the Parquet files collected by Tailpipe, so you query with standard DuckDB SQL while benefiting from partition pruning and a lakehouse-style layout.

## Format
A [format](/docs/reference/config-files/format) describe the layout of the source data so that it can be collected into a table.
Expand All @@ -40,7 +40,7 @@ A [format type](/docs/reference/config-files/format#format-types) defines the pa

## Hive

A tree of Parquet files in the Tailpipe workspace (by default,`~/.tailpipe/data/default`). The `tailpipe.db` in `~/.tailpipe/data/default` (and derivatives created by `tailpipe connect`, e.g. `tailpipe_20241212152506.db`) are thin wrappers that materialize views over the Parquet data.
A tree of Parquet files in the Tailpipe workspace (by default, `~/.tailpipe/data/default`), organized with hive-style partition keys (for example, `tp_table=.../tp_partition=.../tp_index=.../year=YYYY/month=mm`). DuckLake’s catalog (`metadata.sqlite`) points to these files to enable efficient SQL queries.

## Index

Expand Down
18 changes: 9 additions & 9 deletions docs/sql/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ You can **filter** rows where columns only have a specific value:
```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type
from
Expand All @@ -41,7 +41,7 @@ or a **range** of values:
```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type
from
Expand All @@ -55,7 +55,7 @@ or match a **pattern**:
```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type,
event_name
Expand All @@ -70,23 +70,23 @@ You can **filter on multiple columns**, joined by `and` or `or`:
```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type,
event_name
from
aws_cloudtrail_log
where
event_name = 'UpdateTrail'
and tp_date > date '2024-11-06';
and tp_timestamp > date '2024-11-06';
```

You can **sort** your results:

```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type,
event_name
Expand All @@ -101,15 +101,15 @@ You can **sort on multiple columns, ascending or descending**:
```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type,
event_name
from
aws_cloudtrail_log
order by
aws_region asc,
tp_date desc;
tp_timestamp desc;
```

You can group and use standard aggregate functions. You can **count** results:
Expand Down Expand Up @@ -147,7 +147,7 @@ or exclude **all but one matching row**:
```sql
select distinct on (event_type)
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type,
event_name
Expand Down
2 changes: 1 addition & 1 deletion docs/sql/querying-ips.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You can find requests **from a specific IP address**:
```sql
select
tp_partition,
tp_date,
tp_timestamp,
aws_region,
event_type
from
Expand Down
4 changes: 2 additions & 2 deletions docs/sql/tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ select count(*) from aws_cloudtrail_log where partition = 'prod'
select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789
```

*Date*. Each file contains log data for one day. You can filter to include only files for that day.
*Timestamp*. Filter by timestamp, to efficiently get all matching files.

```sql
select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 and tp_date = '2024-12-01'
select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 and tp_timestamp > date '2024-12-01'
```

The [hive directory structure](/docs/collect/configure#hive-partitioning) enables you to exclude large numbers of Parquet files.
Expand Down