Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ set(EXTENSION_SOURCES
src/parse_tables.cpp
src/parse_where.cpp
src/parse_functions.cpp
src/parse_columns.cpp
)

build_static_extension(${TARGET_NAME} ${EXTENSION_SOURCES})
Expand Down
80 changes: 78 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,19 @@ An experimental DuckDB extension that exposes functionality from DuckDB's native

- **Extract table references** from a SQL query with context information (e.g. `FROM`, `JOIN`, etc.)
- **Extract function calls** from a SQL query with context information (e.g. `SELECT`, `WHERE`, `HAVING`, etc.)
- **Extract column references** from a SQL query with comprehensive dependency tracking
- **Parse WHERE clauses** to extract conditions and operators
- Support for **window functions**, **nested functions**, and **CTEs**
- **Alias chain tracking** for complex column dependencies
- **Nested struct field access** parsing (e.g., `table.column.field.subfield`)
- **Input vs output column distinction** for complete dependency analysis
- Includes **schema**, **name**, and **context** information for all extractions
- Built on DuckDB's native SQL parser
- Simple SQL interface — no external tooling required


## Known Limitations
- Only `SELECT` statements are supported for table and function parsing
- Only `SELECT` statements are supported for table, function, and column parsing
- WHERE clause parsing supports additional statement types
- Full parse tree is not exposed (only specific structural elements)

Expand Down Expand Up @@ -92,9 +96,17 @@ Context helps identify where elements are used in the query.
- `group_by`: function in a `GROUP BY` clause
- `nested`: function call nested within another function

### Column Context
- `select`: column in a `SELECT` clause
- `where`: column in a `WHERE` clause
- `having`: column in a `HAVING` clause
- `order_by`: column in an `ORDER BY` clause
- `group_by`: column in a `GROUP BY` clause
- `function_arg`: column used as a function argument

## Functions

This extension provides parsing functions for tables, functions, and WHERE clauses. Each category includes both table functions (for detailed results) and scalar functions (for programmatic use).
This extension provides parsing functions for tables, functions, columns, and WHERE clauses. Each category includes both table functions (for detailed results) and scalar functions (for programmatic use).

In general, errors (e.g. Parse Exception) will not be exposed to the user, but instead will result in an empty result. This simplifies batch processing. When validity is needed, [is_parsable](#is_parsablesql_query--scalar-function) can be used.

Expand Down Expand Up @@ -190,6 +202,70 @@ SELECT list_filter(parse_functions('SELECT upper(name) FROM users WHERE lower(em

---

### Column Parsing Functions

These functions extract column references from SQL queries, providing comprehensive dependency tracking including alias chains, nested struct field access, and input/output column distinction.

#### `parse_columns(sql_query)` – Table Function

Parses a SQL `SELECT` query and returns all column references along with their context, schema qualification, and dependency information.

##### Usage
```sql
SELECT * FROM parse_columns('SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id;');
```

##### Returns
A table with:
- `expression_identifiers`: JSON array of identifier paths (e.g., `[["u","name"]]` or `[["schema","table","column","field"]]`)
- `table_schema`: schema name for table columns (NULL for aliases/expressions)
- `table_name`: table name for table columns (NULL for aliases/expressions)
- `column_name`: column name for simple references (NULL for complex expressions)
- `context`: where the column appears in the query (select, where, function_arg, etc.)
- `expression`: full expression text as it appears in the SQL
- `selected_name`: output column name for SELECT items (NULL for input columns)

##### Basic Example
```sql
SELECT * FROM parse_columns('SELECT name, age FROM users;');
```

| expression_identifiers | table_schema | table_name | column_name | context | expression | selected_name |
|------------------------|--------------|------------|-------------|---------|------------|---------------|
| [["name"]] | NULL | NULL | name | select | name | NULL |
| [["age"]] | NULL | NULL | age | select | age | NULL |

##### Alias Chain Example
```sql
SELECT * FROM parse_columns('SELECT 1 AS a, users.age AS b, a+b AS c FROM users;');
```

| expression_identifiers | table_schema | table_name | column_name | context | expression | selected_name |
|------------------------|--------------|------------|-------------|--------------|------------|---------------|
| [["users","age"]] | main | users | age | select | users.age | NULL |
| [["users","age"]] | NULL | NULL | NULL | select | users.age | b |
| [["a"]] | NULL | NULL | a | function_arg | a | NULL |
| [["b"]] | NULL | NULL | b | function_arg | b | NULL |
| [["a"],["b"]] | NULL | NULL | NULL | select | (a + b) | c |

##### Nested Struct Example
```sql
SELECT * FROM parse_columns('SELECT users.profile.address.city FROM users;');
```

| expression_identifiers | table_schema | table_name | column_name | context | expression | selected_name |
|------------------------------------------------|--------------|------------|-------------|---------|------------------------------|---------------|
| [["users","profile","address","city"]] | users | profile | address | select | users.profile.address.city | NULL |

##### Complex Multi-table Example
```sql
SELECT * FROM parse_columns('SELECT u.name, o.total, u.age + o.total AS score FROM users u JOIN orders o ON u.id = o.user_id WHERE u.status = "active";');
```

Shows columns from multiple tables with different contexts (select, function_arg, join conditions).

---

### Table Parsing Functions

#### `parse_tables(sql_query)` – Table Function
Expand Down
74 changes: 74 additions & 0 deletions column_parser_examples.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
-- Column Parser Examples - Demonstrating Key Features
LOAD parser_tools;

SELECT '=== Example 1: Basic Column References ===' as example;
SELECT * FROM parse_columns('SELECT name, age, email FROM customers') LIMIT 3;

SELECT '=== Example 2: Alias Chain (Key Innovation) ===' as example;
SELECT * FROM parse_columns('SELECT 1 AS a, users.age AS b, a+b AS c, b AS d FROM users');

SELECT '=== Example 3: Schema-Qualified Columns ===' as example;
SELECT * FROM parse_columns('SELECT main.customers.name, main.customers.email FROM main.customers') LIMIT 2;

SELECT '=== Example 4: Nested Struct Field Access ===' as example;
SELECT expression_identifiers, expression, table_schema, table_name, column_name
FROM parse_columns('SELECT customers.profile.address.city, customers.profile.address.street FROM customers');

SELECT '=== Example 5: Multi-table JOIN with Complex Expressions ===' as example;
SELECT column_name, context, expression, selected_name
FROM parse_columns('
SELECT
c.name AS customer_name,
o.total AS order_amount,
c.age + o.total AS customer_score
FROM customers c
JOIN orders o ON c.id = o.customer_id
')
WHERE column_name IS NOT NULL OR selected_name IS NOT NULL;

SELECT '=== Example 6: Input vs Output Column Distinction ===' as example;
SELECT
CASE WHEN selected_name IS NULL THEN 'INPUT' ELSE 'OUTPUT' END as column_type,
COALESCE(selected_name, column_name) as identifier,
expression,
context
FROM parse_columns('
SELECT
customers.name AS customer_name,
orders.total * 1.1 AS total_with_tax,
customers.age
FROM customers
JOIN orders ON customers.id = orders.customer_id
')
ORDER BY column_type, identifier;

SELECT '=== Example 7: Different SQL Contexts ===' as example;
SELECT DISTINCT context, COUNT(*) as count
FROM parse_columns('
SELECT
c.name,
COUNT(*) as order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE c.age > 25 AND c.status = ''active''
GROUP BY c.id, c.name
HAVING COUNT(*) > 2
ORDER BY c.name
')
GROUP BY context
ORDER BY context;

SELECT '=== Example 8: Function Arguments vs Select Items ===' as example;
SELECT
context,
column_name,
expression,
CASE WHEN selected_name IS NOT NULL THEN selected_name ELSE 'N/A' END as output_name
FROM parse_columns('
SELECT
UPPER(c.name) AS customer_name,
CONCAT(c.first_name, '' '', c.last_name) AS full_name,
LENGTH(c.email) AS email_length
FROM customers c
')
ORDER BY context, column_name;
25 changes: 25 additions & 0 deletions src/include/parse_columns.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#pragma once

#include "duckdb.hpp"
#include <string>
#include <vector>

namespace duckdb {

// Forward declarations
class DatabaseInstance;

struct ColumnResult {
vector<vector<string>> expression_identifiers; // All identifiers in expression
string table_schema; // NULL for aliases, schema name for table columns
string table_name; // NULL for aliases, table name for table columns
string column_name; // Column name (for single column refs), NULL for complex expressions
string context; // Context where column appears (select, where, function_arg, etc.)
string expression; // Full expression text
string selected_name; // NULL for input columns, output column name for SELECT items
};

void RegisterParseColumnsFunction(DatabaseInstance &db);
void RegisterParseColumnScalarFunction(DatabaseInstance &db);

} // namespace duckdb
Loading
Loading