diff --git a/docs/cn/guides/54-query/00-sql-analytics.md b/docs/cn/guides/54-query/00-sql-analytics.md index 995cada4b1..5b821c8c99 100644 --- a/docs/cn/guides/54-query/00-sql-analytics.md +++ b/docs/cn/guides/54-query/00-sql-analytics.md @@ -2,12 +2,12 @@ title: SQL 分析 --- -> **场景:** CityDrive 会把所有行车视频写入共享的关系表,分析师因此可以在同一批 `video_id` / `frame_id` 上做过滤、连接与聚合,供后续的 JSON、向量、地理和 ETL 负载共用。 +> **场景:** CityDrive 将所有行车记录暂存到统一的关系型表中。这些关系型数据(如视频元信息、事件标签)均由后台处理流程从原始行车视频的关键帧中提取而来。这样分析师就可以针对同一批 `video_id` / `frame_id` 数据进行过滤、关联和聚合,并供所有下游业务复用。 -本演练建模了 CityDrive 编目中的关系层,并串起常见的 SQL 积木。这里出现的示例 ID 会在其余指南中再次用到。 +本指南将对该目录的关系型数据部分进行建模,并重点介绍实用的 SQL 构建模块。这里用到的示例 ID 也会在后续的 JSON、向量、地理空间和 ETL 指南中反复出现。 ## 1. 创建基础表 -`citydrive_videos` 保存视频级元数据,而 `frame_events` 记录每段视频里抽出的关键帧。 +`citydrive_videos` 用于存储视频片段的元数据,而 `frame_events` 则记录从每个片段中提取出的关键帧(Interesting Frames)。 ```sql CREATE OR REPLACE TABLE citydrive_videos ( @@ -41,21 +41,39 @@ INSERT INTO frame_events VALUES ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), - ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9); + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9), + -- 故意保留一个孤立事件,用于演示 NOT EXISTS + ('FRAME-0501', 'VID-MISSING-001', 10, '2025-01-04 10:00:00', 'sensor_fault', 0.25, 15.0); + +-- 下面的 JOIN 模式需要此表;表结构与“JSON 与搜索”指南中的一致。 +CREATE OR REPLACE TABLE frame_metadata_catalog ( + doc_id STRING, + meta_json VARIANT, + captured_at TIMESTAMP, + INVERTED INDEX idx_meta_json (meta_json) +); + +INSERT INTO frame_metadata_catalog VALUES + ('FRAME-0101', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":32.4},"detections":{"objects":[{"type":"vehicle","confidence":0.88},{"type":"brake_light","confidence":0.64}]},"media_meta":{"tagging":{"labels":["hard_brake","rain","downtown_loop"]}}}'), '2025-01-01 08:15:21'), + ('FRAME-0102', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":24.8},"detections":{"objects":[{"type":"pedestrian","confidence":0.92},{"type":"bike","confidence":0.35}]},"media_meta":{"tagging":{"labels":["pedestrian","swerve","crosswalk"]}}}'), '2025-01-01 08:33:54'), + ('FRAME-0201', PARSE_JSON('{"scene":{"weather_code":"overcast","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":48.1},"detections":{"objects":[{"type":"lane_merge","confidence":0.74},{"type":"vehicle","confidence":0.41}]},"media_meta":{"tagging":{"labels":["lane_merge","urban"]}}}'), '2025-01-01 11:12:02'), + ('FRAME-0301', PARSE_JSON('{"scene":{"weather_code":"clear","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":52.6},"detections":{"objects":[{"type":"vehicle","confidence":0.82},{"type":"hard_brake","confidence":0.59}]},"media_meta":{"tagging":{"labels":["hard_brake","highway"]}}}'), '2025-01-02 09:44:18'), + ('FRAME-0401', PARSE_JSON('{"scene":{"weather_code":"lightfog","lighting":"night"},"camera":{"sensor_view":"rear"},"vehicle":{"speed_kmh":38.9},"detections":{"objects":[{"type":"traffic_light","confidence":0.78},{"type":"vehicle","confidence":0.36}]},"media_meta":{"tagging":{"labels":["night_lowlight","traffic_light"]}}}'), '2025-01-03 21:18:07'); ``` 文档:[CREATE TABLE](/sql/sql-commands/ddl/table/ddl-create-table)、[INSERT](/sql/sql-commands/dml/dml-insert)。 --- -## 2. 只看最新车次 -把调查范围控制在最近 3 天的导航路线。 +## 2. 过滤工作集 +将查询范围限定在种子数据中 1 月 1 日至 3 日的快照上,以确保演示查询始终能返回结果。 ```sql WITH recent_videos AS ( SELECT * FROM citydrive_videos - WHERE capture_date >= DATEADD('day', -3, TODAY()) + WHERE capture_date >= '2025-01-01' + AND capture_date < '2025-01-04' ) SELECT v.video_id, v.route_name, @@ -69,10 +87,20 @@ ORDER BY flagged_frames DESC; 文档:[DATEADD](/sql/sql-functions/datetime-functions/date-add)、[GROUP BY](/sql/sql-commands/query-syntax/query-select#group-by-clause)。 +示例输出: + +``` +video_id | route_name | weather | flagged_frames +VID-20250101-001| Downtown Loop | Rain | 2 +VID-20250101-002| Port Perimeter | Overcast | 1 +VID-20250102-001| Airport Connector | Clear | 1 +VID-20250103-001| CBD Night Sweep | LightFog | 1 +``` + --- -## 3. 常见 JOIN 模式 -### INNER JOIN:取帧上下文 +## 3. 连接模式 (JOIN Patterns) +### INNER JOIN:获取帧上下文 ```sql SELECT f.frame_id, f.event_tag, @@ -84,7 +112,18 @@ JOIN citydrive_videos AS v USING (video_id) ORDER BY f.collected_at; ``` -### NOT EXISTS:做 QA +示例输出: + +``` +frame_id | event_tag | risk_score | route_name | camera_source +FRAME-0101| hard_brake | 0.81 | Downtown Loop | roof_cam +FRAME-0102| pedestrian | 0.67 | Downtown Loop | roof_cam +FRAME-0201| lane_merge | 0.74 | Port Perimeter | front_cam +FRAME-0301| hard_brake | 0.59 | Airport Connector | front_cam +FRAME-0401| night_lowlight | 0.63 | CBD Night Sweep | rear_cam +``` + +### 反连接 (Anti Join):质量检查 (QA) ```sql SELECT frame_id FROM frame_events f @@ -95,7 +134,14 @@ WHERE NOT EXISTS ( ); ``` -### LATERAL FLATTEN:展开 JSON 检测 +示例输出: + +``` +frame_id +FRAME-0501 +``` + +### LATERAL FLATTEN:展开嵌套检测结果 ```sql SELECT f.frame_id, obj.value['type']::STRING AS detected_type, @@ -107,12 +153,20 @@ WHERE f.event_tag = 'pedestrian' ORDER BY confidence DESC; ``` +示例输出: + +``` +frame_id | detected_type | confidence +FRAME-0102| pedestrian | 0.92 +FRAME-0102| bike | 0.35 +``` + 文档:[JOIN](/sql/sql-commands/query-syntax/query-join)、[FLATTEN](/sql/sql-functions/table-functions/flatten)。 --- ## 4. 车队 KPI 聚合 -### 分路线的行为统计 +### 按路线统计驾驶行为 ```sql SELECT v.route_name, f.event_tag, @@ -124,7 +178,18 @@ GROUP BY v.route_name, f.event_tag ORDER BY avg_risk DESC, occurrences DESC; ``` -### ROLLUP 总计 +示例输出: + +``` +route_name | event_tag | occurrences | avg_risk +Downtown Loop | hard_brake | 1 | 0.81 +Port Perimeter | lane_merge | 1 | 0.74 +Downtown Loop | pedestrian | 1 | 0.67 +CBD Night Sweep | night_lowlight | 1 | 0.63 +Airport Connector | hard_brake | 1 | 0.59 +``` + +### ROLLUP:计算总计 ```sql SELECT v.route_name, f.event_tag, @@ -135,7 +200,20 @@ GROUP BY ROLLUP(v.route_name, f.event_tag) ORDER BY v.route_name NULLS LAST, f.event_tag; ``` -### CUBE:路线 × 天气 覆盖 +示例输出(前 6 行): + +``` +route_name | event_tag | occurrences +Airport Connector | hard_brake | 1 +Airport Connector | NULL | 1 +CBD Night Sweep | night_lowlight | 1 +CBD Night Sweep | NULL | 1 +Downtown Loop | hard_brake | 1 +Downtown Loop | pedestrian | 1 +... (total rows: 10) +``` + +### CUBE:路线 × 天气覆盖率 ```sql SELECT v.route_name, v.weather, @@ -145,10 +223,23 @@ GROUP BY CUBE(v.route_name, v.weather) ORDER BY v.route_name NULLS LAST, v.weather NULLS LAST; ``` +示例输出(前 6 行): + +``` +route_name | weather | videos +Airport Connector | Clear | 1 +Airport Connector | NULL | 1 +CBD Night Sweep | LightFog | 1 +CBD Night Sweep | NULL | 1 +Downtown Loop | Rain | 1 +Downtown Loop | NULL | 1 +... (total rows: 13) +``` + --- ## 5. 窗口函数 -### 单次视频的风险累计 +### 单个视频的累积风险 ```sql WITH ordered_events AS ( SELECT video_id, collected_at, risk_score @@ -166,7 +257,19 @@ FROM ordered_events ORDER BY video_id, collected_at; ``` -### 帧级滑动平均 +示例输出(前 6 行): + +``` +video_id | collected_at | risk_score | cumulative_risk +VID-20250101-001| 2025-01-01 08:15:21 | 0.81 | 0.81 +VID-20250101-001| 2025-01-01 08:33:54 | 0.67 | 1.48 +VID-20250101-002| 2025-01-01 11:12:02 | 0.74 | 0.74 +VID-20250102-001| 2025-01-02 09:44:18 | 0.59 | 0.59 +VID-20250103-001| 2025-01-03 21:18:07 | 0.63 | 0.63 +VID-MISSING-001 | 2025-01-04 10:00:00 | 0.25 | 0.25 +``` + +### 最近帧的滑动平均值 ```sql SELECT video_id, frame_id, @@ -181,12 +284,24 @@ FROM frame_events ORDER BY video_id, frame_index; ``` +示例输出(前 6 行): + +``` +video_id | frame_id | frame_index | risk_score | rolling_avg_risk +VID-20250101-001| FRAME-0101 | 125 | 0.81 | 0.81 +VID-20250101-001| FRAME-0102 | 416 | 0.67 | 0.74 +VID-20250101-002| FRAME-0201 | 298 | 0.74 | 0.74 +VID-20250102-001| FRAME-0301 | 188 | 0.59 | 0.59 +VID-20250103-001| FRAME-0401 | 522 | 0.63 | 0.63 +VID-MISSING-001 | FRAME-0501 | 10 | 0.25 | 0.25 +``` + 窗口函数可以在 SQL 中直接表达滚动求和或滑动平均。完整列表见:[窗口函数](/sql/sql-functions/window-functions)。 --- -## 6. 聚合索引提速 -使用 [Aggregating Index](/guides/performance/aggregating-index) 缓存高频汇总,让仪表盘查询避开全表扫描。 +## 6. 聚合索引加速 +持久化常用的仪表盘汇总数据。 ```sql CREATE OR REPLACE AGGREGATING INDEX idx_video_event_summary @@ -199,4 +314,74 @@ FROM frame_events GROUP BY video_id, event_tag; ``` -当你再次运行相同的汇总(如路线事件分布)时,`EXPLAIN` 会显示 `AggregatingIndex` 节点,说明查询已经命中上面的摘要副本。索引会在新的帧写入后自动刷新,无须额外 ETL 即可保持秒级体验。 +当分析师再次查询相同的 KPI 时,优化器会直接从索引中读取数据: + +```sql +SELECT v.route_name, + e.event_tag, + COUNT(*) AS event_count, + AVG(e.risk_score) AS avg_risk +FROM frame_events e +JOIN citydrive_videos v USING (video_id) +WHERE v.capture_date >= '2025-01-01' +GROUP BY v.route_name, e.event_tag +ORDER BY avg_risk DESC; +``` + +示例输出: + +``` +route_name | event_tag | event_count | avg_risk +Downtown Loop | hard_brake | 1 | 0.81 +Port Perimeter | lane_merge | 1 | 0.74 +Downtown Loop | pedestrian | 1 | 0.67 +CBD Night Sweep | night_lowlight | 1 | 0.63 +Airport Connector | hard_brake | 1 | 0.59 +``` + +文档:[Aggregating Index](/guides/performance/aggregating-index) 和 [EXPLAIN](/sql/sql-commands/explain-cmds/explain)。 + +--- + +## 7. 存储过程自动化 +将逻辑封装起来,确保定时任务始终生成一致的报告。 + +```sql +CREATE OR REPLACE PROCEDURE citydrive_route_report(days_back UINT8) +RETURNS TABLE(route_name STRING, event_tag STRING, event_count BIGINT, avg_risk DOUBLE) +LANGUAGE SQL +AS +$$ +BEGIN + RETURN TABLE ( + SELECT v.route_name, + e.event_tag, + COUNT(*) AS event_count, + AVG(e.risk_score) AS avg_risk + FROM frame_events e + JOIN citydrive_videos v USING (video_id) + WHERE v.capture_date >= DATEADD('day', -:days_back, DATE '2025-01-04') + GROUP BY v.route_name, e.event_tag + ); +END; +$$; + +CALL PROCEDURE citydrive_route_report(30); +``` + +示例输出: + +``` +route_name | event_tag | event_count | avg_risk +Downtown Loop | hard_brake | 1 | 0.81 +CBD Night Sweep | night_lowlight | 1 | 0.63 +Downtown Loop | pedestrian | 1 | 0.67 +Airport Connector | hard_brake | 1 | 0.59 +Port Perimeter | lane_merge | 1 | 0.74 +``` + +存储过程可以手动触发,也可以通过 [TASKS](/guides/load-data/continuous-data-pipelines/task) 或编排工具触发。 + +--- + +有了这些表和模式,CityDrive 指南的其余部分就可以引用完全相同的 `video_id` 键——无论是用于 JSON 搜索的 `frame_metadata_catalog`、用于相似度分析的帧嵌入、用于地理查询的 GPS 位置,还是保持它们同步的单一 ETL 链路。 diff --git a/docs/cn/guides/54-query/01-json-search.md b/docs/cn/guides/54-query/01-json-search.md index c33105526e..f5cc37a4e6 100644 --- a/docs/cn/guides/54-query/01-json-search.md +++ b/docs/cn/guides/54-query/01-json-search.md @@ -2,12 +2,12 @@ title: JSON 与搜索 --- -> **场景:** CityDrive 会为每个抽取出来的帧附带一份 JSON 元数据,并希望直接在 Databend 内用 Elasticsearch 风格的过滤语法完成检索,而不用把数据复制到别的系统。 +> **场景:** CityDrive 会为每个提取的视频帧关联一份 JSON 元数据。这些 JSON 数据由后台工具从视频关键帧中提取,包含了场景识别、物体检测等丰富的非结构化信息。我们需要直接在 Databend 中使用类似 Elasticsearch 的语法对这些 JSON 进行检索,避免将数据复制到外部系统。 -Databend 可以在同一仓库里托管多模态信号:VARIANT 列支持倒排索引,位图表刻画标签覆盖率,向量索引用于相似度查询,原生 GEOMETRY 列提供空间过滤。 +Databend 将这些多模态信号统一存储在一个数仓中。通过倒排索引,我们可以在 VARIANT 列上实现 ES 风格的搜索;利用位图索引(Bitmap)加速标签筛选;使用向量索引处理相似度查询;同时借助原生的 GEOMETRY 列支持地理空间过滤。 ## 1. 创建元数据表 -每个帧保存一份 JSON,有了共同的结构,任意查询都可以复用。 +为每一帧存储一份 JSON 数据,确保所有查询都基于统一的结构进行。 ```sql CREATE DATABASE IF NOT EXISTS video_unified_demo; @@ -18,12 +18,20 @@ CREATE OR REPLACE TABLE frame_metadata_catalog ( meta_json VARIANT, captured_at TIMESTAMP, INVERTED INDEX idx_meta_json (meta_json) -) CLUSTER BY (captured_at); +); + +-- 下方查询示例用到的样本行。 +INSERT INTO frame_metadata_catalog VALUES + ('FRAME-0101', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":32.4},"detections":{"objects":[{"type":"vehicle","confidence":0.88},{"type":"brake_light","confidence":0.64}]},"media_meta":{"tagging":{"labels":["hard_brake","rain","downtown_loop"]}}}'), '2025-01-01 08:15:21'), + ('FRAME-0102', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":24.8},"detections":{"objects":[{"type":"pedestrian","confidence":0.92},{"type":"bike","confidence":0.35}]},"media_meta":{"tagging":{"labels":["pedestrian","swerve","crosswalk"]}}}'), '2025-01-01 08:33:54'), + ('FRAME-0201', PARSE_JSON('{"scene":{"weather_code":"overcast","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":48.1},"detections":{"objects":[{"type":"lane_merge","confidence":0.74},{"type":"vehicle","confidence":0.41}]},"media_meta":{"tagging":{"labels":["lane_merge","urban"]}}}'), '2025-01-01 11:12:02'), + ('FRAME-0301', PARSE_JSON('{"scene":{"weather_code":"clear","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":52.6},"detections":{"objects":[{"type":"vehicle","confidence":0.82},{"type":"hard_brake","confidence":0.59}]},"media_meta":{"tagging":{"labels":["hard_brake","highway"]}}}'), '2025-01-02 09:44:18'), + ('FRAME-0401', PARSE_JSON('{"scene":{"weather_code":"lightfog","lighting":"night"},"camera":{"sensor_view":"rear"},"vehicle":{"speed_kmh":38.9},"detections":{"objects":[{"type":"traffic_light","confidence":0.78},{"type":"vehicle","confidence":0.36}]},"media_meta":{"tagging":{"labels":["night_lowlight","traffic_light"]}}}'), '2025-01-03 21:18:07'); ``` -> 需要同时管理多模态数据(向量嵌入、GPS 轨迹、标签位图)?可以直接复用 [向量](./02-vector-db.md) 与 [地理](./03-geo-analytics.md) 指南里的建表语句,再同 JSON 结果拼接。 +> 需要多模态数据(如向量嵌入、GPS 轨迹、标签位图)?可以参考 [向量](./02-vector-db.md) 和 [地理](./03-geo-analytics.md) 指南中的表结构,将它们与这里的搜索结果关联起来。 -## 2. 使用 `QUERY()` 的检索模式 +## 2. 使用 `QUERY()` 进行检索 ### 数组匹配 ```sql SELECT doc_id, @@ -35,7 +43,14 @@ ORDER BY captured_at DESC LIMIT 5; ``` -### 布尔 AND +示例输出: + +``` +doc_id | captured_at | detections +FRAME-0102 | 2025-01-01 08:33:54 | {"objects":[{"confidence":0.92,"type":"pedestrian"},{"confidence":0.35,"type":"bike"}]} +``` + +### 布尔与 (AND) ```sql SELECT doc_id, captured_at FROM frame_metadata_catalog @@ -44,7 +59,15 @@ WHERE QUERY('meta_json.scene.weather_code:rain ORDER BY captured_at; ``` -### 布尔 OR / 列表 +示例输出: + +``` +doc_id | captured_at +FRAME-0101 | 2025-01-01 08:15:21 +FRAME-0102 | 2025-01-01 08:33:54 +``` + +### 布尔或 (OR) / 列表匹配 ```sql SELECT doc_id, meta_json['media_meta']['tagging']['labels'] AS labels @@ -54,6 +77,16 @@ ORDER BY captured_at DESC LIMIT 10; ``` +示例输出: + +``` +doc_id | labels +FRAME-0301 | ["hard_brake","highway"] +FRAME-0201 | ["lane_merge","urban"] +FRAME-0102 | ["pedestrian","swerve","crosswalk"] +FRAME-0101 | ["hard_brake","rain","downtown_loop"] +``` + ### 数值范围 ```sql SELECT doc_id, @@ -64,7 +97,17 @@ ORDER BY speed DESC LIMIT 10; ``` -### 权重(Boosting) +示例输出: + +``` +doc_id | speed +FRAME-0301 | 52.6 +FRAME-0201 | 48.1 +FRAME-0401 | 38.9 +FRAME-0101 | 32.4 +``` + +### 权重提升 (Boosting) ```sql SELECT doc_id, SCORE() AS relevance @@ -74,4 +117,12 @@ ORDER BY relevance DESC LIMIT 8; ``` -`QUERY()` 遵循 Elasticsearch 的语义(布尔逻辑、范围、权重、列表等),`SCORE()` 则暴露检索相关性,方便在 SQL 里直接排序。完整算子列表见:[搜索函数](/sql/sql-functions/search-functions)。 +示例输出: + +``` +doc_id | relevance +FRAME-0101 | 7.0161 +FRAME-0102 | 3.6252 +``` + +`QUERY()` 支持 Elasticsearch 的语义(包括布尔逻辑、范围查询、权重提升、列表匹配等)。`SCORE()` 函数则返回相关性评分,方便直接在 SQL 中对结果进行重排序。完整的算子列表请参考:[搜索函数](/sql/sql-functions/search-functions)。 diff --git a/docs/cn/guides/54-query/02-vector-db.md b/docs/cn/guides/54-query/02-vector-db.md index 30e8404ff9..67d746ba61 100644 --- a/docs/cn/guides/54-query/02-vector-db.md +++ b/docs/cn/guides/54-query/02-vector-db.md @@ -2,12 +2,12 @@ title: 向量搜索 --- -> **场景:** CityDrive 把每个帧的嵌入直接存放在 Databend,语义相似搜索(“找出和它看起来像的帧”)便可与传统 SQL 分析一同运行,无需再部署独立的向量服务。 +> **场景:** CityDrive 将每一帧的向量嵌入(Embeddings)直接存储在 Databend 中。这些向量数据是 AI 模型对视频关键帧进行推理的结果,用于捕捉画面的视觉语义特征。这样一来,语义相似度搜索(即“查找与此画面相似的帧”)就可以与传统的 SQL 分析任务并行运行,而无需部署额外的向量数据库服务。 -`frame_embeddings` 表与 `frame_events`、`frame_metadata_catalog`、`frame_geo_points` 共用同一批 `frame_id`,让语义检索与常规 SQL 牢牢绑定在一起。 +`frame_embeddings` 表与 `frame_events`、`frame_metadata_catalog` 以及 `frame_geo_points` 表共用同一套 `frame_id` 主键,这使得语义搜索能够与经典 SQL 查询紧密结合,无缝衔接。 -## 1. 准备嵌入表 -生产模型通常输出 512–1536 维,本例使用 512 维方便直接复制到演示集群。 +## 1. 准备向量表 +生产环境中的模型通常会输出 512 到 1536 维的向量。为了方便您直接复制到演示集群中运行,本例将使用 512 维向量,无需修改 DDL。 ```sql CREATE OR REPLACE TABLE frame_embeddings ( @@ -16,23 +16,75 @@ CREATE OR REPLACE TABLE frame_embeddings ( sensor_view STRING, embedding VECTOR(512), encoder_build STRING, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + created_at TIMESTAMP, VECTOR INDEX idx_frame_embeddings(embedding) distance='cosine' ); -INSERT INTO frame_embeddings VALUES - ('FRAME-0101', 'VID-20250101-001', 'roof_cam', RANDOM_VECTOR(512), 'clip-lite-v1', DEFAULT), - ('FRAME-0102', 'VID-20250101-001', 'roof_cam', RANDOM_VECTOR(512), 'clip-lite-v1', DEFAULT), - ('FRAME-0201', 'VID-20250101-002', 'front_cam',RANDOM_VECTOR(512), 'night-fusion-v2', DEFAULT), - ('FRAME-0401', 'VID-20250103-001', 'rear_cam', RANDOM_VECTOR(512), 'night-fusion-v2', DEFAULT); +-- SQL UDF:通过 ARRAY_AGG + 窗口函数构建 512 维向量;仅作为教程占位符。 +CREATE OR REPLACE FUNCTION demo_random_vector(seed STRING) +RETURNS TABLE(embedding VECTOR(512)) +AS $$ +SELECT CAST( + ARRAY_AGG(rand_val) OVER ( + PARTITION BY seed + ORDER BY seq + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) + AS VECTOR(512) + ) AS embedding +FROM ( + SELECT seed, + dims.number AS seq, + (RAND() * 0.2 - 0.1)::FLOAT AS rand_val + FROM numbers(512) AS dims +) vals +QUALIFY ROW_NUMBER() OVER (PARTITION BY seed ORDER BY seq) = 1; +$$; + +INSERT INTO frame_embeddings (frame_id, video_id, sensor_view, embedding, encoder_build, created_at) +SELECT 'FRAME-0101', 'VID-20250101-001', 'roof_cam', embedding, 'clip-lite-v1', '2025-01-01 08:15:21' +FROM demo_random_vector('FRAME-0101') +UNION ALL +SELECT 'FRAME-0102', 'VID-20250101-001', 'roof_cam', embedding, 'clip-lite-v1', '2025-01-01 08:33:54' +FROM demo_random_vector('FRAME-0102') +UNION ALL +SELECT 'FRAME-0201', 'VID-20250101-002', 'front_cam', embedding, 'night-fusion-v2', '2025-01-01 11:12:02' +FROM demo_random_vector('FRAME-0201') +UNION ALL +SELECT 'FRAME-0401', 'VID-20250103-001', 'rear_cam', embedding, 'night-fusion-v2', '2025-01-03 21:18:07' +FROM demo_random_vector('FRAME-0401'); ``` -文档:[向量类型](/sql/sql-reference/data-types/vector)、[向量索引](/sql/sql-reference/data-types/vector#vector-indexing)。 +> 此数组生成器仅用于使教程自包含。在生产环境中,请将其替换为模型中的真实嵌入。 + +如果您尚未运行 SQL 分析指南,请创建支持的 `frame_events` 表并播种向量演练所连接的相同样本行: + +```sql +CREATE OR REPLACE TABLE frame_events ( + frame_id STRING, + video_id STRING, + frame_index INT, + collected_at TIMESTAMP, + event_tag STRING, + risk_score DOUBLE, + speed_kmh DOUBLE +); + +INSERT INTO frame_events VALUES + ('FRAME-0101', 'VID-20250101-001', 125, '2025-01-01 08:15:21', 'hard_brake', 0.81, 32.4), + ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), + ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), + ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9), + ('FRAME-0501', 'VID-MISSING-001', 10, '2025-01-04 10:00:00', 'sensor_fault', 0.25, 15.0); +``` + +文档:[Vector 类型](/sql/sql-reference/data-types/vector) 和 [Vector 索引](/sql/sql-reference/data-types/vector#vector-indexing)。 --- -## 2. 运行余弦搜索 -先取出某一帧的嵌入,再让 HNSW 索引返回最近邻。 +## 2. 运行余弦相似度搜索 +提取某一帧的向量嵌入,并利用 HNSW 索引快速返回其最近邻(Nearest Neighbours)。 ```sql WITH query_embedding AS ( @@ -49,9 +101,18 @@ ORDER BY distance LIMIT 3; ``` -距离越小越相似。即便有数百万帧,`VECTOR INDEX` 也能让响应保持毫秒级。 +示例输出: + +``` +frame_id | video_id | distance +FRAME-0101| VID-20250101-001 | 0.0000 +FRAME-0201| VID-20250101-002 | 0.9801 +FRAME-0102| VID-20250101-001 | 0.9842 +``` -继续叠加传统谓词(如路线、视频、传感器视角),即可在向量比对前后收窄候选集。 +距离越小 = 越相似。即使有数百万帧,`VECTOR INDEX` 也能保持低延迟。 + +在向量比较之前或之后添加传统谓词(路线、视频、传感器视图)以缩小候选集。 ```sql WITH query_embedding AS ( @@ -69,12 +130,19 @@ ORDER BY distance LIMIT 5; ``` -优化器会在满足 `sensor_view` 过滤的同时继续走向量索引。 +示例输出: + +``` +frame_id | sensor_view | distance +FRAME-0401| rear_cam | 1.0537 +``` + +优化器仍然使用向量索引,同时遵循 `sensor_view` 过滤器。 --- -## 3. 丰富相似帧 -把 Top-N 相似帧物化,再与 `frame_events` 连接,方便下游分析。 +## 3. 丰富相似帧信息 +将匹配度最高的 Top-N 帧物化(Materialize),然后关联 `frame_events` 表,为下游分析提供更多上下文信息。 ```sql WITH query_embedding AS ( @@ -101,4 +169,14 @@ LEFT JOIN frame_events fe USING (frame_id) ORDER BY sf.distance; ``` -嵌入与关系表同库共存,调查人员可以立即从“视觉相似”跳转到“同时伴随 `hard_brake` 标签、特定天气或 JSON 检测”的线索,无需导出数据。 +示例输出: + +``` +frame_id | video_id | event_tag | risk_score | distance +FRAME-0102| VID-20250101-001 | pedestrian | 0.67 | 0.0000 +FRAME-0201| VID-20250101-002 | lane_merge | 0.74 | 0.9802 +FRAME-0101| VID-20250101-001 | hard_brake | 0.81 | 0.9842 +FRAME-0401| VID-20250103-001 | night_lowlight | 0.63 | 1.0020 +``` + +由于向量嵌入与关系型表存储在同一个库中,您可以轻松地从“查找相似画面”跳转到“查找同时包含急刹车标签、特定天气或 JSON 检测结果的帧”,而无需将数据导出到其他服务中进行处理。 diff --git a/docs/cn/guides/54-query/03-geo-analytics.md b/docs/cn/guides/54-query/03-geo-analytics.md index 0ab247ff1d..9400a0bb1d 100644 --- a/docs/cn/guides/54-query/03-geo-analytics.md +++ b/docs/cn/guides/54-query/03-geo-analytics.md @@ -2,12 +2,12 @@ title: 地理分析 --- -> **场景:** CityDrive 会为每个被标记的帧记录精准的 GPS 定位以及与信号灯的距离,运营人员可以纯 SQL 回答“事故发生在什么位置?”之类的问题。 +> **场景:** CityDrive 为每个标记帧记录了精确的 GPS 定位以及与交通信号灯的距离。这些地理空间数据源于行车记录仪的 GPS 模块,并已与视频关键帧的时间戳精准对齐。运营团队可以完全通过 SQL 来回答“事件发生在哪里?”这类问题。 -`frame_geo_points` 与 `signal_contact_points` 同样复用本指南里的 `video_id` / `frame_id`,因此可以在不复制数据的情况下把 SQL 指标延伸到地图视图。 +`frame_geo_points` 和 `signal_contact_points` 表与本指南其他部分一样,共用 `video_id`/`frame_id` 键。这意味着您无需复制数据,即可将分析视角从 SQL 指标切换到地图视图。 ## 1. 创建位置表 -如果你已完成 JSON 指南,这些表应该已经存在。下方片段包含表结构以及几条深圳示例数据。 +如果您已经完成了 JSON 指南的操作,这些表可能已经存在。下方的代码展示了表结构以及几条深圳地区的示例数据。 ```sql CREATE OR REPLACE TABLE frame_geo_points ( @@ -20,11 +20,11 @@ CREATE OR REPLACE TABLE frame_geo_points ( ); INSERT INTO frame_geo_points VALUES - ('VID-20250101-001','FRAME-0101',TO_GEOMETRY('SRID=4326;POINT(114.0579 22.5431)'),104,'fusion_gnss','2025-01-01 08:15:21'), - ('VID-20250101-001','FRAME-0102',TO_GEOMETRY('SRID=4326;POINT(114.0610 22.5460)'),104,'fusion_gnss','2025-01-01 08:33:54'), - ('VID-20250101-002','FRAME-0201',TO_GEOMETRY('SRID=4326;POINT(114.1040 22.5594)'),104,'fusion_gnss','2025-01-01 11:12:02'), - ('VID-20250102-001','FRAME-0301',TO_GEOMETRY('SRID=4326;POINT(114.0822 22.5368)'),104,'fusion_gnss','2025-01-02 09:44:18'), - ('VID-20250103-001','FRAME-0401',TO_GEOMETRY('SRID=4326;POINT(114.1195 22.5443)'),104,'fusion_gnss','2025-01-03 21:18:07'); + ('VID-20250101-001','FRAME-0101',TO_GEOMETRY('SRID=4326;POINT(114.0579123456789 22.543123456789)'),104,'fusion_gnss','2025-01-01 08:15:21'), + ('VID-20250101-001','FRAME-0102',TO_GEOMETRY('SRID=4326;POINT(114.0610987654321 22.546098765432)'),104,'fusion_gnss','2025-01-01 08:33:54'), + ('VID-20250101-002','FRAME-0201',TO_GEOMETRY('SRID=4326;POINT(114.104012345678 22.559456789012)'),104,'fusion_gnss','2025-01-01 11:12:02'), + ('VID-20250102-001','FRAME-0301',TO_GEOMETRY('SRID=4326;POINT(114.082265432109 22.53687654321)'),104,'fusion_gnss','2025-01-02 09:44:18'), + ('VID-20250103-001','FRAME-0401',TO_GEOMETRY('SRID=4326;POINT(114.119501234567 22.544365432101)'),104,'fusion_gnss','2025-01-03 21:18:07'); CREATE OR REPLACE TABLE signal_contact_points ( node_id STRING, @@ -35,6 +35,43 @@ CREATE OR REPLACE TABLE signal_contact_points ( distance_m DOUBLE, created_at TIMESTAMP ); + +INSERT INTO signal_contact_points VALUES + ('SIG-0001', TO_GEOMETRY('SRID=4326;POINT(114.058500123456 22.543800654321)'), 'VID-20250101-001', 'FRAME-0101', TO_GEOMETRY('SRID=4326;POINT(114.0579123456789 22.543123456789)'), 0.012345, '2025-01-01 08:15:30'), + ('SIG-0002', TO_GEOMETRY('SRID=4326;POINT(114.118900987654 22.544800123456)'), 'VID-20250103-001', 'FRAME-0401', TO_GEOMETRY('SRID=4326;POINT(114.119501234567 22.544365432101)'), 0.008765, '2025-01-03 21:18:20'); + +-- 下方查询会连接到的帧事件与 JSON 表(与 SQL/搜索指南里一致)。 +CREATE OR REPLACE TABLE frame_events ( + frame_id STRING, + video_id STRING, + frame_index INT, + collected_at TIMESTAMP, + event_tag STRING, + risk_score DOUBLE, + speed_kmh DOUBLE +); + +INSERT INTO frame_events VALUES + ('FRAME-0101', 'VID-20250101-001', 125, '2025-01-01 08:15:21', 'hard_brake', 0.81, 32.4), + ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), + ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), + ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9), + ('FRAME-0501', 'VID-MISSING-001', 10, '2025-01-04 10:00:00', 'sensor_fault', 0.25, 15.0); + +CREATE OR REPLACE TABLE frame_metadata_catalog ( + doc_id STRING, + meta_json VARIANT, + captured_at TIMESTAMP, + INVERTED INDEX idx_meta_json (meta_json) +); + +INSERT INTO frame_metadata_catalog VALUES + ('FRAME-0101', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":32.4},"detections":{"objects":[{"type":"vehicle","confidence":0.88},{"type":"brake_light","confidence":0.64}]},"media_meta":{"tagging":{"labels":["hard_brake","rain","downtown_loop"]}}}'), '2025-01-01 08:15:21'), + ('FRAME-0102', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":24.8},"detections":{"objects":[{"type":"pedestrian","confidence":0.92},{"type":"bike","confidence":0.35}]},"media_meta":{"tagging":{"labels":["pedestrian","swerve","crosswalk"]}}}'), '2025-01-01 08:33:54'), + ('FRAME-0201', PARSE_JSON('{"scene":{"weather_code":"overcast","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":48.1},"detections":{"objects":[{"type":"lane_merge","confidence":0.74},{"type":"vehicle","confidence":0.41}]},"media_meta":{"tagging":{"labels":["lane_merge","urban"]}}}'), '2025-01-01 11:12:02'), + ('FRAME-0301', PARSE_JSON('{"scene":{"weather_code":"clear","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":52.6},"detections":{"objects":[{"type":"vehicle","confidence":0.82},{"type":"hard_brake","confidence":0.59}]},"media_meta":{"tagging":{"labels":["hard_brake","highway"]}}}'), '2025-01-02 09:44:18'), + ('FRAME-0401', PARSE_JSON('{"scene":{"weather_code":"lightfog","lighting":"night"},"camera":{"sensor_view":"rear"},"vehicle":{"speed_kmh":38.9},"detections":{"objects":[{"type":"traffic_light","confidence":0.78},{"type":"vehicle","confidence":0.36}]},"media_meta":{"tagging":{"labels":["night_lowlight","traffic_light"]}}}'), '2025-01-03 21:18:07'); ``` 文档:[地理空间数据类型](/sql/sql-reference/data-types/geospatial)。 @@ -42,7 +79,7 @@ CREATE OR REPLACE TABLE signal_contact_points ( --- ## 2. 空间过滤 -可计算帧与市中心坐标的距离,或检查它是否落在多边形内部。需要以米为单位时,把坐标投影到 SRID 3857。 +计算每一帧与市中心关键坐标的距离,或者检查它是否落在某个多边形区域内。当需要以米为单位计算距离时,请将坐标转换为 SRID 3857 投影。 ```sql SELECT l.frame_id, @@ -61,7 +98,15 @@ WHERE ST_DISTANCE( ORDER BY meters_from_hq; ``` -调试时可以输出 `ST_ASTEXT(l.position_wgs84)`,若偏好直接使用球面距离,可改用 [`HAVERSINE`](/sql/sql-functions/geospatial-functions#trigonometric-distance-functions)。 +示例输出: + +``` +frame_id | video_id | event_tag | meters_from_hq +FRAME-0102| VID-20250101-001 | pedestrian | 180.277138577 +FRAME-0101| VID-20250101-001 | hard_brake | 324.291965923 +``` + +提示:调试时可以使用 `ST_ASTEXT(l.geom)` 查看几何文本;如果更偏好大圆距离计算,可以使用 [`HAVERSINE`](/sql/sql-functions/geospatial-functions#trigonometric-distance-functions)。 ```sql WITH school_zone AS ( @@ -82,10 +127,18 @@ CROSS JOIN school_zone WHERE ST_CONTAINS(poly, l.position_wgs84); ``` +示例输出: + +``` +frame_id | video_id | event_tag +FRAME-0101| VID-20250101-001 | hard_brake +FRAME-0102| VID-20250101-001 | pedestrian +``` + --- -## 3. 六边形聚合 -把风险帧聚合进 H3 单元,用于仪表盘或热力图。 +## 3. 六边形网格聚合 +将风险帧聚合到六边形网格(H3)中,以便在仪表盘上进行展示。 ```sql SELECT GEO_TO_H3(ST_X(position_wgs84), ST_Y(position_wgs84), 8) AS h3_cell, @@ -97,12 +150,23 @@ GROUP BY h3_cell ORDER BY avg_risk DESC; ``` +示例输出: + +``` +h3_cell | frame_count | avg_risk +613635011200942079| 1 | 0.81 +613635011532292095| 1 | 0.74 +613635011238690815| 1 | 0.67 +613635015391051775| 1 | 0.63 +613635011309993983| 1 | 0.59 +``` + 文档:[H3 函数](/sql/sql-functions/geospatial-functions#h3-indexing--conversion)。 --- -## 4. 交通信号上下文 -连接 `signal_contact_points` 与 `frame_geo_points`,即可验证存量指标或把空间条件与 JSON 搜索联动。 +## 4. 交通环境上下文 +将 `signal_contact_points` 与 `frame_geo_points` 进行连接,以验证存储的指标,或者将空间查询条件与 JSON 搜索相结合。 ```sql SELECT t.node_id, @@ -113,18 +177,26 @@ SELECT t.node_id, l.source_system FROM signal_contact_points AS t JOIN frame_geo_points AS l USING (frame_id) -WHERE t.distance_m < 0.03 -- 不同投影下约等于 30 米 +WHERE t.distance_m < 0.03 -- 约小于 30 米,具体取决于 SRID ORDER BY t.distance_m; ``` +示例输出: + +``` +node_id | video_id | frame_id | recomputed_distance | stored_distance | source_system +SIG-0002| VID-20250103-001 | FRAME-0401| 0.000741116 | 0.008765 | fusion_gnss +SIG-0001| VID-20250101-001 | FRAME-0101| 0.000896705 | 0.012345 | fusion_gnss +``` + ```sql WITH near_junction AS ( SELECT frame_id FROM frame_geo_points WHERE ST_DISTANCE( ST_TRANSFORM(position_wgs84, 3857), - ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0700 22.5400)'), 3857) - ) <= 150 + ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0830 22.5370)'), 3857) + ) <= 200 ) SELECT f.frame_id, f.event_tag, @@ -136,12 +208,19 @@ JOIN frame_metadata_catalog AS meta WHERE QUERY('meta_json.media_meta.tagging.labels:hard_brake'); ``` -这类模式可以先按地理范围筛选,再对剩余帧执行 JSON 搜索。 +示例输出: + +``` +frame_id | event_tag | labels +FRAME-0301| hard_brake | ["hard_brake","highway"] +``` + +这种模式允许您先通过地理范围进行过滤,然后再对筛选出的帧应用 JSON 搜索。 --- -## 5. 发布热力视图 -把空间摘要封装成视图,供 BI 或 GIS 工具直接查询。 +## 5. 发布热力图视图 +将地理热力图封装为视图,供 BI 或 GIS 工具直接查询,避免重复运行繁重的 SQL 计算。 ```sql CREATE OR REPLACE VIEW v_citydrive_geo_heatmap AS @@ -153,4 +232,15 @@ JOIN frame_events AS f USING (frame_id) GROUP BY h3_cell; ``` -同一批 `video_id` 现在既能支撑向量、文本,也能支撑空间查询,调查团队不再需要维护额外的管道。 +示例输出: + +``` +h3_cell | frames | avg_risk +609131411584057343| 1 | 0.81 +609131411919601663| 1 | 0.74 +609131411617611775| 1 | 0.67 +609131415778361343| 1 | 0.63 +609131411684720639| 1 | 0.59 +``` + +现在,Databend 可以基于完全相同的 `video_id` 同时提供向量、文本和空间查询服务,调查团队再也不需要费力地协调多个独立的数据管道了。 diff --git a/docs/cn/guides/54-query/04-lakehouse-etl.md b/docs/cn/guides/54-query/04-lakehouse-etl.md index 6de3620406..61190a626b 100644 --- a/docs/cn/guides/54-query/04-lakehouse-etl.md +++ b/docs/cn/guides/54-query/04-lakehouse-etl.md @@ -1,21 +1,21 @@ --- -title: 湖仓 ETL +title: Lakehouse ETL --- -> **场景:** CityDrive 的数据工程团队会把每一批行车录像导出成 Parquet(视频、帧事件、JSON 元数据、嵌入、GPS 轨迹、信号灯距离),希望用一套 COPY 流程将共享表刷新到 Databend。 +> **场景:** CityDrive 的数据工程团队将每一批行车记录仪数据导出为 Parquet 格式(包含视频、帧事件、元数据 JSON、向量嵌入、GPS 轨迹、交通信号灯距离)。这些文件汇聚了从原始视频流中提取的所有多模态信号,构成了数仓的数据基础。团队希望通过一条 COPY 管道将这些数据更新到 Databend 的表中。 -加载闭环非常直接: +加载流程非常直观: ``` -对象存储 → STAGE → COPY INTO 表 → (可选)STREAMS / TASKS +对象存储 -> STAGE -> COPY INTO 表 -> (可选) STREAMS/TASKS ``` -根据自己的桶路径或格式进行调整,然后直接执行下面的 SQL。语法与[加载数据指南](/guides/load-data/)一致。 +请根据您的环境调整存储桶路径或文件格式,然后粘贴以下命令。语法与 [数据加载指南](/guides/load-data/) 保持一致。 --- ## 1. 创建 Stage -为 CityDrive 导出的桶创建可复用的 Stage。示例使用 Parquet,你可以改成任意受支持的格式。 +创建一个可复用的 Stage,指向存储 CityDrive 导出数据的存储桶。请将凭证/URL 替换为您自己的账户信息;本例使用 Parquet 格式,但只要指定相应的 `FILE_FORMAT`,任何支持的格式均可使用。 ```sql CREATE OR REPLACE CONNECTION citydrive_s3 @@ -30,9 +30,9 @@ CREATE OR REPLACE STAGE citydrive_stage ``` > [!IMPORTANT] -> 请把示例中的 AWS 密钥与桶地址替换成真实值,否则 `LIST`、`SELECT ... FROM @citydrive_stage`、`COPY INTO` 都会因为 403/`InvalidAccessKeyId` 失败。 +> 请务必将占位符 AWS 密钥和存储桶 URL 替换为您环境中的真实值。如果没有有效的凭证,`LIST`、`SELECT ... FROM @citydrive_stage` 和 `COPY INTO` 语句将因 S3 返回 `InvalidAccessKeyId`/403 错误而失败。 -快速检查: +快速完整性检查: ```sql LIST @citydrive_stage/videos/; @@ -46,7 +46,7 @@ LIST @citydrive_stage/traffic-lights/; --- ## 2. 预览文件 -在装载前对 Stage 做一次 `SELECT`,确认 schema 与样例行。 +在正式加载之前,使用 `SELECT` 查询 Stage 中的文件,以确认 Schema 和样本行数据。 ```sql SELECT * @@ -58,12 +58,12 @@ FROM @citydrive_stage/frame-events/batch_2025_01_01.parquet LIMIT 5; ``` -Databend 会沿用 Stage 定义的文件格式,因此无需额外参数。 +Databend 会根据 Stage 定义自动推断格式,因此此处无需指定额外选项。 --- -## 3. COPY INTO 统一表 -每份导出都对应指南里的一张共享表。内联的 `::TYPE` 转换可以保证上下游 schema 一致。 +## 3. COPY INTO 目标表 +每个导出文件对应指南中使用的表之一。使用内联转换(Inline Casts)可以保持 Schema 的一致性,即使上游字段顺序发生变化也不受影响。 ### `citydrive_videos` ```sql @@ -116,7 +116,7 @@ FROM ( SELECT frame_id::STRING, video_id::STRING, sensor_view::STRING, - embedding::VECTOR(768), -- 根据实际维度调整 + embedding::VECTOR(768), -- 请替换为您的实际维度 encoder_build::STRING, created_at::TIMESTAMP FROM @citydrive_stage/frame-embeddings/ @@ -155,27 +155,27 @@ FROM ( FILE_FORMAT = (TYPE = 'PARQUET'); ``` -完成后,SQL 分析、`QUERY()` 搜索、向量相似、地理过滤等所有负载都会读取完全相同的数据。 +在此步骤之后,每个下游工作负载——SQL 分析、Elasticsearch `QUERY()`、向量相似度、地理空间过滤——都将读取完全相同的数据。 --- -## 4. Streams(可选) -想让下游作业只消费最近一次批量新增的数据?给目标表创建 Stream。 +## 4. 使用 Streams 进行增量处理(可选) +如果希望下游任务仅消费自上一批次以来新增的行,可以使用 Streams。 ```sql CREATE OR REPLACE STREAM frame_events_stream ON TABLE frame_events; -SELECT * FROM frame_events_stream; -- 查看刚 COPY 的新行 --- …处理… -SELECT * FROM frame_events_stream WITH CONSUME; -- 推进游标 +SELECT * FROM frame_events_stream; -- 显示新复制的行 +-- …处理行… +SELECT * FROM frame_events_stream WITH CONSUME; -- 推进偏移量 ``` -`WITH CONSUME` 会在你处理完行后向前推进 offset。参考:[Streams](/guides/load-data/continuous-data-pipelines/stream)。 +`WITH CONSUME` 确保 Stream 游标在行被处理后向前移动。参考文档:[Streams](/guides/load-data/continuous-data-pipelines/stream)。 --- -## 5. Tasks(可选) -Task 会按计划运行**单条 SQL**。你可以为每张表建一个轻量 Task,或把逻辑写成存储过程后在 Task 中调用。 +## 5. 使用 Tasks 进行定时加载(可选) +Task 可以按计划运行**一条 SQL 语句**。您可以为每个表创建轻量级的 Task,或者如果更喜欢单一入口,也可以将逻辑封装在存储过程中。 ```sql CREATE OR REPLACE TASK task_load_citydrive_videos @@ -200,7 +200,7 @@ ALTER TASK task_load_citydrive_videos RESUME; CREATE OR REPLACE TASK task_load_frame_events WAREHOUSE = 'default' SCHEDULE = 10 MINUTE -AS + AS COPY INTO frame_events (frame_id, video_id, frame_index, collected_at, event_tag, risk_score, speed_kmh) FROM ( SELECT frame_id::STRING, @@ -217,8 +217,8 @@ AS ALTER TASK task_load_frame_events RESUME; ``` -其余表可以按同样模式新增 Task。更多调度/依赖选项见:[Tasks](/guides/load-data/continuous-data-pipelines/task)。 +使用相同的模式为 `frame_metadata_catalog`、嵌入或 GPS 数据添加更多 Task。完整选项请参考:[Tasks](/guides/load-data/continuous-data-pipelines/task)。 --- -当这些作业运行后,“统一工作负载”系列里的每个指南都读取相同的 CityDrive 表——无需额外 ETL,也不需要重复存储。 +一旦这些任务运行起来,统一工作负载系列中的每个指南都将读取相同的 CityDrive 表——无需额外的 ETL 层,也无需重复存储数据。 diff --git a/docs/cn/guides/54-query/index.md b/docs/cn/guides/54-query/index.md index 035b9c9200..79aca39849 100644 --- a/docs/cn/guides/54-query/index.md +++ b/docs/cn/guides/54-query/index.md @@ -1,15 +1,17 @@ --- -title: 统一引擎场景 +title: 统一工作负载 --- -CityDrive Intelligence 会保存每一次行车记录:把整段视频拆成帧,并为每个 `video_id` 写入结构化元数据、JSON 清单、行为标签、向量特征以及 GPS 轨迹。下面这一组指南展示 Databend 如何把这些需求都跑在同一个数仓里,既不需要复制数据,也不用额外搭建搜索或向量集群。 +CityDrive Intelligence 记录每一次行车过程的视频。通过后台处理工具,系统将视频流拆解为关键帧图片,进而从每张图片中提取出丰富的多模态信息,并以 `video_id` 为核心进行存储。这些信息涵盖了关系型元数据、JSON 清单、行为标签、向量嵌入以及 GPS 轨迹。 -| 指南 | 内容摘要 | +本系列指南将展示 Databend 如何在一个数仓中统一处理所有这些工作负载——既无需数据搬运,也无需维护额外的搜索集群。 + +| 指南 | 涵盖内容 | |-------|----------------| -| [SQL 分析](./00-sql-analytics.md) | 构建基础表,示范过滤、连接、窗口与聚合索引 | -| [JSON 与搜索](./01-json-search.md) | 加载 `frame_metadata_catalog`,运行 Elasticsearch `QUERY()`,关联位图标签 | -| [向量搜索](./02-vector-db.md) | 保留向量特征,用余弦距离做语义相似度检索,并联动风险指标 | -| [地理分析](./03-geo-analytics.md) | 运用 `GEOMETRY`、距离/多边形过滤以及信号灯关联 | -| [湖仓 ETL](./04-lakehouse-etl.md) | 一次暂存,`COPY INTO` 共享表,并可选配 Streams/Tasks | +| [SQL 分析](./00-sql-analytics.md) | 基础表设计、过滤、连接、窗口函数及聚合索引 | +| [JSON 与搜索](./01-json-search.md) | 加载 `frame_metadata_catalog`,执行 Elasticsearch 风格的 `QUERY()`,并关联位图标签 | +| [向量搜索](./02-vector-db.md) | 存储向量嵌入,运行余弦相似度搜索,并关联风险指标 | +| [地理空间分析](./03-geo-analytics.md) | 利用 `GEOMETRY` 类型,进行距离/多边形过滤及红绿灯关联查询 | +| [Lakehouse ETL](./04-lakehouse-etl.md) | 一次暂存 (Stage),通过 `COPY INTO` 写入目标表,并添加流/任务 (Streams/Tasks) | -按顺序体验,即可看到同一批 CityDrive 标识符如何贯穿经典 SQL、全文检索、向量、地理和 ETL,全程由 Databend 的单一执行引擎托管。 +建议按顺序阅读,体验同一套标识符如何贯穿经典 SQL、文本搜索、向量分析、地理空间分析和 ETL 流程——所有这一切都基于同一个 CityDrive 业务场景。 diff --git a/docs/en/guides/54-query/00-sql-analytics.md b/docs/en/guides/54-query/00-sql-analytics.md index ae1d8ad852..e0a48b2c7a 100644 --- a/docs/en/guides/54-query/00-sql-analytics.md +++ b/docs/en/guides/54-query/00-sql-analytics.md @@ -2,7 +2,7 @@ title: SQL Analytics --- -> **Scenario:** CityDrive stages every dash-cam run into shared relational tables so analysts can filter, join, and aggregate the same `video_id` / `frame_id` pairs for all downstream workloads. +> **Scenario:** CityDrive stages all dash-cam records into shared relational tables. This relational data (e.g., video metadata, event tags) is extracted by background processing pipelines from keyframes of the raw dash-cam video. Analysts can then filter, join, and aggregate on the same `video_id` / `frame_id` pairs used by all downstream workloads. This walkthrough models the relational side of that catalog and highlights practical SQL building blocks. The sample IDs here appear again in the JSON, vector, geo, and ETL guides. @@ -41,7 +41,24 @@ INSERT INTO frame_events VALUES ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), - ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9); + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9), + -- Deliberate orphan to illustrate NOT EXISTS + ('FRAME-0501', 'VID-MISSING-001', 10, '2025-01-04 10:00:00', 'sensor_fault', 0.25, 15.0); + +-- Needed for the JOIN patterns below; same schema as the JSON & Search guide. +CREATE OR REPLACE TABLE frame_metadata_catalog ( + doc_id STRING, + meta_json VARIANT, + captured_at TIMESTAMP, + INVERTED INDEX idx_meta_json (meta_json) +); + +INSERT INTO frame_metadata_catalog VALUES + ('FRAME-0101', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":32.4},"detections":{"objects":[{"type":"vehicle","confidence":0.88},{"type":"brake_light","confidence":0.64}]},"media_meta":{"tagging":{"labels":["hard_brake","rain","downtown_loop"]}}}'), '2025-01-01 08:15:21'), + ('FRAME-0102', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":24.8},"detections":{"objects":[{"type":"pedestrian","confidence":0.92},{"type":"bike","confidence":0.35}]},"media_meta":{"tagging":{"labels":["pedestrian","swerve","crosswalk"]}}}'), '2025-01-01 08:33:54'), + ('FRAME-0201', PARSE_JSON('{"scene":{"weather_code":"overcast","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":48.1},"detections":{"objects":[{"type":"lane_merge","confidence":0.74},{"type":"vehicle","confidence":0.41}]},"media_meta":{"tagging":{"labels":["lane_merge","urban"]}}}'), '2025-01-01 11:12:02'), + ('FRAME-0301', PARSE_JSON('{"scene":{"weather_code":"clear","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":52.6},"detections":{"objects":[{"type":"vehicle","confidence":0.82},{"type":"hard_brake","confidence":0.59}]},"media_meta":{"tagging":{"labels":["hard_brake","highway"]}}}'), '2025-01-02 09:44:18'), + ('FRAME-0401', PARSE_JSON('{"scene":{"weather_code":"lightfog","lighting":"night"},"camera":{"sensor_view":"rear"},"vehicle":{"speed_kmh":38.9},"detections":{"objects":[{"type":"traffic_light","confidence":0.78},{"type":"vehicle","confidence":0.36}]},"media_meta":{"tagging":{"labels":["night_lowlight","traffic_light"]}}}'), '2025-01-03 21:18:07'); ``` Docs: [CREATE TABLE](/sql/sql-commands/ddl/table/ddl-create-table), [INSERT](/sql/sql-commands/dml/dml-insert). @@ -49,13 +66,14 @@ Docs: [CREATE TABLE](/sql/sql-commands/ddl/table/ddl-create-table), [INSERT](/sq --- ## 2. Filter the Working Set -Keep investigations focused on fresh drives. +Keep investigations focused on the Jan 1–3 snapshot from the seed data so the demo always returns rows. ```sql WITH recent_videos AS ( SELECT * FROM citydrive_videos - WHERE capture_date >= DATEADD('day', -3, TODAY()) + WHERE capture_date >= '2025-01-01' + AND capture_date < '2025-01-04' ) SELECT v.video_id, v.route_name, @@ -69,6 +87,16 @@ ORDER BY flagged_frames DESC; Docs: [DATEADD](/sql/sql-functions/datetime-functions/date-add), [GROUP BY](/sql/sql-commands/query-syntax/query-select#group-by-clause). +Sample output: + +``` +video_id | route_name | weather | flagged_frames +VID-20250101-001| Downtown Loop | Rain | 2 +VID-20250101-002| Port Perimeter | Overcast | 1 +VID-20250102-001| Airport Connector | Clear | 1 +VID-20250103-001| CBD Night Sweep | LightFog | 1 +``` + --- ## 3. JOIN Patterns @@ -84,6 +112,17 @@ JOIN citydrive_videos AS v USING (video_id) ORDER BY f.collected_at; ``` +Sample output: + +``` +frame_id | event_tag | risk_score | route_name | camera_source +FRAME-0101| hard_brake | 0.81 | Downtown Loop | roof_cam +FRAME-0102| pedestrian | 0.67 | Downtown Loop | roof_cam +FRAME-0201| lane_merge | 0.74 | Port Perimeter | front_cam +FRAME-0301| hard_brake | 0.59 | Airport Connector | front_cam +FRAME-0401| night_lowlight | 0.63 | CBD Night Sweep | rear_cam +``` + ### Anti join QA ```sql SELECT frame_id @@ -95,6 +134,13 @@ WHERE NOT EXISTS ( ); ``` +Sample output: + +``` +frame_id +FRAME-0501 +``` + ### LATERAL FLATTEN for nested detections ```sql SELECT f.frame_id, @@ -107,6 +153,14 @@ WHERE f.event_tag = 'pedestrian' ORDER BY confidence DESC; ``` +Sample output: + +``` +frame_id | detected_type | confidence +FRAME-0102| pedestrian | 0.92 +FRAME-0102| bike | 0.35 +``` + Docs: [JOIN](/sql/sql-commands/query-syntax/query-join), [FLATTEN](/sql/sql-functions/table-functions/flatten). --- @@ -124,6 +178,17 @@ GROUP BY v.route_name, f.event_tag ORDER BY avg_risk DESC, occurrences DESC; ``` +Sample output: + +``` +route_name | event_tag | occurrences | avg_risk +Downtown Loop | hard_brake | 1 | 0.81 +Port Perimeter | lane_merge | 1 | 0.74 +Downtown Loop | pedestrian | 1 | 0.67 +CBD Night Sweep | night_lowlight | 1 | 0.63 +Airport Connector | hard_brake | 1 | 0.59 +``` + ### ROLLUP totals ```sql SELECT v.route_name, @@ -135,6 +200,19 @@ GROUP BY ROLLUP(v.route_name, f.event_tag) ORDER BY v.route_name NULLS LAST, f.event_tag; ``` +Sample output (first 6 rows): + +``` +route_name | event_tag | occurrences +Airport Connector | hard_brake | 1 +Airport Connector | NULL | 1 +CBD Night Sweep | night_lowlight | 1 +CBD Night Sweep | NULL | 1 +Downtown Loop | hard_brake | 1 +Downtown Loop | pedestrian | 1 +... (total rows: 10) +``` + ### CUBE for route × weather coverage ```sql SELECT v.route_name, @@ -145,6 +223,19 @@ GROUP BY CUBE(v.route_name, v.weather) ORDER BY v.route_name NULLS LAST, v.weather NULLS LAST; ``` +Sample output (first 6 rows): + +``` +route_name | weather | videos +Airport Connector | Clear | 1 +Airport Connector | NULL | 1 +CBD Night Sweep | LightFog | 1 +CBD Night Sweep | NULL | 1 +Downtown Loop | Rain | 1 +Downtown Loop | NULL | 1 +... (total rows: 13) +``` + --- ## 5. Window Functions @@ -166,6 +257,18 @@ FROM ordered_events ORDER BY video_id, collected_at; ``` +Sample output (first 6 rows): + +``` +video_id | collected_at | risk_score | cumulative_risk +VID-20250101-001| 2025-01-01 08:15:21 | 0.81 | 0.81 +VID-20250101-001| 2025-01-01 08:33:54 | 0.67 | 1.48 +VID-20250101-002| 2025-01-01 11:12:02 | 0.74 | 0.74 +VID-20250102-001| 2025-01-02 09:44:18 | 0.59 | 0.59 +VID-20250103-001| 2025-01-03 21:18:07 | 0.63 | 0.63 +VID-MISSING-001 | 2025-01-04 10:00:00 | 0.25 | 0.25 +``` + ### Rolling average over recent frames ```sql SELECT video_id, @@ -181,6 +284,18 @@ FROM frame_events ORDER BY video_id, frame_index; ``` +Sample output (first 6 rows): + +``` +video_id | frame_id | frame_index | risk_score | rolling_avg_risk +VID-20250101-001| FRAME-0101 | 125 | 0.81 | 0.81 +VID-20250101-001| FRAME-0102 | 416 | 0.67 | 0.74 +VID-20250101-002| FRAME-0201 | 298 | 0.74 | 0.74 +VID-20250102-001| FRAME-0301 | 188 | 0.59 | 0.59 +VID-20250103-001| FRAME-0401 | 522 | 0.63 | 0.63 +VID-MISSING-001 | FRAME-0501 | 10 | 0.25 | 0.25 +``` + Docs: [Window functions](/sql/sql-functions/window-functions). --- @@ -208,11 +323,22 @@ SELECT v.route_name, AVG(e.risk_score) AS avg_risk FROM frame_events e JOIN citydrive_videos v USING (video_id) -WHERE v.capture_date >= DATEADD('day', -14, TODAY()) +WHERE v.capture_date >= '2025-01-01' GROUP BY v.route_name, e.event_tag ORDER BY avg_risk DESC; ``` +Sample output: + +``` +route_name | event_tag | event_count | avg_risk +Downtown Loop | hard_brake | 1 | 0.81 +Port Perimeter | lane_merge | 1 | 0.74 +Downtown Loop | pedestrian | 1 | 0.67 +CBD Night Sweep | night_lowlight | 1 | 0.63 +Airport Connector | hard_brake | 1 | 0.59 +``` + Docs: [Aggregating Index](/guides/performance/aggregating-index) and [EXPLAIN](/sql/sql-commands/explain-cmds/explain). --- @@ -234,7 +360,7 @@ BEGIN AVG(e.risk_score) AS avg_risk FROM frame_events e JOIN citydrive_videos v USING (video_id) - WHERE v.capture_date >= DATEADD('day', -:days_back, TODAY()) + WHERE v.capture_date >= DATEADD('day', -:days_back, DATE '2025-01-04') GROUP BY v.route_name, e.event_tag ); END; @@ -243,6 +369,17 @@ $$; CALL PROCEDURE citydrive_route_report(30); ``` +Sample output: + +``` +route_name | event_tag | event_count | avg_risk +Downtown Loop | hard_brake | 1 | 0.81 +CBD Night Sweep | night_lowlight | 1 | 0.63 +Downtown Loop | pedestrian | 1 | 0.67 +Airport Connector | hard_brake | 1 | 0.59 +Port Perimeter | lane_merge | 1 | 0.74 +``` + Stored procedures can be triggered manually, via [TASKS](/guides/load-data/continuous-data-pipelines/task), or from orchestration tools. --- diff --git a/docs/en/guides/54-query/01-json-search.md b/docs/en/guides/54-query/01-json-search.md index 5f766298f0..05d5d45ec5 100644 --- a/docs/en/guides/54-query/01-json-search.md +++ b/docs/en/guides/54-query/01-json-search.md @@ -2,7 +2,7 @@ title: JSON & Search --- -> **Scenario:** CityDrive attaches a metadata JSON payload to every extracted frame and needs Elasticsearch-style filtering on that JSON without copying it out of Databend. +> **Scenario:** CityDrive attaches a metadata JSON payload to every extracted frame. This JSON data is extracted from video keyframes by background tools, containing rich unstructured information like scene recognition and object detection. We need to filter this JSON in Databend with Elasticsearch-style syntax without replicating it to an external system. JSON without copying it out of Databend. Databend keeps these heterogeneous signals in one warehouse. Inverted indexes power Elasticsearch-style search on VARIANT columns, bitmap tables summarize label coverage, vector indexes answer similarity lookups, and native GEOMETRY columns support spatial filters. @@ -18,7 +18,15 @@ CREATE OR REPLACE TABLE frame_metadata_catalog ( meta_json VARIANT, captured_at TIMESTAMP, INVERTED INDEX idx_meta_json (meta_json) -) CLUSTER BY (captured_at); +); + +-- Sample rows for the queries below. +INSERT INTO frame_metadata_catalog VALUES + ('FRAME-0101', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":32.4},"detections":{"objects":[{"type":"vehicle","confidence":0.88},{"type":"brake_light","confidence":0.64}]},"media_meta":{"tagging":{"labels":["hard_brake","rain","downtown_loop"]}}}'), '2025-01-01 08:15:21'), + ('FRAME-0102', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":24.8},"detections":{"objects":[{"type":"pedestrian","confidence":0.92},{"type":"bike","confidence":0.35}]},"media_meta":{"tagging":{"labels":["pedestrian","swerve","crosswalk"]}}}'), '2025-01-01 08:33:54'), + ('FRAME-0201', PARSE_JSON('{"scene":{"weather_code":"overcast","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":48.1},"detections":{"objects":[{"type":"lane_merge","confidence":0.74},{"type":"vehicle","confidence":0.41}]},"media_meta":{"tagging":{"labels":["lane_merge","urban"]}}}'), '2025-01-01 11:12:02'), + ('FRAME-0301', PARSE_JSON('{"scene":{"weather_code":"clear","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":52.6},"detections":{"objects":[{"type":"vehicle","confidence":0.82},{"type":"hard_brake","confidence":0.59}]},"media_meta":{"tagging":{"labels":["hard_brake","highway"]}}}'), '2025-01-02 09:44:18'), + ('FRAME-0401', PARSE_JSON('{"scene":{"weather_code":"lightfog","lighting":"night"},"camera":{"sensor_view":"rear"},"vehicle":{"speed_kmh":38.9},"detections":{"objects":[{"type":"traffic_light","confidence":0.78},{"type":"vehicle","confidence":0.36}]},"media_meta":{"tagging":{"labels":["night_lowlight","traffic_light"]}}}'), '2025-01-03 21:18:07'); ``` > Need multimodal data (vector embeddings, GPS trails, tag bitmaps)? Grab the schemas from the [Vector](./02-vector-db.md) and [Geo](./03-geo-analytics.md) guides so you can combine them with the search results shown here. @@ -35,6 +43,13 @@ ORDER BY captured_at DESC LIMIT 5; ``` +Sample output: + +``` +doc_id | captured_at | detections +FRAME-0102 | 2025-01-01 08:33:54 | {"objects":[{"confidence":0.92,"type":"pedestrian"},{"confidence":0.35,"type":"bike"}]} +``` + ### Boolean AND ```sql SELECT doc_id, captured_at @@ -44,6 +59,14 @@ WHERE QUERY('meta_json.scene.weather_code:rain ORDER BY captured_at; ``` +Sample output: + +``` +doc_id | captured_at +FRAME-0101 | 2025-01-01 08:15:21 +FRAME-0102 | 2025-01-01 08:33:54 +``` + ### Boolean OR / List ```sql SELECT doc_id, @@ -54,6 +77,16 @@ ORDER BY captured_at DESC LIMIT 10; ``` +Sample output: + +``` +doc_id | labels +FRAME-0301 | ["hard_brake","highway"] +FRAME-0201 | ["lane_merge","urban"] +FRAME-0102 | ["pedestrian","swerve","crosswalk"] +FRAME-0101 | ["hard_brake","rain","downtown_loop"] +``` + ### Numeric Ranges ```sql SELECT doc_id, @@ -64,6 +97,16 @@ ORDER BY speed DESC LIMIT 10; ``` +Sample output: + +``` +doc_id | speed +FRAME-0301 | 52.6 +FRAME-0201 | 48.1 +FRAME-0401 | 38.9 +FRAME-0101 | 32.4 +``` + ### Boosting ```sql SELECT doc_id, @@ -74,4 +117,12 @@ ORDER BY relevance DESC LIMIT 8; ``` +Sample output: + +``` +doc_id | relevance +FRAME-0101 | 7.0161 +FRAME-0102 | 3.6252 +``` + `QUERY()` follows Elasticsearch semantics (boolean logic, ranges, boosts, lists). `SCORE()` exposes the Elasticsearch relevance so you can re-rank results inside SQL. See [Search functions](/sql/sql-functions/search-functions) for the full operator list. diff --git a/docs/en/guides/54-query/02-vector-db.md b/docs/en/guides/54-query/02-vector-db.md index 6bb174830c..bf1c0c80d1 100644 --- a/docs/en/guides/54-query/02-vector-db.md +++ b/docs/en/guides/54-query/02-vector-db.md @@ -2,7 +2,7 @@ title: Vector Search --- -> **Scenario:** CityDrive keeps per-frame embeddings in Databend so semantic similarity search (“find frames that look like this”) runs alongside traditional SQL analytics—no extra vector service required. +> **Scenario:** CityDrive keeps embeddings for every frame directly in Databend. These vector embeddings are the result of AI models inferencing on video keyframes to capture visual semantic features. Semantic similarity search ("find frames that look like this") can run alongside traditional SQL analytics—no separate vector service required. The `frame_embeddings` table shares the same `frame_id` keys as `frame_events`, `frame_metadata_catalog`, and `frame_geo_points`, which keeps semantic search and classic SQL glued together. @@ -16,15 +16,67 @@ CREATE OR REPLACE TABLE frame_embeddings ( sensor_view STRING, embedding VECTOR(512), encoder_build STRING, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + created_at TIMESTAMP, VECTOR INDEX idx_frame_embeddings(embedding) distance='cosine' ); -INSERT INTO frame_embeddings VALUES - ('FRAME-0101', 'VID-20250101-001', 'roof_cam', RANDOM_VECTOR(512), 'clip-lite-v1', DEFAULT), - ('FRAME-0102', 'VID-20250101-001', 'roof_cam', RANDOM_VECTOR(512), 'clip-lite-v1', DEFAULT), - ('FRAME-0201', 'VID-20250101-002', 'front_cam',RANDOM_VECTOR(512), 'night-fusion-v2', DEFAULT), - ('FRAME-0401', 'VID-20250103-001', 'rear_cam', RANDOM_VECTOR(512), 'night-fusion-v2', DEFAULT); +-- SQL UDF: build 512 dims via ARRAY_AGG + window frame; tutorial placeholder only. +CREATE OR REPLACE FUNCTION demo_random_vector(seed STRING) +RETURNS TABLE(embedding VECTOR(512)) +AS $$ +SELECT CAST( + ARRAY_AGG(rand_val) OVER ( + PARTITION BY seed + ORDER BY seq + ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING + ) + AS VECTOR(512) + ) AS embedding +FROM ( + SELECT seed, + dims.number AS seq, + (RAND() * 0.2 - 0.1)::FLOAT AS rand_val + FROM numbers(512) AS dims +) vals +QUALIFY ROW_NUMBER() OVER (PARTITION BY seed ORDER BY seq) = 1; +$$; + +INSERT INTO frame_embeddings (frame_id, video_id, sensor_view, embedding, encoder_build, created_at) +SELECT 'FRAME-0101', 'VID-20250101-001', 'roof_cam', embedding, 'clip-lite-v1', '2025-01-01 08:15:21' +FROM demo_random_vector('FRAME-0101') +UNION ALL +SELECT 'FRAME-0102', 'VID-20250101-001', 'roof_cam', embedding, 'clip-lite-v1', '2025-01-01 08:33:54' +FROM demo_random_vector('FRAME-0102') +UNION ALL +SELECT 'FRAME-0201', 'VID-20250101-002', 'front_cam', embedding, 'night-fusion-v2', '2025-01-01 11:12:02' +FROM demo_random_vector('FRAME-0201') +UNION ALL +SELECT 'FRAME-0401', 'VID-20250103-001', 'rear_cam', embedding, 'night-fusion-v2', '2025-01-03 21:18:07' +FROM demo_random_vector('FRAME-0401'); +``` + +> This array generator is just to keep the tutorial self-contained. Replace it with real embeddings from your model in production. + +If you haven’t run the SQL Analytics guide yet, create the supporting `frame_events` table and seed the same sample rows the vector walkthrough joins against: + +```sql +CREATE OR REPLACE TABLE frame_events ( + frame_id STRING, + video_id STRING, + frame_index INT, + collected_at TIMESTAMP, + event_tag STRING, + risk_score DOUBLE, + speed_kmh DOUBLE +); + +INSERT INTO frame_events VALUES + ('FRAME-0101', 'VID-20250101-001', 125, '2025-01-01 08:15:21', 'hard_brake', 0.81, 32.4), + ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), + ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), + ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9), + ('FRAME-0501', 'VID-MISSING-001', 10, '2025-01-04 10:00:00', 'sensor_fault', 0.25, 15.0); ``` Docs: [Vector type](/sql/sql-reference/data-types/vector) and [Vector index](/sql/sql-reference/data-types/vector#vector-indexing). @@ -49,6 +101,15 @@ ORDER BY distance LIMIT 3; ``` +Sample output: + +``` +frame_id | video_id | distance +FRAME-0101| VID-20250101-001 | 0.0000 +FRAME-0201| VID-20250101-002 | 0.9801 +FRAME-0102| VID-20250101-001 | 0.9842 +``` + Lower distance = more similar. The `VECTOR INDEX` keeps latency low even with millions of frames. Add traditional predicates (route, video, sensor view) before or after the vector comparison to narrow the candidate set. @@ -69,6 +130,13 @@ ORDER BY distance LIMIT 5; ``` +Sample output: + +``` +frame_id | sensor_view | distance +FRAME-0401| rear_cam | 1.0537 +``` + The optimizer still uses the vector index while honoring the `sensor_view` filter. --- @@ -101,4 +169,14 @@ LEFT JOIN frame_events fe USING (frame_id) ORDER BY sf.distance; ``` +Sample output: + +``` +frame_id | video_id | event_tag | risk_score | distance +FRAME-0102| VID-20250101-001 | pedestrian | 0.67 | 0.0000 +FRAME-0201| VID-20250101-002 | lane_merge | 0.74 | 0.9802 +FRAME-0101| VID-20250101-001 | hard_brake | 0.81 | 0.9842 +FRAME-0401| VID-20250103-001 | night_lowlight | 0.63 | 1.0020 +``` + Because the embeddings live next to relational tables, you can pivot from “frames that look alike” to “frames that also had `hard_brake` tags, specific weather, or JSON detections” without exporting data to another service. diff --git a/docs/en/guides/54-query/03-geo-analytics.md b/docs/en/guides/54-query/03-geo-analytics.md index 334149c22c..9e2c8b3737 100644 --- a/docs/en/guides/54-query/03-geo-analytics.md +++ b/docs/en/guides/54-query/03-geo-analytics.md @@ -2,7 +2,7 @@ title: Geo Analytics --- -> **Scenario:** CityDrive records precise GPS fixes and traffic-signal distances for each flagged frame so operations teams can answer “where did this happen?” entirely in SQL. +> **Scenario:** CityDrive records precise GPS positioning and distance-to-signal for every flagged frame. This geospatial data originates from the dash-cam's GPS module and is precisely aligned with the timestamps of video keyframes. Ops teams can answer "where did this happen?" purely in SQL. `frame_geo_points` and `signal_contact_points` share the same `video_id`/`frame_id` keys as the rest of the guide, so you can move from SQL metrics to maps without copying data. @@ -20,11 +20,11 @@ CREATE OR REPLACE TABLE frame_geo_points ( ); INSERT INTO frame_geo_points VALUES - ('VID-20250101-001','FRAME-0101',TO_GEOMETRY('SRID=4326;POINT(114.0579 22.5431)'),104,'fusion_gnss','2025-01-01 08:15:21'), - ('VID-20250101-001','FRAME-0102',TO_GEOMETRY('SRID=4326;POINT(114.0610 22.5460)'),104,'fusion_gnss','2025-01-01 08:33:54'), - ('VID-20250101-002','FRAME-0201',TO_GEOMETRY('SRID=4326;POINT(114.1040 22.5594)'),104,'fusion_gnss','2025-01-01 11:12:02'), - ('VID-20250102-001','FRAME-0301',TO_GEOMETRY('SRID=4326;POINT(114.0822 22.5368)'),104,'fusion_gnss','2025-01-02 09:44:18'), - ('VID-20250103-001','FRAME-0401',TO_GEOMETRY('SRID=4326;POINT(114.1195 22.5443)'),104,'fusion_gnss','2025-01-03 21:18:07'); + ('VID-20250101-001','FRAME-0101',TO_GEOMETRY('SRID=4326;POINT(114.0579123456789 22.543123456789)'),104,'fusion_gnss','2025-01-01 08:15:21'), + ('VID-20250101-001','FRAME-0102',TO_GEOMETRY('SRID=4326;POINT(114.0610987654321 22.546098765432)'),104,'fusion_gnss','2025-01-01 08:33:54'), + ('VID-20250101-002','FRAME-0201',TO_GEOMETRY('SRID=4326;POINT(114.104012345678 22.559456789012)'),104,'fusion_gnss','2025-01-01 11:12:02'), + ('VID-20250102-001','FRAME-0301',TO_GEOMETRY('SRID=4326;POINT(114.082265432109 22.53687654321)'),104,'fusion_gnss','2025-01-02 09:44:18'), + ('VID-20250103-001','FRAME-0401',TO_GEOMETRY('SRID=4326;POINT(114.119501234567 22.544365432101)'),104,'fusion_gnss','2025-01-03 21:18:07'); CREATE OR REPLACE TABLE signal_contact_points ( node_id STRING, @@ -35,6 +35,43 @@ CREATE OR REPLACE TABLE signal_contact_points ( distance_m DOUBLE, created_at TIMESTAMP ); + +INSERT INTO signal_contact_points VALUES + ('SIG-0001', TO_GEOMETRY('SRID=4326;POINT(114.058500123456 22.543800654321)'), 'VID-20250101-001', 'FRAME-0101', TO_GEOMETRY('SRID=4326;POINT(114.0579123456789 22.543123456789)'), 0.012345, '2025-01-01 08:15:30'), + ('SIG-0002', TO_GEOMETRY('SRID=4326;POINT(114.118900987654 22.544800123456)'), 'VID-20250103-001', 'FRAME-0401', TO_GEOMETRY('SRID=4326;POINT(114.119501234567 22.544365432101)'), 0.008765, '2025-01-03 21:18:20'); + +-- Frames and JSON tables these queries join against (same rows as SQL & Search guides). +CREATE OR REPLACE TABLE frame_events ( + frame_id STRING, + video_id STRING, + frame_index INT, + collected_at TIMESTAMP, + event_tag STRING, + risk_score DOUBLE, + speed_kmh DOUBLE +); + +INSERT INTO frame_events VALUES + ('FRAME-0101', 'VID-20250101-001', 125, '2025-01-01 08:15:21', 'hard_brake', 0.81, 32.4), + ('FRAME-0102', 'VID-20250101-001', 416, '2025-01-01 08:33:54', 'pedestrian', 0.67, 24.8), + ('FRAME-0201', 'VID-20250101-002', 298, '2025-01-01 11:12:02', 'lane_merge', 0.74, 48.1), + ('FRAME-0301', 'VID-20250102-001', 188, '2025-01-02 09:44:18', 'hard_brake', 0.59, 52.6), + ('FRAME-0401', 'VID-20250103-001', 522, '2025-01-03 21:18:07', 'night_lowlight', 0.63, 38.9), + ('FRAME-0501', 'VID-MISSING-001', 10, '2025-01-04 10:00:00', 'sensor_fault', 0.25, 15.0); + +CREATE OR REPLACE TABLE frame_metadata_catalog ( + doc_id STRING, + meta_json VARIANT, + captured_at TIMESTAMP, + INVERTED INDEX idx_meta_json (meta_json) +); + +INSERT INTO frame_metadata_catalog VALUES + ('FRAME-0101', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":32.4},"detections":{"objects":[{"type":"vehicle","confidence":0.88},{"type":"brake_light","confidence":0.64}]},"media_meta":{"tagging":{"labels":["hard_brake","rain","downtown_loop"]}}}'), '2025-01-01 08:15:21'), + ('FRAME-0102', PARSE_JSON('{"scene":{"weather_code":"rain","lighting":"day"},"camera":{"sensor_view":"roof"},"vehicle":{"speed_kmh":24.8},"detections":{"objects":[{"type":"pedestrian","confidence":0.92},{"type":"bike","confidence":0.35}]},"media_meta":{"tagging":{"labels":["pedestrian","swerve","crosswalk"]}}}'), '2025-01-01 08:33:54'), + ('FRAME-0201', PARSE_JSON('{"scene":{"weather_code":"overcast","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":48.1},"detections":{"objects":[{"type":"lane_merge","confidence":0.74},{"type":"vehicle","confidence":0.41}]},"media_meta":{"tagging":{"labels":["lane_merge","urban"]}}}'), '2025-01-01 11:12:02'), + ('FRAME-0301', PARSE_JSON('{"scene":{"weather_code":"clear","lighting":"day"},"camera":{"sensor_view":"front"},"vehicle":{"speed_kmh":52.6},"detections":{"objects":[{"type":"vehicle","confidence":0.82},{"type":"hard_brake","confidence":0.59}]},"media_meta":{"tagging":{"labels":["hard_brake","highway"]}}}'), '2025-01-02 09:44:18'), + ('FRAME-0401', PARSE_JSON('{"scene":{"weather_code":"lightfog","lighting":"night"},"camera":{"sensor_view":"rear"},"vehicle":{"speed_kmh":38.9},"detections":{"objects":[{"type":"traffic_light","confidence":0.78},{"type":"vehicle","confidence":0.36}]},"media_meta":{"tagging":{"labels":["night_lowlight","traffic_light"]}}}'), '2025-01-03 21:18:07'); ``` Docs: [Geospatial types](/sql/sql-reference/data-types/geospatial). @@ -61,6 +98,14 @@ WHERE ST_DISTANCE( ORDER BY meters_from_hq; ``` +Sample output: + +``` +frame_id | video_id | event_tag | meters_from_hq +FRAME-0102| VID-20250101-001 | pedestrian | 180.277138577 +FRAME-0101| VID-20250101-001 | hard_brake | 324.291965923 +``` + Tip: add `ST_ASTEXT(l.geom)` while debugging or switch to [`HAVERSINE`](/sql/sql-functions/geospatial-functions#trigonometric-distance-functions) for great-circle math. ```sql @@ -82,6 +127,14 @@ CROSS JOIN school_zone WHERE ST_CONTAINS(poly, l.position_wgs84); ``` +Sample output: + +``` +frame_id | video_id | event_tag +FRAME-0101| VID-20250101-001 | hard_brake +FRAME-0102| VID-20250101-001 | pedestrian +``` + --- ## 3. Hex Aggregations @@ -97,6 +150,17 @@ GROUP BY h3_cell ORDER BY avg_risk DESC; ``` +Sample output: + +``` +h3_cell | frame_count | avg_risk +613635011200942079| 1 | 0.81 +613635011532292095| 1 | 0.74 +613635011238690815| 1 | 0.67 +613635015391051775| 1 | 0.63 +613635011309993983| 1 | 0.59 +``` + Docs: [H3 functions](/sql/sql-functions/geospatial-functions#h3-indexing--conversion). --- @@ -117,14 +181,22 @@ WHERE t.distance_m < 0.03 -- roughly < 30 meters depending on SRID ORDER BY t.distance_m; ``` +Sample output: + +``` +node_id | video_id | frame_id | recomputed_distance | stored_distance | source_system +SIG-0002| VID-20250103-001 | FRAME-0401| 0.000741116 | 0.008765 | fusion_gnss +SIG-0001| VID-20250101-001 | FRAME-0101| 0.000896705 | 0.012345 | fusion_gnss +``` + ```sql WITH near_junction AS ( SELECT frame_id FROM frame_geo_points WHERE ST_DISTANCE( ST_TRANSFORM(position_wgs84, 3857), - ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0700 22.5400)'), 3857) - ) <= 150 + ST_TRANSFORM(TO_GEOMETRY('SRID=4326;POINT(114.0830 22.5370)'), 3857) + ) <= 200 ) SELECT f.frame_id, f.event_tag, @@ -136,6 +208,13 @@ JOIN frame_metadata_catalog AS meta WHERE QUERY('meta_json.media_meta.tagging.labels:hard_brake'); ``` +Sample output: + +``` +frame_id | event_tag | labels +FRAME-0301| hard_brake | ["hard_brake","highway"] +``` + This pattern lets you filter by geography first, then apply JSON search to the surviving frames. --- @@ -153,4 +232,15 @@ JOIN frame_events AS f USING (frame_id) GROUP BY h3_cell; ``` +Sample output: + +``` +h3_cell | frames | avg_risk +609131411584057343| 1 | 0.81 +609131411919601663| 1 | 0.74 +609131411617611775| 1 | 0.67 +609131415778361343| 1 | 0.63 +609131411684720639| 1 | 0.59 +``` + Databend now serves vector, text, and spatial queries off the exact same `video_id`, so investigation teams never have to reconcile separate pipelines. diff --git a/docs/en/guides/54-query/04-lakehouse-etl.md b/docs/en/guides/54-query/04-lakehouse-etl.md index f8f5892596..5e07a85037 100644 --- a/docs/en/guides/54-query/04-lakehouse-etl.md +++ b/docs/en/guides/54-query/04-lakehouse-etl.md @@ -2,7 +2,7 @@ title: Lakehouse ETL --- -> **Scenario:** CityDrive’s data engineering team exports each dash-cam batch as Parquet (videos, frame events, metadata JSON, embeddings, GPS tracks, traffic-signal distances) and wants one COPY pipeline to refresh the shared tables in Databend. +> **Scenario:** CityDrive's data engineering team exports every batch of dash-cam data as Parquet (videos, frame events, metadata JSON, embeddings, GPS traces, traffic light distances). These Parquet files aggregate all multimodal signals extracted from the raw video streams, forming the foundation of the warehouse. They want to update Databend's shared tables via a single COPY pipeline. to refresh the shared tables in Databend. The loading loop is straightforward: diff --git a/docs/en/guides/54-query/index.md b/docs/en/guides/54-query/index.md index c53511ce66..72ae961e78 100644 --- a/docs/en/guides/54-query/index.md +++ b/docs/en/guides/54-query/index.md @@ -2,7 +2,9 @@ title: Unified Workloads --- -CityDrive Intelligence records every dash-cam drive, splits it into frames, and stores multiple signals per `video_id`: relational metadata, JSON manifests, behaviour tags, embeddings, and GPS traces. This guide set shows how Databend keeps all those workloads in one warehouse—no copy jobs, no extra search cluster. +CityDrive Intelligence records video of every drive. Background processing tools split the video stream into keyframe images, extracting rich multimodal information from each image and storing it by `video_id`. These signals include relational metadata, JSON manifests, behavior tags, vector embeddings, and GPS traces. + +This guide set shows how Databend keeps all those workloads in one warehouse—no copy jobs, no extra search cluster. | Guide | What it covers | |-------|----------------|