Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions docs/content.zh/docs/dev/table/concepts/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ under the License.
# 流式概念

Flink 的 [Table API]({{< ref "docs/dev/table/tableApi" >}}) 和 [SQL]({{< ref "docs/dev/table/sql/overview" >}}) 是流批统一的 API。
这意味着 Table API & SQL 在无论有限的批式输入还是无限的流式输入下,都具有相同的语义
这意味着,无论输入数据是有界的批处理输入还是无界的流处理输入,Table API SQL 查询都具有相同的语义
因为传统的关系代数以及 SQL 最开始都是为了批式处理而设计的,
关系型查询在流式场景下不如在批式场景下容易懂。

Expand Down Expand Up @@ -58,13 +58,11 @@ Flink 的 [Table API]({{< ref "docs/dev/table/tableApi" >}}) 和 [SQL]({{< ref "

#### 状态算子

包含诸如[连接]({{< ref "docs/dev/table/sql/queries/joins" >}})、[聚合]({{< ref "docs/dev/table/sql/queries/group-agg" >}})或[去重]({{< ref "docs/dev/table/sql/queries/deduplication" >}}) 等操作的语句需要在 Flink 抽象的容错存储内保存中间结果
查询中若包含诸如[连接]({{< ref "docs/dev/table/sql/queries/joins" >}})、[聚合]({{< ref "docs/dev/table/sql/queries/group-agg" >}})或[去重]({{< ref "docs/dev/table/sql/queries/deduplication" >}})等有状态操作,就需要将中间结果存储在具备容错能力的存储系统中 —— 而 Flink 的状态抽象机制,正是用于实现这一需求的核心组件

例如对两个表进行 join 操作的普通 SQL 需要算子保存两个表的全部输入。基于正确的 SQL 语义,运行时假设两表会在任意时间点进行匹配。
Flink 提供了 [优化窗口和时段 Join 聚合]({{< ref "docs/dev/table/sql/queries/joins" >}})
以利用 [watermarks]({{< ref "docs/dev/table/concepts/time_attributes" >}}) 概念来让保持较小的状态规模。
例如,对两张表执行常规 SQL 连接(join)时,算子需要把两侧的输入表完整地保存在状态中。为了保证 SQL 语义的正确性,运行时环境要假定 “两侧数据在任何时间点进行匹配”。而 Flink 提供了 [优化窗口和时段 Join 聚合]({{< ref "docs/dev/table/sql/queries/joins" >}}) —— 它们借助水印 [watermarks]({{< ref "docs/dev/table/concepts/time_attributes" >}})机制,能有效控制状态数据的规模。

另一个计算词频的例子如下
再举一个例子,比如下面这个计算单词计数(word count)的查询。

```sql
CREATE TABLE doc (
Expand Down
50 changes: 16 additions & 34 deletions docs/content.zh/docs/dev/table/data_stream_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,59 +22,41 @@ specific language governing permissions and limitations
under the License.
-->

# DataStream API Integration
# DataStream API 集成

Both Table API and DataStream API are equally important when it comes to defining a data
processing pipeline.
在定义数据处理管道时,Table API 和 DataStream API 有着同等重要的地位。

The DataStream API offers the primitives of stream processing (namely time, state, and dataflow
management) in a relatively low-level imperative programming API. The Table API abstracts away many
internals and provides a structured and declarative API.
DataStream API 以相对底层的命令式编程接口的方式,提供流处理的基础构件(即时间、状态和数据流管理 )。而 Table API 则对诸多内部细节进行了抽象封装,提供结构化且声明式的编程接口。

Both APIs can work with bounded *and* unbounded streams.
两种 API 均支持处理有界流(bounded)与无界流(unbounded)。

Bounded streams need to be managed when processing historical data. Unbounded streams occur
in real-time processing scenarios that might be initialized with historical data first.
有界流在处理历史数据(如离线业务日志、归档交易记录等)场景中,需对这类数据边界明确的流进行管理。
无界流常见于实时处理场景,且此类场景可能会先基于历史数据完成初始化。

For efficient execution, both APIs offer processing bounded streams in an optimized batch execution
mode. However, since batch is just a special case of streaming, it is also possible to run pipelines
of bounded streams in regular streaming execution mode.
两种 API 均对批模式处理有界流的行为实现了高性能的优化。但由于批处理本质上是流处理的特殊形式(即有界流处理),因此也可将有界流的数据处理管道运行在常规的流执行模式中,灵活适配不同业务的执行需求。

Pipelines in one API can be defined end-to-end without dependencies on the other API. However, it
might be useful to mix both APIs for various reasons:
使用其中一种 API 可独立定义端到端的数据处理管道,无需依赖另一 API。但在以下场景中,混合使用两种 API 可能会很有用:

- Use the table ecosystem for accessing catalogs or connecting to external systems easily, before
implementing the main pipeline in DataStream API.
- Access some of the SQL functions for stateless data normalization and cleansing, before
implementing the main pipeline in DataStream API.
- Switch to DataStream API every now and then if a more low-level operation (e.g. custom timer
handling) is not present in Table API.
- 基于 DataStream API 实现主处理管道前,借助 Table API 的生态体系,可便捷访问Catalog或连接外部系统。
- 基于DataStream API 实现主管道逻辑前,利用 Table API 中的 SQL 函数,对无状态的数据进行规范化和清洗。
- 若 Table API 未提供某类底层操作(如自定义定时器处理),可适时切换使用 DataStream API 完成此类需求。

Flink provides special bridging functionalities to make the integration with DataStream API as smooth
as possible.
Flink 提供了专门的桥接功能,旨在尽可能简化 Table API 与 DataStream API 的集成过程。

{{< hint info >}}
Switching between DataStream and Table API adds some conversion overhead. For example, internal data
structures of the table runtime (i.e. `RowData`) that partially work on binary data need to be converted
to more user-friendly data structures (i.e. `Row`). Usually, this overhead can be neglected but is
mentioned here for completeness.
在 DataStream API 与 Table API 之间切换会增加一些转换开销。例如,Table 运行时的内部数据结构(即 `RowData`,部分基于二进制数据工作)需转换为更便于用户使用的数据结构(即 `Row`)。通常情况下,该开销可忽略不计,但为保证内容完整性,此处仍予以说明。
{{< /hint >}}

{{< top >}}

<a name="converting-between-datastream-and-table"></a>

Converting between DataStream and Table
DataStream Table API 间的转换
---------------------------------------

Flink provides a specialized `StreamTableEnvironment` for integrating with the
DataStream API. Those environments extend the regular `TableEnvironment` with additional methods
and take the `StreamExecutionEnvironment` used in the DataStream API as a parameter.
Flink 提供了专门的 `StreamTableEnvironment`,用于实现与 DataStream API 的集成。这类环境在常规 `TableEnvironment` 的基础上扩展了额外方法,并将 DataStream API 中使用的 `StreamExecutionEnvironment` 作为参数传入。

The following code shows an example of how to go back and forth between the two APIs. Column names
and types of the `Table` are automatically derived from the `TypeInformation` of the `DataStream`.
Since the DataStream API does not support changelog processing natively, the code assumes
append-only/insert-only semantics during the stream-to-table and table-to-stream conversion.
以下代码展示了两种 API 间双向转换的示例。Table 的列名与数据类型会从 `DataStream` 的 `TypeInformation` 中自动推导。由于 DataStream API 本身不原生支持变更日志(changelog)处理,因此在流表转换(流转表、表转流)过程中,该代码默认采用仅追加(append-only)/ 仅插入(insert-only)语义。

{{< tabs "6ec84aa4-d91d-4c47-9fa2-b1aae1e3cdb5" >}}
{{< tab "Java" >}}
Expand Down