|
1 | 1 | # Dataflow Specification |
2 | 2 |
|
3 | | -Dataflows are specified through a YAML file. This section presents our current draft for the file format. It only includes basic functionality for now, we will extend it later when we introduce more advanced features. |
4 | 3 |
|
5 | | -## Dataflow |
| 4 | +Dora dataflows are specified through a YAML file. |
| 5 | +This dataflow configuration file specifies the nodes of the dataflow and their inputs and outputs. |
| 6 | +It also allows configuring communication parameters and enabling debug options. |
6 | 7 |
|
7 | | -Dataflows are specified through the following format: |
| 8 | +This article provides an introduction to the dataflow file format and its most important fields. |
| 9 | +For a complete reference of all available fields their behavior check ouf the documentation of the |
| 10 | +[`Descriptor`](https://docs.rs/dora-core/0.3.12/dora_core/descriptor/struct.Descriptor.html) and |
| 11 | +[`Node`](https://docs.rs/dora-core/0.3.12/dora_core/descriptor/struct.Node.html) structs. |
| 12 | + |
| 13 | +## Defining Nodes |
| 14 | + |
| 15 | +The most important field in a dataflow configuration file is the `nodes` field, which lists the |
| 16 | +nodes of the dataflow. |
| 17 | +Each node is identified by a unique `id`: |
8 | 18 |
|
9 | 19 | ```yaml |
10 | 20 | nodes: |
11 | 21 | - id: foo |
| 22 | + path: path/to/the/executable |
12 | 23 | # ... (see below) |
13 | 24 | - id: bar |
| 25 | + path: path/to/another/executable |
14 | 26 | # ... (see below) |
15 | 27 | ``` |
16 | 28 |
|
| 29 | +For each node, you need to specify the `path` of the executable or script that Dora should run when starting the node. |
| 30 | +Most of the other node fields are optional, but you typically want to specify at least some `inputs` and/or `outputs`. |
| 31 | + |
17 | 32 | ### Inputs and Outputs |
18 | 33 |
|
19 | | -Each operator or custom node has a separate namespace for its outputs. To refer to outputs, the \<operator\>/\<output\> syntax is used. This way, there are no name conflicts between operators. |
| 34 | +Nodes can send output messages that can be received by other nodes as input. |
| 35 | +All inputs and outputs need to be specified in the dataflow configuration file. |
| 36 | + |
| 37 | +For each node, list all output IDs that it sends under the |
| 38 | +[`outputs`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.outputs) |
| 39 | +key. |
| 40 | +Only the specified output IDs are valid to be used in output sending functions such as |
| 41 | +[`send_output`](https://docs.rs/dora-node-api/latest/dora_node_api/struct.DoraNode.html#method.send_output). |
| 42 | + |
| 43 | +Receiving nodes can subscribe to outputs by listing them in their |
| 44 | +[`inputs`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.inputs) |
| 45 | +field. |
| 46 | +The `inputs` field should be a key-value map of the following format: |
| 47 | +`input_id: source_node_id/source_node_output_id` |
| 48 | + |
| 49 | +The components are defined as follows: |
| 50 | + - `input_id` is the local identifier that should be used for this input. |
| 51 | + This will map to the `id` field of |
| 52 | + [`Event::Input`](https://docs.rs/dora-node-api/latest/dora_node_api/enum.Event.html#variant.Input) |
| 53 | + events sent to the node event loop. |
| 54 | + - `source_node_id` should be the `id` field of the node that sends the output that we want |
| 55 | + to subscribe to |
| 56 | + - `source_node_output_id` should be the identifier of the output that that we want |
| 57 | + to subscribe to |
| 58 | + |
| 59 | +#### Input/Output Example |
| 60 | + |
| 61 | +```yaml |
| 62 | +nodes: |
| 63 | + - id: example-node |
| 64 | + outputs: |
| 65 | + - one |
| 66 | + - two |
| 67 | + - id: receiver |
| 68 | + inputs: |
| 69 | + my_input: example-node/two |
| 70 | +``` |
| 71 | +
|
| 72 | +
|
| 73 | +### Fields Controlling Node Execution |
| 74 | +
|
| 75 | +Use the following fields to define how a node is executed, including command-line arguments and environment |
| 76 | +variables. |
| 77 | +
|
| 78 | +##### `path` (required) |
| 79 | + |
| 80 | +Specifies the path of the executable or script that Dora should run when starting the dataflow. |
| 81 | +This can point to a normal executable (e.g. when using a compiled language such as Rust) or a Python script. |
| 82 | + |
| 83 | +```yaml |
| 84 | +nodes: |
| 85 | + - id: rust-example |
| 86 | + path: target/release/rust-node |
| 87 | + - id: python-example |
| 88 | + path: ./receive_data.py |
| 89 | +``` |
| 90 | + |
| 91 | +See the |
| 92 | +[`path` field documentation](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.path) |
| 93 | +for details. |
| 94 | + |
| 95 | +##### `args` and `env` |
| 96 | + |
| 97 | +Use the |
| 98 | +[`args`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.args) |
| 99 | +field to specify command-line arguments that should be passed to the executable/script specified in |
| 100 | +`path`. |
| 101 | +Use the |
| 102 | +[`env`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.env) |
| 103 | +field for setting environment variables. |
| 104 | + |
| 105 | +```yaml |
| 106 | +nodes: |
| 107 | + - id: example |
| 108 | + path: example-node |
| 109 | + args: -v --some-flag foo |
| 110 | + env: |
| 111 | + IMAGE_WIDTH: 640 |
| 112 | + IMAGE_HEIGHT: 480 |
| 113 | +``` |
| 114 | + |
| 115 | +### Fields Controlling Node Build |
| 116 | + |
| 117 | +Use build fields define how a node is set up and built on `dora build`. |
| 118 | +All build fields are optional. |
| 119 | + |
| 120 | +##### `build` |
20 | 121 |
|
21 | | -Input operands are specified using the \<name\>: \<operator\>/\<output\> syntax, where \<data\> is the internal name that should be used for the operand. The main advantage of this name mapping is that the same operator executable can be reused multiple times on different input. |
| 122 | +The |
| 123 | +[`build`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.build) |
| 124 | +field specifies the command that should be invoked for building the node. |
22 | 125 |
|
23 | | -## Nodes |
| 126 | +```yaml |
| 127 | +- id: build-example |
| 128 | + build: cargo build -p receive_data --release |
| 129 | + path: target/release/receive_data |
| 130 | +- id: multi-line-example |
| 131 | + build: | |
| 132 | + pip install flash-attn |
| 133 | + pip install -e ../../node-hub/dora-phi4 |
| 134 | + path: dora-phi4 |
| 135 | +``` |
| 136 | + |
| 137 | +**Special treatment of `pip`:** Build lines that start with `pip` or `pip3` are treated in a |
| 138 | +special way: |
| 139 | +If the `--uv` argument is passed to the `dora build` command, all `pip`/`pip3` commands are run |
| 140 | +through the [`uv` package manager](https://docs.astral.sh/uv/). |
| 141 | + |
| 142 | +#### `git` |
| 143 | + |
| 144 | +The [`git`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.git) |
| 145 | +field allows downloading nodes from git repositories. |
| 146 | +This can be especially useful for distributed dataflows. |
24 | 147 |
|
25 | | -Nodes are defined using the following format: |
| 148 | +When a `git` key is specified, `dora build` automatically clones the specified repository (or reuse |
| 149 | +an existing clone). |
| 150 | +Then it checks out the specified [`branch`](#branch), [`tag`](#tag), or [`rev`](#rev), or the |
| 151 | +default branch if none of them are specified. |
| 152 | +Afterwards it runs the `build` command if specified. |
26 | 153 |
|
27 | 154 | ```yaml |
28 | 155 | nodes: |
29 | | - - id: some-unique-id |
30 | | - # For nodes with multiple operators |
31 | | - operators: |
32 | | - - id: operator-1 |
33 | | - # ... (see below) |
34 | | - - id: operator-2 |
35 | | - # ... (see below) |
36 | | - |
37 | | - - id: some-unique-id-2 |
38 | | - custom: |
39 | | - source: path/to/timestamp |
40 | | - env: |
41 | | - - ENVIRONMENT_VARIABLE_1: true |
42 | | - working-directory: some/path |
43 | | - |
44 | | - inputs: |
45 | | - input_1: operator_2/output_4 |
46 | | - input_2: custom_node_2/output_4 |
47 | | - outputs: |
48 | | - - output_1 |
49 | | - |
50 | | - # Unique operator |
51 | | - - id: some-unique-id-3 |
52 | | - operator: |
53 | | - # ... (see below) |
| 156 | + - id: rust-node |
| 157 | + git: https://github.com/dora-rs/dora.git |
| 158 | + branch: main |
| 159 | + build: cargo build -p rust-dataflow-example-node |
| 160 | + path: target/debug/rust-dataflow-example-node |
54 | 161 | ``` |
55 | 162 |
|
56 | | -Nodes specify the executable name and arguments like a normal shell operation through the `run` field. Through the optional `env` field, it is possible to set environment variables for the process. The optional `working-directory` field allows to overwrite the directory in which the program is started. |
57 | 163 |
|
58 | | -To integrate with the rest of the dora dataflow, custom nodes must specify their inputs and outputs, similar to operators. They can reference outputs of both operators, and other custom nodes. |
| 164 | +## Operators |
| 165 | + |
| 166 | +Operators are an experimental, lightweight alternative to nodes. |
| 167 | +Instead of running as a separate process, operators are linked into a runtime process. |
| 168 | +This allows running multiple operators to share a single address space (not supported for Python currently). |
| 169 | + |
| 170 | +Operators are defined as part of the node list, as children of a runtime node. |
| 171 | +A runtime node is a special node that specifies no `path` field, but contains an |
| 172 | +[`operators`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.operators) |
| 173 | +field instead. |
| 174 | + |
| 175 | +## Other Dataflow Fields |
| 176 | + |
| 177 | +See the [`Descriptor`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Descriptor.html) |
| 178 | +struct for a full list of supported fields. |
| 179 | + |
0 commit comments