Skip to content

Commit d1b0fde

Browse files
authored
Merge pull request #41 from dora-rs/dataflow-doc
Rewrite 'Dataflow Specification' chapter and explain most important fields
2 parents 7f15e62 + 883e815 commit d1b0fde

File tree

1 file changed

+155
-34
lines changed

1 file changed

+155
-34
lines changed

docs/api/dataflow-config.md

Lines changed: 155 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,179 @@
11
# Dataflow Specification
22

3-
Dataflows are specified through a YAML file. This section presents our current draft for the file format. It only includes basic functionality for now, we will extend it later when we introduce more advanced features.
43

5-
## Dataflow
4+
Dora dataflows are specified through a YAML file.
5+
This dataflow configuration file specifies the nodes of the dataflow and their inputs and outputs.
6+
It also allows configuring communication parameters and enabling debug options.
67

7-
Dataflows are specified through the following format:
8+
This article provides an introduction to the dataflow file format and its most important fields.
9+
For a complete reference of all available fields their behavior check ouf the documentation of the
10+
[`Descriptor`](https://docs.rs/dora-core/0.3.12/dora_core/descriptor/struct.Descriptor.html) and
11+
[`Node`](https://docs.rs/dora-core/0.3.12/dora_core/descriptor/struct.Node.html) structs.
12+
13+
## Defining Nodes
14+
15+
The most important field in a dataflow configuration file is the `nodes` field, which lists the
16+
nodes of the dataflow.
17+
Each node is identified by a unique `id`:
818

919
```yaml
1020
nodes:
1121
- id: foo
22+
path: path/to/the/executable
1223
# ... (see below)
1324
- id: bar
25+
path: path/to/another/executable
1426
# ... (see below)
1527
```
1628

29+
For each node, you need to specify the `path` of the executable or script that Dora should run when starting the node.
30+
Most of the other node fields are optional, but you typically want to specify at least some `inputs` and/or `outputs`.
31+
1732
### Inputs and Outputs
1833

19-
Each operator or custom node has a separate namespace for its outputs. To refer to outputs, the \<operator\>/\<output\> syntax is used. This way, there are no name conflicts between operators.
34+
Nodes can send output messages that can be received by other nodes as input.
35+
All inputs and outputs need to be specified in the dataflow configuration file.
36+
37+
For each node, list all output IDs that it sends under the
38+
[`outputs`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.outputs)
39+
key.
40+
Only the specified output IDs are valid to be used in output sending functions such as
41+
[`send_output`](https://docs.rs/dora-node-api/latest/dora_node_api/struct.DoraNode.html#method.send_output).
42+
43+
Receiving nodes can subscribe to outputs by listing them in their
44+
[`inputs`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.inputs)
45+
field.
46+
The `inputs` field should be a key-value map of the following format:
47+
`input_id: source_node_id/source_node_output_id`
48+
49+
The components are defined as follows:
50+
- `input_id` is the local identifier that should be used for this input.
51+
This will map to the `id` field of
52+
[`Event::Input`](https://docs.rs/dora-node-api/latest/dora_node_api/enum.Event.html#variant.Input)
53+
events sent to the node event loop.
54+
- `source_node_id` should be the `id` field of the node that sends the output that we want
55+
to subscribe to
56+
- `source_node_output_id` should be the identifier of the output that that we want
57+
to subscribe to
58+
59+
#### Input/Output Example
60+
61+
```yaml
62+
nodes:
63+
- id: example-node
64+
outputs:
65+
- one
66+
- two
67+
- id: receiver
68+
inputs:
69+
my_input: example-node/two
70+
```
71+
72+
73+
### Fields Controlling Node Execution
74+
75+
Use the following fields to define how a node is executed, including command-line arguments and environment
76+
variables.
77+
78+
##### `path` (required)
79+
80+
Specifies the path of the executable or script that Dora should run when starting the dataflow.
81+
This can point to a normal executable (e.g. when using a compiled language such as Rust) or a Python script.
82+
83+
```yaml
84+
nodes:
85+
- id: rust-example
86+
path: target/release/rust-node
87+
- id: python-example
88+
path: ./receive_data.py
89+
```
90+
91+
See the
92+
[`path` field documentation](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.path)
93+
for details.
94+
95+
##### `args` and `env`
96+
97+
Use the
98+
[`args`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.args)
99+
field to specify command-line arguments that should be passed to the executable/script specified in
100+
`path`.
101+
Use the
102+
[`env`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.env)
103+
field for setting environment variables.
104+
105+
```yaml
106+
nodes:
107+
- id: example
108+
path: example-node
109+
args: -v --some-flag foo
110+
env:
111+
IMAGE_WIDTH: 640
112+
IMAGE_HEIGHT: 480
113+
```
114+
115+
### Fields Controlling Node Build
116+
117+
Use build fields define how a node is set up and built on `dora build`.
118+
All build fields are optional.
119+
120+
##### `build`
20121

21-
Input operands are specified using the \<name\>: \<operator\>/\<output\> syntax, where \<data\> is the internal name that should be used for the operand. The main advantage of this name mapping is that the same operator executable can be reused multiple times on different input.
122+
The
123+
[`build`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.build)
124+
field specifies the command that should be invoked for building the node.
22125

23-
## Nodes
126+
```yaml
127+
- id: build-example
128+
build: cargo build -p receive_data --release
129+
path: target/release/receive_data
130+
- id: multi-line-example
131+
build: |
132+
pip install flash-attn
133+
pip install -e ../../node-hub/dora-phi4
134+
path: dora-phi4
135+
```
136+
137+
**Special treatment of `pip`:** Build lines that start with `pip` or `pip3` are treated in a
138+
special way:
139+
If the `--uv` argument is passed to the `dora build` command, all `pip`/`pip3` commands are run
140+
through the [`uv` package manager](https://docs.astral.sh/uv/).
141+
142+
#### `git`
143+
144+
The [`git`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.git)
145+
field allows downloading nodes from git repositories.
146+
This can be especially useful for distributed dataflows.
24147

25-
Nodes are defined using the following format:
148+
When a `git` key is specified, `dora build` automatically clones the specified repository (or reuse
149+
an existing clone).
150+
Then it checks out the specified [`branch`](#branch), [`tag`](#tag), or [`rev`](#rev), or the
151+
default branch if none of them are specified.
152+
Afterwards it runs the `build` command if specified.
26153

27154
```yaml
28155
nodes:
29-
- id: some-unique-id
30-
# For nodes with multiple operators
31-
operators:
32-
- id: operator-1
33-
# ... (see below)
34-
- id: operator-2
35-
# ... (see below)
36-
37-
- id: some-unique-id-2
38-
custom:
39-
source: path/to/timestamp
40-
env:
41-
- ENVIRONMENT_VARIABLE_1: true
42-
working-directory: some/path
43-
44-
inputs:
45-
input_1: operator_2/output_4
46-
input_2: custom_node_2/output_4
47-
outputs:
48-
- output_1
49-
50-
# Unique operator
51-
- id: some-unique-id-3
52-
operator:
53-
# ... (see below)
156+
- id: rust-node
157+
git: https://github.com/dora-rs/dora.git
158+
branch: main
159+
build: cargo build -p rust-dataflow-example-node
160+
path: target/debug/rust-dataflow-example-node
54161
```
55162

56-
Nodes specify the executable name and arguments like a normal shell operation through the `run` field. Through the optional `env` field, it is possible to set environment variables for the process. The optional `working-directory` field allows to overwrite the directory in which the program is started.
57163

58-
To integrate with the rest of the dora dataflow, custom nodes must specify their inputs and outputs, similar to operators. They can reference outputs of both operators, and other custom nodes.
164+
## Operators
165+
166+
Operators are an experimental, lightweight alternative to nodes.
167+
Instead of running as a separate process, operators are linked into a runtime process.
168+
This allows running multiple operators to share a single address space (not supported for Python currently).
169+
170+
Operators are defined as part of the node list, as children of a runtime node.
171+
A runtime node is a special node that specifies no `path` field, but contains an
172+
[`operators`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Node.html#structfield.operators)
173+
field instead.
174+
175+
## Other Dataflow Fields
176+
177+
See the [`Descriptor`](https://docs.rs/dora-core/latest/dora_core/descriptor/struct.Descriptor.html)
178+
struct for a full list of supported fields.
179+

0 commit comments

Comments
 (0)