Skip to content

Commit ecc9185

Browse files
committed
docs: update training, stream and torch components docs
1 parent af06396 commit ecc9185

File tree

8 files changed

+254
-93
lines changed

8 files changed

+254
-93
lines changed

docs/assets/templates/python/material/docstring.html

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,27 @@
22
{{ log.debug("Rendering docstring") }}
33
{% for section in docstring_sections %}
44
{% if not config.only_parameters %}
5-
{% if section.kind.value == "text" %}
5+
{% if section.kind.value == "text" and (not config.sections or "text" in config.sections) %}
66
{{ section.value|convert_markdown(heading_level - 1, html_id) }}
7-
{% elif section.kind.value == "attributes" %}
7+
{% elif section.kind.value == "attributes" and (not config.sections or "attributes" in config.sections) %}
88
{% include "docstring/attributes.html" with context %}
9-
{% elif section.kind.value == "parameters" %}
9+
{% elif section.kind.value == "parameters" and (not config.sections or "parameters" in config.sections) %}
1010
{% include "docstring/parameters.html" with context %}
11-
{% elif section.kind.value == "other parameters" %}
11+
{% elif section.kind.value == "other parameters" and (not config.sections or "parameters" in config.sections) %}
1212
{% include "docstring/other_parameters.html" with context %}
13-
{% elif section.kind.value == "raises" %}
13+
{% elif section.kind.value == "raises" and (not config.sections or "raises" in config.sections) %}
1414
{% include "docstring/raises.html" with context %}
15-
{% elif section.kind.value == "warns" %}
15+
{% elif section.kind.value == "warns" and (not config.sections or "warns" in config.sections) %}
1616
{% include "docstring/warns.html" with context %}
17-
{% elif section.kind.value == "yields" %}
17+
{% elif section.kind.value == "yields" and (not config.sections or "yields" in config.sections) %}
1818
{% include "docstring/yields.html" with context %}
19-
{% elif section.kind.value == "receives" %}
19+
{% elif section.kind.value == "receives" and (not config.sections or "receives" in config.sections) %}
2020
{% include "docstring/receives.html" with context %}
21-
{% elif section.kind.value == "returns" %}
21+
{% elif section.kind.value == "returns" and (not config.sections or "returns" in config.sections) %}
2222
{% include "docstring/returns.html" with context %}
23-
{% elif section.kind.value == "examples" %}
23+
{% elif section.kind.value == "examples" and (not config.sections or "examples" in config.sections) %}
2424
{% include "docstring/examples.html" with context %}
25-
{% elif section.kind.value == "admonition" %}
25+
{% elif section.kind.value == "admonition" and (not config.sections or "admonition" in config.sections) %}
2626
{% include "docstring/admonition.html" with context %}
2727
{% endif %}
2828
{% elif section.kind.value == "parameters" %}

docs/assets/templates/python/material/docstring/parameters.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{% if config.only_parameters != "no-header" %}
1+
{% if config.only_parameters != "no-header" and config.header != false %}
22
{{ "# Parameters\n"|convert_markdown(heading_level, html_id) }}
33
{% endif %}
44
{% if config.docstring_section_style == "table" %}

docs/concepts/inference.md

Lines changed: 95 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -32,22 +32,20 @@ doc = nlp(text)
3232

3333
To leverage multiple GPUs when processing multiple documents, refer to the [multiprocessing backend][edsnlp.processing.multiprocessing.execute_multiprocessing_backend] description below.
3434

35-
## Inference on multiple documents {: #edsnlp.core.stream.Stream }
35+
## Streams
3636

3737
When processing multiple documents, we can optimize the inference by parallelizing the computation on a single core, multiple cores and GPUs or even multiple machines.
3838

39-
### Streams
40-
4139
These optimizations are enabled by performing *lazy inference* : the operations (e.g., reading a document, converting it to a Doc, running the different pipes of a model or writing the result somewhere) are not executed immediately but are instead scheduled in a [Stream][edsnlp.core.stream.Stream] object. It can then be executed by calling the `execute` method, iterating over it or calling a writing method (e.g., `to_pandas`). In fact, data connectors like `edsnlp.data.read_json` return a stream, as well as the `nlp.pipe` method.
4240

4341
A stream contains :
4442

4543
- a `reader`: the source of the data (e.g., a file, a database, a list of strings, etc.)
46-
- the list of operations to perform under a `pipeline` attribute containing the name if any, function / pipe, keyword arguments and context for each operation
44+
- the list of operations to perform (`stream.ops`) that contain the function / pipe, keyword arguments and context for each operation
4745
- an optional `writer`: the destination of the data (e.g., a file, a database, a list of strings, etc.)
4846
- the execution `config`, containing the backend to use and its configuration such as the number of workers, the batch size, etc.
4947

50-
All methods (`.map`, `.map_batches`, `.map_gpu`, `.map_pipeline`, `.set_processing`) of the stream are chainable, meaning that they return a new stream object (no in-place modification).
48+
All methods (`map()`, `map_batches()`, `map_gpu()`, `map_pipeline()`, `set_processing()`) of the stream are chainable, meaning that they return a new stream object (no in-place modification).
5149

5250
For instance, the following code will load a model, read a folder of JSON files, apply the model to each document and write the result in a Parquet folder, using 4 CPUs and 2 GPUs.
5351

@@ -88,53 +86,133 @@ data = data.set_processing(
8886
num_cpu_workers=4,
8987
# 2 GPUs to accelerate deep-learning pipes
9088
num_gpu_workers=2,
91-
89+
# Show the progress bar
90+
show_progress=True,
9291
)
9392
9493
# Write the result, this will execute the stream
9594
data.write_parquet("path/to/output_folder", converter="...", write_in_worker=True)
9695
```
9796

98-
### Applying operations to a stream
97+
Streams support a variety of operations, such as applying a function to each element of the stream, batching the elements, applying a model to the elements, etc. In each case, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `execute()`, `to_*()` or `write_*()` methods.
98+
99+
### `map()` {: #edsnlp.core.stream.Stream.map }
100+
101+
::: edsnlp.core.stream.Stream.map
102+
options:
103+
sections: ['text', 'parameters']
104+
header: false
105+
show_source: false
99106

100-
To apply an operation to a stream, you can use the `.map` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection.
107+
### `map_batches()` {: #edsnlp.core.stream.Stream.map_batches }
108+
109+
To apply an operation to a stream in batches, you can use the `map_batches()` method. It takes a callable as input, an optional dictionary of keyword arguments and batching arguments.
110+
111+
::: edsnlp.core.stream.Stream.map_batches
112+
options:
113+
heading_level: 3
114+
sections: ['text', 'parameters']
115+
header: false
116+
show_source: false
117+
118+
### `map_pipeline()` {: #edsnlp.core.stream.Stream.map_pipeline }
119+
120+
::: edsnlp.core.stream.Stream.map_pipeline
121+
options:
122+
heading_level: 3
123+
sections: ['text', 'parameters']
124+
header: false
125+
show_source: false
126+
127+
### `map_gpu()` {: #edsnlp.core.stream.Stream.map_gpu }
128+
129+
::: edsnlp.core.stream.Stream.map_gpu
130+
options:
131+
heading_level: 3
132+
sections: ['text', 'parameters']
133+
header: false
134+
show_source: false
101135

102-
To apply an operation to a stream in batches, you can use the `.map_batches` method. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each batch of the collection (as a list of elements), and should return a list of results, that will be concatenated at the end.
136+
### `loop()` {: #edsnlp.core.stream.Stream.loop }
103137

104-
To apply a model, you can use the `.map_pipeline` method. It takes a model as input and will add every pipe of the model to the scheduled operations.
138+
::: edsnlp.core.stream.Stream.loop
139+
options:
140+
heading_level: 3
141+
sections: ['text', 'parameters']
142+
header: false
143+
show_source: false
105144

106-
To run a specific function on a GPU (for advanced users, otherwise `map_pipeline` should accommodate most use cases), you can use the `.map_gpu` method. It takes two or three callables as input: the first on (`prepare_batches`) takes a batch of inputs and should return some tensors that will be sent to the GPU and passed to the second callable (`forward`), which will apply the deep learning ops and return the results. The third callable (`postprocess`) and gets the batch of inputs as well as the `forward` results and should return the final results (for instance, the input documents annotated with the predictions).
145+
### `shuffle()` {: #edsnlp.core.stream.Stream.shuffle }
107146

108-
In each cases, the operations will not be executed immediately but will be scheduled to be executed when iterating of the collection, or calling the `.execute`, `.to_*` or `.write_*` methods.
147+
::: edsnlp.core.stream.Stream.shuffle
148+
options:
149+
heading_level: 3
150+
sections: ['text', 'parameters']
151+
header: false
152+
show_source: false
109153

110-
### Execution of a stream {: #edsnlp.core.stream.Stream.set_processing }
154+
### Configure the execution with `set_processing()` {: #edsnlp.core.stream.Stream.set_processing }
111155

112156
You can configure how the operations performed in the stream is executed by calling its `set_processing(...)` method. The following options are available :
113157

114158
::: edsnlp.core.stream.Stream.set_processing
115159
options:
116160
heading_level: 3
117-
only_parameters: "no-header"
161+
sections: ['text', 'parameters']
162+
header: false
163+
show_source: false
118164

119165
## Backends
120166

121-
### Simple backend {: #edsnlp.processing.simple.execute_simple_backend }
167+
The `backend` parameter of the `set_processing` supports the following values:
168+
169+
### `simple` {: #edsnlp.processing.simple.execute_simple_backend }
122170

123171
::: edsnlp.processing.simple.execute_simple_backend
124172
options:
125173
heading_level: 3
126174
show_source: false
127175

128-
### Multiprocessing backend {: #edsnlp.processing.multiprocessing.execute_multiprocessing_backend }
176+
### `multiprocessing` {: #edsnlp.processing.multiprocessing.execute_multiprocessing_backend }
129177

130178
::: edsnlp.processing.multiprocessing.execute_multiprocessing_backend
131179
options:
132180
heading_level: 3
133181
show_source: false
134182

135-
### Spark backend {: #edsnlp.processing.spark.execute_spark_backend }
183+
### `spark` {: #edsnlp.processing.spark.execute_spark_backend }
136184

137185
::: edsnlp.processing.spark.execute_spark_backend
138186
options:
139187
heading_level: 3
140188
show_source: false
189+
190+
## Batching
191+
192+
Many operations rely on batching, either to be more efficient or because they require a fixed-size input. The `batch_size` and `batch_by` argument of the `map_batches()` method allows you to specify the size of the batches and what function to use to compute the size of the batches.
193+
194+
```{ .python .no-check }
195+
# Accumulate in chunks of 1024 documents
196+
lengths = data.map_batches(len, batch_size=1024)
197+
198+
# Accumulate in chunks of 100 000 words
199+
lengths = data.map_batches(len, batch_size=100_000, batch_by="words")
200+
# or
201+
lengths = data.map_batches(len, batch_size="100_000 words")
202+
```
203+
204+
We also support special values for `batch_size` which use "sentinels" (i.e. markers inserted in the stream) to delimit the batches.
205+
206+
```{ .python .no-check }
207+
# Accumulate every element of the input in a single batch
208+
# which is useful when looping over the data in training
209+
lengths = data.map_batches(len, batch_size="dataset")
210+
211+
# Accumulate in chunks of fragments, in the case of parquet datasets
212+
lengths = data.map_batches(len, batch_size="fragments")
213+
```
214+
215+
Note that these batch functions are only available under specific conditions:
216+
217+
- either `backend="simple"` or `deterministic=True` (default) if `backend="multiprocessing"`, otherwise elements might be processed out of order
218+
- if every op before was elementwise (e.g. `map()`, `map_gpu()`, `map_pipeline()` and no generator function), or `sentinel_mode` was explicitly set to `"split"` in `map_batches()`, otherwise the sentinel are dropped by default when the user requires batching.

docs/concepts/torch-component.md

Lines changed: 81 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -120,10 +120,10 @@ During the training loop, when computing the loss for each component, the forwar
120120

121121
## Implementation example
122122

123-
Here is an example of a trainable component:
123+
Here is a draft of a trainable component:
124124

125125
```python
126-
from typing import Any, Dict, Iterable, Sequence
126+
from typing import Any, Dict, Iterable, Sequence, List, Set
127127

128128
import torch
129129
from tqdm import tqdm
@@ -138,12 +138,27 @@ class MyComponent(TorchComponent):
138138
self, # A subcomponent
139139
nlp: Pipeline,
140140
name: str,
141+
*,
141142
embedding: TorchComponent,
142143
):
143144
super().__init__(nlp=nlp, name=name)
144145
self.embedding = embedding
145146

146147
def post_init(self, gold_data: Iterable["spacy.tokens.Doc"], exclude: set):
148+
"""
149+
This method completes the attributes of the component, by looking at some
150+
documents. It is especially useful to build vocabularies or detect the labels
151+
of a classification task.
152+
153+
Parameters
154+
----------
155+
gold_data: Iterable[Doc]
156+
The documents to use for initialization.
157+
exclude: Set
158+
The names of components to exclude from initialization.
159+
This argument will be gradually updated with the names of initialized
160+
components
161+
"""
147162
super().post_init(gold_data, exclude)
148163

149164
# Initialize the component with the gold documents
@@ -159,23 +174,61 @@ class MyComponent(TorchComponent):
159174
# Initialize any layer that might be missing from the module
160175
self.classifier = torch.nn.Linear(...)
161176

162-
def preprocess(self, doc: "spacy.tokens.Doc") -> Dict[str, Any]:
163-
# Preprocess the doc to extract features required to run the embedding
164-
# subcomponent, and this component
177+
def preprocess(self, doc: "spacy.tokens.Doc", **kwargs) -> Dict[str, Any]:
178+
"""
179+
Preprocess the document to extract features that will be used by the
180+
neural network and its subcomponents on to perform its predictions.
181+
182+
Parameters
183+
----------
184+
doc: Doc
185+
Document to preprocess
186+
187+
Returns
188+
-------
189+
Dict[str, Any]
190+
Dictionary (optionally nested) containing the features extracted from
191+
the document.
192+
"""
165193
return {
166194
"embedding": self.embedding.preprocess(doc),
167195
"my-feature": ...,
168196
}
169197

170198
def collate(self, batch) -> Dict:
171-
# Collate the features of the "embedding" subcomponent
172-
# and the features of this component as well
199+
"""
200+
Collate the batch of features into a single batch of tensors that can be
201+
used by the forward method of the component.
202+
203+
Parameters
204+
----------
205+
batch: Dict[str, Any]
206+
Batch of features
207+
208+
Returns
209+
-------
210+
BatchInput
211+
Dictionary (optionally nested) containing the collated tensors
212+
"""
173213
return {
174214
"embedding": self.embedding.collate(batch["embedding"]),
175215
"my-feature": torch.as_tensor(batch["my-feature"]),
176216
}
177217

178218
def forward(self, batch: Dict) -> Dict:
219+
"""
220+
Perform the forward pass of the neural network.
221+
222+
Parameters
223+
----------
224+
batch: BatchInput
225+
Batch of tensors (nested dictionary) computed by the collate method
226+
227+
Returns
228+
-------
229+
BatchOutput
230+
Dict of scores, losses, embeddings tensors, etc.
231+
"""
179232
# Call the embedding subcomponent
180233
embeds = self.embedding(batch["embedding"])
181234

@@ -185,9 +238,28 @@ class MyComponent(TorchComponent):
185238
return output
186239

187240
def postprocess(
188-
self, docs: Sequence["spacy.tokens.Doc"], output: Dict
241+
self,
242+
docs: Sequence["spacy.tokens.Doc"],
243+
results: Dict,
244+
inputs: List[Dict[str, Any]],
189245
) -> Sequence["spacy.tokens.Doc"]:
190-
# Annotate the docs with the outputs of the forward method
246+
"""
247+
Update the documents with the predictions of the neural network.
248+
By default, this is a no-op.
249+
250+
Parameters
251+
----------
252+
docs: Sequence[Doc]
253+
List of documents to update
254+
results: BatchOutput
255+
Batch of predictions, as returned by the forward method
256+
inputs: BatchInput
257+
List of preprocessed features, as returned by the preprocess method
258+
259+
Returns
260+
-------
261+
Sequence[Doc]
262+
"""
191263
...
192264
return docs
193265
```

0 commit comments

Comments
 (0)