changelog

Thomzoy · Thomzoy · commit 5f31166b7bb0 · 2024-10-15T11:22:03.000+02:00
diff --git a/changelog.md b/changelog.md
@@ -1,5 +1,17 @@
 # Changelog
 
+## Unreleased
+
+### Added
+
+- `EDS.Tokenizer` now handles `-\n` (found in text when spliting a long word with a linebreak) as a specific token, which can be discarded by the normalizer pipe.
+
+### Fixed
+
+- When using `ignore_space_tokens=True`, words separated only by linebreaks will be collected (via `get_text()`) with spaces inbetween
+- The `process` method of `Qualifiers` now accepts `Span` as input, an treats it as a `Doc` to avoid alignment issues
+- The `detailed_status_mapping` of disorder/behavior pipes is now a defaultdict to avoid `KeyError: None` that can occur when loading pre-annotated docs without instanciating pipes beforehands
+
 ## v0.13.1
 
 ### Added
diff --git a/docs/pipes/ner/behaviors/index.md b/docs/pipes/ner/behaviors/index.md
@@ -9,25 +9,24 @@ Some general considerations about those components:
 - The matched comorbidity is also available under the `ent.label_` of each match.
 - Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
 - Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
-- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
-  ```{ .python .no-check }
-  nlp.add_pipe(
-      eds.normalizer(
-          accents=True,
-          lowercase=True,
-          quotes=True,
-          spaces=True,
-          pollution=dict(
-              information=True,
-              bars=True,
-              biology=True,
-              doctors=True,
-              web=True,
-              coding=True,
-              footer=True,
-          ),
-      ),
-  )
+- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with these additional flags:
+
+    ```{ .python .no-check }
+    import edsnlp, edsnlp.pipes as eds
+    ...
+
+    nlp.add_pipe(
+        eds.normalizer(
+            accents=True,
+            lowercase=True,
+            quotes=True,
+            spaces=True,
+            pollution=dict(
+                biology=True,
+                coding=True,
+            ),
+        ),
+    )
   ```
 
 --8<-- "docs/pipes/ner/disorders/warning.md"
diff --git a/docs/pipes/ner/disorders/index.md b/docs/pipes/ner/disorders/index.md
@@ -12,7 +12,7 @@ Some general considerations about those components:
 - The matched comorbidity is also available under the `ent.label_` of each match.
 - Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
 - Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
-- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
+- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with these additional flags:
 
     ```{ .python .no-check }
     import edsnlp, edsnlp.pipes as eds
@@ -25,13 +25,8 @@ Some general considerations about those components:
             quotes=True,
             spaces=True,
             pollution=dict(
-                information=True,
-                bars=True,
                 biology=True,
-                doctors=True,
-                web=True,
                 coding=True,
-                footer=True,
             ),
         ),
     )