docs: add details for disorders and behavior pipes

Thomzoy · Thomzoy · commit c1cf75077b64 · 2024-10-16T09:46:56.000+02:00
diff --git a/docs/pipes/ner/behaviors/alcohol.md b/docs/pipes/ner/behaviors/alcohol.md
@@ -1,5 +1,7 @@
 # Alcohol consumption {: #edsnlp.pipes.ner.behaviors.alcohol.factory.create_component }
 
+--8<-- "docs/pipes/ner/disorders/warning.md"
+
 ::: edsnlp.pipes.ner.behaviors.alcohol.factory.create_component
     options:
         heading_level: 2
diff --git a/docs/pipes/ner/behaviors/index.md b/docs/pipes/ner/behaviors/index.md
@@ -2,99 +2,6 @@
 
 ## Presentation
 
-EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the ContextualMatcher component.
-Some general considerations about those components:
+EDS-NLP offers two components to extract behavioral patterns, namely the tobacco and alcohol consumption status. Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.
 
-- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
-- The matched comorbidity is also available under the `ent.label_` of each match.
-- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
-- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
-- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
-  ```{ .python .no-check }
-  nlp.add_pipe(
-      eds.normalizer(
-          accents=True,
-          lowercase=True,
-          quotes=True,
-          spaces=True,
-          pollution=dict(
-              information=True,
-              bars=True,
-              biology=True,
-              doctors=True,
-              web=True,
-              coding=True,
-              footer=True,
-          ),
-      ),
-  )
-  ```
-
-!!! warning "Use qualifiers"
-    Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.
-
-    !!! aphp "Use the ML model"
-
-        The model will soon be available in the models catalogue of AP-HP's CDW.
-
-## Usage
-
-```{ .python .no-check }
-import edsnlp, edsnlp.pipes as eds
-
-nlp = edsnlp.blank("eds")
-nlp.add_pipe(eds.sentences())
-nlp.add_pipe(
-    eds.normalizer(
-        accents=True,
-        lowercase=True,
-        quotes=True,
-        spaces=True,
-        pollution=dict(
-            information=True,
-            bars=True,
-            biology=True,
-            doctors=True,
-            web=True,
-            coding=True,
-            footer=True,
-        ),
-    ),
-)
-nlp.add_pipe(eds.tobacco())
-nlp.add_pipe(eds.diabetes())
-
-text = """
-Compte-rendu de consultation.
-
-Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
-Le patient va bien depuis la dernière fois.
-Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).
-
-Sur le plan de son diabète, la glycémie est stable.
-"""
-
-doc = nlp(text)
-
-doc.spans
-# Out: {
-# 'pollutions': [],
-# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
-# 'diabetes': [rétinopathie diabétique, diabète]
-# }
-
-tobacco_matches = doc.spans["tobacco"]
-tobacco_matches[0]._.detailed_status
-# Out: "ABSTINENCE" #
-
-tobacco_matches[0]._.assigned["PA"]  # paquet-année
-# Out: 10 # (1)
-
-
-diabetes = doc.spans["diabetes"]
-(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
-# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (2)
-```
-
-1. Here we see an example of additional information that can be extracted
-2. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
+--8<-- "docs/pipes/ner/disorders/presentation.md"
diff --git a/docs/pipes/ner/disorders/index.md b/docs/pipes/ner/disorders/index.md
@@ -2,58 +2,6 @@
 
 ## Presentation
 
-The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the ContextualMatcher component.
+The following components extract 16 different conditions from the [Charlson Comorbidity Index](https://www.rdplf.org/calculateurs/pages/charlson/charlson.html). Each component is based on the [ContextualMatcher][edsnlp.pipes.core.contextual_matcher.ContextualMatcher] matcher, itself based on `eds.contextual_matcher` component.
 
-The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]
-
-Some general considerations about those components:
-
-- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
-- The matched comorbidity is also available under the `ent.label_` of each match.
-- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
-- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
-- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline with the following parameters:
-
-    ```{ .python .no-check }
-    import edsnlp, edsnlp.pipes as eds
-    ...
-
-    nlp.add_pipe(
-        eds.normalizer(
-            accents=True,
-            lowercase=True,
-            quotes=True,
-            spaces=True,
-            pollution=dict(
-                information=True,
-                bars=True,
-                biology=True,
-                doctors=True,
-                web=True,
-                coding=True,
-                footer=True,
-            ),
-        ),
-    )
-    ```
-
-!!! warning "Use qualifiers"
-    Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you can use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.
-
-    !!! aphp "Use the ML model"
-
-        The model will soon be available in the models catalogue of AP-HP's CDW.
-
-!!! tip "On the medical definition of the comorbidities"
-
-    Those components were developped to extract **chronic** and **symptomatic** conditions only.
-
-## Aggregation
-
-For relevant phenotyping, matches should be aggregated at the document-level. For instance, a document might mention a complicated diabetes at the beginning ("*Le patient a une rétinopathie diabétique*"), and then refer to this diabetes without mentionning that it is complicated anymore ("*Concernant son diabète, le patient ...*").
-Thus, a good and simple aggregation rule is, for each comorbidity, to
-
-- disregard all entities tagged as irrelevant by the qualification component(s)
-- take the maximum (i.e., the most severe) status of the leftover entities
-
-An implementation of this rule is presented [here][aggregating-results]
+--8<-- "docs/pipes/ner/disorders/presentation.md"
diff --git a/docs/pipes/ner/disorders/presentation.md b/docs/pipes/ner/disorders/presentation.md
@@ -0,0 +1,77 @@
+The components were developed by AP-HP's Data Science team with a team of medical experts, following the insights of the algorithm proposed by [@petitjean_2024]
+
+Some general considerations about those components:
+
+- Extracted entities are stored in `doc.ents` and `doc.spans`. For instance, the `eds.tobacco` component stores matches in `doc.spans["tobacco"]`.
+- The matched comorbidity is also available under the `ent.label_` of each match.
+- Matches have an associated `_.status` attribute taking the value `1`, or `2`. A corresponding `_.detailed_status` attribute stores the human-readable status, which can be component-dependent. See each component documentation for more details.
+- Some components add additional information to matches. For instance, the `tobacco` adds, if relevant, extracted *pack-year* (= *paquet-année*). Those information are available under the `ent._.assigned` attribute.
+- Those components work on **normalized** documents. Please use the `eds.normalizer` pipeline (see [Usage](#usage) below)
+
+--8<-- "docs/pipes/ner/disorders/warning.md"
+
+!!! warning "Use qualifiers"
+    Those components **should be used with a qualification pipeline** to avoid extracted unwanted matches. At the very least, you should use available rule-based qualifiers (`eds.negation`, `eds.hypothesis` and `eds.family`). Better, a machine learning qualification component was developed and trained specifically for those components. For privacy reason, the model isn't publicly available yet.
+
+    !!! aphp "Use the ML model"
+
+        For projects working on AP-HP's CDW, this model is available via its models catalogue.
+
+## Usage
+
+```{ .python .no-check }
+import edsnlp, edsnlp.pipes as eds
+
+nlp = edsnlp.blank("eds")
+nlp.add_pipe(eds.sentences())
+nlp.add_pipe(
+    eds.normalizer(
+        accents=True,
+        lowercase=True,
+        quotes=True,
+        spaces=True,
+        pollution=dict(
+            biology=True, #(1)
+            coding=True, #(2)
+        ),
+    ),
+)
+nlp.add_pipe(eds.tobacco())
+nlp.add_pipe(eds.diabetes())
+
+text = """
+Compte-rendu de consultation.
+
+Je vois ce jour M. SCOTT pour le suivi de sa rétinopathie diabétique.
+Le patient va bien depuis la dernière fois.
+Je le félicite pour la poursuite de son sevrage tabagique (toujours à 10 paquet-année).
+
+Sur le plan de son diabète, la glycémie est stable.
+"""
+
+doc = nlp(text)
+
+doc.spans
+# Out: {
+# 'pollutions': [],
+# 'tobacco': [sevrage tabagique (toujours à 10 paquet-année],
+# 'diabetes': [rétinopathie diabétique, diabète]
+# }
+
+tobacco_matches = doc.spans["tobacco"]
+tobacco_matches[0]._.detailed_status
+# Out: "ABSTINENCE" #
+
+tobacco_matches[0]._.assigned["PA"]  # paquet-année
+# Out: 10 # (3)
+
+
+diabetes = doc.spans["diabetes"]
+(diabetes[0]._.detailed_status, diabetes[1]._.detailed_status)
+# Out: ('WITH_COMPLICATION', 'WITHOUT_COMPLICATION') # (4)
+```
+
+1. This will discard mentions of biology results, which often leads to false positive
+2. This will discard mentions of ICD10 coding that sometimes appears at the end of clinical documents
+3. Here we see an example of additional information that can be extracted
+4. Here we see the importance of document-level aggregation to extract the correct severity of each comorbidity.
diff --git a/docs/pipes/ner/disorders/warning.md b/docs/pipes/ner/disorders/warning.md
@@ -0,0 +1,7 @@
+!!! danger "On overlapping entities"
+    When using multiple disorders or behavior pipelines, some entities may be extracted from different pipes. For instance:
+
+    * "Intoxication éthylotabagique" will be tagged both by `eds.tobacco` and `eds.alcohol`
+    * "Chirrose alcoolique" will be tagged both by `eds.liver_disease` and `eds.alcohol`
+
+    As `doc.ents` discards overlapping entities, you should use `doc.spans` instead.