Improvement: new TNM regex #366

LucasDedieu · 2025-01-24T17:42:50Z

Description

Add a new TNM regex that outperforms the old one. By default, eds.tnm will use the new regex pattern, but the old one will remain accessible.

Installation:

 pip install git+https://github.com/aphp/edsnlp.git@tnm_new_regex

Code example:

import edsnlp, edsnlp.pipes as eds
from edsnlp.pipes.ner.tnm.patterns_new import tnm_pattern_new
from edsnlp.pipes.ner.tnm.patterns import tnm_pattern

text = "Mise à jour de la classification : T3 N1b M0."

# Old 
nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.tnm(pattern=tnm_pattern))
print(nlp(text).ents)
# Out: ()

# New
nlp_new = edsnlp.blank("eds")
nlp_new.add_pipe(eds.tnm(pattern=tnm_pattern_new))
print(nlp_new(text).ents)
# Out: (T3 N1b M0)

Changes

patterns_new.py: File containing new tnm regex. Compare to old one add many new sections.
patterns.py: Old regex file. Renamed some sections to match new section names used in model.py.
tnm.py: Change default pattern to new pattern.
test_tnm.py: Change tnp pipe definition to still use old regex.
model.py: Remove part of pydantic typing validation to work with both old and new patterns.

TODO

model.py: add pydantic good typing
test_tnm.py: update unit tests

Checklist

[] If this PR is a bug fix, the bug is documented in the test suite.
[] Changes were documented in the changelog (pending section).
[] If necessary, changes were made to the documentation (eg new pipeline).

sonarqubecloud · 2025-01-24T17:43:47Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-01-24T18:06:27Z

Coverage Report

Name

Stmts

Miss

∆ Miss

Cover

edsnlp/pipes/ner/tnm/model.py

New missing coverage at line 25 !

     def __str__(self) -> str:
-         return self.value

New missing coverage at line 131 !

         if self.node_prefix:
-             norm.append(f"{self.node_prefix or ''}")

New missing coverage at line 141 !

         if self.metastasis_prefix:
-             norm.append(f"{self.metastasis_prefix or ''}")

New missing coverage at line 146 !

             if self.metastasis_specification:
-                 norm.append(f"{self.metastasis_specification or ''}")

New missing coverage at line 149 !

         if self.pleura:
-             norm.append(f"PL{self.pleura}")

New missing coverage at line 154 !

             if self.resection_specification:
-                 norm.append(f"{self.resection_specification or ''}")
             if self.resection_loc:

New missing coverage at line 156 !

             if self.resection_loc:
-                 norm.append(f"{self.resection_loc or ''}")

Was already missing at line 164

     def __str__(self):
-         return self.norm()

Was already missing at line 188

             )
-             exclude_unset = skip_defaults

New missing coverage at line 220 !

             if isinstance(v, TnmEnum):
-                 d[k] = v.value

134

10

8

92.54%

TOTAL

11575

231

8

98.00%

Files without new missing coverage

Name	Stmts	Miss	Cover
edsnlp/utils/torch.py Was already missing at line 102 def load_pruned_obj(obj, _): - return obj Was already missing at line 118 def save_align_devices_hook(pickler, obj): - pickler.save_reduce(load_align_devices_hook, (obj.__dict__,), obj=obj) Was already missing at lines 121-128 def load_align_devices_hook(state): - state["execution_device"] = MAP_LOCATION ... - AlignDevicesHook = None Was already missing at line 143 if torch.Tensor in copyreg.dispatch_table: - old_dispatch[torch.Tensor] = copyreg.dispatch_table[torch.Tensor] copyreg.pickle(torch.Tensor, reduce_empty)	83	9	89.16%
edsnlp/utils/span_getters.py Was already missing at lines 78-80 if span_getter is None: - yield doc[:], None - return if callable(span_getter): Was already missing at lines 81-83 if callable(span_getter): - yield from span_getter(doc) - return for key, span_filter in span_getter.items(): Was already missing at line 85 if key == "*": - candidates = ( (span, group) for group in doc.spans.values() for span in group Was already missing at lines 94-97 else: - for span, group in candidates: - if span.label_ in span_filter: - yield span, group Was already missing at line 101 if callable(span_setter): - span_setter(doc, matches) else: Was already missing at line 181 elif isinstance(v, str): - new_value[k] = [v] elif isinstance(v, list) and all(isinstance(i, str) for i in v):	231	10	95.67%
edsnlp/utils/resources.py Was already missing at line 33 if not verbs: - return conjugated_verbs	24	1	95.83%
edsnlp/utils/numbers.py Was already missing at line 34 else: - string = s string = string.lower().strip() Was already missing at lines 38-41 return int(string) - except ValueError: - parsed = DIGITS_MAPPINGS.get(string, None) - return parsed	16	4	75.00%
edsnlp/utils/filter.py Was already missing at line 206 if isinstance(label, int): - return [span for span in spans if span.label == label] else:	74	1	98.65%
edsnlp/tune.py Was already missing at line 169 ) - except RuntimeError as e: if "zero total variance" in str(e): # pragma: no cover Was already missing at line 684 else: - n_trials = compute_n_trials( gpu_hours, compute_time_per_trial(study, ema=True)	289	2	99.31%
edsnlp/training/trainer.py Was already missing at line 88 if result is None: - result = {} if isinstance(x, dict): Was already missing at lines 762-769 accelerator.backward(loss) - except torch.cuda.OutOfMemoryError: ... - raise del loss Was already missing at lines 808-810 ) > grad_max_dev * math.sqrt(grad_var): - spike = True - cumulated_data["spikes"] += 1 else: Was already missing at line 817 if spike and grad_dev_policy == "clip_mean": - torch.nn.utils.clip_grad_norm_( grad_params, grad_mean, norm_type=2 Was already missing at line 821 elif spike and grad_dev_policy == "clip_threshold": - torch.nn.utils.clip_grad_norm_( grad_params,	306	9	97.06%
edsnlp/reducers.py Was already missing at line 115 if not hasattr(module, "__file__"): - return True if module.__file__ is None: Was already missing at line 117 if module.__file__ is None: - return False # Hack to avoid copying the full module dict	67	2	97.01%
edsnlp/processing/spark.py Was already missing at line 50 getActiveSession = SparkSession.getActiveSession - except AttributeError:	47	1	97.87%
edsnlp/processing/multiprocessing.py Was already missing at lines 393-398 self.on_stop() - except BaseException as e: ... - self.main_control_queue.put(e) finally: Was already missing at lines 402-404 pass - except StopSignal: - pass for name, queue in self.consumer_queues(stage): Was already missing at line 542 while schedule[task_idx] is None: - task_idx = (task_idx + 1) % len(schedule) Was already missing at lines 606-608 if isinstance(docs, StreamSentinel): - self.active_batches[stage].append([None, None, None, docs]) - continue batch_id = str(hash(tuple(id(x) for x in docs)))[-8:] + "-" + self.uid Was already missing at lines 1121-1127 if out[0].kind == requires_sentinel: - missing_sentinels -= 1 ... - missing_sentinels = len(self.cpu_worker_names) continue	626	14	97.76%
edsnlp/processing/deprecated_pipe.py Was already missing at lines 207-209 def converter(doc): - res = results_extractor(doc) - return ( [{"note_id": doc._.note_id, **row} for row in res]	57	2	96.49%
edsnlp/pipes/trainable/span_linker/span_linker.py Was already missing at lines 402-404 if self.reference_mode == "synonym": - embeds = embeds.to(new_lin.weight) - new_lin.weight.data = embeds else:	173	2	98.84%
edsnlp/pipes/trainable/span_classifier/span_classifier.py Was already missing at line 373 if not all(keep_bindings): - logger.warning( "Some attributes have no labels or values and have been removed:"	164	1	99.39%
edsnlp/pipes/trainable/ner_crf/ner_crf.py Was already missing at line 301 if self.labels is not None and not self.infer_span_setter: - return Was already missing at lines 309-311 if callable(self.target_span_getter): - for span in get_spans(doc, self.target_span_getter): - inferred_labels.add(span.label_) else:	172	3	98.26%
edsnlp/pipes/trainable/layers/crf.py Was already missing at line 21 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).logsumexp(-2) Was already missing at line 29 # out: 2 * N * O - return (log_A.unsqueeze(-1) + log_B.unsqueeze(-3)).max(-2) Was already missing at line 98 if learnable_transitions: - self.transitions = torch.nn.Parameter( torch.zeros_like(forbidden_transitions, dtype=torch.float) Was already missing at line 108 if learnable_transitions and with_start_end_transitions: - self.start_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float) Was already missing at line 117 if learnable_transitions and with_start_end_transitions: - self.end_transitions = torch.nn.Parameter( torch.zeros(num_tags, dtype=torch.float)	137	5	96.35%
edsnlp/pipes/trainable/embeddings/transformer/transformer.py Was already missing at line 165 if quantization is not None: - kwargs["quantization_config"] = quantization Was already missing at line 185 if self.cls_token_id is None: - [self.cls_token_id] = self.tokenizer.convert_tokens_to_ids( [self.tokenizer.special_tokens_map["bos_token"]] Was already missing at line 189 if self.sep_token_id is None: - [self.sep_token_id] = self.tokenizer.convert_tokens_to_ids( [self.tokenizer.special_tokens_map["eos_token"]]	166	3	98.19%
edsnlp/pipes/qualifiers/reported_speech/reported_speech.py Was already missing at lines 24-28 return "REPORTED" - elif token._.rspeech is False: - return "DIRECT" - else: - return None	100	3	97.00%
edsnlp/pipes/qualifiers/negation/negation.py Was already missing at line 28 else: - return None	101	1	99.01%
edsnlp/pipes/qualifiers/hypothesis/hypothesis.py Was already missing at line 27 else: - return None	98	1	98.98%
edsnlp/pipes/qualifiers/history/history.py Was already missing at lines 26-32 def history_getter(token: Union[Token, Span]) -> Optional[str]: - if token._.history is True: - return "ATCD" - elif token._.history is False: - return "CURRENT" - else: - return None Was already missing at lines 353-359 ) - except ValueError: ... - note_datetime = None Was already missing at lines 368-374 ) - except ValueError: ... - birth_datetime = None Was already missing at lines 437-440 ) - except ValueError as e: - absolute_date = None - logger.warning( "In doc {}, the following date {} raises this error: {}. "	179	14	92.18%
edsnlp/pipes/qualifiers/family/family.py Was already missing at line 27 else: - return None	83	1	98.80%
edsnlp/pipes/ner/scores/sofa/sofa.py Was already missing at line 32 if not assigned: - continue if assigned.get("method_max") is not None: Was already missing at line 40 else: - method = "Non précisée"	25	2	92.00%
edsnlp/pipes/ner/scores/elston_ellis/patterns.py Was already missing at line 26 if x <= 5: - return 1 Was already missing at lines 32-36 else: - return 3 - - except ValueError: - return None	21	4	80.95%
edsnlp/pipes/ner/scores/charlson/patterns.py Was already missing at lines 21-23 return int(extracted_score) - except ValueError: - return None	13	2	84.62%
edsnlp/pipes/ner/disorders/solid_tumor/solid_tumor.py Was already missing at lines 131-137 for span in spans: - span.label_ = "solid_tumor" ... - yield span	38	6	84.21%
edsnlp/pipes/ner/disorders/peripheral_vascular_disease/peripheral_vascular_disease.py Was already missing at line 108 if "peripheral" not in span._.assigned.keys(): - continue	16	1	93.75%
edsnlp/pipes/ner/disorders/diabetes/diabetes.py Was already missing at line 131 # Mostly FP - continue Was already missing at line 134 elif self.has_far_complications(span): - span._.status = 2 Was already missing at line 145 if next(iter(self.complication_matcher(context)), None) is not None: - return True return False	30	3	90.00%
edsnlp/pipes/ner/disorders/connective_tissue_disease/connective_tissue_disease.py Was already missing at line 104 # Huge change of FP / Title section - continue	15	1	93.33%
edsnlp/pipes/ner/disorders/ckd/ckd.py Was already missing at lines 121-124 dfg_value = float(dfg_span.text.replace(",", ".").strip()) - except ValueError: - logger.trace(f"DFG value couldn't be extracted from {dfg_span.text}") - return False	30	3	90.00%
edsnlp/pipes/ner/disorders/cerebrovascular_accident/cerebrovascular_accident.py Was already missing at lines 112-114 if span._.source == "ischemia": - if "brain" not in span._.assigned.keys(): - continue	18	2	88.89%
edsnlp/pipes/ner/adicap/models.py Was already missing at line 15 def norm(self) -> str: - return self.code Was already missing at line 18 def __str__(self): - return self.norm()	16	2	87.50%
edsnlp/pipes/misc/split/split.py Was already missing at lines 183-185 if max_length <= 0 and self.regex is None: - yield doc - return	73	2	97.26%
edsnlp/pipes/misc/sections/sections.py Was already missing at line 126 if sections is None: - sections = patterns.sections sections = dict(sections)	45	1	97.78%
edsnlp/pipes/misc/quantities/quantities.py Was already missing at lines 147-149 def __getitem__(self, item: int): - assert isinstance(item, int) - return [self][item] Was already missing at lines 160-163 def __eq__(self, other: Any): - if isinstance(other, SimpleQuantity): - return self.convert_to(other.unit) == other.value - return False Was already missing at line 166 if other.unit == self.unit: - return SimpleQuantity(self.value + other.value, self.unit, self.registry) return SimpleQuantity( Was already missing at line 193 return self.convert_to(other_unit) - except KeyError: raise AttributeError(f"Unit {other_unit} not found") Was already missing at line 198 def verify(cls, ent): - return True Was already missing at line 264 def __lt__(self, other: Union[SimpleQuantity, "RangeQuantity"]): - return max(self.convert_to(other.unit)) < min((part.value for part in other)) Was already missing at line 275 return self.convert_to(other.unit) == other.value - return False Was already missing at line 289 def verify(cls, ent): - return True Was already missing at line 888 if snippet.end != last and doclike.doc[last: snippet.end].text.strip() == "": - pseudo.append("w") pseudo = "".join(pseudo) Was already missing at line 1069 if start_line is None: - continue Was already missing at lines 1100-1102 unit_norm = self.unit_followers[unit_before.label_] - except (KeyError, AttributeError, IndexError): - pass Was already missing at line 1145 ): - ent = doc[unit_text.start: number.end] else: Was already missing at lines 1152-1154 dims = self.unit_registry.parse_unit(unit_norm)[0] - except KeyError: - continue Was already missing at lines 1260-1262 last._.set(last.label_, new_value) - except (AttributeError, TypeError): - merged.append(ent) else:	440	20	95.45%
edsnlp/pipes/misc/dates/models.py Was already missing at line 165 else: - d["month"] = note_datetime.month if self.day is None: Was already missing at lines 169-175 else: - if self.year is None: ... - d["day"] = default_day Was already missing at lines 183-185 return dt - except ValueError: - return None Was already missing at line 201 else: - return None Was already missing at line 217 if self.second: - norm += f"{self.second:02}s"	206	11	94.66%
edsnlp/pipes/misc/dates/dates.py Was already missing at line 249 if isinstance(absolute, str): - absolute = [absolute] if isinstance(relative, str): Was already missing at line 251 if isinstance(relative, str): - relative = [relative] if isinstance(duration, str): Was already missing at line 253 if isinstance(duration, str): - relative = [duration] if isinstance(false_positive, str): Was already missing at lines 357-366 if self.merge_mode == "align": - alignments = align_spans(matches, spans, sort_by_overlap=True) ... - matches.append(span) Was already missing at lines 462-464 if v1.mode == Mode.DURATION: - m1 = Bound.FROM if v2.bound == Bound.UNTIL else Bound.UNTIL - m2 = v2.mode or Bound.FROM elif v2.mode == Mode.DURATION:	153	14	90.85%
edsnlp/pipes/misc/consultation_dates/consultation_dates.py Was already missing at line 131 else: - self.date_matcher = None Was already missing at line 134 if not consultation_mention: - consultation_mention = [] elif consultation_mention is True:	48	2	95.83%
edsnlp/pipes/core/normalizer/__init__.py Was already missing at line 7 def excluded_or_space_getter(t): - return t.is_space or t.tag_ == "EXCLUDED"	5	1	80.00%
edsnlp/pipes/core/endlines/endlines.py Was already missing at lines 160-164 if end_lines_model is None: - path = build_path(__file__, "base_model.pkl") - - with open(path, "rb") as inp: - self.model = pickle.load(inp) elif isinstance(end_lines_model, str): Was already missing at lines 167-169 self.model = pickle.load(inp) - elif isinstance(end_lines_model, EndLinesModel): - self.model = end_lines_model else: Was already missing at line 200 ): - return "ENUMERATION" Was already missing at line 287 if np.isnan(sigma): - sigma = 1	89	7	92.13%
edsnlp/pipes/core/contextual_matcher/contextual_matcher.py Was already missing at lines 241-243 ): - to_keep = False - break	130	2	98.46%
edsnlp/patch_spacy.py Was already missing at lines 67-69 # if module is reloaded. - existing_func = registry.factories.get(internal_name) - if not util.is_same_func(factory_func, existing_func): raise ValueError(	31	2	93.55%
edsnlp/package.py Was already missing at lines 474-476 version = version or pyproject["project"]["version"] - except (KeyError, TypeError): - version = "0.1.0" name = name or pyproject["project"]["name"] Was already missing at line 480 else: - main_package = None model_package = snake_case(name.lower())	214	3	98.60%
edsnlp/metrics/span_attribute.py Was already missing at lines 67-69 ) - assert attributes is None - attributes = kwargs.pop("qualifiers") if attributes is None:	75	2	97.33%
edsnlp/matchers/simstring.py Was already missing at line 280 if custom: - attr = attr[1:].lower() Was already missing at line 295 if custom: - token_text = getattr(token._, attr) else:	146	2	98.63%
edsnlp/language.py Was already missing at line 103 if last != begin: - logger.warning( "Missed some characters during"	51	1	98.04%
edsnlp/data/standoff.py Was already missing at line 38 def __init__(self, ann_file, line): - super().__init__(f"File {ann_file}, unrecognized Brat line {line}") Was already missing at line 192 ) - except Exception: raise Exception(	186	2	98.92%
edsnlp/data/polars.py Was already missing at line 36 if hasattr(data, "collect"): - data = data.collect() assert isinstance(data, pl.DataFrame)	55	1	98.18%
edsnlp/data/json.py Was already missing at line 81 return records - except Exception as e: raise Exception(f"Cannot read {file}: {e}")	112	1	99.11%
edsnlp/data/converters.py Was already missing at line 428 elif key == "XPOS": - word.tag_ = value elif key == "FEATS": Was already missing at line 822 for attr in bool_attributes: - self.default_attributes[attr] = False self.opener = opener or self.PRESETS[preset]["opener"] Was already missing at line 830 if self.keep_raw_attribute_values: - return value try: Was already missing at lines 869-872 ) - except StopIteration: - warnings.warn(f"Unmatched closing tag for '{sep.group()}'") - continue start, start_label, start_attrs = starts.pop(idx) Was already missing at line 900 ): - if not Span.has_extension(dst): Span.set_extension(dst, default=None) Was already missing at line 907 if span is None: - continue for k, v in attrs.items(): Was already missing at lines 922-925 for attr, value in self.default_attributes.items(): - for span in spans: - if span._.get(attr) is None: - span._.set(attr, value) Was already missing at line 960 if isinstance(converter, type): - return converter(**kwargs), {} return converter, validate_kwargs(converter, kwargs)	317	11	96.53%
edsnlp/data/conll.py Was already missing at lines 81-83 ) - except StopIteration: - cols = DEFAULT_COLUMNS warnings.warn( Was already missing at lines 92-96 if not line: - if doc["words"]: - yield doc - doc = {"words": []} - continue if line.startswith("#"):	76	6	92.11%
edsnlp/core/torch_component.py Was already missing at line 392 if hasattr(self, "compiled"): - res = self.compiled(batch) else: Was already missing at line 438 """ - return self.preprocess(doc) Was already missing at line 463 if object.__repr__(self) in exclude: - return exclude.add(object.__repr__(self))	187	3	98.40%
edsnlp/core/stream.py Was already missing at lines 190-192 if isinstance(batch, StreamSentinel): - yield batch - continue results = [] Was already missing at lines 1007-1009 elif op.batch_fn is None: - batch_size = op.size - batch_fn = batchify else:	356	4	98.88%
edsnlp/core/pipeline.py Was already missing at line 605 if name in exclude: - continue if name not in components: Was already missing at lines 716-719 """ - res = Stream.ensure_stream(docs) - res = res.map(functools.partial(self.preprocess, supervision=supervision)) - return res	446	4	99.10%
edsnlp/connectors/omop.py Was already missing at line 69 if not isinstance(row.ents, list): - continue Was already missing at line 87 else: - doc.spans[span.label_].append(span) Was already missing at line 127 if df.note_id.isna().any(): - df["note_id"] = range(len(df)) Was already missing at line 171 if i > 0: - df.term_modifiers += ";" df.term_modifiers += ext + "=" + df[ext].astype(str)	84	4	95.24%

277 files skipped due to complete coverage.

Coverage failure: total of 98.00% is less than 98.07% ❌

sonarqubecloud · 2025-05-14T14:55:07Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-05-14T14:59:58Z

Docs preview URL

https://edsnlp-tnmnewregex.vercel.app/

sonarqubecloud · 2025-07-04T15:28:24Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

percevalw force-pushed the master branch from c10a458 to 8276cc9 Compare April 18, 2025 01:15

LucasDedieu force-pushed the tnm_new_regex branch from 4a7168c to 8e7ee3a Compare May 14, 2025 14:54

fix: add a new tnm regex pattern by default.

0c012f5

LucasDedieu force-pushed the tnm_new_regex branch from d8d2c40 to 0c012f5 Compare July 4, 2025 15:21

linting

e40f154

percevalw force-pushed the master branch 2 times, most recently from d2e1f39 to 65669dc Compare September 4, 2025 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvement: new TNM regex #366

Improvement: new TNM regex #366

Uh oh!

LucasDedieu commented Jan 24, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jan 24, 2025

Uh oh!

github-actions bot commented Jan 24, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented May 14, 2025

Uh oh!

github-actions bot commented May 14, 2025

Uh oh!

sonarqubecloud bot commented Jul 4, 2025

Uh oh!

Uh oh!

Improvement: new TNM regex #366

Are you sure you want to change the base?

Improvement: new TNM regex #366

Uh oh!

Conversation

LucasDedieu commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

TODO

Checklist

Uh oh!

sonarqubecloud bot commented Jan 24, 2025

Quality Gate passed

Uh oh!

github-actions bot commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

sonarqubecloud bot commented May 14, 2025

Quality Gate passed

Uh oh!

github-actions bot commented May 14, 2025

Docs preview URL

Uh oh!

sonarqubecloud bot commented Jul 4, 2025

Quality Gate passed

Uh oh!

Uh oh!

LucasDedieu commented Jan 24, 2025 •

edited

Loading

github-actions bot commented Jan 24, 2025 •

edited

Loading