Skip to content

Commit ed56f2d

Browse files
touma-Idolfim-ibm
andauthored
fix(html): Parse rawspan and colspan when they include non numerical values (#2048)
* use re to stop at first non-digit Signed-off-by: Maroun Touma <[email protected]> * Allow digit in first place followed by non numerical values Signed-off-by: Maroun Touma <[email protected]> * refactor to match type checker Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Maroun Touma <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: Michele Dolfi <[email protected]>
1 parent bfda6d3 commit ed56f2d

File tree

1 file changed

+10
-2
lines changed

1 file changed

+10
-2
lines changed

docling/backend/html_backend.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -511,9 +511,17 @@ def _get_cell_spans(cell: Tag) -> tuple[int, int]:
511511
str(cell.get("colspan", "1")),
512512
str(cell.get("rowspan", "1")),
513513
)
514+
515+
def _extract_num(s: str) -> int:
516+
if s and s[0].isnumeric():
517+
match = re.search(r"\d+", s)
518+
if match:
519+
return int(match.group())
520+
return 1
521+
514522
int_spans: tuple[int, int] = (
515-
int(raw_spans[0]) if raw_spans[0].isnumeric() else 1,
516-
int(raw_spans[1]) if raw_spans[0].isnumeric() else 1,
523+
_extract_num(raw_spans[0]),
524+
_extract_num(raw_spans[1]),
517525
)
518526

519527
return int_spans

0 commit comments

Comments
 (0)