Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions src/core/evaluator.js
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the args[1]?.get("ActualText") be exposed in the getOperatorList result as well?

e.g. something like this

args = [
    args[0].name,
    args[1] instanceof Dict ? args[1].get("MCID") : null,
    args[1] instanceof Dict ? args[1].get("ActualText") : null  // <--- extra arg
];

in

pdf.js/src/core/evaluator.js

Lines 2300 to 2303 in d2a6638

args = [
args[0].name,
args[1] instanceof Dict ? args[1].get("MCID") : null,
];

Not sure whether its a breaking change, but it's crucial for reconstructing content (e.g. svg) from the results of getOperatorList() when not using getTextContent().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file a bug and explain why it'd be useful to have such a feature ?
Could it help to fix an existing issue in the current viewer ?

Copy link
Contributor

@fnlctrl fnlctrl Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@calixteman Ok, I'll open a another ticket for it. I don't think it's related to the current issue with the viewer.

Actually I opened the original ticket because I got wrong text from getOperatorList(), and the viewer is also affected so I used it to open the ticket as it's easier to reproduce than a code snippet.

I was actually building a pdf -> svg conversion tool with getOperatorList(). I found getTextContent() to be not useful - it only extracts text, and the shape info can only be obtained from getOperatorList(), and there's no easy way to interweave the text+shape back into correct order from the results of both functions, so I ditched getTextContent() and only use getOperatorList() to also obtain text.

Original file line number Diff line number Diff line change
Expand Up @@ -2406,6 +2406,7 @@ class PartialEvaluator {
transform: null,
fontName: null,
hasEOL: false,
span: "",
};

// Use a circular buffer (length === 2) to save the last chars in the
Expand Down Expand Up @@ -3070,6 +3071,19 @@ class PartialEvaluator {
textContentItem.str.length = 0;
}

function replaceTextContentBySpan() {
const { span, str } = textContentItem;
if (!span) {
return;
}
textContentItem.span = "";
if (/^\s+$/.test(span)) {
return;
}
str.length = 0;
str.push(span);
}

function enqueueChunk(batch = false) {
const length = textContent.items.length;
if (length === 0) {
Expand Down Expand Up @@ -3446,6 +3460,11 @@ class PartialEvaluator {
return;
case OPS.beginMarkedContent:
flushTextContentItem();
if (args[0]?.name === "Span") {
textContentItem.span = stringToPDFString(
args[1]?.get("ActualText") || ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR also fix #12237 perhaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a start at fixing that issue. This is the first step, getting this /ActualText into the text content. That issue is asking for src/display/text_layer.js:#processItems to draw spans containing this actual text in the right places. That will mean also accumulating the text drawing that would have been done (to know the bounds of the glyphs that will be drawn) so that the bounds of the span can be calculated.

);
}
Comment on lines +3463 to +3467
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this addition does. This is for a BMC which is just a tag so there never is an args[1]? beginMarkedContentProps (below) is for BDC which is a tag and dictionary.

if (includeMarkedContent) {
markedContentData.level++;

Expand All @@ -3457,6 +3476,11 @@ class PartialEvaluator {
break;
case OPS.beginMarkedContentProps:
flushTextContentItem();
if (args[0]?.name === "Span") {
textContentItem.span = stringToPDFString(
args[1]?.get("ActualText") || ""
);
}
if (includeMarkedContent) {
markedContentData.level++;

Expand All @@ -3474,6 +3498,7 @@ class PartialEvaluator {
}
break;
case OPS.endMarkedContent:
replaceTextContentBySpan();
flushTextContentItem();
if (includeMarkedContent) {
if (markedContentData.level === 0) {
Expand Down
1 change: 1 addition & 0 deletions test/pdfs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -726,3 +726,4 @@
!chrome-text-selection-markedContent.pdf
!bug1963407.pdf
!issue19517.pdf
!issue20007.pdf
Binary file added test/pdfs/issue20007.pdf
Binary file not shown.
14 changes: 14 additions & 0 deletions test/unit/api_spec.js
Original file line number Diff line number Diff line change
Expand Up @@ -3923,6 +3923,20 @@ Caron Broadcasting, Inc., an Ohio corporation (“Lessee”).`)
expect(items[1].fontName).not.toEqual(items[0].fontName);
});

it("gets text content from /ActualText", async function () {
const loadingTask = getDocument(buildGetDocumentParams("issue20007.pdf"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know why, but the unit test failure suggests that this can't be loaded:

TEST-UNEXPECTED-FAIL | get the text a content stream containing some ActualText | in firefox | ResponseException: Unexpected server response (404) while retrieving PDF "http://127.0.0.1:38175/test/pdfs/issue20007.pdf". in http://127.0.0.1:38175/src/shared/util.js (line 501)

Moreover, is the movement in the reference test expected?

const pdfDoc = await loadingTask.promise;
const pdfPage = await pdfDoc.getPage(1);

const { items } = await pdfPage.getTextContent({
disableNormalization: true,
});
const text = mergeText(items);
expect(text).toEqual("The quick brown fox jumps over the lazy dog");

await loadingTask.destroy();
});

it("gets empty structure tree", async function () {
const tree = await page.getStructTree();

Expand Down
Loading