feat(models): add API usage to picture descriptions; unify response type #2445

FrigaZzz · 2025-10-11T14:31:43Z

This PR introduces standardized capture and propagation of usage metadata (token counters, etc.) from OpenAI/VLM-compatible picture description backends within docling.

Context and motivation

References: #2271, #2402, #2403

Currently, docling has no built-in way to track resource consumption (tokens, API calls) when using picture description models. This makes it difficult for users to:

Monitor and optimize API costs
Debug performance issues
Implement rate limiting or usage quotas

Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on positive feedback, this PR integrates usage tracking directly into docling's model runtime.

Long-term plan: Move the usage field into docling_core so it becomes part of the canonical annotation data model. This PR focuses on the runtime wiring in docling to keep changes reviewable and unblock immediate usage capture needs.

What's changed

1. New response types for image API calls

ApiImageResponse: Carries both generated text and optional usage metadata from image APIs
OpenAiResponseUsage: Represents token usage (input_tokens, output_tokens, total_tokens, etc.) from OpenAI-compatible backends

2. Usage metadata storage

DescriptionAnnotationWithUsage: Temporary wrapper enabling the runtime to attach usage metadata to each Description annotation produced by picture description models

3. Runtime integration

PictureDescriptionBaseModel._annotate_images() now returns Iterable[ApiImageResponse] (previously plain text strings)
API-backed and VLM-backed picture description models updated to use the new response type and propagate usage
ApiVlmModel updated to decode using response.text instead of raw response objects

4. Backward compatibility

No behavior change if a backend doesn't report usage: the usage field remains None and pipeline output is unchanged
Existing pipelines continue to work without modification

Breaking changes

⚠️ Subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse instead of str.

Migration example:

# Before
def _annotate_images(self, images: List[Image]) -> Iterable[str]:
    return [self._describe(img) for img in images]

# After
def _annotate_images(self, images: List[Image]) -> Iterable[ApiImageResponse]:
    return [
        ApiImageResponse(text=self._describe(img), usage=self._get_usage())
        for img in images
    ]

Documentation

✅ Code-level docstrings included
📋 Public documentation will be updated after the docling_core integration to document the usage field on description annotations
📋 The tutorial plugin (docs(custom-plugins): retrieve TOKEN USAGE for Image Processing with Custom PictureDescriptionApiModel #2403) will remain as a minimal example, now requiring fewer overrides

Limitations and next steps

Current limitation: The canonical Description annotation type lives in docling_core. In this PR, usage is temporarily attached via DescriptionAnnotationWithUsage in docling to validate the wiring.

Proposed follow-up PRs:

docling_core: Add optional usage field to the canonical DescriptionAnnotation
docling: Adopt the new core release, remove temporary DescriptionAnnotationWithUsage, and complete end-to-end integration
Documentation: Add "Usage telemetry" section for picture descriptions; optionally add CLI/debug utilities to print usage per annotation

Testing

✅ Existing tests pass with no behavioral changes when backends don't report usage
✅ New tests validate usage propagation for OpenAI-compatible backends
✅ Validated end-to-end with the plugin example from docs(custom-plugins): retrieve TOKEN USAGE for Image Processing with Custom PictureDescriptionApiModel #2403

Checklist

Commit messages follow conventional commits
Runtime and types updated consistently
Breaking changes documented with migration guide
Public documentation (pending docling_core integration)
Example code updates (pending docling_core integration)

References

Get the total tokens used in the image description #2271 – Early validation example
Add documentation and examples for custom plugin development #2402 – Feature discussion and motivation
docs(custom-plugins): retrieve TOKEN USAGE for Image Processing with Custom PictureDescriptionApiModel #2403 – Plugin proof-of-concept

- Introduce ApiImageResponse and OpenAiResponseUsage to carry usage metadata from image API calls - Add DescriptionAnnotationWithUsage to store usage alongside description text - Change _annotate_images to return Iterable[ApiImageResponse]; update API and VLM models to comply - Fix ApiVlmModel to decode responses using response.text instead of the raw response object Why: enables tracking/reporting of OpenAI/VLM token usage in picture description annotations. BREAKING CHANGE: subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse Signed-off-by: FrigaZzz <[email protected]>

…description-model-2271

github-actions · 2025-10-11T14:31:51Z

✅ DCO Check Passed

Thanks @FrigaZzz, all your commits are properly signed off. 🎉

dosubot · 2025-10-11T14:31:58Z

Related Documentation

Checked 2 published document(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2025-10-11T14:32:17Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dolfim-ibm · 2025-10-13T07:00:44Z

@FrigaZzz my proposal is to simplify even further and add this directly to the current class. Do you see any issue with it?

@cau-git any other thought?

codecov · 2025-10-13T07:12:09Z

Codecov Report

❌ Patch coverage is 43.58974% with 22 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/utils/api_image_request.py	28.57%	15 Missing ⚠️
docling/models/picture_description_vlm_model.py	44.44%	5 Missing ⚠️
docling/models/api_vlm_model.py	0.00%	1 Missing ⚠️
docling/models/picture_description_base_model.py	83.33%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

FrigaZzz · 2025-10-13T09:22:48Z

@FrigaZzz my proposal is to simplify even further and add this directly to the current class. Do you see any issue with it?

@cau-git any other thought?

Hi!

The main issue I see with moving the DescriptionAnnotation class from docling-core to the docling package (and adding the usage metadata there) is that it would break the serialization logic in docling-core, causing problems with the export functionality.

The export logic (HTML and Markdown serializers) relies on type checking to verify that annotations derive from the DescriptionAnnotation base type. If we move this class to the docling package, the isinstance() checks in docling-core would fail unless we'd create an unwanted dependency where docling-core would need to depend on docling (reversing the intended dependency direction).

These checks are implemented in several places:

They all check: isinstance(annotation, DescriptionAnnotation)

I created DescriptionAnnotationWithUsage as an extension of the base DescriptionAnnotation (which remains in docling-core). Through inheritance, all the internal instance checks in docling-core continue to work seamlessly without any modifications.

It's actually not a bad idea to keep the fundamental DescriptionAnnotation type as part of the CORE package, while having a more feature-rich, extended version directly in the docling package. This follows good separation of concerns, docling-core provides the basic primitives, and docling extends them with additional functionality (like usage tracking).

The trade-off: This does introduce some ambiguity for end users. When developing plugin extensions or working with PictureDescriptionBaseModel, they might encounter DescriptionAnnotation from docling-core or DescriptionAnnotationWithUsage from docling, which can be confusing.

The cleanest approach would still be to add the usage field directly to DescriptionAnnotation in docling-core. This would:

Avoid having docling-core depend on docling (maintaining proper dependency direction)
Eliminate ambiguity by having a single, authoritative annotation type
Provide a cleaner API for all users
But it requires coordinated releases of both packages (docling-core first, then docling), which is why I opted for the current approach to unblock usage tracking functionality asap

dolfim-ibm · 2025-10-13T13:57:54Z

All correct. I initially missed that the usage is going in the docling-core DescriptionAnnotation class. Given that, I still think it is worth to simplify and just have one model runtime which is potentially aware of the token usage count.

I would anyway like to emphasize a bit more the fact that usage is some model metadata. What about storing it as

class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str

    inference_details: dict[str, Any] = {}  # add usage here, or we could make it a BaseModel with usage and potentially other details

FrigaZzz · 2025-10-14T22:16:53Z

All correct. I initially missed that the usage is going in the docling-core DescriptionAnnotation class. Given that, I still think it is worth to simplify and just have one model runtime which is potentially aware of the token usage count.

I would anyway like to emphasize a bit more the fact that usage is some model metadata. What about storing it as
class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str

    inference_details: dict[str, Any] = {}  # add usage here, or we could make it a BaseModel with usage and potentially other details

Good point on making it a BaseModel. Actually, I think we should go with that approach because it gives us validation, type safety, and automatic serialization without losing the flexibility we need. We can define a base class that all providers conform to:

# file: docling-core/docling_core/types/doc/document.py
class BaseInferenceDetails(BaseModel):
    """Base inference details that all providers should include."""
    pass

class OpenAiInferenceDetails(BaseInferenceDetails):
    """OpenAI-specific inference details."""
    usage: dict[str, int] # lets keep it a dict so we won't run into issues if OpenAI or other providers change their data model. 

class CustomProviderInferenceDetails(BaseInferenceDetails):
    """Custom provider inference details."""
    tokens_in: int
    tokens_out: int
    processing_time_ms: float
    # more fields

Then in DescriptionAnnotation:

# file: docling-core/docling_core/types/doc/document.py
class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str
    inference_details: BaseInferenceDetails = BaseInferenceDetails()

This way, end users still take responsibility for deserializing into their expected model (they define their own subclass), but we get validation and type safety out of the box. The callbacks remain the same—they just return the instantiated BaseModel instead of a raw dict.

Development Plan

First, we update the DescriptionAnnotation class in the docling-core project with the new BaseInferenceDetails structure.

Then, we attach this new version to the docling repository and update PictureDescriptionBaseModel (and their extensions) to populate the inference_details field with the appropriate provider-specific model instances.

Handling Inference Details Extraction

We add callback functions to PictureDescriptionApiOptions:

docling/docling/datamodel/pipeline_options.py
class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
    kind: ClassVar[Literal["api"]] = "api"
    url: AnyUrl = AnyUrl("http://localhost:8000/v1/chat/completions")
    headers: Dict[str, str] = {}
    params: Dict[str, Any] = {}
    timeout: float = 20
    concurrency: int = 1
    prompt: str = "Describe this image in a few sentences."
    provenance: str = ""
    inference_details_process_callback: Callable[[Any], BaseInferenceDetails] | None = None
    text_content_extraction_callback: Callable[[Any], str] | None = None

The callbacks take the raw API response and return typed models. If not provided, we use default behavior for backward compatibility (OpenAI expected fields).

Updated api_image_request

# file: docling/utils/api_image_request.py
from dataclasses import dataclass
from typing import Any, Callable, Optional, Dict
from PIL import Image
from pydantic import BaseModel, AnyUrl

# it could be either empty or keep at least the usage field
class BaseInferenceDetails(BaseModel):
    """Base inference details that all providers should include."""
    usage: dict[str, int]


# note: maybe could drop the OpenAiInferenceDetails and keep only the BaseInferenceDetails
class OpenAiInferenceDetails(BaseInferenceDetails):
    """OpenAI-specific inference details."""
    # usage: dict[str, int] 
    pass


class ApiImageResponse(BaseModel):
    """Generic response from image-based API calls."""
    text: str
    inference_details: BaseInferenceDetails


# the current implementation
def _default_text_extraction(api_response: Any) -> str:
    """Default text extraction for OpenAI API responses."""
    api_resp = OpenAiApiResponse.model_validate_json(api_response)
    return api_resp.choices[0].message.content.strip()

# we assume the usage field
def _default_inference_details_extraction(api_response: Any) -> BaseInferenceDetails:
    """Default inference details extraction for OpenAI API responses."""
    api_resp = OpenAiApiResponse.model_validate_json(api_response)
    
    if api_resp.usage is None:
        usage = OpenAiResponseUsage(
            prompt_tokens=0, completion_tokens=0, total_tokens=0
        )
    else:
        usage = api_resp.usage
    
    return OpenAiInferenceDetails(usage=usage.model_dump())


def api_image_request(
    image: Image.Image,
    prompt: str,
    url: AnyUrl,
    timeout: float = 20,
    headers: Optional[Dict[str, str]] = None,
    text_content_extraction_callback: Optional[Callable[[Any], str]] = None,
    inference_details_process_callback: Optional[Callable[[Any], BaseInferenceDetails]] = None,
    **params,
) -> ApiImageResponse:
    """
    # updated code

# the same applies to api_image_request_streaming

Other impacted files:

docling/models/picture_description_base_model.py
docling/models/picture_description_vlm_model.py
docling/models/picture_description_api_model.py
docling/models/api_vlm_model.py

Does this direction work better? Should we proceed with the BaseModel approach for BaseInferenceDetails?

vagenas · 2025-10-21T20:53:53Z

It can indeed be very useful to expose such information when available.

Some relevant context here is that we are currently amidst a rework of the metadata modeling in docling-core ("metadata" being a new term for what has been "annotations", i.e. derived information, e.g. a summary):
docling-project/docling-core#408

As we are finalizing this work in the following couple days, we are going to account for this use case and ensure it is adequately accommodated by the new setup.

FrigaZzz and others added 2 commits October 11, 2025 13:38

Merge branch 'docling-project:main' into feature/token-usage-picture-…

927e7c4

…description-model-2271

FrigaZzz changed the title ~~feat(models): add API usage to picture descriptions; unify response type; fix VLM decoding~~ feat(models): add API usage to picture descriptions; unify response type Oct 11, 2025

FrigaZzz mentioned this pull request Oct 11, 2025

docs(custom-plugins): retrieve TOKEN USAGE for Image Processing with Custom PictureDescriptionApiModel #2403

Open

2 tasks

PeterStaar-IBM requested review from cau-git and dolfim-ibm October 13, 2025 06:55

dolfim-ibm assigned FrigaZzz and dolfim-ibm Oct 13, 2025

dosubot bot mentioned this pull request Oct 17, 2025

VLM OCR processing fails with 500 Internal Server Error from Azure OpenAI #2485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(models): add API usage to picture descriptions; unify response type #2445

feat(models): add API usage to picture descriptions; unify response type #2445

FrigaZzz commented Oct 11, 2025

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

dosubot bot commented Oct 11, 2025

Uh oh!

mergify bot commented Oct 11, 2025

Uh oh!

dolfim-ibm commented Oct 13, 2025

Uh oh!

codecov bot commented Oct 13, 2025

Uh oh!

FrigaZzz commented Oct 13, 2025 •

edited

Loading

Uh oh!

dolfim-ibm commented Oct 13, 2025

Uh oh!

FrigaZzz commented Oct 14, 2025 •

edited

Loading

Uh oh!

vagenas commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(models): add API usage to picture descriptions; unify response type #2445

Are you sure you want to change the base?

feat(models): add API usage to picture descriptions; unify response type #2445

Conversation

FrigaZzz commented Oct 11, 2025

Context and motivation

What's changed

1. New response types for image API calls

2. Usage metadata storage

3. Runtime integration

4. Backward compatibility

Breaking changes

Documentation

Limitations and next steps

Testing

Checklist

References

Uh oh!

github-actions bot commented Oct 11, 2025

Uh oh!

dosubot bot commented Oct 11, 2025

Uh oh!

mergify bot commented Oct 11, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dolfim-ibm commented Oct 13, 2025

Uh oh!

codecov bot commented Oct 13, 2025

Codecov Report

Uh oh!

FrigaZzz commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dolfim-ibm commented Oct 13, 2025

Uh oh!

FrigaZzz commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Development Plan

Handling Inference Details Extraction

Updated api_image_request

Uh oh!

vagenas commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FrigaZzz commented Oct 13, 2025 •

edited

Loading

FrigaZzz commented Oct 14, 2025 •

edited

Loading