Skip to content

Conversation

@FrigaZzz
Copy link

This PR introduces standardized capture and propagation of usage metadata (token counters, etc.) from OpenAI/VLM-compatible picture description backends within docling.

Context and motivation

References: #2271, #2402, #2403

Currently, docling has no built-in way to track resource consumption (tokens, API calls) when using picture description models. This makes it difficult for users to:

  • Monitor and optimize API costs
  • Debug performance issues
  • Implement rate limiting or usage quotas

Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on positive feedback, this PR integrates usage tracking directly into docling's model runtime.

Long-term plan: Move the usage field into docling_core so it becomes part of the canonical annotation data model. This PR focuses on the runtime wiring in docling to keep changes reviewable and unblock immediate usage capture needs.

What's changed

1. New response types for image API calls

  • ApiImageResponse: Carries both generated text and optional usage metadata from image APIs
  • OpenAiResponseUsage: Represents token usage (input_tokens, output_tokens, total_tokens, etc.) from OpenAI-compatible backends

2. Usage metadata storage

  • DescriptionAnnotationWithUsage: Temporary wrapper enabling the runtime to attach usage metadata to each Description annotation produced by picture description models

3. Runtime integration

  • PictureDescriptionBaseModel._annotate_images() now returns Iterable[ApiImageResponse] (previously plain text strings)
  • API-backed and VLM-backed picture description models updated to use the new response type and propagate usage
  • ApiVlmModel updated to decode using response.text instead of raw response objects

4. Backward compatibility

  • No behavior change if a backend doesn't report usage: the usage field remains None and pipeline output is unchanged
  • Existing pipelines continue to work without modification

Breaking changes

⚠️ Subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse instead of str.

Migration example:

# Before
def _annotate_images(self, images: List[Image]) -> Iterable[str]:
    return [self._describe(img) for img in images]

# After
def _annotate_images(self, images: List[Image]) -> Iterable[ApiImageResponse]:
    return [
        ApiImageResponse(text=self._describe(img), usage=self._get_usage())
        for img in images
    ]

Documentation

Limitations and next steps

Current limitation: The canonical Description annotation type lives in docling_core. In this PR, usage is temporarily attached via DescriptionAnnotationWithUsage in docling to validate the wiring.

Proposed follow-up PRs:

  1. docling_core: Add optional usage field to the canonical DescriptionAnnotation
  2. docling: Adopt the new core release, remove temporary DescriptionAnnotationWithUsage, and complete end-to-end integration
  3. Documentation: Add "Usage telemetry" section for picture descriptions; optionally add CLI/debug utilities to print usage per annotation

Testing

Checklist

  • Commit messages follow conventional commits
  • Runtime and types updated consistently
  • Breaking changes documented with migration guide
  • Public documentation (pending docling_core integration)
  • Example code updates (pending docling_core integration)

References

FrigaZzz and others added 2 commits October 11, 2025 13:38
- Introduce ApiImageResponse and OpenAiResponseUsage to carry usage metadata from image API calls
- Add DescriptionAnnotationWithUsage to store usage alongside description text
- Change _annotate_images to return Iterable[ApiImageResponse]; update API and VLM models to comply
- Fix ApiVlmModel to decode responses using response.text instead of the raw response object

Why: enables tracking/reporting of OpenAI/VLM token usage in picture description annotations.

BREAKING CHANGE: subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse
Signed-off-by: FrigaZzz <[email protected]>
@github-actions
Copy link
Contributor

DCO Check Passed

Thanks @FrigaZzz, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Oct 11, 2025

Related Documentation

Checked 2 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Oct 11, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@FrigaZzz FrigaZzz changed the title feat(models): add API usage to picture descriptions; unify response type; fix VLM decoding feat(models): add API usage to picture descriptions; unify response type Oct 11, 2025
@dolfim-ibm
Copy link
Contributor

@FrigaZzz my proposal is to simplify even further and add this directly to the current class. Do you see any issue with it?

@cau-git any other thought?

@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 43.58974% with 22 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/utils/api_image_request.py 28.57% 15 Missing ⚠️
docling/models/picture_description_vlm_model.py 44.44% 5 Missing ⚠️
docling/models/api_vlm_model.py 0.00% 1 Missing ⚠️
docling/models/picture_description_base_model.py 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@FrigaZzz
Copy link
Author

FrigaZzz commented Oct 13, 2025

@FrigaZzz my proposal is to simplify even further and add this directly to the current class. Do you see any issue with it?

@cau-git any other thought?

Hi!

The main issue I see with moving the DescriptionAnnotation class from docling-core to the docling package (and adding the usage metadata there) is that it would break the serialization logic in docling-core, causing problems with the export functionality.

The export logic (HTML and Markdown serializers) relies on type checking to verify that annotations derive from the DescriptionAnnotation base type. If we move this class to the docling package, the isinstance() checks in docling-core would fail unless we'd create an unwanted dependency where docling-core would need to depend on docling (reversing the intended dependency direction).

These checks are implemented in several places:

They all check: isinstance(annotation, DescriptionAnnotation)

I created DescriptionAnnotationWithUsage as an extension of the base DescriptionAnnotation (which remains in docling-core). Through inheritance, all the internal instance checks in docling-core continue to work seamlessly without any modifications.

It's actually not a bad idea to keep the fundamental DescriptionAnnotation type as part of the CORE package, while having a more feature-rich, extended version directly in the docling package. This follows good separation of concerns, docling-core provides the basic primitives, and docling extends them with additional functionality (like usage tracking).

The trade-off: This does introduce some ambiguity for end users. When developing plugin extensions or working with PictureDescriptionBaseModel, they might encounter DescriptionAnnotation from docling-core or DescriptionAnnotationWithUsage from docling, which can be confusing.

The cleanest approach would still be to add the usage field directly to DescriptionAnnotation in docling-core. This would:

  • Avoid having docling-core depend on docling (maintaining proper dependency direction)
  • Eliminate ambiguity by having a single, authoritative annotation type
  • Provide a cleaner API for all users
  • But it requires coordinated releases of both packages (docling-core first, then docling), which is why I opted for the current approach to unblock usage tracking functionality asap

@dolfim-ibm
Copy link
Contributor

All correct. I initially missed that the usage is going in the docling-core DescriptionAnnotation class. Given that, I still think it is worth to simplify and just have one model runtime which is potentially aware of the token usage count.

I would anyway like to emphasize a bit more the fact that usage is some model metadata. What about storing it as

class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str

    inference_details: dict[str, Any] = {}  # add usage here, or we could make it a BaseModel with usage and potentially other details

@FrigaZzz
Copy link
Author

FrigaZzz commented Oct 14, 2025

All correct. I initially missed that the usage is going in the docling-core DescriptionAnnotation class. Given that, I still think it is worth to simplify and just have one model runtime which is potentially aware of the token usage count.

I would anyway like to emphasize a bit more the fact that usage is some model metadata. What about storing it as

class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str

    inference_details: dict[str, Any] = {}  # add usage here, or we could make it a BaseModel with usage and potentially other details

Good point on making it a BaseModel. Actually, I think we should go with that approach because it gives us validation, type safety, and automatic serialization without losing the flexibility we need. We can define a base class that all providers conform to:

# file: docling-core/docling_core/types/doc/document.py
class BaseInferenceDetails(BaseModel):
    """Base inference details that all providers should include."""
    pass

class OpenAiInferenceDetails(BaseInferenceDetails):
    """OpenAI-specific inference details."""
    usage: dict[str, int] # lets keep it a dict so we won't run into issues if OpenAI or other providers change their data model. 

class CustomProviderInferenceDetails(BaseInferenceDetails):
    """Custom provider inference details."""
    tokens_in: int
    tokens_out: int
    processing_time_ms: float
    # more fields

Then in DescriptionAnnotation:

# file: docling-core/docling_core/types/doc/document.py
class DescriptionAnnotation(BaseAnnotation):
    """DescriptionAnnotation."""

    kind: Literal["description"] = "description"
    text: str
    provenance: str
    inference_details: BaseInferenceDetails = BaseInferenceDetails()

This way, end users still take responsibility for deserializing into their expected model (they define their own subclass), but we get validation and type safety out of the box. The callbacks remain the same—they just return the instantiated BaseModel instead of a raw dict.

Development Plan

First, we update the DescriptionAnnotation class in the docling-core project with the new BaseInferenceDetails structure.

Then, we attach this new version to the docling repository and update PictureDescriptionBaseModel (and their extensions) to populate the inference_details field with the appropriate provider-specific model instances.

Handling Inference Details Extraction

We add callback functions to PictureDescriptionApiOptions:

docling/docling/datamodel/pipeline_options.py
class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
    kind: ClassVar[Literal["api"]] = "api"
    url: AnyUrl = AnyUrl("http://localhost:8000/v1/chat/completions")
    headers: Dict[str, str] = {}
    params: Dict[str, Any] = {}
    timeout: float = 20
    concurrency: int = 1
    prompt: str = "Describe this image in a few sentences."
    provenance: str = ""
    inference_details_process_callback: Callable[[Any], BaseInferenceDetails] | None = None
    text_content_extraction_callback: Callable[[Any], str] | None = None

The callbacks take the raw API response and return typed models. If not provided, we use default behavior for backward compatibility (OpenAI expected fields).

Updated api_image_request

# file: docling/utils/api_image_request.py
from dataclasses import dataclass
from typing import Any, Callable, Optional, Dict
from PIL import Image
from pydantic import BaseModel, AnyUrl

# it could be either empty or keep at least the usage field
class BaseInferenceDetails(BaseModel):
    """Base inference details that all providers should include."""
    usage: dict[str, int]


# note: maybe could drop the OpenAiInferenceDetails and keep only the BaseInferenceDetails
class OpenAiInferenceDetails(BaseInferenceDetails):
    """OpenAI-specific inference details."""
    # usage: dict[str, int] 
    pass


class ApiImageResponse(BaseModel):
    """Generic response from image-based API calls."""
    text: str
    inference_details: BaseInferenceDetails


# the current implementation
def _default_text_extraction(api_response: Any) -> str:
    """Default text extraction for OpenAI API responses."""
    api_resp = OpenAiApiResponse.model_validate_json(api_response)
    return api_resp.choices[0].message.content.strip()

# we assume the usage field
def _default_inference_details_extraction(api_response: Any) -> BaseInferenceDetails:
    """Default inference details extraction for OpenAI API responses."""
    api_resp = OpenAiApiResponse.model_validate_json(api_response)
    
    if api_resp.usage is None:
        usage = OpenAiResponseUsage(
            prompt_tokens=0, completion_tokens=0, total_tokens=0
        )
    else:
        usage = api_resp.usage
    
    return OpenAiInferenceDetails(usage=usage.model_dump())


def api_image_request(
    image: Image.Image,
    prompt: str,
    url: AnyUrl,
    timeout: float = 20,
    headers: Optional[Dict[str, str]] = None,
    text_content_extraction_callback: Optional[Callable[[Any], str]] = None,
    inference_details_process_callback: Optional[Callable[[Any], BaseInferenceDetails]] = None,
    **params,
) -> ApiImageResponse:
    """
    # updated code

# the same applies to api_image_request_streaming 

Other impacted files:

  • docling/models/picture_description_base_model.py
  • docling/models/picture_description_vlm_model.py
  • docling/models/picture_description_api_model.py
  • docling/models/api_vlm_model.py

Does this direction work better? Should we proceed with the BaseModel approach for BaseInferenceDetails?

@vagenas
Copy link
Contributor

vagenas commented Oct 21, 2025

It can indeed be very useful to expose such information when available.

Some relevant context here is that we are currently amidst a rework of the metadata modeling in docling-core ("metadata" being a new term for what has been "annotations", i.e. derived information, e.g. a summary):
docling-project/docling-core#408

As we are finalizing this work in the following couple days, we are going to account for this use case and ensure it is adequately accommodated by the new setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants