-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat(models): add API usage to picture descriptions; unify response type #2445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(models): add API usage to picture descriptions; unify response type #2445
Conversation
- Introduce ApiImageResponse and OpenAiResponseUsage to carry usage metadata from image API calls - Add DescriptionAnnotationWithUsage to store usage alongside description text - Change _annotate_images to return Iterable[ApiImageResponse]; update API and VLM models to comply - Fix ApiVlmModel to decode responses using response.text instead of the raw response object Why: enables tracking/reporting of OpenAI/VLM token usage in picture description annotations. BREAKING CHANGE: subclasses of PictureDescriptionBaseModel must update _annotate_images() to return ApiImageResponse Signed-off-by: FrigaZzz <[email protected]>
…description-model-2271
|
✅ DCO Check Passed Thanks @FrigaZzz, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Hi! The main issue I see with moving the The export logic (HTML and Markdown serializers) relies on type checking to verify that annotations derive from the These checks are implemented in several places:
They all check: I created It's actually not a bad idea to keep the fundamental The trade-off: This does introduce some ambiguity for end users. When developing plugin extensions or working with The cleanest approach would still be to add the
|
|
All correct. I initially missed that the usage is going in the docling-core I would anyway like to emphasize a bit more the fact that class DescriptionAnnotation(BaseAnnotation):
"""DescriptionAnnotation."""
kind: Literal["description"] = "description"
text: str
provenance: str
inference_details: dict[str, Any] = {} # add usage here, or we could make it a BaseModel with usage and potentially other details |
Good point on making it a BaseModel. Actually, I think we should go with that approach because it gives us validation, type safety, and automatic serialization without losing the flexibility we need. We can define a base class that all providers conform to: # file: docling-core/docling_core/types/doc/document.py
class BaseInferenceDetails(BaseModel):
"""Base inference details that all providers should include."""
pass
class OpenAiInferenceDetails(BaseInferenceDetails):
"""OpenAI-specific inference details."""
usage: dict[str, int] # lets keep it a dict so we won't run into issues if OpenAI or other providers change their data model.
class CustomProviderInferenceDetails(BaseInferenceDetails):
"""Custom provider inference details."""
tokens_in: int
tokens_out: int
processing_time_ms: float
# more fieldsThen in # file: docling-core/docling_core/types/doc/document.py
class DescriptionAnnotation(BaseAnnotation):
"""DescriptionAnnotation."""
kind: Literal["description"] = "description"
text: str
provenance: str
inference_details: BaseInferenceDetails = BaseInferenceDetails()This way, end users still take responsibility for deserializing into their expected model (they define their own subclass), but we get validation and type safety out of the box. The callbacks remain the same—they just return the instantiated BaseModel instead of a raw dict. Development PlanFirst, we update the Then, we attach this new version to the docling repository and update Handling Inference Details ExtractionWe add callback functions to docling/docling/datamodel/pipeline_options.py
class PictureDescriptionApiOptions(PictureDescriptionBaseOptions):
kind: ClassVar[Literal["api"]] = "api"
url: AnyUrl = AnyUrl("http://localhost:8000/v1/chat/completions")
headers: Dict[str, str] = {}
params: Dict[str, Any] = {}
timeout: float = 20
concurrency: int = 1
prompt: str = "Describe this image in a few sentences."
provenance: str = ""
inference_details_process_callback: Callable[[Any], BaseInferenceDetails] | None = None
text_content_extraction_callback: Callable[[Any], str] | None = NoneThe callbacks take the raw API response and return typed models. If not provided, we use default behavior for backward compatibility (OpenAI expected fields). Updated api_image_request# file: docling/utils/api_image_request.py
from dataclasses import dataclass
from typing import Any, Callable, Optional, Dict
from PIL import Image
from pydantic import BaseModel, AnyUrl
# it could be either empty or keep at least the usage field
class BaseInferenceDetails(BaseModel):
"""Base inference details that all providers should include."""
usage: dict[str, int]
# note: maybe could drop the OpenAiInferenceDetails and keep only the BaseInferenceDetails
class OpenAiInferenceDetails(BaseInferenceDetails):
"""OpenAI-specific inference details."""
# usage: dict[str, int]
pass
class ApiImageResponse(BaseModel):
"""Generic response from image-based API calls."""
text: str
inference_details: BaseInferenceDetails
# the current implementation
def _default_text_extraction(api_response: Any) -> str:
"""Default text extraction for OpenAI API responses."""
api_resp = OpenAiApiResponse.model_validate_json(api_response)
return api_resp.choices[0].message.content.strip()
# we assume the usage field
def _default_inference_details_extraction(api_response: Any) -> BaseInferenceDetails:
"""Default inference details extraction for OpenAI API responses."""
api_resp = OpenAiApiResponse.model_validate_json(api_response)
if api_resp.usage is None:
usage = OpenAiResponseUsage(
prompt_tokens=0, completion_tokens=0, total_tokens=0
)
else:
usage = api_resp.usage
return OpenAiInferenceDetails(usage=usage.model_dump())
def api_image_request(
image: Image.Image,
prompt: str,
url: AnyUrl,
timeout: float = 20,
headers: Optional[Dict[str, str]] = None,
text_content_extraction_callback: Optional[Callable[[Any], str]] = None,
inference_details_process_callback: Optional[Callable[[Any], BaseInferenceDetails]] = None,
**params,
) -> ApiImageResponse:
"""
# updated code
# the same applies to api_image_request_streaming Other impacted files:
Does this direction work better? Should we proceed with the BaseModel approach for |
|
It can indeed be very useful to expose such information when available. Some relevant context here is that we are currently amidst a rework of the metadata modeling in docling-core ("metadata" being a new term for what has been "annotations", i.e. derived information, e.g. a summary): As we are finalizing this work in the following couple days, we are going to account for this use case and ensure it is adequately accommodated by the new setup. |
This PR introduces standardized capture and propagation of usage metadata (token counters, etc.) from OpenAI/VLM-compatible picture description backends within docling.
Context and motivation
References: #2271, #2402, #2403
Currently, docling has no built-in way to track resource consumption (tokens, API calls) when using picture description models. This makes it difficult for users to:
Initial work was validated as a third-party plugin (#2403) to test the end-to-end flow without modifying core. Based on positive feedback, this PR integrates usage tracking directly into docling's model runtime.
Long-term plan: Move the usage field into
docling_coreso it becomes part of the canonical annotation data model. This PR focuses on the runtime wiring in docling to keep changes reviewable and unblock immediate usage capture needs.What's changed
1. New response types for image API calls
ApiImageResponse: Carries both generated text and optional usage metadata from image APIsOpenAiResponseUsage: Represents token usage (input_tokens,output_tokens,total_tokens, etc.) from OpenAI-compatible backends2. Usage metadata storage
DescriptionAnnotationWithUsage: Temporary wrapper enabling the runtime to attach usage metadata to eachDescriptionannotation produced by picture description models3. Runtime integration
PictureDescriptionBaseModel._annotate_images()now returnsIterable[ApiImageResponse](previously plain text strings)ApiVlmModelupdated to decode usingresponse.textinstead of raw response objects4. Backward compatibility
usagefield remainsNoneand pipeline output is unchangedBreaking changes
PictureDescriptionBaseModelmust update_annotate_images()to returnApiImageResponseinstead ofstr.Migration example:
Documentation
docling_coreintegration to document the usage field on description annotationsLimitations and next steps
Current limitation: The canonical
Descriptionannotation type lives indocling_core. In this PR, usage is temporarily attached viaDescriptionAnnotationWithUsagein docling to validate the wiring.Proposed follow-up PRs:
docling_core: Add optionalusagefield to the canonicalDescriptionAnnotationdocling: Adopt the new core release, remove temporaryDescriptionAnnotationWithUsage, and complete end-to-end integrationTesting
Checklist
docling_coreintegration)docling_coreintegration)References