Skip to content

feat: add Cohere Image Document Embedder #2190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 18, 2025
Merged

feat: add Cohere Image Document Embedder #2190

merged 12 commits into from
Aug 18, 2025

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Aug 14, 2025

Related Issues

Proposed Changes:

  • implemented a new component which uses Cohere models to embed image-based documents (sync + async)

How did you test it?

CI, several new tests

Checklist

@github-actions github-actions bot added integration:cohere type:documentation Improvements or additions to documentation labels Aug 14, 2025
@anakin87 anakin87 changed the title Cohere Image Document Embedder - draft feat: add Cohere Image Document Embedder Aug 14, 2025
@anakin87 anakin87 marked this pull request as ready for review August 14, 2025 14:05
@anakin87 anakin87 requested a review from a team as a code owner August 14, 2025 14:05
@anakin87 anakin87 requested review from sjrl and Amnah199 and removed request for a team August 14, 2025 14:05
Comment on lines 311 to 312
async for doc, image in tqdm_async(
zip(documents, images_to_embed), desc="Embedding images", disable=not self.progress_bar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this raise a TypeError as zip is a sync iterator. Maybe we can use await inside the loop and remove async with for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. No need for async for here

:raises RuntimeError:
If the conversion of some documents fails.
"""
if not isinstance(documents, list) or (documents and not isinstance(documents[0], Document)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if user passes string somewhere in the middle, this break in the mid of processing.

Suggested change
if not isinstance(documents, list) or (documents and not isinstance(documents[0], Document)):
if not isinstance(documents, list) or not all(isinstance(d, Document) for d in documents):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a good point. In other places we opted to only do a soft check here b/c users could theoretically pass a large amount of docs and doing a isinstance check on all of them could be uncessarily expensive (although tbh not sure how expensive it would be)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sjrl is right about the motivation, but I extended the check


embedding = getattr(response.embeddings, self.embedding_type.value)[0]
except Exception as e:
msg = f"Error embedding Document {doc.id}. The Document will be skipped. \nException: {e}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make it more clear that we are returning the document with embedding=None. Skipped is a little ambiguous. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anakin87 I may advocate for just letting the exception raise here since that how most of our Document Embedders work (only exception is AzureDocumentEmbedder). If an embedding fails then the API error is raised so users can decide what to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now letting the exception raise


embedding = getattr(response.embeddings, self.embedding_type.value)[0]
except Exception as e:
msg = f"Error embedding Document {doc.id}. The Document will be skipped. \nException: {e}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now letting the exception raise

api_key: Secret = Secret.from_env_var(["COHERE_API_KEY", "CO_API_KEY"]),
model: str = "embed-v4.0",
api_base_url: str = "https://api.cohere.com",
timeout: int = 120,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout is a float type in the ClientV2 object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, also in other Cohere embedders. (I think that mypy was not complaining because int and float are compatible)

@anakin87 anakin87 requested review from sjrl and Amnah199 August 18, 2025 13:27
Comment on lines +10 to +13
_batch_convert_pdf_pages_to_images,
_encode_image_to_base64,
_extract_image_sources_info,
_PDFPageInfo,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to do now, but let's keep an eye on how often these are used in integrations. If we find that these are necessary for users to use to make custom image embedders we may want to consider making them public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or create some sort of public interface that uses methods internally.

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one minor comment about an additional integration test.

Copy link
Contributor

@Amnah199 Amnah199 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG!

@anakin87 anakin87 merged commit 129f948 into main Aug 18, 2025
11 checks passed
@anakin87 anakin87 deleted the cohere-img-embedder branch August 18, 2025 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:cohere type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants