-
Notifications
You must be signed in to change notification settings - Fork 181
feat: add Cohere Image Document Embedder #2190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
async for doc, image in tqdm_async( | ||
zip(documents, images_to_embed), desc="Embedding images", disable=not self.progress_bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this raise a TypeError as zip is a sync iterator. Maybe we can use await inside the loop and remove async with for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. No need for async for
here
:raises RuntimeError: | ||
If the conversion of some documents fails. | ||
""" | ||
if not isinstance(documents, list) or (documents and not isinstance(documents[0], Document)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if user passes string somewhere in the middle, this break in the mid of processing.
if not isinstance(documents, list) or (documents and not isinstance(documents[0], Document)): | |
if not isinstance(documents, list) or not all(isinstance(d, Document) for d in documents): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a good point. In other places we opted to only do a soft check here b/c users could theoretically pass a large amount of docs and doing a isinstance check on all of them could be uncessarily expensive (although tbh not sure how expensive it would be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @sjrl is right about the motivation, but I extended the check
|
||
embedding = getattr(response.embeddings, self.embedding_type.value)[0] | ||
except Exception as e: | ||
msg = f"Error embedding Document {doc.id}. The Document will be skipped. \nException: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make it more clear that we are returning the document with embedding=None. Skipped is a little ambiguous. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anakin87 I may advocate for just letting the exception raise here since that how most of our Document Embedders work (only exception is AzureDocumentEmbedder). If an embedding fails then the API error is raised so users can decide what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now letting the exception raise
...ions/cohere/src/haystack_integrations/components/embedders/cohere/document_image_embedder.py
Show resolved
Hide resolved
|
||
embedding = getattr(response.embeddings, self.embedding_type.value)[0] | ||
except Exception as e: | ||
msg = f"Error embedding Document {doc.id}. The Document will be skipped. \nException: {e}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now letting the exception raise
...ions/cohere/src/haystack_integrations/components/embedders/cohere/document_image_embedder.py
Show resolved
Hide resolved
api_key: Secret = Secret.from_env_var(["COHERE_API_KEY", "CO_API_KEY"]), | ||
model: str = "embed-v4.0", | ||
api_base_url: str = "https://api.cohere.com", | ||
timeout: int = 120, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout
is a float type in the ClientV2
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, also in other Cohere embedders. (I think that mypy was not complaining because int
and float
are compatible)
_batch_convert_pdf_pages_to_images, | ||
_encode_image_to_base64, | ||
_extract_image_sources_info, | ||
_PDFPageInfo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to do now, but let's keep an eye on how often these are used in integrations. If we find that these are necessary for users to use to make custom image embedders we may want to consider making them public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or create some sort of public interface that uses methods internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just one minor comment about an additional integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG!
Related Issues
CohereDocumentImageEmbedder
#2134Proposed Changes:
How did you test it?
CI, several new tests
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.