-
Notifications
You must be signed in to change notification settings - Fork 16
wip feat(Documents): add support for classification #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/documents/add-support-for-digitization
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -186,3 +186,54 @@ class DigitizationResult(BaseModel): | |
document_text: str = Field(alias="documentText") | ||
project_id: str = Field(alias="projectId") | ||
project_type: ProjectType = Field(alias="projectType") | ||
|
||
|
||
class Reference(BaseModel): | ||
model_config = ConfigDict( | ||
serialize_by_alias=True, | ||
validate_by_alias=True, | ||
) | ||
|
||
text_start_index: int = Field(alias="TextStartIndex") | ||
text_length: int = Field(alias="TextLength") | ||
tokens: List[str] = Field(alias="Tokens") | ||
|
||
|
||
class DocumentBounds(BaseModel): | ||
model_config = ConfigDict( | ||
serialize_by_alias=True, | ||
validate_by_alias=True, | ||
) | ||
|
||
start_page: int = Field(alias="StartPage") | ||
page_count: int = Field(alias="PageCount") | ||
text_start_index: int = Field(alias="TextStartIndex") | ||
text_length: int = Field(alias="TextLength") | ||
page_range: int = Field(alias="PageRange") | ||
|
||
|
||
class ClassificationResult(BaseModel): | ||
model_config = ConfigDict( | ||
serialize_by_alias=True, | ||
validate_by_alias=True, | ||
) | ||
|
||
document_id: str = Field(alias="DocumentId") | ||
document_type_id: str = Field(alias="DocumentTypeId") | ||
confidence: float = Field(alias="Confidence") | ||
ocr_confidence: float = Field(alias="OcrConfidence") | ||
reference: Reference = Field(alias="Reference") | ||
document_bounds: DocumentBounds = Field(alias="DocumentBounds") | ||
classifier_name: str = Field(alias="ClassifierName") | ||
project_id: str = Field(alias="ProjectId") | ||
|
||
|
||
class ClassificationResponse(BaseModel): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not used. I guess the purpose of this class was to be used as response type for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I forgot to delete it. At first, I thought about returning the entire response, but I think it’s more intuitive to just return a list of classification results, to keep it consistent with the other functions |
||
model_config = ConfigDict( | ||
serialize_by_alias=True, | ||
validate_by_alias=True, | ||
) | ||
|
||
classification_results: List[ClassificationResult] = Field( | ||
alias="classificationResults" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Painful to read through this. We really need to split into more granular functionalities per project type. But leaving that aside, isn't it missing some cases that are invalid? e.g. passing digitization with an IXP project and also a classification result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I refined these checks in a later PR, so I also need to update them here. But as you said, there are lots of conditions and branches, it's even worse trying to describe them in words, like in docstrings