-
Notifications
You must be signed in to change notification settings - Fork 3
Vector based topics tagging for videos #2649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OpenAPI ChangesShow/hide No detectable change.Unexpected changes? Ensure your branch is up-to-date with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great, just a couple minor suggestions
| mocker.patch( | ||
| "learning_resources_search.plugins.get_similar_topics_qdrant", | ||
| return_value=["topic1", "topic2"], | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick, this could be a fixture since it returns the same 2 topics in all the above tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call. I made it a fixture.
learning_resources_search/api.py
Outdated
| from vector_search.encoders.utils import dense_encoder | ||
| from vector_search.utils import qdrant_client, vector_point_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the dense_encoder import to the top but the imports from vector_search.utils were causing a circular import error
| list of learning resources | ||
| """ | ||
| hits = _qdrant_similar_results(value_doc, num_resources) | ||
| from vector_search.utils import vector_point_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to the top of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment about circular import above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/4981
Description (What does it do?)
This PR switches us to using vector embeddings to tag videos with topics.
The changes also include:
How can this be tested?
python manage.py sync_topic_embeddingsto create the topics collection in qdrantpython manage.py backpopulate_youtube_data --fetchand then inspect the tags it assigns via the following:Additional Context
the topics look much better than the previously used opensearch topic assignment however - there is a minor issue (which manifests differently depending on the embedding model) where sometimes broader topics end up being preferred over more specific ones - "hubness problem" or "centroid bias" in embeddings - we can consider a follow up change to blacklist/de-rank more general topics within the subtopics or some other solution. You may see topics like "philosophy" or "pedagogy" being preferred over a more specific options.