Skip to content

Conversation

@shanbady
Copy link
Contributor

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/4981

Description (What does it do?)

This PR switches us to using vector embeddings to tag videos with topics.

The changes also include:

  • the creation of a new "topics" collection in qdrant that stores subtopics
  • the addition of a "sync_topic_embeddings" management command that syncs topics from the database into qdrant
  • a periodic (daily) vector_search.tasks.sync_topics celery task to keep topics in sync

How can this be tested?

  1. checkout this branch and restart celery
  2. run python manage.py sync_topic_embeddings to create the topics collection in qdrant
  3. verify that the topics collection exists via the qdrant dashboard - there should be topics populated (just sub-topics aka topics that have a parent)
  4. you can test the topic tagging for any resource manually via the following shell script:
from learning_resources.models import *
from learning_resources.utils import similar_topics_action
from learning_resources.models import *
lrs = LearningResource.objects.filter(id__in=ContentFile.objects.filter(file_extension__in=[".srt", ".vtt"]).values_list("run__learning_resource__id", flat=True))
for lr in lrs:
	print(f"topics for '{lr.title}' --- ", [topic['name'] for topic in similar_topics_action(lr)])
  1. additionally, you can set YOUTUBE_DEVELOPER_KEY and YOUTUBE_CONFIG_URL settings and fetch youtube videos via python manage.py backpopulate_youtube_data --fetch and then inspect the tags it assigns via the following:
from learning_resources.models import *
for video in Video.objects.all().order_by("-created_on")[:100]:
	lr = video.learning_resource
	print(f"topics for '{lr.title}' --- ", [topic.name for topic in lr.topics.all()],"\n\n")

Additional Context

the topics look much better than the previously used opensearch topic assignment however - there is a minor issue (which manifests differently depending on the embedding model) where sometimes broader topics end up being preferred over more specific ones - "hubness problem" or "centroid bias" in embeddings - we can consider a follow up change to blacklist/de-rank more general topics within the subtopics or some other solution. You may see topics like "philosophy" or "pedagogy" being preferred over a more specific options.

@shanbady shanbady changed the title Shanbady/video topics Vector bases topics tagging for videos Oct 28, 2025
@github-actions
Copy link

github-actions bot commented Oct 28, 2025

OpenAPI Changes

Show/hide No detectable change.

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

@shanbady shanbady marked this pull request as ready for review October 28, 2025 15:42
@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Oct 28, 2025
@shanbady shanbady changed the title Vector bases topics tagging for videos Vector based topics tagging for videos Oct 28, 2025
@mbertrand mbertrand self-assigned this Oct 29, 2025
Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great, just a couple minor suggestions

Comment on lines 1917 to 1920
mocker.patch(
"learning_resources_search.plugins.get_similar_topics_qdrant",
return_value=["topic1", "topic2"],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick, this could be a fixture since it returns the same 2 topics in all the above tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. I made it a fixture.

Comment on lines 839 to 840
from vector_search.encoders.utils import dense_encoder
from vector_search.utils import qdrant_client, vector_point_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to top of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the dense_encoder import to the top but the imports from vector_search.utils were causing a circular import error

list of learning resources
"""
hits = _qdrant_similar_results(value_doc, num_resources)
from vector_search.utils import vector_point_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to the top of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment about circular import above

@mbertrand mbertrand added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Oct 29, 2025
@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Oct 29, 2025
@shanbady shanbady requested a review from mbertrand October 29, 2025 15:06
Copy link
Member

@mbertrand mbertrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shanbady shanbady merged commit e1de652 into main Oct 29, 2025
13 checks passed
@shanbady shanbady deleted the shanbady/video-topics branch October 29, 2025 15:54
@odlbot odlbot mentioned this pull request Oct 29, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs Review An open Pull Request that is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants