Pipeline refactor #573

rtibbles · 2025-02-20T00:55:36Z

Summary

Refactors the vast majority of file handling into a staged pipeline that can take any path as input and run appropriate processing on the file.
Three stages at the moment are Download, Convert/Validate, and Extract Metadata.
Adds validation for audio and video files and zip based file types.
Adds a mostly unused extract metadata stage that mostly infers presets for now, but could extract additional metadata in future.
The pipeline and stages are extensible, so thumbnail generation could be added later as an additional stage as well as integration with other tooling to generate descriptions, subtitles etc.

References

Prerequisite work for #558

Reviewer guidance

Do all the tests pass?
Do the places where the tests have been changed make sense?
Does the overall structure of the pipeline feel understandable?

N.B. I have deliberately skipped the HTML5 dependency test case, which currently fails, as the example zip does not have an index.html.

We will need some slightly more nuanced validation to allow this to pass in future - but as this is very rarely used this seems acceptable for now.

See: https://github.com/search?q=org%3Alearningequality%20HTML5_DEPENDENCY_ZIP&type=code

Only one instance is of actual active code in a content integration script - one other is from commented out code using the archived imscp project, and the others are either ricecooker, le-utils, or one of these in an inadvertently committed virtualenv.

Further work on this can happen when this PR is finalized: #468

bjester · 2025-02-20T17:08:56Z

ricecooker/utils/pipeline/__init__.py

+        DownloadStageHandler,
+        ConversionStageHandler,
+        ExtractMetadataStageHandler,
+    ]


I was thinking about this more and the workaround you explained for passing options to handlers like the video compression handler. I think that raises the idea that overall, the architecture could benefit from a looser design, which uses a factory to construct these, rather than statically defining them. That would allow for better composition and customization of different handlers.

For example, in Morango, something like that isn't really necessary for the operations because their role is very rigid. In curriculum automation, that flexibility is useful, and so I used a factory pattern to define the 'default' behavior and setup. The combination provides both the ready-to-run approach and the flexibility to adapt or add later on without explicitly modifying the existing code.

In an earlier iteration, I deferred instantiation of the handler classes, and iterated over the classes instead of the instantiated objects. Then each time a handler was instantiated it was with a specific path attribute and context object.

In that way, the pipeline was kind of its own factory.

I do think being able to add children dynamically to the StageHandler class and Pipeline class would be helpful (at init), so I can tweak that somewhat.

Have a factory function for each stage, with a default, and then an overall default pipeline that invokes those factory functions too.

I am also thinking about how I can surface the requisiteness of some parts of the context earlier in the pipeline, and I think I have a way to do it that should not be too onerous.

We already have the tree structure of each stage to play with, and for any particular path that gets input, if we can just report what kind of file will be outputted by it, then we can infer its entire path through the pipeline, and then warn early if arguments are missing.

I will implement both of these and do a force push.

Actually, thinking it through a bit more, the 'output prediction' is going to get squirrelly fast, especially when I add Google Drive download which could be hiding any file type behind its URL. So I'll just focus on the factory functions.

OK, I think I've done this in a fairly minimal way - the init methods can now optionally take a children argument, and the CHILDREN attribute is changed to DEFAULT_CHILDREN which is used in the case that the children init argument is not used.

Seems to work neatly, and provides some flexibility without having to subclass.

Will force push my local rebase once #574 is merged, as I have also rebased on that to fix youtube downloading issues I discovered with the refactor.

Copilot

Pull Request Overview

This PR refactors file handling into a staged pipeline that processes files through Download, Convert/Validate, and Extract Metadata stages. Key changes include updating tests to use new pipeline APIs (e.g. adding an ffmpeg_settings parameter and replacing some file class invocations), reorganizing caching and file transfer logic, and introducing new extract metadata functionality for various file types.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_youtube.py	Added new fixtures and tests for YouTube subtitle processing.
tests/test_videos.py	Updated video file creation to pass ffmpeg_settings consistently.
tests/test_thumbnails.py	Replaced DocumentFile with EPubFile for EPUB tests.
tests/test_files.py	Refactored tests to patch new compress paths and updated cache key tests.
ricecooker/utils/zip.py	Removed try/except around zipfile.ZipFile, affecting error handling.
ricecooker/utils/pipeline/*	Added new pipeline stages and refactored file handler implementations.
ricecooker/config.py and chefs.py	Integrated the new FilePipeline into the chef commands.

Comments suppressed due to low confidence (2)

tests/test_files.py:891

The error message contains a typo ('mising'); please correct it to 'missing'.

assert os.path.exists(localpath), "Error mising local test file " + localpath

ricecooker/utils/zip.py:36

The removal of the try/except block means that invalid zip archives will now raise uncaught exceptions; consider reintroducing error handling to provide a clearer error message.

inputzip = zipfile.ZipFile(path)

tests/test_youtube.py

marcellamaki

Co-reviewed with Richard and then re-reviewed separately.

Overall, the approach seems like a good plan in terms of the management of responsibilities. Richard will address Blaine's remaining feedback about ffmpeg/flexibility in follow up.

Since there is pre-existing work adding in additional test coverage, we can be fairly confident that the behavior has been retained in the refactor.

marcellamaki · 2025-04-09T14:19:32Z

tests/pipeline/test_transfer.py

+
+
+content_disposition_filename_cases = [
+    ('Content-Disposition: attachment; filename="example.jpg"', "example.jpg"),


is the content disposition always either an attachment or inline?

Probably? This test was pre-existing, I just moved it.

Tweak existing tests and fixtures to make sure correct File classes are used.

rtibbles · 2025-04-09T19:56:38Z

Follow up issue to make the pipeline more easily configurable: #592

rtibbles force-pushed the pipeline branch 2 times, most recently from 97a1d15 to 3ea7647 Compare February 20, 2025 02:34

bjester reviewed Feb 20, 2025

View reviewed changes

rtibbles mentioned this pull request Feb 20, 2025

Add vcr for youtube test #574

Merged

rtibbles force-pushed the pipeline branch from 3ea7647 to f08f78a Compare February 20, 2025 23:39

rtibbles mentioned this pull request Feb 21, 2025

Google Drive handler integration #577

Merged

rtibbles assigned marcellamaki Mar 11, 2025

rtibbles requested a review from Copilot March 26, 2025 20:17

Copilot AI reviewed Mar 26, 2025

View reviewed changes

tests/test_youtube.py Outdated Show resolved Hide resolved

marcellamaki approved these changes Apr 9, 2025

View reviewed changes

rtibbles added 3 commits April 9, 2025 12:49

Move most youtube specific tests into the test_youtube.py file.

51b9864

Restrict graphie files to only be web based.

6e3a1ef

Tweak existing tests and fixtures to make sure correct File classes are used.

Refactor file processing to primarily happen through a single pipeline.

41dad80

rtibbles force-pushed the pipeline branch from f08f78a to 41dad80 Compare April 9, 2025 19:50

rtibbles merged commit 5502dff into learningequality:develop Apr 9, 2025
16 checks passed

rtibbles deleted the pipeline branch April 9, 2025 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline refactor #573

Pipeline refactor #573

Uh oh!

rtibbles commented Feb 20, 2025 •

edited

Loading

Uh oh!

bjester Feb 20, 2025

Uh oh!

rtibbles Feb 20, 2025

Uh oh!

rtibbles Feb 20, 2025

Uh oh!

rtibbles Feb 20, 2025

Uh oh!

rtibbles Feb 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

marcellamaki left a comment

Uh oh!

marcellamaki Apr 9, 2025

Uh oh!

rtibbles Apr 9, 2025

Uh oh!

rtibbles commented Apr 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		content_disposition_filename_cases = [
		('Content-Disposition: attachment; filename="example.jpg"', "example.jpg"),

Pipeline refactor #573

Pipeline refactor #573

Uh oh!

Conversation

rtibbles commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Reviewer guidance

Uh oh!

bjester Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

rtibbles Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

rtibbles Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

rtibbles Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

rtibbles Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

marcellamaki left a comment

Choose a reason for hiding this comment

Uh oh!

marcellamaki Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

rtibbles Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

rtibbles commented Apr 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rtibbles commented Feb 20, 2025 •

edited

Loading