Skip to content

Conversation

@rtibbles
Copy link
Member

@rtibbles rtibbles commented Feb 20, 2025

Summary

  • Refactors the vast majority of file handling into a staged pipeline that can take any path as input and run appropriate processing on the file.
  • Three stages at the moment are Download, Convert/Validate, and Extract Metadata.
  • Adds validation for audio and video files and zip based file types.
  • Adds a mostly unused extract metadata stage that mostly infers presets for now, but could extract additional metadata in future.
  • The pipeline and stages are extensible, so thumbnail generation could be added later as an additional stage as well as integration with other tooling to generate descriptions, subtitles etc.

References

Prerequisite work for #558

Reviewer guidance

Do all the tests pass?
Do the places where the tests have been changed make sense?
Does the overall structure of the pipeline feel understandable?

N.B. I have deliberately skipped the HTML5 dependency test case, which currently fails, as the example zip does not have an index.html.

We will need some slightly more nuanced validation to allow this to pass in future - but as this is very rarely used this seems acceptable for now.

See: https://github.com/search?q=org%3Alearningequality%20HTML5_DEPENDENCY_ZIP&type=code

Only one instance is of actual active code in a content integration script - one other is from commented out code using the archived imscp project, and the others are either ricecooker, le-utils, or one of these in an inadvertently committed virtualenv.

Further work on this can happen when this PR is finalized: #468

@rtibbles rtibbles force-pushed the pipeline branch 2 times, most recently from 97a1d15 to 3ea7647 Compare February 20, 2025 02:34
DownloadStageHandler,
ConversionStageHandler,
ExtractMetadataStageHandler,
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this more and the workaround you explained for passing options to handlers like the video compression handler. I think that raises the idea that overall, the architecture could benefit from a looser design, which uses a factory to construct these, rather than statically defining them. That would allow for better composition and customization of different handlers.

For example, in Morango, something like that isn't really necessary for the operations because their role is very rigid. In curriculum automation, that flexibility is useful, and so I used a factory pattern to define the 'default' behavior and setup. The combination provides both the ready-to-run approach and the flexibility to adapt or add later on without explicitly modifying the existing code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier iteration, I deferred instantiation of the handler classes, and iterated over the classes instead of the instantiated objects. Then each time a handler was instantiated it was with a specific path attribute and context object.

In that way, the pipeline was kind of its own factory.

I do think being able to add children dynamically to the StageHandler class and Pipeline class would be helpful (at init), so I can tweak that somewhat.

Have a factory function for each stage, with a default, and then an overall default pipeline that invokes those factory functions too.

I am also thinking about how I can surface the requisiteness of some parts of the context earlier in the pipeline, and I think I have a way to do it that should not be too onerous.

We already have the tree structure of each stage to play with, and for any particular path that gets input, if we can just report what kind of file will be outputted by it, then we can infer its entire path through the pipeline, and then warn early if arguments are missing.

I will implement both of these and do a force push.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking it through a bit more, the 'output prediction' is going to get squirrelly fast, especially when I add Google Drive download which could be hiding any file type behind its URL. So I'll just focus on the factory functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I've done this in a fairly minimal way - the init methods can now optionally take a children argument, and the CHILDREN attribute is changed to DEFAULT_CHILDREN which is used in the case that the children init argument is not used.

Seems to work neatly, and provides some flexibility without having to subclass.

Will force push my local rebase once #574 is merged, as I have also rebased on that to fix youtube downloading issues I discovered with the refactor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors file handling into a staged pipeline that processes files through Download, Convert/Validate, and Extract Metadata stages. Key changes include updating tests to use new pipeline APIs (e.g. adding an ffmpeg_settings parameter and replacing some file class invocations), reorganizing caching and file transfer logic, and introducing new extract metadata functionality for various file types.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_youtube.py Added new fixtures and tests for YouTube subtitle processing.
tests/test_videos.py Updated video file creation to pass ffmpeg_settings consistently.
tests/test_thumbnails.py Replaced DocumentFile with EPubFile for EPUB tests.
tests/test_files.py Refactored tests to patch new compress paths and updated cache key tests.
ricecooker/utils/zip.py Removed try/except around zipfile.ZipFile, affecting error handling.
ricecooker/utils/pipeline/* Added new pipeline stages and refactored file handler implementations.
ricecooker/config.py and chefs.py Integrated the new FilePipeline into the chef commands.
Comments suppressed due to low confidence (2)

tests/test_files.py:891

  • The error message contains a typo ('mising'); please correct it to 'missing'.
assert os.path.exists(localpath), "Error mising local test file " + localpath

ricecooker/utils/zip.py:36

  • The removal of the try/except block means that invalid zip archives will now raise uncaught exceptions; consider reintroducing error handling to provide a clearer error message.
inputzip = zipfile.ZipFile(path)

Copy link
Member

@marcellamaki marcellamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-reviewed with Richard and then re-reviewed separately.

Overall, the approach seems like a good plan in terms of the management of responsibilities. Richard will address Blaine's remaining feedback about ffmpeg/flexibility in follow up.

Since there is pre-existing work adding in additional test coverage, we can be fairly confident that the behavior has been retained in the refactor.



content_disposition_filename_cases = [
('Content-Disposition: attachment; filename="example.jpg"', "example.jpg"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the content disposition always either an attachment or inline?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? This test was pre-existing, I just moved it.

@rtibbles
Copy link
Member Author

rtibbles commented Apr 9, 2025

Follow up issue to make the pipeline more easily configurable: #592

@rtibbles rtibbles merged commit 5502dff into learningequality:develop Apr 9, 2025
16 checks passed
@rtibbles rtibbles deleted the pipeline branch April 9, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants