Skip to content

Conversation

andybonnetto
Copy link

@andybonnetto andybonnetto commented Sep 3, 2025

Description

We added the evaluation of Lemonade (on HuggingFace). A motion centered MCQ dataset extracted from the EPFL-Smart-Kitchen-30 dataset (ArXiv and Zenodo). The benchmark was evaluated using Lmms-eval in the publication.

Modifications

  • Added lemonade folder in tasks containing lemonade.yaml and utils.py
  • utils.py contains data loading from hugging face , video loading, answer parsing, metric evaluation.
  • lemonade.yaml contains the default prompts

Summary by CodeRabbit

  • New Features
    • Added support for the LEMONADE multimodal evaluation task with a ready-to-use evaluation configuration.
    • Automatic download/extraction and caching of required video assets on first run.
    • Frame sampling from videos for consistent visual inputs.
    • Prompt templates and generation settings for deterministic multiple-choice outputs.
    • Robust answer parsing, structured result processing, and aggregated accuracy reporting.

Copy link
Contributor

coderabbitai bot commented Sep 3, 2025

Walkthrough

Adds a new LEMONADE evaluation task: a YAML config and a utilities module that download/extract videos, sample frames, build prompts, parse multiple-choice responses, process per-example results, and aggregate accuracy.

Changes

Cohort / File(s) Summary
LEMONADE task config
lmms_eval/tasks/lemonade/lemonade.yaml
New task config pointing to amathislab/LEMONADE; sets task: lemonade, test_split: test, output_type: generate_until; wires doc_to_visual, doc_to_text, doc_to_target: "Correct Answer", process_results; sets generation_kwargs (max_new_tokens:128, temperature:0, do_sample:false) and an accuracy metric using lemonade_aggregate_results; provides default pre/post prompts.
LEMONADE utilities
lmms_eval/tasks/lemonade/utils.py
New module adding constants (MAX_NUM_FRAMES, LEMONADE_ZIP_NAMES, HF cache paths), functions to download/extract video zips, load_video (frame sampling → PIL images), parse_options, lemonade_doc_to_visual, lemonade_doc_to_text, multi-choice helpers (get_multi_choice_info, parse_multi_choice_response), lemonade_process_results, and lemonade_aggregate_results (grouped accuracy and summary).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as LMMS Eval Runner
  participant Config as lemonade.yaml
  participant Utils as lemonade/utils.py
  participant Store as Local Cache (HF_HOME/.../videos)
  participant Model as Model

  Runner->>Config: load task config
  loop each doc
    Runner->>Utils: lemonade_doc_to_visual(doc)
    alt videos missing
      Utils->>Store: download_and_extract_lemonade_videos()
      Store-->>Utils: videos ready
    end
    Utils->>Utils: load_video(clip, start, end, max_frames)
    Utils-->>Runner: frames[]
    Runner->>Utils: lemonade_doc_to_text(doc, lmms_eval_specific_kwargs)
    Utils-->>Runner: prompt_text
    Runner->>Model: generate(prompt_text + frames, generation_kwargs)
    Model-->>Runner: raw_prediction
    Runner->>Utils: lemonade_process_results(doc, raw_prediction)
    Utils-->>Runner: parsed_result (qid, correct, pred, meta)
  end
  Runner->>Utils: lemonade_aggregate_results(all_results)
  Utils-->>Runner: overall_accuracy + breakdown
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change—adding evaluation support for the Lemonade benchmark—without extraneous detail or ambiguity.
Description Check ✅ Passed The description provides a clear overview of the Lemonade dataset, cites its sources, and details the added files and their functionality, fulfilling the template’s core requirement for a detailed description.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Poem

A carrot of code in lemony light,
I hop through frames, eight at a bite.
Options lined A–Z, I sniff and see,
I choose the letter, tidy as can be.
Zip, unzip, tally—hooray! 🥕✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
lmms_eval/tasks/lemonade/utils.py (2)

91-91: Use more specific error message

The error message should be more specific to help with debugging.

Apply this diff:

-        raise FileNotFoundError(f"Video file not found: {video_path}")
+        raise FileNotFoundError(
+            f"Video file not found: {video_path}. "
+            f"Expected video for clip '{doc['Clip']}' at {video_path}"
+        )

127-201: Add type hints and simplify complex parsing logic

This function has complex parsing logic that could be simplified and needs type hints.

Apply this diff to add type hints:

-def parse_multi_choice_response(response, all_choices, index2ans):
+def parse_multi_choice_response(
+    response: str,
+    all_choices: list[str],
+    index2ans: dict[str, str]
+) -> str:
     """
     Parse the prediction from the generated response.
     Return the predicted index e.g., A, B, C, D.
     """

Consider refactoring this function to reduce complexity by extracting helper methods for each parsing strategy.

lmms_eval/tasks/lemonade/lemonade.yaml (1)

1-23: Add newline at end of file

YAML files should end with a newline character per linting rules.

Apply this diff:

   default:
     pre_prompt: "Answer the following multiple-choice question using the given images.\n"
     post_prompt: "\nRespond only with the letter of the correct answer."
+
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 83e78e2 and c8c8d3c.

📒 Files selected for processing (2)
  • lmms_eval/tasks/lemonade/lemonade.yaml (1 hunks)
  • lmms_eval/tasks/lemonade/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

  • lmms_eval/tasks/lemonade/lemonade.yaml
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/tasks/lemonade/utils.py
🪛 YAMLlint (1.37.1)
lmms_eval/tasks/lemonade/lemonade.yaml

[error] 23-23: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.12.2)
lmms_eval/tasks/lemonade/utils.py

91-91: Avoid specifying long messages outside the exception class

(TRY003)


104-104: Use of possibly insecure function; consider using ast.literal_eval

(S307)


115-115: Use of assert detected

(S101)


206-206: Use of possibly insecure function; consider using ast.literal_eval

(S307)

Comment on lines 23 to 36
def download_and_extract_lemonade_videos(data_dir):
os.makedirs(data_dir, exist_ok=True)
videos_dir = os.path.join(data_dir, "videos")
os.makedirs(videos_dir, exist_ok=True)

for zip_name in LEMONADE_ZIP_NAMES:
print(f"Downloading {zip_name} from Hugging Face...")
zip_path = hf_hub_download(repo_id="amathislab/LEMONADE", filename=zip_name, repo_type="dataset", cache_dir=os.path.join(data_dir, "cache"))

with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(videos_dir)

print("All videos downloaded and extracted successfully.\n")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add type hints to all functions per coding guidelines

Type hints are required for all Python code according to your guidelines. This function and all others in the file lack proper type annotations.

Apply this diff to add type hints:

-def download_and_extract_lemonade_videos(data_dir):
+def download_and_extract_lemonade_videos(data_dir: str) -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def download_and_extract_lemonade_videos(data_dir):
os.makedirs(data_dir, exist_ok=True)
videos_dir = os.path.join(data_dir, "videos")
os.makedirs(videos_dir, exist_ok=True)
for zip_name in LEMONADE_ZIP_NAMES:
print(f"Downloading {zip_name} from Hugging Face...")
zip_path = hf_hub_download(repo_id="amathislab/LEMONADE", filename=zip_name, repo_type="dataset", cache_dir=os.path.join(data_dir, "cache"))
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(videos_dir)
print("All videos downloaded and extracted successfully.\n")
def download_and_extract_lemonade_videos(data_dir: str) -> None:
os.makedirs(data_dir, exist_ok=True)
videos_dir = os.path.join(data_dir, "videos")
os.makedirs(videos_dir, exist_ok=True)
for zip_name in LEMONADE_ZIP_NAMES:
print(f"Downloading {zip_name} from Hugging Face...")
zip_path = hf_hub_download(
repo_id="amathislab/LEMONADE",
filename=zip_name,
repo_type="dataset",
cache_dir=os.path.join(data_dir, "cache"),
)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
zip_ref.extractall(videos_dir)
print("All videos downloaded and extracted successfully.\n")
🤖 Prompt for AI Agents
In lmms_eval/tasks/lemonade/utils.py around lines 23 to 36, the function
download_and_extract_lemonade_videos lacks type annotations; update the
signature to download_and_extract_lemonade_videos(data_dir: str) -> None, add
explicit local variable annotations where useful (e.g., videos_dir: str,
zip_path: str, zip_name: str), and ensure any module-level constants like
LEMONADE_ZIP_NAMES are annotated as List[str] (import List from typing); also
import PathLike or Union[str, PathLike] if you prefer allowing Path-like inputs
and annotate hf_hub_download return as str when used; keep behavior unchanged.

Comment on lines 75 to 94
def lemonade_doc_to_visual(doc):
videos_dir = os.path.join(data_dir, "videos")

if not os.path.exists(videos_dir) or len(os.listdir(videos_dir)) == 0:
print("Videos directory is empty — downloading and extracting...\n")
download_and_extract_lemonade_videos(data_dir)

video_filename = doc["Clip"] + "_hololens.mp4"

video_path = os.path.join(videos_dir, video_filename)

if os.path.exists(video_path):
start = int(doc["Start"])
end = int(doc["End"])
frames = load_video(video_path, start, end, max_num_frames=MAX_NUM_FRAMES)
else:
raise FileNotFoundError(f"Video file not found: {video_path}")

return frames

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add type hints and improve error handling

The function needs type hints and has multiple issues with error handling.

Apply this diff to add type hints and use more specific imports:

-def lemonade_doc_to_visual(doc):
+from typing import Any
+
+def lemonade_doc_to_visual(doc: dict[str, Any]) -> list[Image.Image]:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.12.2)

91-91: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In lmms_eval/tasks/lemonade/utils.py around lines 75 to 94, add type hints and
tighten error handling: annotate the function signature (e.g., def
lemonade_doc_to_visual(doc: Dict[str, Any]) -> List[Any]) and import specific
types (from typing import Dict, Any, List; from pathlib import Path); replace
os.path joins with Path objects for data_dir/videos and video_path, validate
required keys ("Clip","Start","End") exist in doc and raise KeyError with clear
message if missing, validate that Start/End are integers and that Start < End
(raise ValueError if not), wrap download_and_extract_lemonade_videos and
load_video calls in try/except to re-raise more informative exceptions, and on
missing video raise FileNotFoundError with the Path shown; ensure the function
returns the appropriately typed frames list.

Comment on lines 213 to 271
def lemonade_aggregate_results(results):
def compute_accuracy(grouped_results):
acc_dict = {}
for key, samples in grouped_results.items():
correct = sum([r["parsed_pred"] == r["answer"] for r in samples])
total = len(samples)
acc = round(correct / total, 5) if total > 0 else 0.0
stderr = round(np.sqrt(acc * (1 - acc) / total), 5) if total > 0 else 0.0
acc_dict[key] = {
"num": total,
"acc": acc,
"acc_stderr": stderr,
}
return acc_dict

qid_results = defaultdict(list)
category_results = defaultdict(list)
subcategory_results = defaultdict(list)
difficulty_results = defaultdict(list)

valid_results = [r for r in results if r["parsed_pred"] != "API Error"]

for r in valid_results:
qid_results[r["QID"]].append(r)
category_results[r["category"]].append(r)
subcategory_results[r["subcategory"]].append(r)
difficulty_results[r["difficulty"]].append(r)

qid_acc = compute_accuracy(qid_results)
category_acc = compute_accuracy(category_results)
subcategory_acc = compute_accuracy(subcategory_results)
difficulty_acc = compute_accuracy(difficulty_results)

total_correct = sum([r["parsed_pred"] == r["answer"] for r in valid_results])
total = len(valid_results)
overall_acc = round(total_correct / total, 5) if total > 0 else 0.0
overall_stderr = round(np.sqrt(overall_acc * (1 - overall_acc) / total), 5) if total > 0 else 0.0

print("\nResults:")

print("\nAccuracy per QID:")
for k, v in qid_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print("\nAccuracy per Category:")
for k, v in category_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print("\nAccuracy per Subcategory:")
for k, v in subcategory_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print("\nAccuracy per Difficulty:")
for k, v in difficulty_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print(f"\nOverall Accuracy: {overall_acc} ± {overall_stderr} ({total} examples)")

return overall_acc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add type hints to function and nested function

Both the main function and the nested compute_accuracy function need type hints.

Apply this diff to add type hints:

+from typing import Any
+
-def lemonade_aggregate_results(results):
-    def compute_accuracy(grouped_results):
+def lemonade_aggregate_results(results: list[dict[str, Any]]) -> float:
+    def compute_accuracy(grouped_results: dict[str, list[dict[str, Any]]]) -> dict[str, dict[str, Any]]:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def lemonade_aggregate_results(results):
def compute_accuracy(grouped_results):
acc_dict = {}
for key, samples in grouped_results.items():
correct = sum([r["parsed_pred"] == r["answer"] for r in samples])
total = len(samples)
acc = round(correct / total, 5) if total > 0 else 0.0
stderr = round(np.sqrt(acc * (1 - acc) / total), 5) if total > 0 else 0.0
acc_dict[key] = {
"num": total,
"acc": acc,
"acc_stderr": stderr,
}
return acc_dict
qid_results = defaultdict(list)
category_results = defaultdict(list)
subcategory_results = defaultdict(list)
difficulty_results = defaultdict(list)
valid_results = [r for r in results if r["parsed_pred"] != "API Error"]
for r in valid_results:
qid_results[r["QID"]].append(r)
category_results[r["category"]].append(r)
subcategory_results[r["subcategory"]].append(r)
difficulty_results[r["difficulty"]].append(r)
qid_acc = compute_accuracy(qid_results)
category_acc = compute_accuracy(category_results)
subcategory_acc = compute_accuracy(subcategory_results)
difficulty_acc = compute_accuracy(difficulty_results)
total_correct = sum([r["parsed_pred"] == r["answer"] for r in valid_results])
total = len(valid_results)
overall_acc = round(total_correct / total, 5) if total > 0 else 0.0
overall_stderr = round(np.sqrt(overall_acc * (1 - overall_acc) / total), 5) if total > 0 else 0.0
print("\nResults:")
print("\nAccuracy per QID:")
for k, v in qid_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
print("\nAccuracy per Category:")
for k, v in category_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
print("\nAccuracy per Subcategory:")
for k, v in subcategory_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
print("\nAccuracy per Difficulty:")
for k, v in difficulty_acc.items():
print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
print(f"\nOverall Accuracy: {overall_acc} ± {overall_stderr} ({total} examples)")
return overall_acc
from typing import Any
def lemonade_aggregate_results(results: list[dict[str, Any]]) -> float:
def compute_accuracy(grouped_results: dict[str, list[dict[str, Any]]]) -> dict[str, dict[str, Any]]:
🤖 Prompt for AI Agents
In lmms_eval/tasks/lemonade/utils.py around lines 213 to 271, the top-level
function lemonade_aggregate_results and its nested compute_accuracy lack type
hints; add typing imports and annotate: import List, Dict, Any, Union from
typing; annotate lemonade_aggregate_results(results: List[Dict[str, Any]]) ->
float and compute_accuracy(grouped_results: Dict[Any, List[Dict[str, Any]]]) ->
Dict[Any, Dict[str, Union[int, float]]]; also annotate local variables/returns
where helpful (e.g., acc_dict: Dict[Any, Dict[str, Union[int, float]]]) so the
function signature and nested helper are fully typed.

Comment on lines 9 to 20

MAX_NUM_FRAMES = 8

LEMONADE_ZIP_NAMES = [
"videos_batch_0.zip",
"videos_batch_1.zip",
"videos_batch_2.zip",
"videos_batch_3.zip",
"videos_batch_4.zip",
]

data_dir = "./data/lemonade"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data path for video evaluation should not be hardcoded

Comment on lines 75 to 93
def lemonade_doc_to_visual(doc):
videos_dir = os.path.join(data_dir, "videos")

if not os.path.exists(videos_dir) or len(os.listdir(videos_dir)) == 0:
print("Videos directory is empty — downloading and extracting...\n")
download_and_extract_lemonade_videos(data_dir)

video_filename = doc["Clip"] + "_hololens.mp4"

video_path = os.path.join(videos_dir, video_filename)

if os.path.exists(video_path):
start = int(doc["Start"])
end = int(doc["End"])
frames = load_video(video_path, start, end, max_num_frames=MAX_NUM_FRAMES)
else:
raise FileNotFoundError(f"Video file not found: {video_path}")

return frames
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc to visual of video should not be frames. Instead, it should return the list of video path

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the long duration of the videos and the repetitiveness/overlap of the video segments, it is wiser to avoid chunking the video prior to access to avoid memory redundancy. This framework (outputting frames) was inspired by another task pushed in your repository : https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/tasks/plm_videobench/rcap/rcap_utils.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be okay if you always want to hack your video frames for every model to test and we would accept merge like that. But we would suggest you to return a video path instead of video frames as multiple images so that the video chunking can be done in the model generation itself (for example using the general api we provided) so that fps and max frames can be set by the user.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
lmms_eval/tasks/lemonade/utils.py (2)

20-21: Do not hardcode data path; respect env override.

Use an environment variable so users can point to pre-downloaded data (requested previously).

-DEFAULT_DATA_DIR = "./data/lemonade"
+DEFAULT_DATA_DIR = os.getenv("LEMONADE_DATA_DIR", "./data/lemonade")

105-133: doc_to_visual should return video paths, not decoded frames (integration mismatch).

Per prior review and typical lmms-eval video interface, pass paths to the runner/model instead of materializing frames here. This also avoids heavy I/O and memory use.

-def lemonade_doc_to_visual(doc: dict[str, Any]) -> list[Image.Image]:
-    """
-    Load video frames for a given entry in the LEMONADE dataset.
- 
-    Args:
-        doc: A dictionary representing an entry in the dataset.
-    Returns:
-        frames: List of PIL Image objects representing sampled frames
-    """
-
-    videos_dir = os.path.join(DEFAULT_DATA_DIR, "videos")
+def lemonade_doc_to_visual(doc: dict[str, Any]) -> list[str]:
+    """
+    Resolve the local video path(s) for the given LEMONADE example.
+
+    Args:
+        doc: One dataset record (expects keys: "Clip", "Start", "End").
+    Returns:
+        A list with a single video path string.
+    """
+
+    videos_dir = os.path.join(DEFAULT_DATA_DIR, "videos")
     if not os.path.exists(videos_dir) or len(os.listdir(videos_dir)) == 0:
         print("Videos directory is empty — downloading and extracting...\n")
         download_and_extract_lemonade_videos(DEFAULT_DATA_DIR)
 
     video_filename = doc["Clip"] + "_hololens.mp4"
     video_path = os.path.join(videos_dir, video_filename)
 
-    if os.path.exists(video_path):
-        start = int(doc["Start"])
-        end = int(doc["End"])
-        frames = load_video(video_path, start, end, max_num_frames=MAX_NUM_FRAMES)
-    else:
-        raise FileNotFoundError(
-            f"Video file not found: {video_path}. "
-            f"Expected video for clip '{doc['Clip']}' at {video_path}"
-        )
-    return frames
+    if not os.path.exists(video_path):
+        raise FileNotFoundError(f"Video not found: {video_path}")
+    # Return path(s); downstream components handle decoding/sampling.
+    return [video_path]

To confirm the expected return type across tasks, run:

#!/bin/bash
# Inspect other video tasks' doc_to_visual signatures/returns.
rg -nP --type=py -C2 '\bdef\s+.*doc_to_visual\s*\('
🧹 Nitpick comments (5)
lmms_eval/tasks/lemonade/utils.py (5)

1-11: Clean up and sort imports; remove unused yaml (Ruff I001/unused-imports).

Keeps the file compliant with the repo's linting rules.

-import ast
-import os
-import cv2
-import numpy as np
-import yaml
-import zipfile
-from collections import defaultdict
-from PIL import Image
-from typing import Any, Optional
-from huggingface_hub import hf_hub_download
+import ast
+import os
+import zipfile
+from collections import defaultdict
+from typing import Any, Optional
+
+import cv2
+import numpy as np
+from PIL import Image
+from huggingface_hub import hf_hub_download

38-43: Allow HF cache override via env var.

Enables shared cache on clusters without modifying code.

-        zip_path = hf_hub_download(
+        zip_path = hf_hub_download(
             repo_id="amathislab/LEMONADE",
             filename=zip_name,
             repo_type="dataset",
-            cache_dir=os.path.join(data_dir, "cache") 
+            cache_dir=os.getenv("HF_HUB_CACHE", os.path.join(data_dir, "cache")),
         )

135-143: Wrap long function signature to respect 88-char limit.

-def lemonade_doc_to_text(doc: dict[str, Any], lmms_eval_specific_kwargs: Optional[dict[str, Any]] = None) -> str:
+def lemonade_doc_to_text(
+    doc: dict[str, Any],
+    lmms_eval_specific_kwargs: Optional[dict[str, Any]] = None,
+) -> str:

170-171: Shorten exception message to satisfy Ruff TRY003.

-    if not isinstance(options, list):
-        raise TypeError(f"Expected list of options, got {type(options)}: {options}")
+    if not isinstance(options, list):
+        raise TypeError(f"options must be list[str], got {type(options)}")

303-315: Type-hint nested helper and remove unnecessary list allocations.

Improves readability and minor perf in aggregations.

-    def compute_accuracy(grouped_results):
-        acc_dict = {}
-        for key, samples in grouped_results.items():
-            correct = sum([r["parsed_pred"] == r["answer"] for r in samples])
+    def compute_accuracy(
+        grouped_results: dict[str, list[dict[str, Any]]],
+    ) -> dict[str, dict[str, float | int]]:
+        acc_dict: dict[str, dict[str, float | int]] = {}
+        for key, samples in grouped_results.items():
+            correct = sum(r["parsed_pred"] == r["answer"] for r in samples)
             total = len(samples)
             acc = round(correct / total, 5) if total > 0 else 0.0
             stderr = round(np.sqrt(acc * (1 - acc) / total), 5) if total > 0 else 0.0
             acc_dict[key] = {
                 "num": total,
                 "acc": acc,
                 "acc_stderr": stderr,
             }
         return acc_dict
@@
-    total_correct = sum([r["parsed_pred"] == r["answer"] for r in valid_results])
+    total_correct = sum(r["parsed_pred"] == r["answer"] for r in valid_results)

Optional: gate the many print()s behind an env flag (e.g., LEMONADE_VERBOSE) or switch to logging.

Also applies to: 335-338

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8c8d3c and 060935d.

📒 Files selected for processing (1)
  • lmms_eval/tasks/lemonade/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/tasks/lemonade/utils.py
🪛 Ruff (0.12.2)
lmms_eval/tasks/lemonade/utils.py

128-131: Avoid specifying long messages outside the exception class

(TRY003)


171-171: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
lmms_eval/tasks/lemonade/utils.py (1)

151-155: Good security hardening: replaced eval() with ast.literal_eval.

This removes code-execution risk on dataset-provided strings.

@kcz358
Copy link
Collaborator

kcz358 commented Sep 9, 2025

Hi, thank you for the PR. I think for the major issue is that we hope that your download data can be placed in the snapshot cache and the cache folder can be always settable using HF_HOME instead of directly setting ./data with a relative path. Can you please check other video tasks to see how to achieve that? Thanks!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
lmms_eval/tasks/lemonade/utils.py (2)

47-80: Zero-division and unreadable-video pitfalls in load_video (duplicate of prior feedback)

Division by zero when num_frames == 0 and missing checks for cap.isOpened() / total_frames <= 0 can crash runs on bad inputs. This was flagged earlier and is still present.

Apply this diff to harden the function and satisfy robustness guidelines:

-def load_video(video_file: str, start_frame: int, end_frame: int, max_num_frames: int = MAX_NUM_FRAMES) -> list[Image.Image]:
+def load_video(
+    video_file: str,
+    start_frame: int,
+    end_frame: int,
+    max_num_frames: int = MAX_NUM_FRAMES,
+) -> list[Image.Image]:
@@
-    cap = cv2.VideoCapture(video_file)
-    try: 
-        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
-        start_frame = max(0, start_frame)
-        end_frame = min(end_frame, total_frames - 1)
-        total_valid_frames = end_frame - start_frame + 1
-        num_frames = min(max_num_frames, total_valid_frames)
-        step = total_valid_frames / num_frames
+    cap = cv2.VideoCapture(video_file)
+    try:
+        if not cap.isOpened():
+            raise FileNotFoundError(f"Cannot open video: {video_file}")
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        if total_frames <= 0:
+            return []
+        start_frame = max(0, start_frame)
+        end_frame = min(end_frame, total_frames - 1)
+        if end_frame < start_frame:
+            raise ValueError(f"Invalid frame range: start={start_frame}, end={end_frame}")
+        total_valid_frames = end_frame - start_frame + 1
+        num_frames = min(max_num_frames, total_valid_frames)
+        if num_frames <= 0:
+            return []
+        step = total_valid_frames / float(num_frames)

Also reformatted the signature to respect the 88-char limit.


289-311: Add type hints to nested compute_accuracy (still missing)

Guidelines require type hints everywhere; nested function currently lacks them.

-    def compute_accuracy(grouped_results):
+    def compute_accuracy(
+        grouped_results: dict[str, list[dict[str, Any]]]
+    ) -> dict[str, dict[str, float | int]]:
🧹 Nitpick comments (6)
lmms_eval/tasks/lemonade/utils.py (6)

102-129: Return video path(s) instead of decoded frames to let models control FPS/chunking

Maintainer requested returning the video path so model backends can chunk/sample consistently with the general API. Current code eagerly decodes frames, which fixes FPS/max-frames policy in the task layer and duplicates memory.

Option A (preferred): Add a parallel helper and switch lemonade.yaml to it:

def lemonade_doc_to_video_paths(doc: dict[str, Any]) -> list[str]:
    if not os.path.exists(videos_dir) or not os.listdir(videos_dir):
        download_and_extract_lemonade_videos("")  # path unused with snapshot cache
    video_filename = doc["Clip"] + "_hololens.mp4"
    video_path = os.path.join(videos_dir, video_filename)
    if not os.path.exists(video_path):
        raise FileNotFoundError(f"Video file not found: {video_path}")
    return [video_path]

Option B: Gate behavior with an env flag (backwards compatible):

# if os.getenv("LEMONADE_RETURN_VIDEO_PATHS") == "1": return [video_path]

Would you like me to update lemonade.yaml accordingly?


124-127: Shorten exception message to satisfy TRY003 and the 88-char limit

The multi-line f-string in the exception violates style rules and is verbose.

-        raise FileNotFoundError(
-            f"Video file not found: {video_path}. "
-            f"Expected video for clip '{doc['Clip']}' at {video_path}"
-        )
+        raise FileNotFoundError(f"Missing video for clip '{doc['Clip']}': {video_path}")

131-152: Docstring OK; consider guarding literal_eval and normalizing whitespace

Literal parsing may fail if Answers is malformed; also trimming leading/trailing whitespace improves robustness.

-    parsed_options = parse_options(ast.literal_eval(doc["Answers"]))
+    try:
+        answers = ast.literal_eval(str(doc["Answers"]).strip())
+    except (ValueError, SyntaxError) as exc:
+        raise ValueError(f"Malformed Answers field for QID={doc.get('QID')}") from exc
+    parsed_options = parse_options(answers)

179-260: Heuristics OK; minor safety on argmax when candidates have only unfound matches

If all start_indexes are -1 (no substring actually found), np.argmax will pick the first entry arbitrarily.

-        pred_index = candidates[np.argmax(start_indexes)]
+        best = int(np.argmax(start_indexes))
+        if start_indexes[best] < 0:
+            return "A"
+        pred_index = candidates[best]

336-355: Prefer logging over prints for metrics summaries

Printing inside aggregation can spam stdout for programmatic runs. Route through logging so callers can control verbosity.

+import logging
+logger = logging.getLogger(__name__)
@@
-    print("\nResults:")
+    logger.info("Results:")
@@
-        print(f"  {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
+        logger.info("  %s: %s ± %s (%s examples)", k, v["acc"], v["acc_stderr"], v["num"])

112-115: Ensure download() return is used; avoid stale global videos_dir

If download_and_extract_lemonade_videos returns a path (see earlier diff), use it here to avoid relying on a precomputed global.

-        download_and_extract_lemonade_videos(base_cache_dir)
+        extracted_dir = download_and_extract_lemonade_videos(base_cache_dir)
+        # Prefer the returned path in case cache location changed.
+        global videos_dir
+        videos_dir = extracted_dir
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 060935d and 50d70da.

📒 Files selected for processing (1)
  • lmms_eval/tasks/lemonade/utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

  • lmms_eval/tasks/lemonade/utils.py
🪛 Ruff (0.12.2)
lmms_eval/tasks/lemonade/utils.py

124-127: Avoid specifying long messages outside the exception class

(TRY003)


167-167: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)
lmms_eval/tasks/lemonade/utils.py (2)

154-177: LGTM: typed and replaces assert with explicit error handling

Mapping options to labels is straightforward and compliant with guidelines.


18-46: Make data_dir parameter functional and extract into HF snapshot cache

  • The data_dir argument isn’t used beyond creating the folder; extraction always writes to the module-global videos_dir. Refactor to call snapshot_download(repo_id="amathislab/LEMONADE"), set videos_dir = os.path.join(snapshot_dir, "videos"), extract zips there, and return the path.

@TashkovskaMatea
Copy link

Hi, thank you for the feedback. We've updated the code so that the download data is now placed in the snapshot cache, following the same approach used in other video tasks. Looking forward to your feedback on this update.

Comment on lines 18 to 45
def download_and_extract_lemonade_videos(data_dir: str) -> None:
"""
Download and extract LEMONADE files from Hugging Face into a local data directory.

Args:
data_dir: Directory that stores the files.
Returns:
None
"""

os.makedirs(data_dir, exist_ok=True)
os.makedirs(videos_dir, exist_ok=True)
print(f"Creating videos directory at {videos_dir}...")

for zip_name in LEMONADE_ZIP_NAMES:
print(f"Downloading {zip_name} from Hugging Face...")
zip_path = hf_hub_download(
repo_id="amathislab/LEMONADE",
filename=zip_name,
repo_type="dataset",
local_dir=os.path.join(base_cache_dir, "lemonade_zips"),
local_dir_use_symlinks=False,
resume_download=True,
)
with zipfile.ZipFile(zip_path, "r") as zf:
zf.extractall(videos_dir)

print("All videos downloaded and extracted successfully.\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest this should not be included. As the download part is being done in the api/task.py. We handle the snapshot download from the hub.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, we removed the custom download function. The dataset now relies on the existing download/unzip logic in api/task.py.

Comment on lines 11 to 12
MAX_NUM_FRAMES = 8
LEMONADE_ZIP_NAMES = [f"videos_batch_{i}.zip" for i in range(5)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not hardcoding this. Otherwise would be really strange for a video task to use only 8 frames or is this intended?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed the hardcoded value and the max_num_frames is now configurable through the task YAML. We can’t set this at model generation time since frame extraction happens during the video loading (due to the long duration of the videos and the overlap/repetitiveness of segments).

Copy link
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I would say it would be fine if you want to hardcode the download logic and load video logic for your benchmarks since it would not disturb the overall pipeline of the framework. But I would again suggest to leave these functionality to be handle on other components of the PR.

Feel free to add a README in your dataset folder on how to use or any comments on the implementation. If you feel good for all of your changes, I will merge it.

@TashkovskaMatea
Copy link

Thanks again for the helpful suggestions - we’ve made the changes and added a README. All set for merge on our side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants