Add lemonade benchmark to the evaluation #813

andybonnetto · 2025-09-03T09:34:52Z

Description

We added the evaluation of Lemonade (on HuggingFace). A motion centered MCQ dataset extracted from the EPFL-Smart-Kitchen-30 dataset (ArXiv and Zenodo). The benchmark was evaluated using Lmms-eval in the publication.

Modifications

Added lemonade folder in tasks containing lemonade.yaml and utils.py
utils.py contains data loading from hugging face , video loading, answer parsing, metric evaluation.
lemonade.yaml contains the default prompts

Summary by CodeRabbit

New Features
- Added support for the LEMONADE multimodal evaluation task with a ready-to-use evaluation configuration.
- Automatic download/extraction and caching of required video assets on first run.
- Frame sampling from videos for consistent visual inputs.
- Prompt templates and generation settings for deterministic multiple-choice outputs.
- Robust answer parsing, structured result processing, and aggregated accuracy reporting.

…lemonade_eval

coderabbitai · 2025-09-03T09:34:59Z

Walkthrough

Adds a new LEMONADE evaluation task: a YAML config and a utilities module that download/extract videos, sample frames, build prompts, parse multiple-choice responses, process per-example results, and aggregate accuracy.

Changes

Cohort / File(s)	Summary
LEMONADE task config `lmms_eval/tasks/lemonade/lemonade.yaml`	New task config pointing to `amathislab/LEMONADE`; sets `task: lemonade`, `test_split: test`, `output_type: generate_until`; wires `doc_to_visual`, `doc_to_text`, `doc_to_target: "Correct Answer"`, `process_results`; sets `generation_kwargs` (max_new_tokens:128, temperature:0, do_sample:false) and an accuracy metric using `lemonade_aggregate_results`; provides default pre/post prompts.
LEMONADE utilities `lmms_eval/tasks/lemonade/utils.py`	New module adding constants (`MAX_NUM_FRAMES`, `LEMONADE_ZIP_NAMES`, HF cache paths), functions to download/extract video zips, `load_video` (frame sampling → PIL images), `parse_options`, `lemonade_doc_to_visual`, `lemonade_doc_to_text`, multi-choice helpers (`get_multi_choice_info`, `parse_multi_choice_response`), `lemonade_process_results`, and `lemonade_aggregate_results` (grouped accuracy and summary).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as LMMS Eval Runner
  participant Config as lemonade.yaml
  participant Utils as lemonade/utils.py
  participant Store as Local Cache (HF_HOME/.../videos)
  participant Model as Model

  Runner->>Config: load task config
  loop each doc
    Runner->>Utils: lemonade_doc_to_visual(doc)
    alt videos missing
      Utils->>Store: download_and_extract_lemonade_videos()
      Store-->>Utils: videos ready
    end
    Utils->>Utils: load_video(clip, start, end, max_frames)
    Utils-->>Runner: frames[]
    Runner->>Utils: lemonade_doc_to_text(doc, lmms_eval_specific_kwargs)
    Utils-->>Runner: prompt_text
    Runner->>Model: generate(prompt_text + frames, generation_kwargs)
    Model-->>Runner: raw_prediction
    Runner->>Utils: lemonade_process_results(doc, raw_prediction)
    Utils-->>Runner: parsed_result (qid, correct, pred, meta)
  end
  Runner->>Utils: lemonade_aggregate_results(all_results)
  Utils-->>Runner: overall_accuracy + breakdown

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly and accurately summarizes the primary change—adding evaluation support for the Lemonade benchmark—without extraneous detail or ambiguity.
Description Check	✅ Passed	The description provides a clear overview of the Lemonade dataset, cites its sources, and details the added files and their functionality, fulfilling the template’s core requirement for a detailed description.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Poem

A carrot of code in lemony light,
I hop through frames, eight at a bite.
Options lined A–Z, I sniff and see,
I choose the letter, tidy as can be.
Zip, unzip, tally—hooray! 🥕✨

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (3)

lmms_eval/tasks/lemonade/utils.py (2)
91-91: Use more specific error message

The error message should be more specific to help with debugging.

Apply this diff:
-        raise FileNotFoundError(f"Video file not found: {video_path}")
+        raise FileNotFoundError(
+            f"Video file not found: {video_path}. "
+            f"Expected video for clip '{doc['Clip']}' at {video_path}"
+        )
127-201: Add type hints and simplify complex parsing logic

This function has complex parsing logic that could be simplified and needs type hints.

Apply this diff to add type hints:
-def parse_multi_choice_response(response, all_choices, index2ans):
+def parse_multi_choice_response(
+    response: str,
+    all_choices: list[str],
+    index2ans: dict[str, str]
+) -> str:
     """
     Parse the prediction from the generated response.
     Return the predicted index e.g., A, B, C, D.
     """
Consider refactoring this function to reduce complexity by extracting helper methods for each parsing strategy.
lmms_eval/tasks/lemonade/lemonade.yaml (1)
1-23: Add newline at end of file

YAML files should end with a newline character per linting rules.

Apply this diff:
   default:
     pre_prompt: "Answer the following multiple-choice question using the given images.\n"
     post_prompt: "\nRespond only with the letter of the correct answer."
+

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 83e78e2 and c8c8d3c.

📒 Files selected for processing (2)

lmms_eval/tasks/lemonade/lemonade.yaml (1 hunks)
lmms_eval/tasks/lemonade/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

lmms_eval/tasks/lemonade/lemonade.yaml

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/lemonade/utils.py

🪛 YAMLlint (1.37.1)

lmms_eval/tasks/lemonade/lemonade.yaml

[error] 23-23: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.12.2)

lmms_eval/tasks/lemonade/utils.py

91-91: Avoid specifying long messages outside the exception class

(TRY003)

104-104: Use of possibly insecure function; consider using ast.literal_eval

(S307)

115-115: Use of assert detected

(S101)

206-206: Use of possibly insecure function; consider using ast.literal_eval

(S307)

coderabbitai · 2025-09-03T09:46:38Z

lmms_eval/tasks/lemonade/utils.py

+def download_and_extract_lemonade_videos(data_dir):
+    os.makedirs(data_dir, exist_ok=True)
+    videos_dir = os.path.join(data_dir, "videos")
+    os.makedirs(videos_dir, exist_ok=True)
+
+    for zip_name in LEMONADE_ZIP_NAMES:
+        print(f"Downloading {zip_name} from Hugging Face...")
+        zip_path = hf_hub_download(repo_id="amathislab/LEMONADE", filename=zip_name, repo_type="dataset", cache_dir=os.path.join(data_dir, "cache"))
+
+        with zipfile.ZipFile(zip_path, "r") as zip_ref:
+            zip_ref.extractall(videos_dir)
+
+    print("All videos downloaded and extracted successfully.\n")
+


🛠️ Refactor suggestion

Add type hints to all functions per coding guidelines

Type hints are required for all Python code according to your guidelines. This function and all others in the file lack proper type annotations.

Apply this diff to add type hints:

-def download_and_extract_lemonade_videos(data_dir): +def download_and_extract_lemonade_videos(data_dir: str) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def download_and_extract_lemonade_videos(data_dir):

os.makedirs(data_dir, exist_ok=True)

videos_dir = os.path.join(data_dir, "videos")

os.makedirs(videos_dir, exist_ok=True)

for zip_name in LEMONADE_ZIP_NAMES:

print(f"Downloading {zip_name} from Hugging Face...")

zip_path = hf_hub_download(repo_id="amathislab/LEMONADE", filename=zip_name, repo_type="dataset", cache_dir=os.path.join(data_dir, "cache"))

with zipfile.ZipFile(zip_path, "r") as zip_ref:

zip_ref.extractall(videos_dir)

print("All videos downloaded and extracted successfully.\n")

def download_and_extract_lemonade_videos(data_dir: str) -> None:

os.makedirs(data_dir, exist_ok=True)

videos_dir = os.path.join(data_dir, "videos")

os.makedirs(videos_dir, exist_ok=True)

for zip_name in LEMONADE_ZIP_NAMES:

print(f"Downloading {zip_name} from Hugging Face...")

zip_path = hf_hub_download(

repo_id="amathislab/LEMONADE",

filename=zip_name,

repo_type="dataset",

cache_dir=os.path.join(data_dir, "cache"),

)

with zipfile.ZipFile(zip_path, "r") as zip_ref:

zip_ref.extractall(videos_dir)

print("All videos downloaded and extracted successfully.\n")

🤖 Prompt for AI Agents

In lmms_eval/tasks/lemonade/utils.py around lines 23 to 36, the function download_and_extract_lemonade_videos lacks type annotations; update the signature to download_and_extract_lemonade_videos(data_dir: str) -> None, add explicit local variable annotations where useful (e.g., videos_dir: str, zip_path: str, zip_name: str), and ensure any module-level constants like LEMONADE_ZIP_NAMES are annotated as List[str] (import List from typing); also import PathLike or Union[str, PathLike] if you prefer allowing Path-like inputs and annotate hf_hub_download return as str when used; keep behavior unchanged.

lmms_eval/tasks/lemonade/utils.py

coderabbitai · 2025-09-03T09:46:39Z

lmms_eval/tasks/lemonade/utils.py

+def lemonade_doc_to_visual(doc):
+    videos_dir = os.path.join(data_dir, "videos")
+
+    if not os.path.exists(videos_dir) or len(os.listdir(videos_dir)) == 0:
+        print("Videos directory is empty — downloading and extracting...\n")
+        download_and_extract_lemonade_videos(data_dir)
+
+    video_filename = doc["Clip"] + "_hololens.mp4"
+
+    video_path = os.path.join(videos_dir, video_filename)
+
+    if os.path.exists(video_path):
+        start = int(doc["Start"])
+        end = int(doc["End"])
+        frames = load_video(video_path, start, end, max_num_frames=MAX_NUM_FRAMES)
+    else:
+        raise FileNotFoundError(f"Video file not found: {video_path}")
+
+    return frames
+


🛠️ Refactor suggestion

Add type hints and improve error handling

The function needs type hints and has multiple issues with error handling.

Apply this diff to add type hints and use more specific imports:

-def lemonade_doc_to_visual(doc): +from typing import Any + +def lemonade_doc_to_visual(doc: dict[str, Any]) -> list[Image.Image]:

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.12.2)

91-91: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In lmms_eval/tasks/lemonade/utils.py around lines 75 to 94, add type hints and tighten error handling: annotate the function signature (e.g., def lemonade_doc_to_visual(doc: Dict[str, Any]) -> List[Any]) and import specific types (from typing import Dict, Any, List; from pathlib import Path); replace os.path joins with Path objects for data_dir/videos and video_path, validate required keys ("Clip","Start","End") exist in doc and raise KeyError with clear message if missing, validate that Start/End are integers and that Start < End (raise ValueError if not), wrap download_and_extract_lemonade_videos and load_video calls in try/except to re-raise more informative exceptions, and on missing video raise FileNotFoundError with the Path shown; ensure the function returns the appropriately typed frames list.

lmms_eval/tasks/lemonade/utils.py

coderabbitai · 2025-09-03T09:46:40Z

lmms_eval/tasks/lemonade/utils.py

+def lemonade_aggregate_results(results):
+    def compute_accuracy(grouped_results):
+        acc_dict = {}
+        for key, samples in grouped_results.items():
+            correct = sum([r["parsed_pred"] == r["answer"] for r in samples])
+            total = len(samples)
+            acc = round(correct / total, 5) if total > 0 else 0.0
+            stderr = round(np.sqrt(acc * (1 - acc) / total), 5) if total > 0 else 0.0
+            acc_dict[key] = {
+                "num": total,
+                "acc": acc,
+                "acc_stderr": stderr,
+            }
+        return acc_dict
+
+    qid_results = defaultdict(list)
+    category_results = defaultdict(list)
+    subcategory_results = defaultdict(list)
+    difficulty_results = defaultdict(list)
+
+    valid_results = [r for r in results if r["parsed_pred"] != "API Error"]
+
+    for r in valid_results:
+        qid_results[r["QID"]].append(r)
+        category_results[r["category"]].append(r)
+        subcategory_results[r["subcategory"]].append(r)
+        difficulty_results[r["difficulty"]].append(r)
+
+    qid_acc = compute_accuracy(qid_results)
+    category_acc = compute_accuracy(category_results)
+    subcategory_acc = compute_accuracy(subcategory_results)
+    difficulty_acc = compute_accuracy(difficulty_results)
+
+    total_correct = sum([r["parsed_pred"] == r["answer"] for r in valid_results])
+    total = len(valid_results)
+    overall_acc = round(total_correct / total, 5) if total > 0 else 0.0
+    overall_stderr = round(np.sqrt(overall_acc * (1 - overall_acc) / total), 5) if total > 0 else 0.0
+
+    print("\nResults:")
+
+    print("\nAccuracy per QID:")
+    for k, v in qid_acc.items():
+        print(f"  {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
+
+    print("\nAccuracy per Category:")
+    for k, v in category_acc.items():
+        print(f"  {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
+
+    print("\nAccuracy per Subcategory:")
+    for k, v in subcategory_acc.items():
+        print(f"  {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
+
+    print("\nAccuracy per Difficulty:")
+    for k, v in difficulty_acc.items():
+        print(f"  {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
+
+    print(f"\nOverall Accuracy: {overall_acc} ± {overall_stderr} ({total} examples)")
+
+    return overall_acc


🛠️ Refactor suggestion

Add type hints to function and nested function

Both the main function and the nested compute_accuracy function need type hints.

Apply this diff to add type hints:

+from typing import Any + -def lemonade_aggregate_results(results): - def compute_accuracy(grouped_results): +def lemonade_aggregate_results(results: list[dict[str, Any]]) -> float: + def compute_accuracy(grouped_results: dict[str, list[dict[str, Any]]]) -> dict[str, dict[str, Any]]:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def lemonade_aggregate_results(results):

def compute_accuracy(grouped_results):

acc_dict = {}

for key, samples in grouped_results.items():

correct = sum([r["parsed_pred"] == r["answer"] for r in samples])

total = len(samples)

acc = round(correct / total, 5) if total > 0 else 0.0

stderr = round(np.sqrt(acc * (1 - acc) / total), 5) if total > 0 else 0.0

acc_dict[key] = {

"num": total,

"acc": acc,

"acc_stderr": stderr,

}

return acc_dict

qid_results = defaultdict(list)

category_results = defaultdict(list)

subcategory_results = defaultdict(list)

difficulty_results = defaultdict(list)

valid_results = [r for r in results if r["parsed_pred"] != "API Error"]

for r in valid_results:

qid_results[r["QID"]].append(r)

category_results[r["category"]].append(r)

subcategory_results[r["subcategory"]].append(r)

difficulty_results[r["difficulty"]].append(r)

qid_acc = compute_accuracy(qid_results)

category_acc = compute_accuracy(category_results)

subcategory_acc = compute_accuracy(subcategory_results)

difficulty_acc = compute_accuracy(difficulty_results)

total_correct = sum([r["parsed_pred"] == r["answer"] for r in valid_results])

total = len(valid_results)

overall_acc = round(total_correct / total, 5) if total > 0 else 0.0

overall_stderr = round(np.sqrt(overall_acc * (1 - overall_acc) / total), 5) if total > 0 else 0.0

print("\nResults:")

print("\nAccuracy per QID:")

for k, v in qid_acc.items():

print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print("\nAccuracy per Category:")

for k, v in category_acc.items():

print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print("\nAccuracy per Subcategory:")

for k, v in subcategory_acc.items():

print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print("\nAccuracy per Difficulty:")

for k, v in difficulty_acc.items():

print(f" {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")

print(f"\nOverall Accuracy: {overall_acc} ± {overall_stderr} ({total} examples)")

return overall_acc

from typing import Any

def lemonade_aggregate_results(results: list[dict[str, Any]]) -> float:

def compute_accuracy(grouped_results: dict[str, list[dict[str, Any]]]) -> dict[str, dict[str, Any]]:

🤖 Prompt for AI Agents

In lmms_eval/tasks/lemonade/utils.py around lines 213 to 271, the top-level function lemonade_aggregate_results and its nested compute_accuracy lack type hints; add typing imports and annotate: import List, Dict, Any, Union from typing; annotate lemonade_aggregate_results(results: List[Dict[str, Any]]) -> float and compute_accuracy(grouped_results: Dict[Any, List[Dict[str, Any]]]) -> Dict[Any, Dict[str, Union[int, float]]]; also annotate local variables/returns where helpful (e.g., acc_dict: Dict[Any, Dict[str, Union[int, float]]]) so the function signature and nested helper are fully typed.

kcz358 · 2025-09-05T07:13:51Z

lmms_eval/tasks/lemonade/utils.py

+
+MAX_NUM_FRAMES = 8
+
+LEMONADE_ZIP_NAMES = [
+    "videos_batch_0.zip",
+    "videos_batch_1.zip",
+    "videos_batch_2.zip",
+    "videos_batch_3.zip",
+    "videos_batch_4.zip",
+]
+
+data_dir = "./data/lemonade"


The data path for video evaluation should not be hardcoded

kcz358 · 2025-09-08T05:31:42Z

lmms_eval/tasks/lemonade/utils.py

+def lemonade_doc_to_visual(doc):
+    videos_dir = os.path.join(data_dir, "videos")
+
+    if not os.path.exists(videos_dir) or len(os.listdir(videos_dir)) == 0:
+        print("Videos directory is empty — downloading and extracting...\n")
+        download_and_extract_lemonade_videos(data_dir)
+
+    video_filename = doc["Clip"] + "_hololens.mp4"
+
+    video_path = os.path.join(videos_dir, video_filename)
+
+    if os.path.exists(video_path):
+        start = int(doc["Start"])
+        end = int(doc["End"])
+        frames = load_video(video_path, start, end, max_num_frames=MAX_NUM_FRAMES)
+    else:
+        raise FileNotFoundError(f"Video file not found: {video_path}")
+
+    return frames


The doc to visual of video should not be frames. Instead, it should return the list of video path

Due to the long duration of the videos and the repetitiveness/overlap of the video segments, it is wiser to avoid chunking the video prior to access to avoid memory redundancy. This framework (outputting frames) was inspired by another task pushed in your repository : https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/tasks/plm_videobench/rcap/rcap_utils.py.

It would be okay if you always want to hack your video frames for every model to test and we would accept merge like that. But we would suggest you to return a video path instead of video frames as multiple images so that the video chunking can be done in the model generation itself (for example using the general api we provided) so that fps and max frames can be set by the user.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

lmms_eval/tasks/lemonade/utils.py (2)

20-21: Do not hardcode data path; respect env override.

Use an environment variable so users can point to pre-downloaded data (requested previously).

-DEFAULT_DATA_DIR = "./data/lemonade"
+DEFAULT_DATA_DIR = os.getenv("LEMONADE_DATA_DIR", "./data/lemonade")

105-133: doc_to_visual should return video paths, not decoded frames (integration mismatch).

Per prior review and typical lmms-eval video interface, pass paths to the runner/model instead of materializing frames here. This also avoids heavy I/O and memory use.

-def lemonade_doc_to_visual(doc: dict[str, Any]) -> list[Image.Image]:
-    """
-    Load video frames for a given entry in the LEMONADE dataset.
- 
-    Args:
-        doc: A dictionary representing an entry in the dataset.
-    Returns:
-        frames: List of PIL Image objects representing sampled frames
-    """
-
-    videos_dir = os.path.join(DEFAULT_DATA_DIR, "videos")
+def lemonade_doc_to_visual(doc: dict[str, Any]) -> list[str]:
+    """
+    Resolve the local video path(s) for the given LEMONADE example.
+
+    Args:
+        doc: One dataset record (expects keys: "Clip", "Start", "End").
+    Returns:
+        A list with a single video path string.
+    """
+
+    videos_dir = os.path.join(DEFAULT_DATA_DIR, "videos")
     if not os.path.exists(videos_dir) or len(os.listdir(videos_dir)) == 0:
         print("Videos directory is empty — downloading and extracting...\n")
         download_and_extract_lemonade_videos(DEFAULT_DATA_DIR)
 
     video_filename = doc["Clip"] + "_hololens.mp4"
     video_path = os.path.join(videos_dir, video_filename)
 
-    if os.path.exists(video_path):
-        start = int(doc["Start"])
-        end = int(doc["End"])
-        frames = load_video(video_path, start, end, max_num_frames=MAX_NUM_FRAMES)
-    else:
-        raise FileNotFoundError(
-            f"Video file not found: {video_path}. "
-            f"Expected video for clip '{doc['Clip']}' at {video_path}"
-        )
-    return frames
+    if not os.path.exists(video_path):
+        raise FileNotFoundError(f"Video not found: {video_path}")
+    # Return path(s); downstream components handle decoding/sampling.
+    return [video_path]

To confirm the expected return type across tasks, run:

#!/bin/bash
# Inspect other video tasks' doc_to_visual signatures/returns.
rg -nP --type=py -C2 '\bdef\s+.*doc_to_visual\s*\('

🧹 Nitpick comments (5)

lmms_eval/tasks/lemonade/utils.py (5)

1-11: Clean up and sort imports; remove unused yaml (Ruff I001/unused-imports).

Keeps the file compliant with the repo's linting rules.

-import ast
-import os
-import cv2
-import numpy as np
-import yaml
-import zipfile
-from collections import defaultdict
-from PIL import Image
-from typing import Any, Optional
-from huggingface_hub import hf_hub_download
+import ast
+import os
+import zipfile
+from collections import defaultdict
+from typing import Any, Optional
+
+import cv2
+import numpy as np
+from PIL import Image
+from huggingface_hub import hf_hub_download

38-43: Allow HF cache override via env var.

Enables shared cache on clusters without modifying code.

-        zip_path = hf_hub_download(
+        zip_path = hf_hub_download(
             repo_id="amathislab/LEMONADE",
             filename=zip_name,
             repo_type="dataset",
-            cache_dir=os.path.join(data_dir, "cache") 
+            cache_dir=os.getenv("HF_HUB_CACHE", os.path.join(data_dir, "cache")),
         )

135-143: Wrap long function signature to respect 88-char limit.

-def lemonade_doc_to_text(doc: dict[str, Any], lmms_eval_specific_kwargs: Optional[dict[str, Any]] = None) -> str:
+def lemonade_doc_to_text(
+    doc: dict[str, Any],
+    lmms_eval_specific_kwargs: Optional[dict[str, Any]] = None,
+) -> str:

170-171: Shorten exception message to satisfy Ruff TRY003.

-    if not isinstance(options, list):
-        raise TypeError(f"Expected list of options, got {type(options)}: {options}")
+    if not isinstance(options, list):
+        raise TypeError(f"options must be list[str], got {type(options)}")

303-315: Type-hint nested helper and remove unnecessary list allocations.

Improves readability and minor perf in aggregations.

-    def compute_accuracy(grouped_results):
-        acc_dict = {}
-        for key, samples in grouped_results.items():
-            correct = sum([r["parsed_pred"] == r["answer"] for r in samples])
+    def compute_accuracy(
+        grouped_results: dict[str, list[dict[str, Any]]],
+    ) -> dict[str, dict[str, float | int]]:
+        acc_dict: dict[str, dict[str, float | int]] = {}
+        for key, samples in grouped_results.items():
+            correct = sum(r["parsed_pred"] == r["answer"] for r in samples)
             total = len(samples)
             acc = round(correct / total, 5) if total > 0 else 0.0
             stderr = round(np.sqrt(acc * (1 - acc) / total), 5) if total > 0 else 0.0
             acc_dict[key] = {
                 "num": total,
                 "acc": acc,
                 "acc_stderr": stderr,
             }
         return acc_dict
@@
-    total_correct = sum([r["parsed_pred"] == r["answer"] for r in valid_results])
+    total_correct = sum(r["parsed_pred"] == r["answer"] for r in valid_results)

Optional: gate the many print()s behind an env flag (e.g., LEMONADE_VERBOSE) or switch to logging.

Also applies to: 335-338

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8c8d3c and 060935d.

📒 Files selected for processing (1)

lmms_eval/tasks/lemonade/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/lemonade/utils.py

🪛 Ruff (0.12.2)

lmms_eval/tasks/lemonade/utils.py

128-131: Avoid specifying long messages outside the exception class

(TRY003)

171-171: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)

lmms_eval/tasks/lemonade/utils.py (1)

151-155: Good security hardening: replaced eval() with ast.literal_eval.

This removes code-execution risk on dataset-provided strings.

lmms_eval/tasks/lemonade/utils.py

kcz358 · 2025-09-09T05:53:39Z

Hi, thank you for the PR. I think for the major issue is that we hope that your download data can be placed in the snapshot cache and the cache folder can be always settable using HF_HOME instead of directly setting ./data with a relative path. Can you please check other video tasks to see how to achieve that? Thanks!

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

lmms_eval/tasks/lemonade/utils.py (2)

47-80: Zero-division and unreadable-video pitfalls in load_video (duplicate of prior feedback)

Division by zero when num_frames == 0 and missing checks for cap.isOpened() / total_frames <= 0 can crash runs on bad inputs. This was flagged earlier and is still present.

Apply this diff to harden the function and satisfy robustness guidelines:

-def load_video(video_file: str, start_frame: int, end_frame: int, max_num_frames: int = MAX_NUM_FRAMES) -> list[Image.Image]:
+def load_video(
+    video_file: str,
+    start_frame: int,
+    end_frame: int,
+    max_num_frames: int = MAX_NUM_FRAMES,
+) -> list[Image.Image]:
@@
-    cap = cv2.VideoCapture(video_file)
-    try: 
-        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
-        start_frame = max(0, start_frame)
-        end_frame = min(end_frame, total_frames - 1)
-        total_valid_frames = end_frame - start_frame + 1
-        num_frames = min(max_num_frames, total_valid_frames)
-        step = total_valid_frames / num_frames
+    cap = cv2.VideoCapture(video_file)
+    try:
+        if not cap.isOpened():
+            raise FileNotFoundError(f"Cannot open video: {video_file}")
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        if total_frames <= 0:
+            return []
+        start_frame = max(0, start_frame)
+        end_frame = min(end_frame, total_frames - 1)
+        if end_frame < start_frame:
+            raise ValueError(f"Invalid frame range: start={start_frame}, end={end_frame}")
+        total_valid_frames = end_frame - start_frame + 1
+        num_frames = min(max_num_frames, total_valid_frames)
+        if num_frames <= 0:
+            return []
+        step = total_valid_frames / float(num_frames)

Also reformatted the signature to respect the 88-char limit.

289-311: Add type hints to nested compute_accuracy (still missing)

Guidelines require type hints everywhere; nested function currently lacks them.

-    def compute_accuracy(grouped_results):
+    def compute_accuracy(
+        grouped_results: dict[str, list[dict[str, Any]]]
+    ) -> dict[str, dict[str, float | int]]:

🧹 Nitpick comments (6)

lmms_eval/tasks/lemonade/utils.py (6)
102-129: Return video path(s) instead of decoded frames to let models control FPS/chunking

Maintainer requested returning the video path so model backends can chunk/sample consistently with the general API. Current code eagerly decodes frames, which fixes FPS/max-frames policy in the task layer and duplicates memory.

Option A (preferred): Add a parallel helper and switch lemonade.yaml to it:
def lemonade_doc_to_video_paths(doc: dict[str, Any]) -> list[str]:
    if not os.path.exists(videos_dir) or not os.listdir(videos_dir):
        download_and_extract_lemonade_videos("")  # path unused with snapshot cache
    video_filename = doc["Clip"] + "_hololens.mp4"
    video_path = os.path.join(videos_dir, video_filename)
    if not os.path.exists(video_path):
        raise FileNotFoundError(f"Video file not found: {video_path}")
    return [video_path]
Option B: Gate behavior with an env flag (backwards compatible):
# if os.getenv("LEMONADE_RETURN_VIDEO_PATHS") == "1": return [video_path]
Would you like me to update lemonade.yaml accordingly?

124-127: Shorten exception message to satisfy TRY003 and the 88-char limit

The multi-line f-string in the exception violates style rules and is verbose.
-        raise FileNotFoundError(
-            f"Video file not found: {video_path}. "
-            f"Expected video for clip '{doc['Clip']}' at {video_path}"
-        )
+        raise FileNotFoundError(f"Missing video for clip '{doc['Clip']}': {video_path}")
131-152: Docstring OK; consider guarding literal_eval and normalizing whitespace

Literal parsing may fail if Answers is malformed; also trimming leading/trailing whitespace improves robustness.
-    parsed_options = parse_options(ast.literal_eval(doc["Answers"]))
+    try:
+        answers = ast.literal_eval(str(doc["Answers"]).strip())
+    except (ValueError, SyntaxError) as exc:
+        raise ValueError(f"Malformed Answers field for QID={doc.get('QID')}") from exc
+    parsed_options = parse_options(answers)
179-260: Heuristics OK; minor safety on argmax when candidates have only unfound matches

If all start_indexes are -1 (no substring actually found), np.argmax will pick the first entry arbitrarily.
-        pred_index = candidates[np.argmax(start_indexes)]
+        best = int(np.argmax(start_indexes))
+        if start_indexes[best] < 0:
+            return "A"
+        pred_index = candidates[best]
336-355: Prefer logging over prints for metrics summaries

Printing inside aggregation can spam stdout for programmatic runs. Route through logging so callers can control verbosity.
+import logging
+logger = logging.getLogger(__name__)
@@
-    print("\nResults:")
+    logger.info("Results:")
@@
-        print(f"  {k}: {v['acc']} ± {v['acc_stderr']} ({v['num']} examples)")
+        logger.info("  %s: %s ± %s (%s examples)", k, v["acc"], v["acc_stderr"], v["num"])
112-115: Ensure download() return is used; avoid stale global videos_dir

If download_and_extract_lemonade_videos returns a path (see earlier diff), use it here to avoid relying on a precomputed global.
-        download_and_extract_lemonade_videos(base_cache_dir)
+        extracted_dir = download_and_extract_lemonade_videos(base_cache_dir)
+        # Prefer the returned path in case cache location changed.
+        global videos_dir
+        videos_dir = extracted_dir

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 060935d and 50d70da.

📒 Files selected for processing (1)

lmms_eval/tasks/lemonade/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/lemonade/utils.py

🪛 Ruff (0.12.2)

lmms_eval/tasks/lemonade/utils.py

124-127: Avoid specifying long messages outside the exception class

(TRY003)

167-167: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (2)

lmms_eval/tasks/lemonade/utils.py (2)

154-177: LGTM: typed and replaces assert with explicit error handling

Mapping options to labels is straightforward and compliant with guidelines.

18-46: Make data_dir parameter functional and extract into HF snapshot cache

The data_dir argument isn’t used beyond creating the folder; extraction always writes to the module-global videos_dir. Refactor to call snapshot_download(repo_id="amathislab/LEMONADE"), set videos_dir = os.path.join(snapshot_dir, "videos"), extract zips there, and return the path.

lmms_eval/tasks/lemonade/utils.py

TashkovskaMatea · 2025-09-24T08:50:13Z

Hi, thank you for the feedback. We've updated the code so that the download data is now placed in the snapshot cache, following the same approach used in other video tasks. Looking forward to your feedback on this update.

kcz358 · 2025-09-26T01:54:18Z

lmms_eval/tasks/lemonade/utils.py

+def download_and_extract_lemonade_videos(data_dir: str) -> None:
+    """
+    Download and extract LEMONADE files from Hugging Face into a local data directory.
+
+    Args:
+        data_dir: Directory that stores the files.
+    Returns:
+        None
+    """
+
+    os.makedirs(data_dir, exist_ok=True)
+    os.makedirs(videos_dir, exist_ok=True)
+    print(f"Creating videos directory at {videos_dir}...")
+
+    for zip_name in LEMONADE_ZIP_NAMES:
+        print(f"Downloading {zip_name} from Hugging Face...")
+        zip_path = hf_hub_download(
+            repo_id="amathislab/LEMONADE",
+            filename=zip_name,
+            repo_type="dataset",
+            local_dir=os.path.join(base_cache_dir, "lemonade_zips"),
+            local_dir_use_symlinks=False,
+            resume_download=True,
+        )
+        with zipfile.ZipFile(zip_path, "r") as zf:
+            zf.extractall(videos_dir)
+
+    print("All videos downloaded and extracted successfully.\n")


I would suggest this should not be included. As the download part is being done in the api/task.py. We handle the snapshot download from the hub.

Thank you for the suggestion, we removed the custom download function. The dataset now relies on the existing download/unzip logic in api/task.py.

kcz358 · 2025-09-26T01:55:07Z

lmms_eval/tasks/lemonade/utils.py

+MAX_NUM_FRAMES = 8
+LEMONADE_ZIP_NAMES = [f"videos_batch_{i}.zip" for i in range(5)]


Suggest not hardcoding this. Otherwise would be really strange for a video task to use only 8 frames or is this intended?

We removed the hardcoded value and the max_num_frames is now configurable through the task YAML. We can’t set this at model generation time since frame extraction happens during the video loading (due to the long duration of the videos and the overlap/repetitiveness of segments).

kcz358

Hi, I would say it would be fine if you want to hardcode the download logic and load video logic for your benchmarks since it would not disturb the overall pipeline of the framework. But I would again suggest to leave these functionality to be handle on other components of the PR.

Feel free to add a README in your dataset folder on how to use or any comments on the implementation. If you feel good for all of your changes, I will merge it.

TashkovskaMatea · 2025-09-27T14:55:36Z

Thanks again for the helpful suggestions - we’ve made the changes and added a README. All set for merge on our side!

TashkovskaMatea and others added 9 commits July 5, 2025 20:59

Video loader with caching and download

139a2b4

Video loader with caching and download

fe24c04

Merge branch 'lemonade_eval' of github.com:amathislab/lmms-eval into …

6a3ef2d

…lemonade_eval

black and isort formating

99af1df

clean imports

34d435f

Video loader with caching and download

a9caecd

black and isort formating

87cf67a

clean imports

0391cd4

Merge branch 'main' of github.com:amathislab/lmms-eval into main

c8c8d3c

coderabbitai bot reviewed Sep 3, 2025

View reviewed changes

kcz358 reviewed Sep 5, 2025

View reviewed changes

kcz358 reviewed Sep 8, 2025

View reviewed changes

implement coderabbitai comments

060935d

coderabbitai bot reviewed Sep 8, 2025

View reviewed changes

lmms_eval/tasks/lemonade/utils.py Show resolved Hide resolved

download data in cache

50d70da

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

lmms_eval/tasks/lemonade/utils.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main'

d723a98

kcz358 reviewed Sep 26, 2025

View reviewed changes

TashkovskaMatea added 3 commits September 27, 2025 11:25

add README for lemonade

6150041

remove custom download def, move max_num_frames to config

e31fc8e

add lemonade to current_tasks

e891bad

		MAX_NUM_FRAMES = 8
		LEMONADE_ZIP_NAMES = [f"videos_batch_{i}.zip" for i in range(5)]

Add lemonade benchmark to the evaluation #813

Are you sure you want to change the base?

Add lemonade benchmark to the evaluation #813

Uh oh!

Conversation

andybonnetto commented Sep 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Modifications

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks (3 passed)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kcz358 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TashkovskaMatea commented Sep 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kcz358 left a comment

Choose a reason for hiding this comment

Uh oh!

TashkovskaMatea commented Sep 27, 2025

Uh oh!

Uh oh!

andybonnetto commented Sep 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 3, 2025 •

edited

Loading

kcz358 commented Sep 9, 2025 •

edited

Loading