[Cookbook Recipe of Using Rubric] #166

ruiqi-zhong · 2025-12-13T18:11:25Z

Rubric-based Grading for LLMs

data.py contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.
generate_data.py generates some example datapoint if you want to run our demo on addition.
env.py determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.
train.py allows you train LLMs on any dataset saved in our format (specified in data.py). The default script will train on the addition task, whose data is generated by generate_data.py.
prometheus_experimental.py contains a script to train the LLMs based on the rubrics from the prometheus-eval/Feedback-Collection dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.

A simple example of using a grader LLM with rubrics

We show how to use rubric-based LLM to provide reward on an addition task. E.g.

**User**: What's 233 + 100?
**Assistant**: 333

Usually, this could be graded by matching the number to the ground truth 333 without needing an LLLM. However, for pedagogical purposes we will grade the response using a language model with rubric. I.e. We will ask a language mode "Does the assistant answer 333?"

Generate an example dataset

To run this, first generate a dataset:

python -m tinker_cookbook.recipes.rubric.generate_data

Then you will see two jsonl file generated, one for training, one for testing. For example, if you look into tinker_cookbook/example_data/example_rubric_train.jsonl, each datapoint consists of

a convo (the conversation prefix that the policy sees)
rubric_items: a list of rubric items that specify what is a good item, how the grader should format the response, and how the grading result should be extracted.

{
  "convo": [
    {
      "role": "user",
      "content": "What is 4 + 5?"
    },
    {
      "role": "assistant",
      "content": "9"
    },
    {
      "role": "user",
      "content": "What is 122 + 12?"
    }
  ],
  "rubric_items": [
    {
      "rubric_str": "Does the chatbot correctly gets the answer 134?",
      "extraction_regex": "<score>(.*)</score>",
      "grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
    }
  ]
}

Debugging and Printing What Happens During Rollouts

Run

python -m tinker_cookbook.recipes.rubric.debug_env

You can see the message that the policy sees, its response, the grader input, and the grader output.

An example training run

To train the LLM to add with a rubric-based LLM, run

python -m tinker_cookbook.recipes.rubric.train

You can see the reward quickly goes up.

A more realistic dataset

We take the prometheus-eval/Feedback-Collection dataset from hugingface, which contains rubric to grade general chat responses. Run the following to kick off training:

python -m tinker_cookbook.recipes.rubric.prometheus_experimental

We can see that the reward climbs up steadily.

Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting for you to improve rubric-based grading for training LLMs!

github-actions · 2025-12-13T18:11:39Z

Claude finished @ruiqi-zhong's task in 1m 9s —— View job

Reviewing PR #166: Very simple example of using rubric

Read and understand the changed files
Review code quality and best practices
Check for potential issues
Provide comprehensive feedback

Summary

This PR adds a well-structured example demonstrating rubric-based LLM grading for RL training on an addition task. The code is pedagogically clear and follows the cookbook's established patterns. Below are my observations:

✅ Strengths

Clear pedagogical design: The example effectively demonstrates how to use an LLM grader with rubrics, making the concept accessible even though a simple regex could solve the addition task.
Good architecture: The Rubric, RubricBasedDatapoint, RubricGradedEnv, and builder classes follow the cookbook's established patterns (builder pattern, chz decorators, async methods).
Debug mode: The debug=True flag in RubricGradedEnv provides helpful colored output for understanding the grading process.
Proper async patterns: Uses asyncio.gather for parallel rubric grading in env.py:209.
Follows CLI conventions: The train.py uses the standard CLIConfig → Config pattern with chz.entrypoint.

🔧 Suggestions for Improvement

1. Duplicate import in `env.py:17` and `env.py:24`

import chz  # line 17
...
import chz  # line 24 (duplicate)

2. Duplicate import in `train.py:1` and `train.py:5`

import chz  # line 1
...
import chz  # line 5 (duplicate)

3. Logger defined but not used in `env.py:27`

logger = logging.getLogger(__name__)

This logger is defined but never used. Consider either:

Using logger.warning() instead of print() on lines 83, 86 for consistency with other cookbook code
Removing the unused logger

4. Minor: Inconsistent serialization in `RubricBasedDatapoint.to_json()` (env.py:122-126)

The Rubric objects are serialized via rubric.to_json() which returns a string, then that string is placed in the dict. This means rubric_items become JSON strings inside the outer JSON, requiring double-parsing. Compare with:

# Current (double-encoded):
"rubric_items": [rubric.to_json() for rubric in self.rubric_items]

# Suggested (cleaner):
"rubric_items": [rubric.to_dict() for rubric in self.rubric_items]

Then from_json would use Rubric.from_dict(rubric) instead of Rubric.from_json(rubric). This is a minor point but would make the JSON output more readable.

5. Missing `init.py`

The tinker_cookbook/recipes/rubric/ directory doesn't have an __init__.py. While not strictly required for the module to work, adding one would be consistent with making the directory a proper Python package.

6. Grammar in docstring (env.py:42, generate_data.py line 12)

# Current:
"Does the chatbot correctly gets the answer 134?"

# Suggested:
"Does the chatbot correctly get the answer 134?"

📝 Other Notes

No tests: This is a pedagogical example, but consider adding a simple unit test for the Rubric.extract_score() method to ensure the regex extraction works correctly for edge cases.
Grader max_tokens: In debug_env.py:30, the grader has max_tokens=64. This might be tight if the grader needs to provide reasoning before the score. The dataset builder uses 2048 which is more appropriate.

Verdict

The PR is a solid pedagogical addition to the cookbook. The suggestions above are minor polish items. The core implementation is correct and well-structured.

github-actions · 2025-12-14T00:00:00Z

Claude finished @ruiqi-zhong's task in 1m 21s —— View job

Reviewing PR #166: Very simple example of using rubric

Read and understand all changed files
Review code quality and best practices
Check for potential issues
Provide comprehensive feedback

Summary

This PR adds a well-structured pedagogical example demonstrating rubric-based LLM grading for RL training. The code follows the cookbook's established patterns (builder pattern, chz decorators, async methods) and provides clear documentation.

🐛 Bug Found

data.py:135 - Type mismatch in RubricDatapointListBuilderFromJsonl

def __call__(self) -> Sequence[RubricBasedDatapoint]:
    datapoints = []
    with open(self.jsonl_path, "r") as f:
        for line in f:
            data = json.loads(line)  # data is a dict
            datapoints.append(RubricBasedDatapoint.from_json(data))  # but from_json expects str
    return datapoints

from_json at line 112 expects a str parameter, but receives a dict. Either:

Pass line directly: RubricBasedDatapoint.from_json(line)
Or add a from_dict method similar to Rubric.from_dict

✅ Strengths

Clear pedagogical design: The addition task example effectively demonstrates the rubric-based grading concept
Good architecture: Follows cookbook patterns with RubricGradedEnv, RubricGradedEnvGroupBuilder, RubricGradedDataset, and builder classes
Debug mode: The debug=True flag provides helpful colored output for understanding the grading flow
Proper async patterns: Uses asyncio.gather for parallel rubric grading (env.py:107-109)
CLI conventions: Uses standard CLIConfig → Config pattern with chz.entrypoint

🔧 Minor Suggestions

Location	Issue	Suggestion
`generate_data.py:14`	Grammar	"Does the chatbot correctly gets" → "Does the chatbot correctly get"
`data.py:154`	Typo	"evalaute" → "evaluate"
`debug_env.py:51`	Low max_tokens	Grader `max_tokens=64` may be tight if reasoning is needed before the score (dataset builder uses 2048)

📝 Notes

No __init__.py: Other recipe directories also don't have one, so this is consistent with the existing pattern.
Previous review suggestions: A previous Claude review suggested removing duplicate imports and an unused logger, which are minor polish items but not blockers.

Verdict

The PR is a solid pedagogical addition. The type mismatch bug in data.py:135 should be fixed before merging. The other suggestions are minor polish items.

joschu · 2025-12-20T05:57:10Z

tinker_cookbook/recipes/rubric/data.py

+        \n\nChatbot: ...
+        """
+        return "\n\n".join(
+            [f"{self.__convert_role(message['role'])}: {message['content']}" for message in convo]


double underscore methods have a special meaning in python -- why not use single underscore?

joschu · 2025-12-20T06:09:18Z

tinker_cookbook/recipes/rubric/env.py

+    RubricDatapointListBuilder,
+)
+
+# ANSI color codes


you can use the termcolor library

joschu · 2025-12-20T06:10:09Z

tinker_cookbook/recipes/rubric/data.py

+        for i in range(1, 6):
+            rubric_text += f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}\n"
+
+        rubric_text += f"\nHere is a reference response that achieved a score of 5: {item['orig_reference_answer']}\n"


some inconsistency in whether you put newline before or after line. how about just keeping a list of lines, and joining them at the end?

joschu · 2025-12-20T06:10:49Z

tinker_cookbook/recipes/rubric/debug_env.py

+    service_client = tinker.ServiceClient()
+    policy = TinkerTokenCompleter(
+        sampling_client=service_client.create_sampling_client(base_model=policy_name),
+        max_tokens=64,


nit: pull this to the top of the function, so arbitrary parameter values are all together

joschu · 2025-12-20T06:12:28Z

tinker_cookbook/recipes/rubric/data.py

+        Create a prompt for the grader to grade the conversation based on the rubric. The prompt should contain 1) the conversation to be graded, and 2) the rubric.
+        """
+
+        prompt = "I will show you 1) a conversation between a human and a chatbot, and 2) a rubric for grading the conversation. Please grade the conversation based on the rubric."


this prompt doesn't say anything indicating that the grading should focus on the last assistant message, right? how about separating the prompt convo (which is fixed) from the completion (which the grader should look at)?

joschu · 2025-12-20T06:13:13Z

tinker_cookbook/recipes/rubric/README.md

+python -m tinker_cookbook.recipes.rubric.prometheus_experimental
+```
+
+We can see that the reward climbs up steadily.


it would be nice to show that the model is learning something meaningful

a very simple example based on rubric grading

d27b7c7

ruiqi-zhong requested a review from joschu December 13, 2025 18:11

ruiqi-zhong and others added 12 commits December 13, 2025 18:16

b

aa0c6de

b

40e5374

b

c51822e

b

cf812c7

b

163d816

b

f370ab8

b

4292086

adding readme

56f9beb

adding readme

4213c9f

b

456a03c

b

d073bf8

Merge branch 'main' into rubric_example

843fce5

ruiqi-zhong marked this pull request as ready for review December 13, 2025 23:59

ruiqi-zhong requested a review from rafailovtm December 14, 2025 00:00

b

0fafd14

ruiqi-zhong changed the title ~~[Very simple example of using rubric]~~ [Cookbook Recipe of Using Rubric] Dec 14, 2025

b

07a54ef

joschu reviewed Dec 20, 2025

View reviewed changes

joschu added 3 commits December 20, 2025 06:28

.

99feeba

.

74a2eb8

.

29a6fc3

joschu approved these changes Dec 20, 2025

View reviewed changes

joschu merged commit be77c0c into main Dec 20, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cookbook Recipe of Using Rubric] #166

[Cookbook Recipe of Using Rubric] #166

ruiqi-zhong commented Dec 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 14, 2025 •

edited

Loading

Uh oh!

joschu Dec 20, 2025

Uh oh!

joschu Dec 20, 2025

Uh oh!

joschu Dec 20, 2025

Uh oh!

joschu Dec 20, 2025

Uh oh!

joschu Dec 20, 2025

Uh oh!

joschu Dec 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Cookbook Recipe of Using Rubric] #166

[Cookbook Recipe of Using Rubric] #166

Conversation

ruiqi-zhong commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rubric-based Grading for LLMs

A simple example of using a grader LLM with rubrics

Generate an example dataset

Debugging and Printing What Happens During Rollouts

An example training run

A more realistic dataset

Uh oh!

github-actions bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #166: Very simple example of using rubric

Summary

✅ Strengths

🔧 Suggestions for Improvement

1. Duplicate import in env.py:17 and env.py:24

2. Duplicate import in train.py:1 and train.py:5

3. Logger defined but not used in env.py:27

4. Minor: Inconsistent serialization in RubricBasedDatapoint.to_json() (env.py:122-126)

5. Missing __init__.py

6. Grammar in docstring (env.py:42, generate_data.py line 12)

📝 Other Notes

Verdict

Uh oh!

github-actions bot commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #166: Very simple example of using rubric

Summary

🐛 Bug Found

✅ Strengths

🔧 Minor Suggestions

📝 Notes

Verdict

Uh oh!

joschu Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

joschu Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

joschu Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

joschu Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

joschu Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

joschu Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruiqi-zhong commented Dec 13, 2025 •

edited

Loading

github-actions bot commented Dec 13, 2025 •

edited

Loading

1. Duplicate import in `env.py:17` and `env.py:24`

2. Duplicate import in `train.py:1` and `train.py:5`

3. Logger defined but not used in `env.py:27`

4. Minor: Inconsistent serialization in `RubricBasedDatapoint.to_json()` (env.py:122-126)

5. Missing `init.py`

github-actions bot commented Dec 14, 2025 •

edited

Loading