-
Notifications
You must be signed in to change notification settings - Fork 251
[Cookbook Recipe of Using Rubric] #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Claude finished @ruiqi-zhong's task in 1m 9s —— View job Reviewing PR #166: Very simple example of using rubric
SummaryThis PR adds a well-structured example demonstrating rubric-based LLM grading for RL training on an addition task. The code is pedagogically clear and follows the cookbook's established patterns. Below are my observations: ✅ Strengths
🔧 Suggestions for Improvement1. Duplicate import in
|
|
Claude finished @ruiqi-zhong's task in 1m 21s —— View job Reviewing PR #166: Very simple example of using rubric
SummaryThis PR adds a well-structured pedagogical example demonstrating rubric-based LLM grading for RL training. The code follows the cookbook's established patterns (builder pattern, 🐛 Bug Found
def __call__(self) -> Sequence[RubricBasedDatapoint]:
datapoints = []
with open(self.jsonl_path, "r") as f:
for line in f:
data = json.loads(line) # data is a dict
datapoints.append(RubricBasedDatapoint.from_json(data)) # but from_json expects str
return datapoints
✅ Strengths
🔧 Minor Suggestions
📝 Notes
VerdictThe PR is a solid pedagogical addition. The type mismatch bug in |
| \n\nChatbot: ... | ||
| """ | ||
| return "\n\n".join( | ||
| [f"{self.__convert_role(message['role'])}: {message['content']}" for message in convo] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double underscore methods have a special meaning in python -- why not use single underscore?
| RubricDatapointListBuilder, | ||
| ) | ||
|
|
||
| # ANSI color codes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the termcolor library
| for i in range(1, 6): | ||
| rubric_text += f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}\n" | ||
|
|
||
| rubric_text += f"\nHere is a reference response that achieved a score of 5: {item['orig_reference_answer']}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some inconsistency in whether you put newline before or after line. how about just keeping a list of lines, and joining them at the end?
| service_client = tinker.ServiceClient() | ||
| policy = TinkerTokenCompleter( | ||
| sampling_client=service_client.create_sampling_client(base_model=policy_name), | ||
| max_tokens=64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pull this to the top of the function, so arbitrary parameter values are all together
| Create a prompt for the grader to grade the conversation based on the rubric. The prompt should contain 1) the conversation to be graded, and 2) the rubric. | ||
| """ | ||
|
|
||
| prompt = "I will show you 1) a conversation between a human and a chatbot, and 2) a rubric for grading the conversation. Please grade the conversation based on the rubric." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this prompt doesn't say anything indicating that the grading should focus on the last assistant message, right? how about separating the prompt convo (which is fixed) from the completion (which the grader should look at)?
| python -m tinker_cookbook.recipes.rubric.prometheus_experimental | ||
| ``` | ||
|
|
||
| We can see that the reward climbs up steadily. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to show that the model is learning something meaningful
Rubric-based Grading for LLMs
data.pycontains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.generate_data.pygenerates some example datapoint if you want to run our demo on addition.env.pydetermines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.train.pyallows you train LLMs on any dataset saved in our format (specified indata.py). The default script will train on the addition task, whose data is generated bygenerate_data.py.prometheus_experimental.pycontains a script to train the LLMs based on the rubrics from theprometheus-eval/Feedback-Collectiondataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.A simple example of using a grader LLM with rubrics
We show how to use rubric-based LLM to provide reward on an addition task. E.g.
Usually, this could be graded by matching the number to the ground truth 333 without needing an LLLM. However, for pedagogical purposes we will grade the response using a language model with rubric. I.e. We will ask a language mode "Does the assistant answer 333?"
Generate an example dataset
To run this, first generate a dataset:
Then you will see two
jsonlfile generated, one for training, one for testing. For example, if you look intotinker_cookbook/example_data/example_rubric_train.jsonl, each datapoint consists ofDebugging and Printing What Happens During Rollouts
Run
You can see the message that the policy sees, its response, the grader input, and the grader output.
An example training run
To train the LLM to add with a rubric-based LLM, run
You can see the reward quickly goes up.
A more realistic dataset
We take the
prometheus-eval/Feedback-Collectiondataset from hugingface, which contains rubric to grade general chat responses. Run the following to kick off training:We can see that the reward climbs up steadily.
Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting for you to improve rubric-based grading for training LLMs!