Skip to content

Conversation

mgmorgan23
Copy link

This PR updates the call to safety evaluations in submit_eval_jobs.py by:

  1. Rerouting it through oe-eval for consistency
  2. Adding a --run_safety_evaluations_reasoning option that runs the thinker safety eval suite through oe-eval
  3. Adding the HF_Token to the gantry args passed in oe-eval.sh (necessary in order to use allenai/wildguard as a classifier in the safety evals)

The beaker image maliam/merge-safety-evals-0804-2 is compatible with this script, built off of this branch: https://github.com/allenai/oe-eval-internal/tree/maliam-add-safety-eval and this fork: https://github.com/mgmorgan23/safety-eval-fork

Example call:
python scripts/submit_eval_jobs.py --model_name hf-open-thoughts-open-thinker3-7B --location open-thoughts/OpenThinker3-7B --is_tuned --evaluate_on_weka --workspace "tulu-3-results" --priority low --preemptible --beaker_image maliam/merge-safety-evals-0804-2 --use_hf_tokenizer_template --oe_eval_tasks "ifeval::tulu" --use_alternate_safety_image maliam/merge-safety-evals-0804-2 --skip_oi_evals --run_safety_evaluations_reasoning

@hamishivi
Copy link
Collaborator

If we are consolidating to oe-eval, is it possible to just remove the existing safety eval code and instead rely on the pre-existing oe-eval code? I understand the want to not touch the old logic but I think its cleaner if we just have this one single oe-eval setup, and not do something special for safety evals.

And maybe you can add a SAFETY_EVAL preset like this: https://github.com/allenai/open-instruct/blob/main/scripts/eval/oe-eval.sh#L142
and edit here accordingly: https://github.com/allenai/open-instruct/blob/main/scripts/submit_eval_jobs.py#L113

@mgmorgan23
Copy link
Author

I updated the logic to use the oe-eval task suite -- the call for a reasoning model is now:

python scripts/submit_eval_jobs.py --model_name hf-open-thoughts-open-thinker3-7B --location open-thoughts/OpenThinker3-7B --is_tuned --evaluate_on_weka --workspace "tulu-3-results" --priority low --preemptible --beaker_image maliam/merge-safety-evals-0806 --use_hf_tokenizer_template --run_oe_eval_experiments --oe_eval_task_suite "SAFETY_EVAL_REASONING" --skip_oi_evals

And for a non-reasoning model:

python scripts/submit_eval_jobs.py --model_name hf-allenai-llama-3-tulu-2-8b --location allenai/llama-3-tulu-2-8b --is_tuned --evaluate_on_weka --workspace "tulu-3-results" --priority low --preemptible --beaker_image maliam/merge-safety-evals-0806 --use_hf_tokenizer_template --run_oe_eval_experiments --oe_eval_task_suite "SAFETY_EVAL" --skip_oi_evals

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment. We should wait to merge this until the oe-eval side of things is done (and then test before merging).

# tested reasonably extensively with 70B
if num_gpus > 1:
num_gpus *= 2
if args.oe_eval_task_suite == 'SAFETY_EVAL' or args.oe_eval_task_suite == 'SAFETY_EVAL_REASONING':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline.
Let's remove the custom logic here and make it inline with everything else, but then for SAFETY_EVAL_REASONING specifically just double the num gpus.

Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to merge this once oe-eval PR is merged! It would also be super useful if you do a quick test run with submit_eval_jobs before merging and link the succesful running jobs (one regular one reasoner).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants