Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
178 commits
Select commit Hold shift + click to select a range
bb00d91
add README
NicolasAG Jun 16, 2025
dc81770
increase env session inactivity timout
NicolasAG Jun 17, 2025
e60d4c1
update readme
NicolasAG Jun 17, 2025
f9e45c2
move miniwob to domains/
NicolasAG Jun 18, 2025
8cdbd06
fix
NicolasAG Jul 7, 2025
5510982
fix path
NicolasAG Jul 7, 2025
07e858c
return RuntimeError instead of HTTPException because not pickable
NicolasAG Jul 7, 2025
5e56896
add env_call_timeout
NicolasAG Jul 8, 2025
c06b768
update gpu fractions
NicolasAG Jul 8, 2025
b1ad285
set kl coef to 0
NicolasAG Jul 8, 2025
6bbe977
Merge remote-tracking branch 'origin/main' into debug_miniwob
NicolasAG Jul 8, 2025
c8ac64d
update max seq len
NicolasAG Jul 8, 2025
b87a6d1
revert to json instead of tool use agent
NicolasAG Jul 9, 2025
824d841
update README
NicolasAG Jul 9, 2025
8d170ec
debug overflow counter
NicolasAG Jul 10, 2025
21a1b2a
fix prompts
NicolasAG Jul 10, 2025
05b6794
update readme
NicolasAG Jul 11, 2025
ef6b2b0
flag tape as invalid instead of raising http errors
NicolasAG Jul 21, 2025
0abc2b0
use redis
NicolasAG Jul 21, 2025
d3f6889
track task names instead of data splits
NicolasAG Jul 21, 2025
9c319e3
fix
NicolasAG Jul 21, 2025
92c8a93
remove unused var in new tapeagent remote_env
NicolasAG Jul 22, 2025
edf4d00
use BaseMetrics
NicolasAG Jul 23, 2025
28749e0
fix
NicolasAG Jul 23, 2025
a4f9f79
keep track of time taken
NicolasAG Jul 23, 2025
8a6120f
send per step times to wandb
ollmer Jul 24, 2025
3d57d2e
processed_entries_queue_popped_data
AlexPiche Jul 25, 2025
4fbc5c7
faster preprocess
AlexPiche Jul 25, 2025
91acbc4
more logging
AlexPiche Jul 25, 2025
fb5a0bd
better namming
AlexPiche Jul 25, 2025
8c78c45
clean up
AlexPiche Jul 25, 2025
d1d1836
Merge remote-tracking branch 'origin/main' into debug_miniwob
NicolasAG Jul 25, 2025
5eb3a4e
use all miniwob tasks
NicolasAG Jul 25, 2025
1b90a4b
add groups_in_progress
AlexPiche Jul 26, 2025
3c8f338
raise when finetune is done
AlexPiche Jul 26, 2025
f88dceb
cte lr
AlexPiche Jul 27, 2025
75d3c9c
default save checkpoints
NicolasAG Jul 28, 2025
6b97c7b
update vllm max tokens
NicolasAG Jul 28, 2025
d3cf30b
assert group size is as expected
NicolasAG Jul 28, 2025
4c50f1f
assert finetuning length is as much as vllm max length
NicolasAG Jul 28, 2025
ff61d73
update finetuning & vllm max lengths
NicolasAG Jul 28, 2025
a00e6e6
debug agent
NicolasAG Jul 28, 2025
6f149c8
use ppo & upd config
NicolasAG Aug 8, 2025
2ae2dd8
update readme
NicolasAG Aug 8, 2025
913c8e2
stop training after 1k steps
NicolasAG Aug 11, 2025
812aafc
first mcp
AlexPiche Aug 15, 2025
ca8516b
fix the env server
AlexPiche Aug 16, 2025
f3af1bc
tweak prompt
AlexPiche Aug 16, 2025
5b10c33
upd
AlexPiche Aug 16, 2025
d2e6d09
clean up
AlexPiche Aug 18, 2025
228cb42
hard code dino
AlexPiche Aug 18, 2025
fdf3c83
less envs
AlexPiche Aug 18, 2025
1165397
less envs
AlexPiche Aug 18, 2025
40a144a
longer timeout
AlexPiche Aug 18, 2025
2d25d88
longer seq length
AlexPiche Aug 18, 2025
2036167
more envs
AlexPiche Aug 18, 2025
664b539
more llms per actor
AlexPiche Aug 18, 2025
4b0db03
even more envs
AlexPiche Aug 18, 2025
63d4092
longer timeout and revert prompt
AlexPiche Aug 18, 2025
6d81456
retry task
AlexPiche Aug 18, 2025
373b0ac
pid deno module
AlexPiche Aug 18, 2025
e2de768
diff deno tmp dir
AlexPiche Aug 18, 2025
763b594
none node modules
AlexPiche Aug 18, 2025
0783570
bigger timeout
AlexPiche Aug 19, 2025
b284fcb
diff temp dir for each mcp
AlexPiche Aug 19, 2025
eb48d90
0.0.0.0
AlexPiche Aug 19, 2025
efa2717
filter based on port
AlexPiche Aug 19, 2025
402eeb2
scale up env servers by llm_servers
NicolasAG Aug 20, 2025
58f31cc
reweight actor/trainer
NicolasAG Aug 20, 2025
4101d77
add massimo miniwob split
NicolasAG Aug 20, 2025
b00e476
cleanup
NicolasAG Aug 20, 2025
3d86a28
change port to 7778
AlexPiche Aug 21, 2025
0b56125
update agent reflection node
NicolasAG Aug 21, 2025
96a75c1
mcp and verify server
AlexPiche Aug 21, 2025
0b4c992
use custom parser
AlexPiche Aug 21, 2025
471d28d
relative path
AlexPiche Aug 21, 2025
8e0eeff
test apth
AlexPiche Aug 21, 2025
f93d756
typo
AlexPiche Aug 21, 2025
32e3eb6
clean up
AlexPiche Aug 21, 2025
5a3ab0e
clean up
AlexPiche Aug 21, 2025
436e233
rename domain to mcp
AlexPiche Aug 22, 2025
366263b
more envs
AlexPiche Aug 22, 2025
9b0a74c
towards massimo setup
NicolasAG Aug 22, 2025
e6e735d
Merge remote-tracking branch 'origin/main' into debug_miniwob
NicolasAG Aug 22, 2025
371be6e
less env replicas
AlexPiche Aug 22, 2025
1045868
Merge remote-tracking branch 'origin/debug_miniwob' into mcp_tir
AlexPiche Aug 22, 2025
05f7667
Merge remote-tracking branch 'origin/debug_miniwob' into mcp_tir
AlexPiche Aug 22, 2025
46b39d1
clean up tmp
AlexPiche Aug 22, 2025
af63f51
change mcp dir
AlexPiche Aug 22, 2025
55a96e5
bigger model len
AlexPiche Aug 22, 2025
dd0ea2b
typo
AlexPiche Aug 22, 2025
dc4052d
typo
AlexPiche Aug 23, 2025
bb4d0c5
clean up
AlexPiche Aug 26, 2025
ccdcd32
center reward
AlexPiche Aug 26, 2025
7f5ed95
running avg reward
AlexPiche Aug 26, 2025
88a0ee7
start from real mean
AlexPiche Aug 26, 2025
ef46f39
upd configs
NicolasAG Aug 28, 2025
1274748
upd
NicolasAG Aug 28, 2025
66bcfbd
Fix paths
rafapi Aug 28, 2025
3fcb847
Use relative path
rafapi Aug 28, 2025
b16d45c
revert reward calculation
NicolasAG Aug 28, 2025
9f239c6
Fix path
rafapi Aug 28, 2025
9e61c35
update massimo cfg to grpo
NicolasAG Aug 28, 2025
020a021
revert mktemp changes
rafapi Aug 28, 2025
ef884f2
test with ppo
NicolasAG Aug 28, 2025
4323f57
Fix deno paths
rafapi Aug 29, 2025
2b5e9f5
udt
rafapi Aug 29, 2025
565d25c
make the cache tag stable across all processes
rafapi Aug 29, 2025
e39ff7b
remove running avg
AlexPiche Aug 29, 2025
fc17df7
fix
rafapi Aug 30, 2025
115f629
Merge branch 'clean_up_running_avg' into mcp_tir
rafapi Sep 2, 2025
537ec7a
update configs
NicolasAG Sep 2, 2025
7a4e73f
add retry mechanism for agent loop
NicolasAG Sep 2, 2025
42e811e
add 30min timeout to rollout function
NicolasAG Sep 3, 2025
a4e8f5f
upd configs
NicolasAG Sep 5, 2025
95b735b
upd
NicolasAG Sep 5, 2025
8616303
upd configs
NicolasAG Sep 5, 2025
f4d8e0d
Avoid hot-spotting env; add extra metrics
rafapi Sep 5, 2025
23decf7
Print correct policy info
rafapi Sep 5, 2025
29118b7
Add aime2025
rafapi Sep 5, 2025
8882859
Test on aime2025
rafapi Sep 5, 2025
923cf6a
reduce n_env
NicolasAG Sep 6, 2025
44a033f
boost preprocess power
NicolasAG Sep 6, 2025
2918d1f
pop old data
NicolasAG Sep 6, 2025
dacaa1f
do not save playwright traces & screenshots
NicolasAG Sep 7, 2025
fcee5ee
return empty aggregate stats if empty stats
NicolasAG Sep 7, 2025
631389f
increase preprocessor power
NicolasAG Sep 7, 2025
f791211
better error handling
NicolasAG Sep 8, 2025
c54d900
fix
NicolasAG Sep 8, 2025
ea4918a
reduce timeouts
NicolasAG Sep 9, 2025
e5fca10
log number of groups done so far
NicolasAG Sep 12, 2025
df66a88
log everything if populate_rl_data fails
NicolasAG Sep 12, 2025
c8d0171
monitor env servers and reset if needed
NicolasAG Sep 12, 2025
981cd85
better health message
NicolasAG Sep 12, 2025
9c755ed
small fix
NicolasAG Sep 13, 2025
ea2d393
kl new old
AlexPiche Sep 22, 2025
eb7eb0d
loo
AlexPiche Sep 25, 2025
0b8a24d
better logs
NicolasAG Sep 26, 2025
1247360
Add new metrics
rafapi Sep 26, 2025
8cb5ef3
Merge remote-tracking branch 'origin/new_metrics' into mcp_tir
rafapi Sep 26, 2025
cd27e30
always check the worker before launching the agent on it + more detai…
NicolasAG Sep 26, 2025
f9ce99e
log stack trace
NicolasAG Sep 29, 2025
60fb042
small cleanup
NicolasAG Sep 29, 2025
61c91c7
Embedded envs
rafapi Sep 30, 2025
bd46a7d
Remove imports
rafapi Sep 30, 2025
724f318
sketch of new actor loop class, reuse most of the current one
ollmer Oct 1, 2025
b5c8d89
seq len 32k fits 1 h100, use qwen3-8b
ollmer Oct 1, 2025
b2fbc2b
debug entrypoint
ollmer Oct 1, 2025
550cb63
Increase shared_memory_entry_size
rafapi Oct 2, 2025
c13a71b
synchronous rollout policy
ollmer Oct 2, 2025
44d6fd4
fix import
ollmer Oct 2, 2025
053a532
llm benchmarking scripts
ollmer Oct 3, 2025
0f9bf6a
move to vllm 0.8.5 to support qwen3
ollmer Oct 6, 2025
44d0de4
launch mode to run inference llm only
ollmer Oct 6, 2025
81675bd
updated ray-based actor loop
ollmer Oct 6, 2025
d16222a
rollout debug
ollmer Oct 6, 2025
74857cf
llm benchmark scripts update
ollmer Oct 6, 2025
1332fc2
flag to control ray usage
ollmer Oct 6, 2025
2163313
mcp config with ray and local envs
ollmer Oct 6, 2025
e6f2329
update debug entrypoint
ollmer Oct 6, 2025
3dcdf09
better timing logging
ollmer Oct 8, 2025
ec567cc
fixes
ollmer Oct 8, 2025
72c04a6
fixes
ollmer Oct 8, 2025
68c9534
faster mcp server startup, significant speedup
ollmer Oct 9, 2025
872059d
fix training texts metadata
ollmer Oct 9, 2025
b702489
fixes
ollmer Oct 9, 2025
dea891a
fix
ollmer Oct 9, 2025
e672312
make exp dir with all my scripts
ollmer Oct 9, 2025
4ad27f9
move personal scripts out
ollmer Oct 10, 2025
41a080d
Remove test reward shaping
rafapi Oct 10, 2025
ab83a2e
Merge branch 'mcp_tir' into envs_speed_debug
rafapi Oct 10, 2025
bd23411
Fix imports
rafapi Oct 10, 2025
50f3ff9
Fix conflicts
rafapi Oct 10, 2025
6c9d5a5
Merge branch 'mcp_tir' into envs_speed_debug
rafapi Oct 10, 2025
d9a65b3
Fix
rafapi Oct 10, 2025
f0fb5db
Merge branch 'mcp_tir' into envs_speed_debug
rafapi Oct 10, 2025
10565e2
Merge branch 'debug_miniwob' into mcp_tir
rafapi Oct 10, 2025
76bbd91
Merge branch 'main' into mcp_tir
rafapi Oct 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ celerybeat.pid

# SageMath parsed files
*.sage.py
node_modules/

# Environments
.env
Expand Down Expand Up @@ -185,4 +186,4 @@ results
results/
data/
cache/
dump.rdb
dump.rdb
13 changes: 9 additions & 4 deletions conf/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ defaults:
- _self_

seed: 42
use_ray: false

finetune:
seed: ${..seed}
Expand All @@ -23,9 +24,9 @@ preprocess:
input: actor
output: training_data
n_workers: 8
chunk_n_groups: 2
chunk_n_groups: 8
# queue for loaded raw groups
raw_queue_size: 8
raw_queue_size: 128
# queue for processed chunks of multiple groups
input_queue_size: 32
# queue for ready chunks for multiple groups
Expand All @@ -47,7 +48,7 @@ llm:
temperature: 1.0
test_llm:
parameters:
max_tokens: 16000
max_tokens: 8192
temperature: 1.0
top_p: 0.95
top_k: 50
Expand All @@ -67,6 +68,7 @@ vllm_config:
tensor-parallel-size: 1
pipeline-parallel-size: 1
generation-config: vllm
max_model_len: 16000

world:
replicas: 1
Expand All @@ -75,10 +77,13 @@ world:
preprocessor_fraction: 0
finetune_fraction: 4

env_replicas: 2
# Number of environment servers per actor VLLM server
env_replicas_per_actor: 1

actor_group_port: 9000
environment_start_port: 7777
# Remote vs embedded environment execution strategy
environment_mode: remote
# this will be autocreated based on the config
jobs: []

Expand Down
2 changes: 1 addition & 1 deletion conf/finetune/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ learning_rate: 1e-6
# How much to clip the gradient (no clipping if null)
gradient_clipping_threshold: 0.3
# Learning rate scheduler type (indexed by completed_steps).
lr_scheduler_type: cosine # could be cosine, constant_with_warmup
lr_scheduler_type: constant # could be cosine, constant_with_warmup
# Number of warmup (completed) steps in the learning rate schedule.
num_warmup_steps: 50
# Number of gradient accumulation steps.
Expand Down
155 changes: 155 additions & 0 deletions conf/mcp.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
defaults:
- base
- override finetune: grpo
- _self_

use_ray: true

llm:
use_cache: false
parameters:
max_tokens: 8192

test_llm:
parameters:
max_tokens: 8192

rewards:
correct_answer_not_finished: 0.0
buffer_tokens: 2000

actor:
rollout_policy: pipelinerl.domains.mcp.generate_mcp_rollout_with_local_env
system_prompt: Please reason step by step, and put your final answer within \boxed{{}}.
rollout_workers: 64
llm_max_rollouts: 256
problem_queue_size: 256
task_template: |-
{task}
shared_memory_entry_size: 200000000

preprocess:
shared_memory_entry_size: 2000000000

finetune:
seq_length: 32000
seq_parallel: 8

dataset_loader: pipelinerl.domains.math.load_datasets
train_dataset_names:
- open_reasoner_zero_57k
- open_reasoner_zero_extended_72k
test_dataset_names:
- aime_2025

vllm_config:
use_v1: true
vllm_kwargs:
enable-auto-tool-choice: ""
tool-call-parser: rl_tool
tool-parser-plugin: ${hydra:runtime.cwd}/pipelinerl/rl_tool_parser_plugin.py
max-num-seqs: 256
max-num-batched-tokens: 32000
max_model_len: 32000
gpu-memory-utilization: 0.9

environment:
_target_: tapeagents.mcp.MCPEnvironment
config_path: ${hydra:runtime.cwd}/conf/mcp/python.json
tools_whitelist:
- run_python_code
read_timeout_seconds: 600
use_cache: false


world:
env_replicas_per_actor: 8
environment_mode: embedded

agent_max_loops: 3
agent:
_target_: tapeagents.agent.Agent
name : mcp_agent
max_iterations: 3
store_llm_calls: true
templates:
system_prompt: |
You are a math-focused AI Agent. Solve problems by combining clear symbolic reasoning
with short, deterministic Python code.
Keep your replies concise and direct. Prioritize clarity and avoid over-elaboration.
Always present the final answer in LaTeX \boxed{{}}.
Do not express emotions or opinions about user questions.

Workflow:
1. Draft a brief plan in plain text.
2. Execute one run_python_code call to compute or verify the result.
3. Finalize by calling MathAnswer with the LaTeX-formatted answer.

Python execution policy (run_python_code):
- Use Python strictly for pure computation to verify and validate the final answer.
- No network, file system, OS or environment access.
- Keep snippets minimal and self-contained; avoid large outputs and long-running loops; print only the final result.

Validation:
- Cross-check results (alternative derivation, invariants, higher precision) before finalizing.
- If execution fails, propose the minimal fix and retry.
Keep replies direct and avoid unnecessary text.
allowed_tools: |
You can call the following tools:
{tools_description}
- run_python_code: deterministic math code; print only the final value.
- MathAnswer: return the LaTeX \boxed{{}} answer when the solution is verified.
Always verify with run_python_code before invoking MathAnswer.
thought_format: |
Important! Respond with the plain text, do not include any JSON or code.
Do not output anything besides what I asked in this message.
allowed_steps: |
Workflow summary:
- Plan briefly in plain text.
- Call run_python_code exactly once per loop to compute/verify.
- Finish with a single MathAnswer tool call carrying the \boxed{{}} result.
format: |
For finalization, reply with a single short sentence that ends in the \boxed{{}} answer,
immediately followed by the MathAnswer function call containing the same \boxed{{}} value.
Never emit unrelated JSON wrappers or duplicate the final thought.


nodes:
- _target_: tapeagents.nodes.StandardNode
name: plan
system_prompt: ${agent.templates.system_prompt}
guidance: |
Produce a concise math plan (formulas/checks). You will ALWAYS verify by executing Python code.
${agent.templates.thought_format}
steps_prompt: ${agent.templates.allowed_tools}
trim_obs_except_last_n: 2

- _target_: tapeagents.nodes.StandardNode
name: code
system_prompt: ${agent.templates.system_prompt}
guidance: |
ALWAYS call run_python_code once to compute/verify the result.
Use exact, deterministic code; print only the final scalar or tuple.
If code fails, fix minimally and call run_python_code again after reviewing the error.
use_known_actions: true
use_function_calls: true
trim_obs_except_last_n: 2

- _target_: tapeagents.nodes.StandardNode
name: finalize
system_prompt: ${agent.templates.system_prompt}
guidance: |
Read the last Python stdout value. First, state the answer in one short sentence that ends with LaTeX \boxed{{}}.
Immediately after that sentence, call the MathAnswer tool exactly once with:
name: MathAnswer
arguments: {"answer": "<final answer in LaTeX \\boxed{}>"}
Do not add any extra text around the tool call. Once the sentence is emitted, return only the MathAnswer function call.
steps:
- pipelinerl.domains.mcp.steps.MathAnswer
use_known_actions: true
use_function_calls: true
trim_obs_except_last_n: 2
next_node: code

model_path: Qwen/Qwen3-8B
# model_path: /mnt/llmd/base_models/ServiceNow-AI/7_9_25_14b_text_reasoning_sft
11 changes: 11 additions & 0 deletions conf/mcp/python.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"mcpServers": {
"python_exec": {
"command": "bash",
"args": [
"-c",
"deno run -N -R=node_modules -W=node_modules --node-modules-dir=auto jsr:@pydantic/mcp-run-python stdio"
]
}
}
}
Loading