WIP: Jupyter Agent 2 #3037

baptistecolle · 2025-08-20T11:12:14Z

This is a WIP blog post for the Jupyter Agent 2 project.

There are still a few graphs to be added, and maybe release some artifacts for it

@lvwerra @ayukh let me know what you think?

I modified the name from data-agent to jupter-agent-2, as I thought that could be more impactful

baptistecolle · 2025-08-20T11:13:57Z

jupyter-agent-2.md

+title: "😎 Creating a Data Science Agent from Scratch"
+thumbnail: /blog/assets/jupyter-agent-2/thumbnail.png
+authors:
+- user: baptistecolle


Let me know the order you guys prefer

baptistecolle · 2025-08-20T11:14:17Z

jupyter-agent-2.md

+
+---
+
+## ⚙️ Processing Pipeline


here i think adding a graph could be cool

ayukh · 2025-08-20T12:36:16Z

jupyter-agent-2.md

+
+*Challenge:* Many datasets were unavailable.  
+*Trick:* Since LLMs are strong at code and have a decent world model, we prompted them to **act as a code interpreter** when the dataset was missing.  
+


I would also clarify that not just many datasets were not available, but also incorrectly mapped in metadata or were not specific in the metadata

I can expand that section later if needed

Yes, feel free to commit to the branch! I wanted this to be like a common pull request, so feel free to modify and reformat a lot. Worst case, ping me on Slack for some questions if you have. So I will let you handle this section as you know it best

pcuenca

Did a quick pass. Cool work. I'd suggest to maybe contextualize more how the demo works, and how it relates with the training process we describe.

_blog.yml

jupyter-agent-2.md

pcuenca · 2025-08-20T15:59:53Z

jupyter-agent-2.md

+We built a pipeline to automatically fetch these datasets, ensuring the code inside notebooks could actually run. The goal was to later train the model on actual code execution.
+
+### 3. Edu scoring
+We scored notebooks based on educational quality. We saw that using the whole notebook was not optimal, as many contained trivial or broken code.  


How did we score them? Did we train/use a separate model?

jupyter-agent-2.md

pcuenca · 2025-08-20T16:11:19Z

jupyter-agent-2.md

+This is a follow-up to our earlier work on [jupyter-agent (v1)](https://huggingface.co/spaces/lvwerra/jupyter-agent).
+
+The **Jupyter Agent** is a data science agent that can execute code directly inside a Jupyter notebook. Think of it like *Cursor*, but living natively inside your data science workflow.  
+For this demo we use **QwenCoder**, currently one of the strongest coding models.


What's the relationship with the Qwen 4B model that is discussed in the rest of the post?

merveenoyan

thanks a lot! 💗
IMHO one big thing missing is instructions on how to use it locally, we only see the link to the Space.

merveenoyan · 2025-08-28T08:46:23Z

jupyter-agent-2.md

+
+# Creating a Data Science Agent from Scratch
+
+Check out our new demo here: [huggingface.co/spaces/lvwerra/jupyter-agent-2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2).  


Suggested change

Check out our new demo here: [huggingface.co/spaces/lvwerra/jupyter-agent-2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2).

Check out our new demo [here](https://huggingface.co/spaces/lvwerra/jupyter-agent-2).

merveenoyan · 2025-08-28T08:46:47Z

jupyter-agent-2.md

+This is a follow-up to our earlier work on [jupyter-agent (v1)](https://huggingface.co/spaces/lvwerra/jupyter-agent).
+
+The **Jupyter Agent** is a data science agent that can execute code directly inside a Jupyter notebook. Think of it like *Cursor*, but living natively inside your data science workflow.  
+For this demo we use **QwenCoder**, currently one of the strongest coding models.


might be nice to link to the ckpt

merveenoyan · 2025-08-28T08:49:48Z

jupyter-agent-2.md

+
+We set out to **train a small data agent model** that could perform better on DABStep.  
+
+Our first choice was **Qwen-4B**: extremely small (fast to iterate with, easy to run), yet strong enough to act in agentic scenarios.  


might be nice to link to checkpoint, which Qwen is it? Qwen3?

merveenoyan · 2025-08-28T08:51:09Z

jupyter-agent-2.md

+| For the year 2023, focusing on the merchant *Crossfit Hanna*, if we incentivize users to switch to a different Authorization Characteristics Indicator, which option would be the most cost-effective? | E:346.49 |
+
+This benchmark remains challenging for today’s LLMs — especially for smaller models.  
+You can explore the live leaderboard here: [huggingface.co/spaces/adyen/DABstep](https://huggingface.co/spaces/adyen/DABstep).


Suggested change

You can explore the live leaderboard here: [huggingface.co/spaces/adyen/DABstep](https://huggingface.co/spaces/adyen/DABstep).

You can explore the live leaderboard [here] (https://huggingface.co/spaces/adyen/DABstep).

you can also embed this to the blog, and it would be nice to link to that blog too either here or at the end of the blog

merveenoyan · 2025-08-28T08:52:22Z

jupyter-agent-2.md

+- Rich metadata for each notebook (authors, datasets used, etc.).  
+
+
+## ⚙️ Processing Pipeline


I think putting multiple subheaders with little text breaks readability a bit

merveenoyan · 2025-08-28T08:53:29Z

jupyter-agent-2.md

+Some training steps were particularly interesting:  
+
+- For trace generation, we used LLMs to generate QA pairs, which gave us a **verifiable environment**.  
+- Finally, we fine-tuned **Qwen-4B** with [TRL](https://huggingface.co/docs/trl).  


would be nice to link to the dataset

also where can people find the fine-tuned checkpoint?

also where can people find the fine-tuned checkpoint?

It's still WIP 😄 I will upload everything later today with that section in the blog updated with more information+links

merveenoyan · 2025-08-28T08:54:17Z

jupyter-agent-2.md

+- *Distillation:* Investigate knowledge distillation, which has shown strong results for improving small models.  
+- *Reinforcement Learning (RL):* Build an RL environment, which has been shown to achieve state-of-the-art performance on agentic tasks. Since our QA setup already provides a verifiable environment, we could leverage it directly for RL training.
+
+Maybe this will lead to… **Jupyter-Agent 3.** 😉  


would be nice to end with a call for action here on people trying it out

baptistecolle added 2 commits August 20, 2025 13:07

v0 jupyer-agent-2 blog

038f063

v0 jupyer-agent-2 blog

1925623

baptistecolle commented Aug 20, 2025

View reviewed changes

jupyter-agent-2.md

---

## ⚙️ Processing Pipeline

Copy link

Contributor Author

baptistecolle Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here i think adding a graph could be cool

ayukh reviewed Aug 20, 2025

View reviewed changes

pcuenca reviewed Aug 20, 2025

View reviewed changes

baptistecolle added 2 commits August 21, 2025 14:14

made changes for review

8b6a72f

made changes for review

fb22209

merveenoyan reviewed Aug 28, 2025

View reviewed changes

baptistecolle marked this pull request as draft August 29, 2025 05:26


		Challenge: Many datasets were unavailable.
		Trick: Since LLMs are strong at code and have a decent world model, we prompted them to act as a code interpreter when the dataset was missing.


		# Creating a Data Science Agent from Scratch

		Check out our new demo here: [huggingface.co/spaces/lvwerra/jupyter-agent-2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2).

	Check out our new demo here: [huggingface.co/spaces/lvwerra/jupyter-agent-2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2).
	Check out our new demo [here](https://huggingface.co/spaces/lvwerra/jupyter-agent-2).


		We set out to train a small data agent model that could perform better on DABStep.

		Our first choice was Qwen-4B: extremely small (fast to iterate with, easy to run), yet strong enough to act in agentic scenarios.

	You can explore the live leaderboard here: [huggingface.co/spaces/adyen/DABstep](https://huggingface.co/spaces/adyen/DABstep).
	You can explore the live leaderboard [here] (https://huggingface.co/spaces/adyen/DABstep).

		- Rich metadata for each notebook (authors, datasets used, etc.).


		## ⚙️ Processing Pipeline

WIP: Jupyter Agent 2 #3037

Are you sure you want to change the base?

WIP: Jupyter Agent 2 #3037

Uh oh!

Conversation

baptistecolle commented Aug 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!