-
Notifications
You must be signed in to change notification settings - Fork 903
WIP: Jupyter Agent 2 #3037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
WIP: Jupyter Agent 2 #3037
Conversation
title: "😎 Creating a Data Science Agent from Scratch" | ||
thumbnail: /blog/assets/jupyter-agent-2/thumbnail.png | ||
authors: | ||
- user: baptistecolle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know the order you guys prefer
|
||
--- | ||
|
||
## ⚙️ Processing Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here i think adding a graph could be cool
|
||
*Challenge:* Many datasets were unavailable. | ||
*Trick:* Since LLMs are strong at code and have a decent world model, we prompted them to **act as a code interpreter** when the dataset was missing. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also clarify that not just many datasets were not available, but also incorrectly mapped in metadata or were not specific in the metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can expand that section later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, feel free to commit to the branch! I wanted this to be like a common pull request, so feel free to modify and reformat a lot. Worst case, ping me on Slack for some questions if you have. So I will let you handle this section as you know it best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick pass. Cool work. I'd suggest to maybe contextualize more how the demo works, and how it relates with the training process we describe.
jupyter-agent-2.md
Outdated
We built a pipeline to automatically fetch these datasets, ensuring the code inside notebooks could actually run. The goal was to later train the model on actual code execution. | ||
|
||
### 3. Edu scoring | ||
We scored notebooks based on educational quality. We saw that using the whole notebook was not optimal, as many contained trivial or broken code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we score them? Did we train/use a separate model?
This is a follow-up to our earlier work on [jupyter-agent (v1)](https://huggingface.co/spaces/lvwerra/jupyter-agent). | ||
|
||
The **Jupyter Agent** is a data science agent that can execute code directly inside a Jupyter notebook. Think of it like *Cursor*, but living natively inside your data science workflow. | ||
For this demo we use **QwenCoder**, currently one of the strongest coding models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship with the Qwen 4B model that is discussed in the rest of the post?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot! 💗
IMHO one big thing missing is instructions on how to use it locally, we only see the link to the Space.
|
||
# Creating a Data Science Agent from Scratch | ||
|
||
Check out our new demo here: [huggingface.co/spaces/lvwerra/jupyter-agent-2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out our new demo here: [huggingface.co/spaces/lvwerra/jupyter-agent-2](https://huggingface.co/spaces/lvwerra/jupyter-agent-2). | |
Check out our new demo [here](https://huggingface.co/spaces/lvwerra/jupyter-agent-2). |
This is a follow-up to our earlier work on [jupyter-agent (v1)](https://huggingface.co/spaces/lvwerra/jupyter-agent). | ||
|
||
The **Jupyter Agent** is a data science agent that can execute code directly inside a Jupyter notebook. Think of it like *Cursor*, but living natively inside your data science workflow. | ||
For this demo we use **QwenCoder**, currently one of the strongest coding models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to link to the ckpt
|
||
We set out to **train a small data agent model** that could perform better on DABStep. | ||
|
||
Our first choice was **Qwen-4B**: extremely small (fast to iterate with, easy to run), yet strong enough to act in agentic scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice to link to checkpoint, which Qwen is it? Qwen3?
| For the year 2023, focusing on the merchant *Crossfit Hanna*, if we incentivize users to switch to a different Authorization Characteristics Indicator, which option would be the most cost-effective? | E:346.49 | | ||
|
||
This benchmark remains challenging for today’s LLMs — especially for smaller models. | ||
You can explore the live leaderboard here: [huggingface.co/spaces/adyen/DABstep](https://huggingface.co/spaces/adyen/DABstep). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can explore the live leaderboard here: [huggingface.co/spaces/adyen/DABstep](https://huggingface.co/spaces/adyen/DABstep). | |
You can explore the live leaderboard [here] (https://huggingface.co/spaces/adyen/DABstep). |
you can also embed this to the blog, and it would be nice to link to that blog too either here or at the end of the blog
- Rich metadata for each notebook (authors, datasets used, etc.). | ||
|
||
|
||
## ⚙️ Processing Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think putting multiple subheaders with little text breaks readability a bit
Some training steps were particularly interesting: | ||
|
||
- For trace generation, we used LLMs to generate QA pairs, which gave us a **verifiable environment**. | ||
- Finally, we fine-tuned **Qwen-4B** with [TRL](https://huggingface.co/docs/trl). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to link to the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also where can people find the fine-tuned checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also where can people find the fine-tuned checkpoint?
It's still WIP 😄 I will upload everything later today with that section in the blog updated with more information+links
- *Distillation:* Investigate knowledge distillation, which has shown strong results for improving small models. | ||
- *Reinforcement Learning (RL):* Build an RL environment, which has been shown to achieve state-of-the-art performance on agentic tasks. Since our QA setup already provides a verifiable environment, we could leverage it directly for RL training. | ||
|
||
Maybe this will lead to… **Jupyter-Agent 3.** 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to end with a call for action here on people trying it out
This is a WIP blog post for the Jupyter Agent 2 project.
There are still a few graphs to be added, and maybe release some artifacts for it
@lvwerra @ayukh let me know what you think?
I modified the name from data-agent to jupter-agent-2, as I thought that could be more impactful