Tiny Language Models Framework

This repository contains the implementation and resources for the Tiny Language Models Framework project. This project's aim is to develop small-scale language models to facilitate detailed research into various aspects of transformer-based language models in the perspective of enveiling properties that may arise within larger language models (LLMs).

Project Structure

The project is structured into research projects (RPs). Each RP is self-contained in its respective top-level folder research_project_X.
Additionally, a top-level datasets folder is available for the different research projects to share.

2025 Release

This reformative edition has delivered the research projects (RP) research_project_1 to research_project_4.
For all these RPs, we have studied small transformer models trained to perform a custom task named TinyPy-CodeTracing, where a language model takes a python snippet as input, and produces its execution trace by duplicating the code snippet at each execution step, and annotating relevant execution information, like represented by the following figure:

The input python snippets and their corresponding execution traces were synthetically generated using an original tool named TinyPy-CodeTracing Generator, illustrated by the figure below:

More details can be found in the dedicated manuscript of this release.
The top-level tinypy_code_tracing_demo contains a demonstration of the base research pipeline used during this release:
- 1_data_generation/: Contains the scripts used for generating data and preparing it for training:
  - 1_tinypy_generator_2.0.py: First stage of TinyPy-CodeTracing Generator. Used to synthesize python snippets with user defined properties.
  - 2_tinypy_code_tracer.py: Second stage of TinyPy-CodeTracing Generator. Used to create the execution trace of python snippet.
  - 3_determinism_filtering.py: Third stage of TinyPy-CodeTracing Generator. Used to fitler our code snippets that do not meet a certain condition that can impact determinism during model inference (check manuscript for details).
  - 4_data_preparation.py: Script to split the data into train-test-validation and tokenize the data.
  - tinypy_code_tracer_tokenizer.py: A custom tokenizer for our data. Used by 4_data_preparation.py.
- 2_model_training/train.py: Our Pytorch training script with support for Multi-GPU data parallelism.
- 3_inference/:
  - eval.py: Script for evaluating a trained model with support for multi-inference parallelism.
  - tinypy_code_tracer_tokenizer.py: The same custom tokenizer used for data preparation. Required by eval.py.

Contact

Younes Boukacem: [email protected]
Hodhaifa Benouaklil: [email protected]

License

This project is licensed under the MIT License.

Acknowledgements

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
__pycache__		__pycache__
datasets		datasets
research_project_1		research_project_1
research_project_2		research_project_2
research_project_3/xpgroup-7		research_project_3/xpgroup-7
research_project_4/robustness_evolution_study		research_project_4/robustness_evolution_study
tinypy_code_tracing_demo		tinypy_code_tracing_demo
README.md		README.md
requirements.txt		requirements.txt
send.sh		send.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tiny Language Models Framework

Project Structure

2025 Release

Contact

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

Modern-Compilers-Lab/Tiny-Language-Models-Framework

Folders and files

Latest commit

History

Repository files navigation

Tiny Language Models Framework

Project Structure

2025 Release

Contact

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages