📘 数学LLMの学習用リポジトリ

⚠️ このリポジトリは現在も整理・開発中の段階にあり、内容は随時更新・改善される予定です。ファイル構成や記述が未整理の部分もありますが、今後整理し、よりわかりやすくする予定です。

📘 数学LLMの学習用リポジトリ

このリポジトリは、学校の授業の一環として取り組んでいる 軽量の数学LLM（大規模言語モデル） の作成プロジェクトです。四則演算などの基本的な演算から始め、段階的にモデルの数学的能力を高めることを目的としています。

🔧 構成概要

以下のような構成のファイルが含まれています：

📂 学習用データセット（JSONL形式）例：data_set/math_model_data_expr_answer.jsonl
🛠 トークナイザー訓練用 .txt 変換スクリプトとその生成物例：jsonl_to_txt.py、math_tokenizer_corpus_expr_answer.txt、data_set/math_tokenizer_train_data3.txt
🧠 SentencePiece を使ったトークナイザー訓練スクリプト例：tokenizer_train.py
トークナイザーファイル例：math_tokenizer_v3.model
🐍 PyTorch による LLM 訓練スクリプト例：train_math_llm.py
🔢 SymPy を用いたデータセット生成/データ整形スクリプト例：scripts/generate_data_sympy1.py、data_formatting1.py、combine_data1.py
依存関係の管理例：requirements.txt

学習データセットはすべて SymPy を用いて自動生成しています。訓練したモデルは Hugging Face Hub に一部アップロードしています。

🔋 Pythonバージョンと環境

このプロジェクトは sentencepiece の使用のため、Python 3.12.7 で開発・動作確認を行っています。
sentencepiece が Python 3.13 系に対応していないため、Python 3.12 系の使用を推奨しています。
それ以外のライブラリは他のバージョンでも動作する可能性があります。

Python仮想環境の構築と依存関係のインストール

このプロジェクトは仮想環境内で動作させることを推奨しています。以下の手順でセットアップできます：

🔧 仮想環境の作成と有効化（Windowsの場合）

python -m venv .venv
.venv\Scripts\activate

🔧 仮想環境の作成と有効化（macOS / Linux の場合）

python3 -m venv .venv
source .venv/bin/activate

📦 依存ライブラリのインストール

pip install -r requirements.txt

Python 3.12.x を推奨（特に sentencepiece が Python 3.13 非対応のため）上記で .venv という仮想環境名を使っていますが、任意の名前でもかまいません。

🔗 リンク集

📄 データ生成用スクリプト（Google Colab）
Google Colab Link
※リポジトリ内にも生成スクリプトを同梱しています。
📦 モデルファイル（Hugging Face）
https://huggingface.co/nobuchiyo345/my-math-llm-large-no-latex

🌐 Web UI (デモ)について

UIは Flask で構築しており、app2.py などを実行することで起動します。
HTML / CSS / JavaScript ファイルは static/ と templates/ に含まれています。
app2.py 内では以下のように、使用するモデルとトークナイザーのファイルパスをハードコードしています：
```
PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
MODEL_DIR_PATH = os.path.join(PROJECT_ROOT, "output_model", "lightweight_math_llm")
TOKENIZER_FILE_PATH = os.path.join(PROJECT_ROOT, "output_model", "math_tokenizer_v3.model")
```
そのため、使用するモデルやトークナイザーのファイル名・フォルダ構成が異なる場合は、これらのパスを手動で書き換えてください。
起動方法（VSCode のターミナルなどで）：
```
python app2.py
```
起動後、ブラウザで http://127.0.0.1:5000/ を開きます。

日本語版（注意書き）ファイルやコードはまだ完全に整理されておらず、わかりにくい部分も多く含まれています。不明点や使い方については、AIアシスタントに問い合わせていただくと解説が得られる場合があります。

📙 Lightweight Math LLM Project

This repository is part of a school project focused on building a lightweight mathematical language model (LLM). The training begins with basic arithmetic operations and gradually expands to more advanced topics.

🧾 Repository Structure

This repository includes the following components:

📂 Training datasets in JSONL format Example: data_set/math_model_data_expr_answer.jsonl
🔄 Scripts for converting datasets into .txt format for tokenizer training Example: jsonl_to_txt.py, math_tokenizer_corpus_expr_answer.txt, data_set/math_tokenizer_train_data3.txt
🧠 Tokenizer training scripts using SentencePiece Example: tokenizer_train.py
Tokenizer files Example: math_tokenizer_v3.model
🐍 LLM training scripts using PyTorch Example: train_math_llm.py
🔢 Dataset generation and formatting scripts using SymPy Example: scripts/generate_data_sympy1.py, data_formatting1.py, combine_data1.py
Dependency management Example: requirements.txt

All training datasets are automatically generated using SymPy. Some trained models are uploaded to Hugging Face Hub.

🔋 Python Version and Environment

This project is developed and tested with Python 3.12.7 due to usage of sentencepiece.
sentencepiece does not support Python 3.13 series, so Python 3.12.x is recommended.
Other libraries may work with different Python versions.

Setting up Python Virtual Environment and Installing Dependencies

It is recommended to run this project inside a Python virtual environment. Follow these steps:

🔧 Create and activate virtual environment (Windows)

python -m venv .venv
.venv\Scripts\activate

🔧 Create and activate virtual environment (macOS / Linux)

python3 -m venv .venv
source .venv/bin/activate

📦 Install required dependencies

pip install -r requirements.txt

Recommended Python version: 3.12.x (due to sentencepiece not yet supporting 3.13) You can use any name for the virtual environment; .venv is just a convention.

🔗 Links

📄 Google Colab for dataset generation
https://colab.research.google.com/drive/1_QzZUL_T2HfS5_iuz7xG23h5hf7p2fC7?usp=sharing
📦 Model files (Hugging Face)
https://huggingface.co/nobuchiyo345/my-math-llm-large-no-latex

🌐 Web UI (Demo)

The UI is built with Flask and can be started by running app2.py or similar files.
HTML / CSS / JavaScript files are included in the static/ and templates/ folders.

In app2.py, the model and tokenizer file paths are hardcoded as follows:

PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
MODEL_DIR_PATH = os.path.join(PROJECT_ROOT, "output_model", "lightweight_math_llm")
TOKENIZER_FILE_PATH = os.path.join(PROJECT_ROOT, "output_model", "math_tokenizer_v3.model")

Therefore, if your model or tokenizer filenames or folder structure differ, please manually edit these paths.

How to start (in VSCode terminal or command line):
```
python app2.py
```
Then open your browser and go to http://127.0.0.1:5000/.

English version (Note) The files and code are not yet fully organized and may contain parts that are hard to understand. If you have any questions or need help using the code, you may ask AI assistants for explanations.

🌟 demo画面イメージ

以下は実際に動かしたときの画面例です。

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
0828_test		0828_test
10_18		10_18
10_25		10_25
__pycache__		__pycache__
cot_dataset		cot_dataset
data_generate		data_generate
data_set		data_set
demo_UI		demo_UI
ex_output_model704		ex_output_model704
pmi		pmi
practice_coding		practice_coding
scripts		scripts
small_data		small_data
static		static
templates		templates
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md
UI_llm.txt		UI_llm.txt
analyze_tokenizer.py		analyze_tokenizer.py
app.py		app.py
app0714.py		app0714.py
app10.py		app10.py
app2.py		app2.py
app3.py		app3.py
app4.py		app4.py
app5.py		app5.py
app6.py		app6.py
app7.py		app7.py
app8.py		app8.py
app9.py		app9.py
chat-PyTorchで軽量LLMを作る手順.txt		chat-PyTorchで軽量LLMを作る手順.txt
chat-数学LLM開発のステップ.txt		chat-数学LLM開発のステップ.txt
check_textfile.py		check_textfile.py
compare_model.py		compare_model.py
compare_model.txt		compare_model.txt
conf.py		conf.py
convert_latex.py		convert_latex.py
convert_latex2.py		convert_latex2.py
convert_latex3.py		convert_latex3.py
corpus_for_tokenizer.txt		corpus_for_tokenizer.txt
data_convert.py		data_convert.py
data_split.py		data_split.py
decimal_to_frac.py		decimal_to_frac.py
decimal_to_frac2.py		decimal_to_frac2.py
decimal_to_frac3.py		decimal_to_frac3.py
evaluate2_model_accuracy.py		evaluate2_model_accuracy.py
evaluate_model_accuracy.py		evaluate_model_accuracy.py
ex_llm_train.py		ex_llm_train.py
ex_llm_train2.py		ex_llm_train2.py
ex_llm_train3.py		ex_llm_train3.py
ex_llm_train4.py		ex_llm_train4.py
ex_math_tokenizer.json		ex_math_tokenizer.json
ex_math_tokenizer_v1.model		ex_math_tokenizer_v1.model
ex_math_tokenizer_v1.vocab		ex_math_tokenizer_v1.vocab
ex_tokenizer_train.py		ex_tokenizer_train.py
fine_tune_ex_model2.py		fine_tune_ex_model2.py
fix_corpus.py		fix_corpus.py
gemini_dataset.py		gemini_dataset.py
gemini_model.py		gemini_model.py
gemini_tokenizer_set.py		gemini_tokenizer_set.py
gemini_train.py		gemini_train.py
inference.py		inference.py
input.txt		input.txt
json_to_jsonl.py		json_to_jsonl.py
jsonl_to_txt.py		jsonl_to_txt.py
jsonl_to_txt2.py		jsonl_to_txt2.py
jsonl_to_txt3.py		jsonl_to_txt3.py
learn_by_text.py		learn_by_text.py
linear.json		linear.json
linear.jsonl		linear.jsonl
math_tokenizer_corpus_expr_answer.txt		math_tokenizer_corpus_expr_answer.txt
math_tokenizer_v2.model		math_tokenizer_v2.model
math_tokenizer_v2.vocab		math_tokenizer_v2.vocab
math_tokenizer_v3.model		math_tokenizer_v3.model
math_tokenizer_v3.vocab		math_tokenizer_v3.vocab
math_tokenizer_v4.model		math_tokenizer_v4.model
math_tokenizer_v4.vocab		math_tokenizer_v4.vocab
math_tokenizer_v5.model		math_tokenizer_v5.model
math_tokenizer_v5.vocab		math_tokenizer_v5.vocab
math_tokenizer_v_latexsplit.model		math_tokenizer_v_latexsplit.model
math_tokenizer_v_latexsplit.vocab		math_tokenizer_v_latexsplit.vocab
model_accuracy.py		model_accuracy.py
model_accuracy_extract.py		model_accuracy_extract.py
model_accuracy_fraction_extract.py		model_accuracy_fraction_extract.py
model_accuracy_fraction_gpu.py		model_accuracy_fraction_gpu.py
model_accuracy_gpu.py		model_accuracy_gpu.py
model_accuracy_gpu2.py		model_accuracy_gpu2.py
model_accuracy_gpu_1014.py		model_accuracy_gpu_1014.py
model_test.py		model_test.py
model_test1019.py		model_test1019.py
model_tokenizer.py		model_tokenizer.py
new_llm_train.py		new_llm_train.py
new_llm_train2.py		new_llm_train2.py
output.txt		output.txt
practice1.py		practice1.py
practice_pytorch.py		practice_pytorch.py
prepare_copus.py		prepare_copus.py
prepare_corpus.py		prepare_corpus.py
prepare_model_data.py		prepare_model_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📘 数学LLMの学習用リポジトリ

🔧 構成概要

🔋 Pythonバージョンと環境

Python仮想環境の構築と依存関係のインストール

🔧 仮想環境の作成と有効化（Windowsの場合）

🔧 仮想環境の作成と有効化（macOS / Linux の場合）

📦 依存ライブラリのインストール

🔗 リンク集

🌐 Web UI (デモ)について

📙 Lightweight Math LLM Project

🧾 Repository Structure

🔋 Python Version and Environment

Setting up Python Virtual Environment and Installing Dependencies

🔧 Create and activate virtual environment (Windows)

🔧 Create and activate virtual environment (macOS / Linux)

📦 Install required dependencies

🔗 Links

🌐 Web UI (Demo)

🌟 demo画面イメージ

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

IDNBC/math_llm_document

Folders and files

Latest commit

History

Repository files navigation

📘 数学LLMの学習用リポジトリ

🔧 構成概要

🔋 Pythonバージョンと環境

Python仮想環境の構築と依存関係のインストール

🔧 仮想環境の作成と有効化（Windowsの場合）

🔧 仮想環境の作成と有効化（macOS / Linux の場合）

📦 依存ライブラリのインストール

🔗 リンク集

🌐 Web UI (デモ)について

📙 Lightweight Math LLM Project

🧾 Repository Structure

🔋 Python Version and Environment

Setting up Python Virtual Environment and Installing Dependencies

🔧 Create and activate virtual environment (Windows)

🔧 Create and activate virtual environment (macOS / Linux)

📦 Install required dependencies

🔗 Links

🌐 Web UI (Demo)

🌟 demo画面イメージ

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages