- 📌 CAG vs RAG: Centralized Repository for NACE Revision
This repository is dedicated to the revision of the Nomenclature statistique des Activités économiques dans la Communauté Européenne (NACE).
It provides tools for automated classification and evaluation of business activity codes using Large Language Models (LLMs) and vector-based retrieval systems.
Ensure you have Python 3.12+ and uv or pip installed, then install the required dependencies:
uv syncor
uv pip install -r pyproject.tomlSet up linting and formatting checks using pre-commit:
uv run pre-commit autoupdate
uv run pre-commit installBefore running the script download the model and put it in cache using the huggingface CLI which is faster than vllm to download the model.
export MODEL_NAME=Qwen/Qwen2.5-0.5B
uv run huggingface-cli download $MODEL_NAMETo create a searchable database of NACE 2025 codes:
uv run src/build_vector_db.pyFor unambiguous classification:
uv run src/encode_unambiguous.pyFor ambiguous classification using an LLM:
uv run src/encode_ambiguous.py --experiment_name NACE2025_DATASET --llm_name Qwen/Qwen3-0.6B
⚠️ TO BE UPDATED
Compare different classification models:
uv run src/evaluate_strategies.pyOnce all unique ambiguous cases have been recoded using the best strategy, you can rebuild the entire dataset with NACE 2025 labels:
uv run src/build_nace2025_sirene4.pyThis repository leverages Large Language Models (LLMs) to assist in classifying business activities. One can also use all open source models available on HuggingFace and compatible with vLLM.
This project supports automated workflows via Argo Workflows. To trigger a workflow, execute:
argo submit argo-workflows/relabel-naf08-to-naf25.yamlOr use the Argo Workflow UI.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- Mieux versionner les prompts qui sont sauvegardé dans s3. Idealement il faudrait versionner via la collection de la base de données + la version du prompt RAG de langfuse
- améliorer les embeddings (faut il garder les notes exlicatives en entier pour faire la similarity search ?)
- implementer le reranker (celui de qwen probablement)
- Inclure des règles métiers dans les prompts
- Inclure des règles code spécifique dans le cas du CAG. (Si LMNP alors on explique ce qui fait la distinction entre les deux code -- cf le fichier de @Nathan) --> Du coup inclure des variables annexes pour aider à départager certaines fois ?