This repository contains the implementation for "From Gloss to Meaning: Evaluating Pre-trained Language Models for Bidirectional Sign Language Translation" - a comprehensive study comparing fine-tuned pre-trained language models against transformer models trained from scratch for sign language gloss translation.
Our research demonstrates that fine-tuning large pre-trained language models significantly outperforms training from scratch for bidirectional sign language gloss translation tasks. We evaluate multiple PLMs across three benchmark datasets with state-of-the-art results.
asl-translation/
├── base_pipeline.py # Base class with common functionality
├── preprocessors.py # Text and gloss preprocessing utilities
│
├── gloss_to_text_data.py # Data processing for gloss→text
├── gloss_to_text_model.py # Model handling for gloss→text
├── gloss_to_text_pipeline.py # Complete gloss→text pipeline
│
├── text_to_gloss_data.py # Data processing for text→gloss
├── text_to_gloss_model.py # Model handling for text→gloss
├── text_to_gloss_pipeline.py # Complete text→gloss pipeline
│
├── example_usage.py # Multiple usage examples
├── requirements.txt # Python dependencies
├── __init__.py # Package initialization
├── setup.py # Package installation
└── README.md # This file
- Python 3.8+
- CUDA-capable GPU (8GB+ VRAM recommended, 16GB+ for LLaMA)
- 16GB+ system RAM (32GB+ recommended for LLaMA)
git clone https://github.com//imics-lab/gloss2text.git
cd gloss2text
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtfrom gloss_to_text_pipeline import GlossToTextTranslationPipeline
# Initialize
pipeline = GlossToTextTranslationPipeline()
# Step by step
raw_ds = pipeline.load_dataset()
df, gloss_col, text_col = pipeline.preprocess_data(raw_ds)
ds = pipeline.prepare_data_for_training(df, gloss_col, text_col)
tokenizer, _ = pipeline.load_model_and_tokenizer()
tok_ds = pipeline.tokenize_data(ds)
trainer = pipeline.train_model(tok_ds, output_dir="./gloss_to_text_t5")t5-base: T5-small (220M params)flan-t5-base: Flan-T5-small (220M params)mbart: mBART-small (125M params)llama-8b: LLaMA 3.1 8B (8B params)
signum: SIGNUM dataset (DGS ↔ German)phoenix: RWTH-PHOENIX-14T (DGS ↔ German)aslg: ASLG-PC12 (ASL ↔ English)
g2t: Gloss-to-Text translationt2g: Text-to-Gloss translationboth: Train both directions sequentially
- Fine-tuned PLMs significantly outperform baseline Transformers across all benchmarks.
- G2T is consistently easier than T2G: BLEU-4 is 30–60% higher and WER substantially lower.
- Llama 8B achieves the best results overall, especially on large-scale ASLG-PC12 (83.10 BLEU-4 G2T, 55.21 BLEU-4 T2G).
- mBART-small excels on low-resource datasets like SIGNUM and PHOENIX-14T due to its multilingual denoising pre-training.
| Model | Min VRAM | Recommended VRAM | Training Time* |
|---|---|---|---|
| T5-small | 4GB | 8GB | ~2 hours |
| Flan-T5-small | 4GB | 8GB | ~2 hours |
| mBART-small | 6GB | 8GB | ~2.5 hours |
| LLaMA 8B | 12GB | 16GB+ | ~8 hours |
Approximate times for 1000 samples, 5 epochs on RTX 4090
- BLEU-1/2/3/4: N-gram precision scores
- ROUGE-L: Longest common subsequence
- METEOR: Alignment-based semantic evaluation
- WER: Word Error Rate
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
This project is licensed under the MIT License - see the LICENSE file for details.