Production backend serving fine-tuned Llama 3.1 8B model for cancer-specific entity extraction (IE) via vLLM on AWS EC2
This repository contains the backend infrastructure for a medical information extraction system that uses a fine-tuned Llama 3.1 8B model (qLoRA 4-bit quantization) to extract structured cancer-related entities from clinical text.
The system performs cancer-specific named entity recognition (NER) and information extraction (IE) on unstructured clinical notes, returning structured data across 7 medical fields: cancer type, stage, gene mutations, biomarkers, treatments, treatment responses, and metastasis sites.
๐ Live Application: https://medical-extraction.vercel.app
๐ฑ Frontend Repository: slm-ft-serving-frontend
flowchart TB
subgraph Browser["๐ User Browser"]
UI["medical-extraction.vercel.app"]
end
subgraph Vercel["โ๏ธ Vercel"]
Frontend["Next.js Frontend<br/>โข React UI components<br/>โข Server-side API routes<br/>โข Input validation"]
end
subgraph EC2["๐ฅ๏ธ EC2 g6.2xlarge (us-east-1)"]
subgraph Gateway["FastAPI Gateway :8080"]
API["REST API endpoints<br/>โข Request validation (Pydantic)<br/>โข CORS configuration"]
end
subgraph vLLM["vLLM Server :8000"]
Model["Llama 3.1 8B + LoRA<br/>โข GPU inference (L4)<br/>โข Model caching on EBS"]
end
Gateway -->|HTTP| vLLM
end
Browser -->|HTTPS| Vercel
Vercel -->|HTTP proxied| EC2
โ ๏ธ Note: This is an experimental project. The frontend is always live on Vercel, but the backend EC2 server may not be running at all times to save costs (~$1/hour for GPU instance). If extraction requests fail, the backend is likely stopped.
| Component | Technology | Port | Description |
|---|---|---|---|
| Frontend | Next.js 16 + React | N/A | User interface on Vercel (separate repo) |
| Gateway | FastAPI + Pydantic | 8080 | REST API layer with validation |
| Inference | vLLM + Llama 3.1 8B | 8000 | LLM serving with LoRA adapter |
| Infrastructure | EC2 + Docker Compose | N/A | Container orchestration |
| CI/CD | GitHub Actions + ECR | N/A | Automated builds and deployments |
The model was fine-tuned on synthetic cancer clinical data using qLoRA (4-bit quantization) technique for parameter-efficient training.
Base Model: meta-llama/Llama-3.1-8B
Fine-tuned Adapter: loghoag/llama-3.1-8b-medical-ie
{
"instruction": "Extract all cancer-related entities from the text.",
"input": "70-year-old man with widely metastatic cutaneous melanoma...",
"output": {
"cancer_type": "melanoma (cutaneous)",
"stage": "IV",
"gene_mutation": null,
"biomarker": "PD-L1 5%; TMB-high",
"treatment": "nivolumab and ipilimumab; stereotactic radiosurgery",
"response": "mixed response",
"metastasis_site": "brain"
}
}The model extracts 7 structured fields:
| Field | Description | Example |
|---|---|---|
cancer_type |
Type of cancer | melanoma, breast cancer, NSCLC |
stage |
Cancer stage | III, IV, metastatic |
gene_mutation |
Genetic mutations | EGFR exon 19, KRAS G12D, BRCA1 |
biomarker |
Biomarker status | HER2+, PD-L1 5%, TMB-high |
treatment |
Treatments given | nivolumab, chemotherapy, surgery |
response |
Treatment response | complete response, stable disease |
metastasis_site |
Metastasis locations | brain, liver, bone |
- AWS account with EC2, ECR, SSM, Secrets Manager access
- GitHub account with Actions enabled
- Poetry installed locally (
brew install poetry) - AWS CLI configured with credentials
- HuggingFace account with Llama 3.1 access
| Secret | Purpose | Location |
|---|---|---|
HF_TOKEN |
HuggingFace access token | AWS Secrets Manager |
AWS_ACCESS_KEY_ID |
AWS credentials | GitHub Secrets |
AWS_SECRET_ACCESS_KEY |
AWS credentials | GitHub Secrets |
# Clone and install
git clone https://github.com/longhoag/slm-ft-serving.git
cd slm-ft-serving && poetry install
# Deploy to EC2
poetry run python scripts/deploy.py
# Verify deployment
curl http://<ec2-ip>:8080/health- Push to main โ GitHub Actions triggers
- Parallel builds โ vLLM + Gateway Docker images
- Push to ECR โ Cache-optimized registry
- Manual deploy โ
poetry run python scripts/deploy.py
| Category | Technology |
|---|---|
| Inference Engine | vLLM (optimized LLM serving) |
| API Framework | FastAPI + Pydantic |
| Language | Python 3.11+ |
| Dependency Management | Poetry |
| Containerization | Docker + Docker Compose |
| Cloud Infrastructure | AWS EC2 (g6.2xlarge, L4 GPU) |
| Container Registry | AWS ECR |
| Remote Execution | AWS Systems Manager (SSM) |
| Secrets Management | AWS Secrets Manager + SSM Parameter Store |
| CI/CD | GitHub Actions |
| Logging | Loguru + CloudWatch Logs |
Frontend (Separate Repository)
| Category | Technology |
|---|---|
| Framework | Next.js 16 (App Router) |
| Language | TypeScript (strict mode) |
| Styling | TailwindCSS v4 |
| UI Components | ShadcnUI (Radix primitives) |
| Deployment | Vercel (serverless) |
flowchart TB
subgraph vLLM["vLLM Server (Port 8000)"]
subgraph Models["Model Loading"]
Base["Base Model<br/>Llama 3.1 8B<br/>(32.1 GB)"]
LoRA["LoRA Adapter<br/>medical-ie<br/>(71.8 MB)"]
Base --> LoRA
end
subgraph GPU["NVIDIA L4 GPU (24GB VRAM)"]
Batch["Continuous batching"]
Paged["PagedAttention"]
API["OpenAI-compatible API"]
end
Models --> GPU
Cache[("Docker Volume<br/>huggingface-cache<br/>(EBS persistent)")]
end
Key Features:
- LoRA hot-loading
- Model persistence on EBS
- Health endpoint for orchestration
- Custom chat template
flowchart TB
subgraph Gateway["FastAPI Gateway (Port 8080)"]
subgraph Endpoints["API Endpoints"]
Health["/health<br/>Health check"]
Docs["/docs<br/>Swagger UI"]
Extract["/api/v1/extract<br/>Main endpoint"]
end
subgraph Pipeline["Request Processing Pipeline"]
direction TB
P1["1. Pydantic validation"]
P2["2. Prompt construction"]
P3["3. vLLM API call"]
P4["4. JSON parsing"]
P5["5. Structured response"]
P1 --> P2 --> P3 --> P4 --> P5
end
Extract --> Pipeline
CORS["CORS: Vercel domains only"]
end
Endpoints:
GET /health- Health checkGET /docs- Swagger UIPOST /api/v1/extract- Main extraction endpoint
flowchart TB
subgraph Compose["Docker Compose Architecture"]
subgraph Network["Docker Network"]
GW["Gateway<br/>:8080<br/>depends_on: vllm:healthy"]
VLLM["vLLM<br/>:8000<br/>GPU: L4"]
GW -->|HTTP| VLLM
end
Volume[("Named Volume<br/>huggingface-cache<br/>Persistent on EBS")]
VLLM --> Volume
end
Client["External Client"] -->|Port 8080| GW
Orchestration:
- Health check dependencies
- 6-min startup for model loading
- GPU reservation
- Persistent volumes
flowchart TB
Push["Push to main"] --> Actions["GitHub Actions"]
Actions --> Build1["Build vLLM Image"]
Actions --> Build2["Build Gateway Image"]
Build1 --> ECR["AWS ECR<br/>โข slm-ft-serving-vllm<br/>โข slm-ft-serving-gateway<br/>โข :buildcache layer"]
Build2 --> ECR
ECR --> Deploy["Manual: deploy.py<br/>(SSM โ EC2)"]
Optimizations:
- Parallel builds
- ECR layer caching
- Disk cleanup before builds
flowchart LR
subgraph Local["Local Mac"]
Script["deploy.py<br/>โข Start EC2<br/>โข Wait OK<br/>โข Send commands"]
end
subgraph AWS["AWS"]
SSM["SSM<br/>Run Command"]
subgraph EC2["EC2 g6.2xlarge"]
Agent["SSM Agent<br/>โข Fetch HF token<br/>โข ECR login<br/>โข Pull images<br/>โข docker compose up"]
end
SSM --> EC2
end
Local -->|"AWS SSM API"| SSM
Security:
- No SSH/
.pemkeys - Secrets from AWS Secrets Manager
- SSM Parameter Store for config
- โ High-Performance Inference - vLLM optimizations for fast LLM serving
- โ GPU Acceleration - NVIDIA L4 GPU for efficient inference
- โ LoRA Adapter Support - Load fine-tuned adapters without full model retraining
- โ Model Caching - Persistent storage on EBS (survives container restarts)
- โ Health Checks - Automated container health monitoring
- โ Input Validation - Pydantic models for request/response validation
- โ CORS Security - Restricted to Vercel domains
- โ
API Documentation - Interactive Swagger UI at
/docs - โ Structured Output - 7 medical fields in JSON format
- โ Error Handling - Proper HTTP status codes and error messages
Frontend Features (View Frontend Repo)
- โจ Real-time Extraction - Extract medical entities in 2-3 seconds
- ๐ Secure Architecture - EC2 backend IP hidden via server-side proxy
- ๐ฑ Responsive Design - Works seamlessly on mobile and desktop
- ๐ฏ Type-safe - Full TypeScript coverage with strict mode
- โก Fast & Modern - Built with Next.js 16 and TailwindCSS v4
- ๐ Auto-deploy - Push to main โ live on Vercel instantly
- SSM-only access: No SSH, no
.pemkeys for EC2 access - Secrets Manager: All secrets stored securely, never in
.envfiles - Poetry for Python: No raw
pip installcommands - Loguru for logging: No
print()statements in production code - Staged development: Complete each stage before moving forward
- Fail-safe execution: Commands execute with error handling and retries
slm-ft-serving/
โโโ .github/
โ โโโ workflows/deploy.yml # CI/CD pipeline
โ โโโ copilot-instructions.md # AI assistant context
โโโ config/deployment.yml # Deployment configuration
โโโ docs/STAGE-3.md # Stage 3 documentation
โโโ gateway/
โ โโโ routers/extraction.py # Extraction endpoint
โ โโโ main.py # FastAPI app
โ โโโ Dockerfile # Gateway Docker image
โโโ scripts/deploy.py # SSM deployment script
โโโ Dockerfile # vLLM Docker image
โโโ docker-compose.yml # Container orchestration
โโโ pyproject.toml # Poetry dependencies
โโโ README.md
This project follows a staged development approach:
| Stage | Status | Description |
|---|---|---|
| 1 | โ Complete | vLLM server with LoRA adapter on EC2 g6.2xlarge |
| 2 | โ Complete | FastAPI gateway with Docker Compose orchestration |
| 3 | โ Complete | Next.js frontend on Vercel |
| 4 | ๐ฎ Planned | CloudWatch monitoring & observability |
- Live Application: https://medical-extraction.vercel.app
- Frontend Repository: slm-ft-serving-frontend
- Base Model: meta-llama/Llama-3.1-8B
- Fine-tuned Adapter: loghoag/llama-3.1-8b-medical-ie
MIT License - see LICENSE file for details.
- vLLM Team - High-performance LLM inference engine
- Meta AI - Llama 3.1 base model
- HuggingFace - Model hosting and fine-tuning infrastructure
- FastAPI - Modern Python web framework
- Vercel - Frontend hosting platform