Skip to content

Fine-tuned Llama 3.1 8B model for cancer-specific entity extraction (IE) serving using vLLM on EC2 instance with production backend and frontend

License

Notifications You must be signed in to change notification settings

longhoag/slm-ft-serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

92 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Medical Information Extraction - Backend Server

Production backend serving fine-tuned Llama 3.1 8B model for cancer-specific entity extraction (IE) via vLLM on AWS EC2

Live Demo GitHub Actions AWS Docker


๐ŸŽฏ Overview

This repository contains the backend infrastructure for a medical information extraction system that uses a fine-tuned Llama 3.1 8B model (qLoRA 4-bit quantization) to extract structured cancer-related entities from clinical text.

The system performs cancer-specific named entity recognition (NER) and information extraction (IE) on unstructured clinical notes, returning structured data across 7 medical fields: cancer type, stage, gene mutations, biomarkers, treatments, treatment responses, and metastasis sites.

๐ŸŒ Live Application: https://medical-extraction.vercel.app

๐Ÿ“ฑ Frontend Repository: slm-ft-serving-frontend


๐Ÿ—๏ธ Architecture

flowchart TB
    subgraph Browser["๐ŸŒ User Browser"]
        UI["medical-extraction.vercel.app"]
    end

    subgraph Vercel["โ˜๏ธ Vercel"]
        Frontend["Next.js Frontend<br/>โ€ข React UI components<br/>โ€ข Server-side API routes<br/>โ€ข Input validation"]
    end

    subgraph EC2["๐Ÿ–ฅ๏ธ EC2 g6.2xlarge (us-east-1)"]
        subgraph Gateway["FastAPI Gateway :8080"]
            API["REST API endpoints<br/>โ€ข Request validation (Pydantic)<br/>โ€ข CORS configuration"]
        end
        subgraph vLLM["vLLM Server :8000"]
            Model["Llama 3.1 8B + LoRA<br/>โ€ข GPU inference (L4)<br/>โ€ข Model caching on EBS"]
        end
        Gateway -->|HTTP| vLLM
    end

    Browser -->|HTTPS| Vercel
    Vercel -->|HTTP proxied| EC2
Loading

โš ๏ธ Note: This is an experimental project. The frontend is always live on Vercel, but the backend EC2 server may not be running at all times to save costs (~$1/hour for GPU instance). If extraction requests fail, the backend is likely stopped.

Component Details

Component Technology Port Description
Frontend Next.js 16 + React N/A User interface on Vercel (separate repo)
Gateway FastAPI + Pydantic 8080 REST API layer with validation
Inference vLLM + Llama 3.1 8B 8000 LLM serving with LoRA adapter
Infrastructure EC2 + Docker Compose N/A Container orchestration
CI/CD GitHub Actions + ECR N/A Automated builds and deployments

๐Ÿงฌ The Model

Fine-tuning Details

The model was fine-tuned on synthetic cancer clinical data using qLoRA (4-bit quantization) technique for parameter-efficient training.

Base Model: meta-llama/Llama-3.1-8B
Fine-tuned Adapter: loghoag/llama-3.1-8b-medical-ie

Training Data Format

{
  "instruction": "Extract all cancer-related entities from the text.",
  "input": "70-year-old man with widely metastatic cutaneous melanoma...",
  "output": {
    "cancer_type": "melanoma (cutaneous)",
    "stage": "IV",
    "gene_mutation": null,
    "biomarker": "PD-L1 5%; TMB-high",
    "treatment": "nivolumab and ipilimumab; stereotactic radiosurgery",
    "response": "mixed response",
    "metastasis_site": "brain"
  }
}

Extraction Fields

The model extracts 7 structured fields:

Field Description Example
cancer_type Type of cancer melanoma, breast cancer, NSCLC
stage Cancer stage III, IV, metastatic
gene_mutation Genetic mutations EGFR exon 19, KRAS G12D, BRCA1
biomarker Biomarker status HER2+, PD-L1 5%, TMB-high
treatment Treatments given nivolumab, chemotherapy, surgery
response Treatment response complete response, stable disease
metastasis_site Metastasis locations brain, liver, bone

๐Ÿš€ Getting Started

Prerequisites

  • AWS account with EC2, ECR, SSM, Secrets Manager access
  • GitHub account with Actions enabled
  • Poetry installed locally (brew install poetry)
  • AWS CLI configured with credentials
  • HuggingFace account with Llama 3.1 access

Required Secrets

Secret Purpose Location
HF_TOKEN HuggingFace access token AWS Secrets Manager
AWS_ACCESS_KEY_ID AWS credentials GitHub Secrets
AWS_SECRET_ACCESS_KEY AWS credentials GitHub Secrets

Quick Start

# Clone and install
git clone https://github.com/longhoag/slm-ft-serving.git
cd slm-ft-serving && poetry install

# Deploy to EC2
poetry run python scripts/deploy.py

# Verify deployment
curl http://<ec2-ip>:8080/health

CI/CD Workflow

  1. Push to main โ†’ GitHub Actions triggers
  2. Parallel builds โ†’ vLLM + Gateway Docker images
  3. Push to ECR โ†’ Cache-optimized registry
  4. Manual deploy โ†’ poetry run python scripts/deploy.py

๐Ÿ› ๏ธ Tech Stack

Backend (This Repository)

Category Technology
Inference Engine vLLM (optimized LLM serving)
API Framework FastAPI + Pydantic
Language Python 3.11+
Dependency Management Poetry
Containerization Docker + Docker Compose
Cloud Infrastructure AWS EC2 (g6.2xlarge, L4 GPU)
Container Registry AWS ECR
Remote Execution AWS Systems Manager (SSM)
Secrets Management AWS Secrets Manager + SSM Parameter Store
CI/CD GitHub Actions
Logging Loguru + CloudWatch Logs
Category Technology
Framework Next.js 16 (App Router)
Language TypeScript (strict mode)
Styling TailwindCSS v4
UI Components ShadcnUI (Radix primitives)
Deployment Vercel (serverless)

๐Ÿ”ง Backend Deep Dive

vLLM Inference Server

flowchart TB
    subgraph vLLM["vLLM Server (Port 8000)"]
        subgraph Models["Model Loading"]
            Base["Base Model<br/>Llama 3.1 8B<br/>(32.1 GB)"]
            LoRA["LoRA Adapter<br/>medical-ie<br/>(71.8 MB)"]
            Base --> LoRA
        end
        
        subgraph GPU["NVIDIA L4 GPU (24GB VRAM)"]
            Batch["Continuous batching"]
            Paged["PagedAttention"]
            API["OpenAI-compatible API"]
        end
        
        Models --> GPU
        
        Cache[("Docker Volume<br/>huggingface-cache<br/>(EBS persistent)")]
    end
Loading

Key Features:

  • LoRA hot-loading
  • Model persistence on EBS
  • Health endpoint for orchestration
  • Custom chat template

FastAPI Gateway

flowchart TB
    subgraph Gateway["FastAPI Gateway (Port 8080)"]
        subgraph Endpoints["API Endpoints"]
            Health["/health<br/>Health check"]
            Docs["/docs<br/>Swagger UI"]
            Extract["/api/v1/extract<br/>Main endpoint"]
        end
        
        subgraph Pipeline["Request Processing Pipeline"]
            direction TB
            P1["1. Pydantic validation"]
            P2["2. Prompt construction"]
            P3["3. vLLM API call"]
            P4["4. JSON parsing"]
            P5["5. Structured response"]
            P1 --> P2 --> P3 --> P4 --> P5
        end
        
        Extract --> Pipeline
        
        CORS["CORS: Vercel domains only"]
    end
Loading

Endpoints:

  • GET /health - Health check
  • GET /docs - Swagger UI
  • POST /api/v1/extract - Main extraction endpoint

Container Orchestration

flowchart TB
    subgraph Compose["Docker Compose Architecture"]
        subgraph Network["Docker Network"]
            GW["Gateway<br/>:8080<br/>depends_on: vllm:healthy"]
            VLLM["vLLM<br/>:8000<br/>GPU: L4"]
            GW -->|HTTP| VLLM
        end
        
        Volume[("Named Volume<br/>huggingface-cache<br/>Persistent on EBS")]
        VLLM --> Volume
    end
    
    Client["External Client"] -->|Port 8080| GW
Loading

Orchestration:

  • Health check dependencies
  • 6-min startup for model loading
  • GPU reservation
  • Persistent volumes

CI/CD Pipeline

flowchart TB
    Push["Push to main"] --> Actions["GitHub Actions"]
    
    Actions --> Build1["Build vLLM Image"]
    Actions --> Build2["Build Gateway Image"]
    
    Build1 --> ECR["AWS ECR<br/>โ€ข slm-ft-serving-vllm<br/>โ€ข slm-ft-serving-gateway<br/>โ€ข :buildcache layer"]
    Build2 --> ECR
    
    ECR --> Deploy["Manual: deploy.py<br/>(SSM โ†’ EC2)"]
Loading

Optimizations:

  • Parallel builds
  • ECR layer caching
  • Disk cleanup before builds

Remote Deployment (SSM)

flowchart LR
    subgraph Local["Local Mac"]
        Script["deploy.py<br/>โ€ข Start EC2<br/>โ€ข Wait OK<br/>โ€ข Send commands"]
    end
    
    subgraph AWS["AWS"]
        SSM["SSM<br/>Run Command"]
        subgraph EC2["EC2 g6.2xlarge"]
            Agent["SSM Agent<br/>โ€ข Fetch HF token<br/>โ€ข ECR login<br/>โ€ข Pull images<br/>โ€ข docker compose up"]
        end
        SSM --> EC2
    end
    
    Local -->|"AWS SSM API"| SSM
Loading

Security:

  • No SSH/.pem keys
  • Secrets from AWS Secrets Manager
  • SSM Parameter Store for config

๐Ÿ“Š Features & Capabilities

Backend Features

  • โœ… High-Performance Inference - vLLM optimizations for fast LLM serving
  • โœ… GPU Acceleration - NVIDIA L4 GPU for efficient inference
  • โœ… LoRA Adapter Support - Load fine-tuned adapters without full model retraining
  • โœ… Model Caching - Persistent storage on EBS (survives container restarts)
  • โœ… Health Checks - Automated container health monitoring
  • โœ… Input Validation - Pydantic models for request/response validation
  • โœ… CORS Security - Restricted to Vercel domains
  • โœ… API Documentation - Interactive Swagger UI at /docs
  • โœ… Structured Output - 7 medical fields in JSON format
  • โœ… Error Handling - Proper HTTP status codes and error messages

Frontend Features (View Frontend Repo)

  • โœจ Real-time Extraction - Extract medical entities in 2-3 seconds
  • ๐Ÿ”’ Secure Architecture - EC2 backend IP hidden via server-side proxy
  • ๐Ÿ“ฑ Responsive Design - Works seamlessly on mobile and desktop
  • ๐ŸŽฏ Type-safe - Full TypeScript coverage with strict mode
  • โšก Fast & Modern - Built with Next.js 16 and TailwindCSS v4
  • ๐Ÿ”„ Auto-deploy - Push to main โ†’ live on Vercel instantly

๐Ÿ”ง Development Notes

Design Principles

  • SSM-only access: No SSH, no .pem keys for EC2 access
  • Secrets Manager: All secrets stored securely, never in .env files
  • Poetry for Python: No raw pip install commands
  • Loguru for logging: No print() statements in production code
  • Staged development: Complete each stage before moving forward
  • Fail-safe execution: Commands execute with error handling and retries

Project Structure

slm-ft-serving/
โ”œโ”€โ”€ .github/
โ”‚   โ”œโ”€โ”€ workflows/deploy.yml        # CI/CD pipeline
โ”‚   โ””โ”€โ”€ copilot-instructions.md     # AI assistant context
โ”œโ”€โ”€ config/deployment.yml           # Deployment configuration
โ”œโ”€โ”€ docs/STAGE-3.md                 # Stage 3 documentation
โ”œโ”€โ”€ gateway/
โ”‚   โ”œโ”€โ”€ routers/extraction.py       # Extraction endpoint
โ”‚   โ”œโ”€โ”€ main.py                     # FastAPI app
โ”‚   โ””โ”€โ”€ Dockerfile                  # Gateway Docker image
โ”œโ”€โ”€ scripts/deploy.py               # SSM deployment script
โ”œโ”€โ”€ Dockerfile                      # vLLM Docker image
โ”œโ”€โ”€ docker-compose.yml              # Container orchestration
โ”œโ”€โ”€ pyproject.toml                  # Poetry dependencies
โ””โ”€โ”€ README.md

๐Ÿ“‹ Project Stages

This project follows a staged development approach:

Stage Status Description
1 โœ… Complete vLLM server with LoRA adapter on EC2 g6.2xlarge
2 โœ… Complete FastAPI gateway with Docker Compose orchestration
3 โœ… Complete Next.js frontend on Vercel
4 ๐Ÿ”ฎ Planned CloudWatch monitoring & observability

๐ŸŒ Related Links


๐Ÿ“ License

MIT License - see LICENSE file for details.


๐Ÿ™ Acknowledgments

  • vLLM Team - High-performance LLM inference engine
  • Meta AI - Llama 3.1 base model
  • HuggingFace - Model hosting and fine-tuning infrastructure
  • FastAPI - Modern Python web framework
  • Vercel - Frontend hosting platform

Releases

No releases published

Packages

No packages published