A Lightweight Inference Server with Resource-Aware Scheduling
AutoScaleAI is a modular Python-based microservice that performs real-time machine learning inference while monitoring system performance. It simulates intelligent auto-scaling behavior by tracking latency, CPU usage, and memory consumption per request — mimicking how infrastructure-aware systems like Nutanix dynamically adapt to workload pressure.
- ✅ FastAPI-based RESTful server for ML inference
- ✅ Pre-trained RandomForest model on the digits dataset
- ✅ System monitoring using
psutilto track CPU and memory usage - ✅ Latency analysis for every prediction
- ✅ Auto-scaling logic simulated via conditional triggers in
autoscaler.py - ✅ Modular architecture with clear separation of concerns
- Backend: Python 3.11, FastAPI, Uvicorn
- ML Model: scikit-learn
RandomForestClassifier - Monitoring:
psutil(CPU + Memory) - Packaging:
joblib - Serving: REST API via Uvicorn
AutoScaleAI/
├── app/
│ ├── main.py # FastAPI API endpoints
│ ├── model.py # Model loading and prediction
│ ├── autoscaler.py # Latency-based scaling logic
│ ├── utils.py # Helper functions (monitoring, formatting)
│ └── train_model.py # One-time model training script
├── requirements.txt
└── README.md
git clone https://github.com/shubh-garg/AutoScaleAI.git
cd AutoScaleAIpip install -r requirements.txtpython app/train_model.pyuvicorn app.main:app --reloadVisit http://127.0.0.1:8000/docs for the Swagger UI.
Endpoint:
POST /predict/
Sample Payload:
{
"features": [0.0, 0.0, 10.0, 5.0, 8.0, 3.0, ..., 0.0]
}Sample Response:
{
"prediction": 4,
"latency": 0.0075,
"cpu": 13.6,
"memory": 83.6
}- Developed a modular FastAPI microservice for AI inference that dynamically reports latency and system usage to simulate resource-aware scheduling.
- Integrated lightweight model serving (
RandomForestClassifier) with live system telemetry viapsutil, achieving <10ms latency on local CPU with scaling logic triggers.
- Add async queueing / worker threads
- Deploy with Docker and Gunicorn
- Integrate GPU usage metrics (if available)
- Replace RF with ONNX or quantized Transformer model
- Live dashboard with Plotly or Grafana
MIT © 2025 Shubh Garg