High-performance emotion recognition pipeline achieving 90+ FPS on CPU.
Motivation was a low power option for retail analysis.
- Retail: Customer engagement analysis
- Education: Student attention monitoring
- Healthcare: Patient mood tracking
- Gaming: Adaptive difficulty based on player emotion
- Content: Audience reaction analysis
- Face Detection: 150+ FPS
- Emotion Classification: 90+ FPS
- End-to-end Latency: ~15ms
- Memory Usage: ~50MB
- Tested on: Intel i5-3320M, 8GB RAM
E2E pipeline is also tested on a docker simulated edge device by constraining pc resources in the following manner:
- 1 vCPU
- 512 MB RAM
- Throttled I/O
| Stage | Latency (ms) | Throughput (FPS) | CPU Usage (%) | Memory (MB) | Threads (PIDs) |
|---|---|---|---|---|---|
| Frame Reader | ~1 | ~5 | 2 | ||
| Face Detector | ~6.1 | ~150 | ~3 | ~7 | 3 |
| Emotion Detector | ~9.1 | ~90 | ~2 | ~6 | 3 |
| Display | ~0.03 | negligible | ~0.5 | ~2 | 2 |
| End-to-End | ~15 | ~90 (avg) | ~6–8 | ~20–50 | ~10–12 |
The pipeline sustains ~90 FPS end-to-end with <10% CPU usage and ~50 MB RAM on a 1 vCPU/512 MB simulated IoT node.
Methodology Notes
- FPS and latency values were recorded under real-time, multi-threaded scheduling.
- Docker constraints were applied explicitly (
--cpus=1 --memory=512m --blkio-weight=100) to simulate an IoT node. - All numbers are averages across 5 runs. Reproducibility can be verified by running the included benchmarking scripts under the same flags.
- No GPU required = lower hardware costs
- Works on edge devices (Raspberry Pi, etc.)
- Scalable to 100s of cameras without expensive servers
- Deploy anywhere (even offline/remote locations)
sudo apt install opencv-dev cmake- Optionally, you could create an image from the provided
Dockerfile.
git clone https://github.com/fw7th/emotion.git
cd src
mkdir build && cd build
cmake .. && make -j$(nproc)./emotion 0 ## 0 is the webcam ID.graph LR
A[Camera Input] --> B[Reader Thread<br/>Frame Skip: Every 3rd]
B --> C[Face Detection Thread<br/>UltraFace NCNN<br/>150+ FPS]
C --> D[Emotion Thread<br/>ROI Crop + Classify<br/>90+ FPS]
D --> E[Display Thread<br/>Bounding Boxes + Labels<br/>600+ FPS]
B -.-> F[Queue 1]
C -.-> G[Queue 2]
D -.-> H[Queue 3]
F --> C
G --> D
H --> E
- Build system: CMake
- Implementation details:
- Custom anti-copy queue.
print_typefunction used to print custom types.- Single person smoothing classes: hysteresis stabilizer, switching on constant high confidence.
- Memory management, move semantics, and efficient vector usage.
- Multi-threaded architecture.
Available models:
- MobileNetv2 (IMAGENET-V2 pretrained) finetuned for task - 79.3% accuracy, 9.1ms inference
- EfficientNet-lite0 (planned - experimental results showed accuracy/speed tradeoff not favorable)
Only change made to model architectures was in conv1 to allow grayscale inputs.
Metrics are relayed in benchmarking details.
The model was fine-tuned on a mix of raf-db and fer2013 in two phases;
- Phase 1; All layers unfrozen, high lr
- Phase 2: Same but with a lower lr
-
While this may seem unorthodox, this produced the best results from rigorous tests.
-
Models were allowed to train for as many epochs as possible till val loss plateaued and early stopping triggered.
-
Pytorch's ReduceLROnPlateau as our LR annealing strategy.
-
Models trained to detect 7 emotions:
[angry, disgust, fear, happy, neutral, sad, surprise]
Data;
- Dataset: FER2013 + RAF-DB [~48k training images] [~10k test images] [~10k val images].
- Augmentations:
GrayScaling
Image resizing (64x64) ColorJitter; brightness=0.2, contrast=0.3
RandomHorizontal flip; p=0.3
RandomErasing; p=0.3, value='0.0'
Tensors were normalized first to range [0,1] then to [-1,1] - Weighted class sampler to balance less represented classes like disgust and mitigate bias.
- Batch size: 192
- Data loader num_workers: 4
Training;
- Loss Function: CrossEntropyLoss
- Optimizer:
- First phase: AdamW(lr=1e-3, weight_decay=1e-2, betas=(0.9, 0.999), eps=1e-8, amsgrad=True )
- Second phase: AdamW(lr=1e-5, same as above for the rest)
- Norm-based gradient clipping; grad_clip_norm=1.0
- Allowed epochs: 100
- Scheduler patience: 3
- Scheduler factor: 0.5
- Early stopping patience: 8
For nerds; more information about training and model performance available in src/python/README.md.
Cleaned datasets available @: [https://drive.google.com/file/d/1kDnWsOLdptVEOWoFhfFTSmv7sU0vP_bM/view?usp=drive_link]
Benchmarks were averaged over 5 runs.
Device note: Reader module is capped at webcam framerate (30 fps on test device).
| Model | Dataset | Accuracy | Weighted F1 | Inference Time | Memory |
|---|---|---|---|---|---|
| MobileNetV2 | FER2013 + RAF-DB | 79.3% | 0.793 | 9.1ms | 50MB |
| FER2013 SOTA (ResNet) | FER2013 | ~73% | - | ~50ms | ~500MB |
| RAF-DB SOTA | RAF-DB | ~88-90% | - | - | - |
| Baseline (random) | - | 14.3% | - | - | - |
| Emotion | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Angry | 0.744 | 0.778 | 0.761 | 1361 |
| Disgust | 0.908 | 0.943 | 0.923 | 801 |
| Fear | 0.777 | 0.737 | 0.756 | 1474 |
| Happy | 0.888 | 0.852 | 0.870 | 2063 |
| Neutral | 0.706 | 0.742 | 0.723 | 1643 |
| Sad | 0.704 | 0.683 | 0.694 | 1586 |
| Surprise | 0.860 | 0.878 | 0.869 | 1359 |
| Macro Avg | 0.798 | 0.802 | 0.800 | 10287 |
| Weighted Avg | 0.793 | 0.793 | 0.793 | 10287 |
| Module | Metric | Avg Value | Notes |
|---|---|---|---|
| Reader | Frame Rate | 30 fps (cap) | Limited by webcam, not algorithm |
| Face Detector | Detection Time | 6.1 ms | ~163 fps effective |
| Avg Frame Proc. | 6.1 ms | ||
| Emotion Det. | Avg Frame Proc. | 9.1 ms | ~106 fps effective |
| Total Emotion Loop | 9.6 ms | ||
| Display | Avg Frame Proc. | 0.03 ms | ~29k fps (negligible) |
- Gaze tracking and blink rate
- Head pose estimation
- Micro-expression detection
- Attention/engagement scoring
- Graceful degradation under load (unlikely with optimization)
- Multi-camera support
N-face emotion smoothing and tracking with SORT (planned).
- This project uses ncnn, a high-performance neural network inference framework (BSD 3-Clause License).
- Face detection based on Ultra-Light-Fast-Generic-Face-Detector-1MB by Linzaer (MIT License).
- ncnn — BSD 3-Clause License.
- Ultra-Light-Fast-Generic-Face-Detector-1MB by Linzaer — MIT License.


