Real-Time Multi-Face Emotion Detection

High-performance emotion recognition pipeline achieving 90+ FPS on CPU.
Motivation was a low power option for retail analysis.

Use Cases

Retail: Customer engagement analysis
Education: Student attention monitoring
Healthcare: Patient mood tracking
Gaming: Adaptive difficulty based on player emotion
Content: Audience reaction analysis

Performance

Face Detection: 150+ FPS
Emotion Classification: 90+ FPS
End-to-end Latency: ~15ms
Memory Usage: ~50MB
Tested on: Intel i5-3320M, 8GB RAM

E2E pipeline is also tested on a docker simulated edge device by constraining pc resources in the following manner:

1 vCPU
512 MB RAM
Throttled I/O

IoT Node Benchmark (Docker simulated)

Stage	Latency (ms)	Throughput (FPS)	CPU Usage (%)	Memory (MB)	Threads (PIDs)
Frame Reader			~1	~5	2
Face Detector	~6.1	~150	~3	~7	3
Emotion Detector	~9.1	~90	~2	~6	3
Display	~0.03	negligible	~0.5	~2	2
End-to-End	~15	~90 (avg)	~6–8	~20–50	~10–12

The pipeline sustains ~90 FPS end-to-end with <10% CPU usage and ~50 MB RAM on a 1 vCPU/512 MB simulated IoT node.

Methodology Notes

FPS and latency values were recorded under real-time, multi-threaded scheduling.
Docker constraints were applied explicitly (--cpus=1 --memory=512m --blkio-weight=100) to simulate an IoT node.
All numbers are averages across 5 runs. Reproducibility can be verified by running the included benchmarking scripts under the same flags.

Why 90 FPS on CPU?

No GPU required = lower hardware costs
Works on edge devices (Raspberry Pi, etc.)
Scalable to 100s of cameras without expensive servers
Deploy anywhere (even offline/remote locations)

Quick Start

Install dependencies

sudo apt install opencv-dev cmake

Optionally, you could create an image from the provided Dockerfile.

Clone and build

git clone https://github.com/fw7th/emotion.git
cd src
mkdir build && cd build
cmake .. && make -j$(nproc)

Run with webcam

./emotion 0 ## 0 is the webcam ID.

Architecture

graph LR
    A[Camera Input] --> B[Reader Thread<br/>Frame Skip: Every 3rd]
    B --> C[Face Detection Thread<br/>UltraFace NCNN<br/>150+ FPS]
    C --> D[Emotion Thread<br/>ROI Crop + Classify<br/>90+ FPS]
    D --> E[Display Thread<br/>Bounding Boxes + Labels<br/>600+ FPS]
    
    B -.-> F[Queue 1]
    C -.-> G[Queue 2] 
    D -.-> H[Queue 3]
    F --> C
    G --> D
    H --> E

Technical Details

Pipeline

Build system: CMake
Implementation details:
- Custom anti-copy queue.
- print_type function used to print custom types.
- Single person smoothing classes: hysteresis stabilizer, switching on constant high confidence.
- Memory management, move semantics, and efficient vector usage.
- Multi-threaded architecture.

Model Details

Available models:

MobileNetv2 (IMAGENET-V2 pretrained) finetuned for task - 79.3% accuracy, 9.1ms inference
EfficientNet-lite0 (planned - experimental results showed accuracy/speed tradeoff not favorable)

Only change made to model architectures was in conv1 to allow grayscale inputs.

Metrics are relayed in benchmarking details.

The model was fine-tuned on a mix of raf-db and fer2013 in two phases;

Phase 1; All layers unfrozen, high lr
Phase 2: Same but with a lower lr

While this may seem unorthodox, this produced the best results from rigorous tests.
Models were allowed to train for as many epochs as possible till val loss plateaued and early stopping triggered.
Pytorch's ReduceLROnPlateau as our LR annealing strategy.
Models trained to detect 7 emotions:

[angry, disgust, fear, happy, neutral, sad, surprise]

Data;

Dataset: FER2013 + RAF-DB [~48k training images] [~10k test images] [~10k val images].
Augmentations:

GrayScaling
Image resizing (64x64) ColorJitter; brightness=0.2, contrast=0.3
RandomHorizontal flip; p=0.3
RandomErasing; p=0.3, value='0.0'
Tensors were normalized first to range [0,1] then to [-1,1]
Weighted class sampler to balance less represented classes like disgust and mitigate bias.
Batch size: 192
Data loader num_workers: 4

Training;

Loss Function: CrossEntropyLoss
Optimizer:
- First phase: AdamW(lr=1e-3, weight_decay=1e-2, betas=(0.9, 0.999), eps=1e-8, amsgrad=True )
- Second phase: AdamW(lr=1e-5, same as above for the rest)
Norm-based gradient clipping; grad_clip_norm=1.0
Allowed epochs: 100
Scheduler patience: 3
Scheduler factor: 0.5
Early stopping patience: 8

For nerds; more information about training and model performance available in src/python/README.md.

Cleaned datasets available @: [https://drive.google.com/file/d/1kDnWsOLdptVEOWoFhfFTSmv7sU0vP_bM/view?usp=drive_link]

Benchmarking

Benchmarks were averaged over 5 runs.
Device note: Reader module is capped at webcam framerate (30 fps on test device).

Model Accuracy (Validation Set - 10k images)

Model	Dataset	Accuracy	Weighted F1	Inference Time	Memory
MobileNetV2	FER2013 + RAF-DB	79.3%	0.793	9.1ms	50MB
FER2013 SOTA (ResNet)	FER2013	~73%	-	~50ms	~500MB
RAF-DB SOTA	RAF-DB	~88-90%	-	-	-
Baseline (random)	-	14.3%	-	-	-

Per-Class Performance (MobileNetV2)

Emotion	Precision	Recall	F1-Score	Support
Angry	0.744	0.778	0.761	1361
Disgust	0.908	0.943	0.923	801
Fear	0.777	0.737	0.756	1474
Happy	0.888	0.852	0.870	2063
Neutral	0.706	0.742	0.723	1643
Sad	0.704	0.683	0.694	1586
Surprise	0.860	0.878	0.869	1359
Macro Avg	0.798	0.802	0.800	10287
Weighted Avg	0.793	0.793	0.793	10287

Speed Benchmarks

Model 1 — MobileNetV2

Module	Metric	Avg Value	Notes
Reader	Frame Rate	30 fps (cap)	Limited by webcam, not algorithm
Face Detector	Detection Time	6.1 ms	~163 fps effective
	Avg Frame Proc.	6.1 ms
Emotion Det.	Avg Frame Proc.	9.1 ms	~106 fps effective
	Total Emotion Loop	9.6 ms
Display	Avg Frame Proc.	0.03 ms	~29k fps (negligible)

Planned Upgrades

Gaze tracking and blink rate
Head pose estimation
Micro-expression detection
Attention/engagement scoring
Graceful degradation under load (unlikely with optimization)
Multi-camera support

Limitations

N-face emotion smoothing and tracking with SORT (planned).

Acknowledgements

This project uses ncnn, a high-performance neural network inference framework (BSD 3-Clause License).
Face detection based on Ultra-Light-Fast-Generic-Face-Detector-1MB by Linzaer (MIT License).

Third-Party Licenses

ncnn — BSD 3-Clause License.
Ultra-Light-Fast-Generic-Face-Detector-1MB by Linzaer — MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
assets		assets
data		data
src		src
third_party		third_party
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cv.sh		cv.sh
dev.sh		dev.sh
lowiot.sh		lowiot.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Multi-Face Emotion Detection

Use Cases

Performance

IoT Node Benchmark (Docker simulated)

Why 90 FPS on CPU?

Quick Start

Install dependencies

Clone and build

Run with webcam

Architecture

Technical Details

Pipeline

Model Details

Benchmarking

Model Accuracy (Validation Set - 10k images)

Per-Class Performance (MobileNetV2)

Speed Benchmarks

Model 1 — MobileNetV2

Planned Upgrades

Limitations

Acknowledgements

Third-Party Licenses

About

Uh oh!

Releases 1

Languages

License

fw7th/emotion

Folders and files

Latest commit

History

Repository files navigation

Real-Time Multi-Face Emotion Detection

Use Cases

Performance

IoT Node Benchmark (Docker simulated)

Why 90 FPS on CPU?

Quick Start

Install dependencies

Clone and build

Run with webcam

Architecture

Technical Details

Pipeline

Model Details

Benchmarking

Model Accuracy (Validation Set - 10k images)

Per-Class Performance (MobileNetV2)

Speed Benchmarks

Model 1 — MobileNetV2

Planned Upgrades

Limitations

Acknowledgements

Third-Party Licenses

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Languages