Skip to content

fw7th/emotion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Multi-Face Emotion Detection

High-performance emotion recognition pipeline achieving 90+ FPS on CPU.
Motivation was a low power option for retail analysis.

Demo

Use Cases

  • Retail: Customer engagement analysis
  • Education: Student attention monitoring
  • Healthcare: Patient mood tracking
  • Gaming: Adaptive difficulty based on player emotion
  • Content: Audience reaction analysis

Performance

  • Face Detection: 150+ FPS
  • Emotion Classification: 90+ FPS
  • End-to-end Latency: ~15ms
  • Memory Usage: ~50MB
  • Tested on: Intel i5-3320M, 8GB RAM

E2E pipeline is also tested on a docker simulated edge device by constraining pc resources in the following manner:

  • 1 vCPU
  • 512 MB RAM
  • Throttled I/O

IoT Node Benchmark (Docker simulated)

Stage Latency (ms) Throughput (FPS) CPU Usage (%) Memory (MB) Threads (PIDs)
Frame Reader ~1 ~5 2
Face Detector ~6.1 ~150 ~3 ~7 3
Emotion Detector ~9.1 ~90 ~2 ~6 3
Display ~0.03 negligible ~0.5 ~2 2
End-to-End ~15 ~90 (avg) ~6–8 ~20–50 ~10–12

The pipeline sustains ~90 FPS end-to-end with <10% CPU usage and ~50 MB RAM on a 1 vCPU/512 MB simulated IoT node.

Methodology Notes

  • FPS and latency values were recorded under real-time, multi-threaded scheduling.
  • Docker constraints were applied explicitly (--cpus=1 --memory=512m --blkio-weight=100) to simulate an IoT node.
  • All numbers are averages across 5 runs. Reproducibility can be verified by running the included benchmarking scripts under the same flags.

IoT docker status dump

Why 90 FPS on CPU?

  • No GPU required = lower hardware costs
  • Works on edge devices (Raspberry Pi, etc.)
  • Scalable to 100s of cameras without expensive servers
  • Deploy anywhere (even offline/remote locations)

Quick Start

Install dependencies

sudo apt install opencv-dev cmake
  • Optionally, you could create an image from the provided Dockerfile.

Clone and build

git clone https://github.com/fw7th/emotion.git
cd src
mkdir build && cd build
cmake .. && make -j$(nproc)

Run with webcam

./emotion 0 ## 0 is the webcam ID.

Architecture

graph LR
    A[Camera Input] --> B[Reader Thread<br/>Frame Skip: Every 3rd]
    B --> C[Face Detection Thread<br/>UltraFace NCNN<br/>150+ FPS]
    C --> D[Emotion Thread<br/>ROI Crop + Classify<br/>90+ FPS]
    D --> E[Display Thread<br/>Bounding Boxes + Labels<br/>600+ FPS]
    
    B -.-> F[Queue 1]
    C -.-> G[Queue 2] 
    D -.-> H[Queue 3]
    F --> C
    G --> D
    H --> E
Loading

Technical Details

Pipeline

  • Build system: CMake
  • Implementation details:
    • Custom anti-copy queue.
    • print_type function used to print custom types.
    • Single person smoothing classes: hysteresis stabilizer, switching on constant high confidence.
    • Memory management, move semantics, and efficient vector usage.
    • Multi-threaded architecture.

Model Details

Available models:

  • MobileNetv2 (IMAGENET-V2 pretrained) finetuned for task - 79.3% accuracy, 9.1ms inference
  • EfficientNet-lite0 (planned - experimental results showed accuracy/speed tradeoff not favorable)

Only change made to model architectures was in conv1 to allow grayscale inputs.

Metrics are relayed in benchmarking details.

The model was fine-tuned on a mix of raf-db and fer2013 in two phases;

  • Phase 1; All layers unfrozen, high lr
  • Phase 2: Same but with a lower lr
  • While this may seem unorthodox, this produced the best results from rigorous tests.

  • Models were allowed to train for as many epochs as possible till val loss plateaued and early stopping triggered.

  • Pytorch's ReduceLROnPlateau as our LR annealing strategy.

  • Models trained to detect 7 emotions:

    [angry, disgust, fear, happy, neutral, sad, surprise]

Data;

  • Dataset: FER2013 + RAF-DB [~48k training images] [~10k test images] [~10k val images].
  • Augmentations:

    GrayScaling
    Image resizing (64x64) ColorJitter; brightness=0.2, contrast=0.3
    RandomHorizontal flip; p=0.3
    RandomErasing; p=0.3, value='0.0'
    Tensors were normalized first to range [0,1] then to [-1,1]

  • Weighted class sampler to balance less represented classes like disgust and mitigate bias.
  • Batch size: 192
  • Data loader num_workers: 4

Training;

  • Loss Function: CrossEntropyLoss
  • Optimizer:
    • First phase: AdamW(lr=1e-3, weight_decay=1e-2, betas=(0.9, 0.999), eps=1e-8, amsgrad=True )
    • Second phase: AdamW(lr=1e-5, same as above for the rest)
  • Norm-based gradient clipping; grad_clip_norm=1.0
  • Allowed epochs: 100
  • Scheduler patience: 3
  • Scheduler factor: 0.5
  • Early stopping patience: 8

For nerds; more information about training and model performance available in src/python/README.md.

Cleaned datasets available @: [https://drive.google.com/file/d/1kDnWsOLdptVEOWoFhfFTSmv7sU0vP_bM/view?usp=drive_link]

Benchmarking

Benchmarks were averaged over 5 runs.
Device note: Reader module is capped at webcam framerate (30 fps on test device).

Model Accuracy (Validation Set - 10k images)

Model Dataset Accuracy Weighted F1 Inference Time Memory
MobileNetV2 FER2013 + RAF-DB 79.3% 0.793 9.1ms 50MB
FER2013 SOTA (ResNet) FER2013 ~73% - ~50ms ~500MB
RAF-DB SOTA RAF-DB ~88-90% - - -
Baseline (random) - 14.3% - - -

Per-Class Performance (MobileNetV2)

Emotion Precision Recall F1-Score Support
Angry 0.744 0.778 0.761 1361
Disgust 0.908 0.943 0.923 801
Fear 0.777 0.737 0.756 1474
Happy 0.888 0.852 0.870 2063
Neutral 0.706 0.742 0.723 1643
Sad 0.704 0.683 0.694 1586
Surprise 0.860 0.878 0.869 1359
Macro Avg 0.798 0.802 0.800 10287
Weighted Avg 0.793 0.793 0.793 10287

Confusion Matrix

Speed Benchmarks

Model 1 — MobileNetV2

Module Metric Avg Value Notes
Reader Frame Rate 30 fps (cap) Limited by webcam, not algorithm
Face Detector Detection Time 6.1 ms ~163 fps effective
Avg Frame Proc. 6.1 ms
Emotion Det. Avg Frame Proc. 9.1 ms ~106 fps effective
Total Emotion Loop 9.6 ms
Display Avg Frame Proc. 0.03 ms ~29k fps (negligible)

Planned Upgrades

  • Gaze tracking and blink rate
  • Head pose estimation
  • Micro-expression detection
  • Attention/engagement scoring
  • Graceful degradation under load (unlikely with optimization)
  • Multi-camera support

Limitations

N-face emotion smoothing and tracking with SORT (planned).

Acknowledgements

Third-Party Licenses