EmoLabs

Framework and Datasets for CNN Emotion Recognition

Winter 2025/26, Heidelberg

This project started as a Bachelor’s Thesis and was continued beyond. 

At the core of this project lies the classification of short speech recordings by emotions. To this end, key technical concepts are related, more than 30 pipelines were implemented, and a well-founded paradigm for competitive architectures derived. In this context, EmoCorpus is introduced, an altered aggregation of widely used emotion-labeled audio datasets. Its counterpart, EmoBench, is distinctly partitioned to enable reproducible evaluations. Pipelines based on Low-Level Descriptors (LLDs) achieved strong performances on low variance data after being augmented with either global attention mechanisms or Long Short-Term Memory (LSTM), reaching accuracies of up to 99.5%. Notably, LSTM exhibited the opposite effect for abstract representations (HSFs). These representations, however, attained the highest accuracy on high-variance data (up to 73.5%) once encoder fine-tuning was enabled. The Human Baseline Study (HBS) complemented the SER results by assessing the performance of participants from seven countries on data drawn from EmoCorpus and EmoBench.

Implications in Healthcare

(1) Embodied Conversational Agents (ECA)
(2) Multimodal Condition Mapping (e.g., on ICD-11)

Leveraged Audio Representations

Modular SER-Framework

(1) Sample Standardization

Project-wide starting point

Unify audio input formats, promote consistency.

 

Algorithmic Procedure:

1. Global Indexing

Scan available datasets, store paths and labels in pairs, along more metadata.

2. Deterministic manifest-split

Generate Base62 on unique project path of each sample. Sort in 9:1 split for training and evaluation by applying mod 10.

3. Waveform normalization

Run standardized algorithm for unifiying channel count, volume coherent to human range of perception and more. 

(2) Feature Extraction

Exemplary for Log-Mel Spectrograms

Generate structured pipeline inputs.

 

Different branches for:

A. Waveforms

Raw signal amplitudes (preserves all initial information; inference for abstraction).

B. wav2vec / HuBERT

Contextualized latent embeddings (with surrounding segments; learned).

C. Spectrograms

Energy distribution across frequency bands.

(3) Convolutional Feature Modeling

Exemplary for Log-Mel Spectrograms

Learn hierarchical patterns from structured inputs.

 

Different branches for:

1. Classic CNN-Route

3-layer network adjusted for preceding stages and their tensor shapes. Isolated experiments on different network depths and dimensionalities of inputs.

2. ResNet Ablation Study

Addresses the vanishing gradient problem and introdes both increased network depth and residual skip connection. Both are examined in isolated cases.

(4) Temporal Sequence Modeling

Extend hierarchical patterns by long-term dependencies.

 

Adresses another bottleneck of classic CNNs: limitation of the temporal scope by the sliding windows.

  • Mechanism: Gated memory cells,
  • Key idea: control specific rememberance of information across temporal operations,
  • Trade-offs: increased capacity, risk of overfitting.

Pipelines examined with and without LSTM.

(5) Feature Aggregation

Flatten features into fixed-size representation.

 

Different branches for:

1. Flattening

Preserve full range of information. Presents elements of non-significant incremental validity to the final fully-connected layer and is likely penalized by lower accuracies.

2. Global Average Pooling

Sum channels by their mean over time.

3. Global Attention Pooling

Sum channels by mean over time, each weighted by importance; rated by the backpropagation feedback machanism.

SER-Datasets

EmoCorpus dataset for training
EmoBench dataset for evaluation

Human Baseline Study

Overall Human Accuracy (UAR): 0.56

Procedure

1. Participant recruitment

A dozen participants, proficient in English, mixed by gender and background.

2. Randomized sample distribution

60 randomized samples (EmoCorpus, EmoBench), class and source balanced.

3. Meta-analysis of responses

Conversion into unified format; compute recall metrics, trace those back by source.

 

Statistical Significance

  • Constrained by sample size, age distribution; but: aligns with comparable reports.
  • Interpreted as approximate anchor point of sample difficulty.

Evaluation of SER-Pipelines

(1) Impact of Input Abstraction
(2) Mitigating Class Stability Degradation on LLDs
(3) Mitigating Low Recall on LLDs
(4) Attention and Memory Mechanics
(5) ResNet Ablation Study
(6) Fine-tuning HSF Encoders

The attached references feature far more evaluations, and most of all according interpretations along key learnings.

Associated ressources

Highly esteemed supervisor