EmoLabs

Framework and Datasets for CNN Emotion Recognition

Winter 2025/26, Heidelberg

This project started as a Bachelor’s Thesis and was continued beyond.

At the core of this project lies the classification of short speech recordings by emotions. To this end, key technical concepts are related, more than 30 pipelines were implemented, and a well-founded paradigm for competitive architectures derived. In this context, EmoCorpus is introduced, an altered aggregation of widely used emotion-labeled audio datasets. Its counterpart, EmoBench, is distinctly partitioned to enable reproducible evaluations. Pipelines based on Low-Level Descriptors (LLDs) achieved strong performances on low variance data after being augmented with either global attention mechanisms or Long Short-Term Memory (LSTM), reaching accuracies of up to 99.5%. Notably, LSTM exhibited the opposite effect for abstract representations (HSFs). These representations, however, attained the highest accuracy on high-variance data (up to 73.5%) once encoder fine-tuning was enabled. The Human Baseline Study (HBS) complemented the SER results by assessing the performance of participants from seven countries on data drawn from EmoCorpus and EmoBench.

Implications in Healthcare

Leveraged Audio Representations

Modular SER-Framework

(1) Sample Standardization

Unify audio input formats, promote consistency.

Algorithmic Procedure:

1. Global Indexing

Scan available datasets, store paths and labels in pairs, along more metadata.

2. Deterministic manifest-split

Generate Base62 on unique project path of each sample. Sort in 9:1 split for training and evaluation by applying mod 10.

3. Waveform normalization

Run standardized algorithm for unifiying channel count, volume coherent to human range of perception and more.

(2) Feature Extraction

Generate structured pipeline inputs.

Different branches for:

A. Waveforms

Raw signal amplitudes (preserves all initial information; inference for abstraction).

B. wav2vec / HuBERT

Contextualized latent embeddings (with surrounding segments; learned).

C. Spectrograms

Energy distribution across frequency bands.

(3) Convolutional Feature Modeling

Learn hierarchical patterns from structured inputs.

Different branches for:

1. Classic CNN-Route

3-layer network adjusted for preceding stages and their tensor shapes. Isolated experiments on different network depths and dimensionalities of inputs.

2. ResNet Ablation Study

Addresses the vanishing gradient problem and introdes both increased network depth and residual skip connection. Both are examined in isolated cases.

(4) Temporal Sequence Modeling

Extend hierarchical patterns by long-term dependencies.

Adresses another bottleneck of classic CNNs: limitation of the temporal scope by the sliding windows.

Mechanism: Gated memory cells,
Key idea: control specific rememberance of information across temporal operations,
Trade-offs: increased capacity, risk of overfitting.

Pipelines examined with and without LSTM.

(5) Feature Aggregation

Flatten features into fixed-size representation.

Different branches for:

1. Flattening

Preserve full range of information. Presents elements of non-significant incremental validity to the final fully-connected layer and is likely penalized by lower accuracies.

2. Global Average Pooling

Sum channels by their mean over time.

3. Global Attention Pooling

Sum channels by mean over time, each weighted by importance; rated by the backpropagation feedback machanism.