Audio Signal Restoration using Convolutional Autoencoders

1. Introduction and Multimedia System Context

The reference script implements a Multimedia Artificial Intelligence system focused on Digital Signal Processing (DSP) of audio. The core of this system is a Convolutional Autoencoder (CAE), a Deep Learning architecture designed to learn efficient representations of structured data.

The fundamental objective is to solve a complex regression problem: mapping a noisy audio signal ($x_{noisy}$) to its clean counterpart ($y_{clean}$).

2. Architecture: Convolutional Autoencoder (CAE)

The chosen architecture is a Deep Symmetric Convolutional Autoencoder. Unlike Dense Neural Networks (MLP), Convolutional Neural Networks (CNN) are intrinsically superior for audio signals, as they exploit temporal correlation and input data topology.

A. Encoder

The encoder extracts hierarchical features and performs dimensionality reduction:

1D Convolution (Conv1D): Performs a discrete cross-correlation operation between the input x[n] (audio chunks) and a kernel or filter k. The filter is applied along the entire temporal signal, conferring translation invariance (detecting patterns regardless of their position in time) and locality (capturing short-term dependencies).
Non-Linear Activation (ReLU): Applies the function f(x)=max(0,x). Non-linearity is essential, allowing the network to learn the complex and non-linear mapping required for restoration. The ReLU function is preferred over historical functions like Sigmoid for being computationally less expensive.
Pooling Layer (MaxPooling1D): Reduces spatial dimensionality (subsampling) by selecting the maximum value in a window. This is key to increasing the receptive field of subsequent layers, allowing deep layers to capture longer-duration temporal structures (global context).

B. Latent Space

This is the "bottleneck" of the architecture. The autoencoder compresses the signal forcing the model to learn the Manifold Hypothesis, i.e., the low-dimensional manifold where essential information of clean audio resides, discarding noise.

C. Decoder

The decoder inverts the function, projecting the latent vector back to the signal space:

Upsampling (UpSampling1D): Increases temporal dimension, reversing the pooling action. In more sophisticated architectures, Transposed Convolution could be used as a more controlled form of upsampling.
Convolution (Post-Upsampling): Applies standard convolution to "smooth" the signal and recover fine details.

3. Training Dynamics and Optimization

A. Preprocessing: Windowing

The audio signal is segmented into fixed-length windows or chunks (CHUNK_SIZE = 2048). This allows assuming short-term stationarity and facilitates batch processing on GPU.

B. Loss Function

Training minimizes the Mean Squared Error (MSE):

$$L(θ) = \frac{1}{N} \sum_{i=1}^{N} ||y^{(i)} - \hat{y}^{(i)}||^2$$

MSE is the standard metric for regression and quadratically penalizes large deviations, forcing the model to prioritize correction of the largest amplitude distortions introduced by quantization.

C. Optimization Algorithm: Adam

Adam (Adaptive Moment Estimation) is used, which is the default optimizer in contemporary Deep Learning due to its robustness and convergence speed. Adam accelerates training by combining two key techniques:

Momentum: Uses a moving average of the gradient to accelerate convergence.
RMSProp: Uses a moving average of the squared gradient to provide an adaptive learning rate for each parameter.

D. Backpropagation

Backpropagation is the algorithmic method to efficiently calculate the loss function gradient, which is crucial for optimization.

Forward Pass: Noisy audio traverses the network, calculating activations in each layer.
Gradient Calculation: Systematically applies the chain rule of multivariable calculus, propagating error from the output layer backward, calculating the impact of each weight on the error.
Weight Update: The Adam optimizer uses the resulting gradient ($∇_θ L(θ)$) to adjust weights (θ) in the steepest descent direction to minimize error.

4. Performance Evaluation Metrics

Objective evaluation of restoration quality is performed in the temporal domain:

MSE (Mean Squared Error): The loss function used during training. Sensitive to outliers.
MAE (Mean Absolute Error): Calculates the average of absolute error magnitudes. Less sensitive to outliers than MSE.
RMSE (Root Mean Squared Error): The square root of MSE. Its advantage is being expressed in the same units as signal amplitude (physical units).
SNR (Signal-to-Noise Ratio): The logarithmic relationship (in dB) between desired signal power and noise power (or reconstruction error). Higher SNR indicates cleaner reconstruction.
Pearson Correlation Coefficient (ρ): Measures linear dependence between original and reconstructed signals. A value close to 1 indicates perfect positive correlation, evaluating phase and waveform similarity.

5. Dataset Construction and Augmentation

Chunking Strategy: Use overlapping windows (e.g., hop = CHUNK_SIZE/2) to increase sample count and smooth boundaries in reconstruction.
Normalization: Normalize per-chunk by RMS or globally to avoid amplitude bias; ensure consistent scaling between noisy and clean pairs.
Augmentations: Add small gain variations, random EQ tilts, or mild time-stretch (±2%) to improve generalization, keeping labels aligned.
Train/Validation/Test Split: Stratify by source content (speech, music, percussive) to avoid leakage and overfitting to timbre.

6. Deliverables

We propose an additional milestone focused on quantization-noise removal using Convolutional Autoencoders. This milestone is offered as an optional activity to improve grades.
"Basic" Notebook: You will receive a simple notebook demonstrating quantization noise (bitcrushing) reduction with a basic model. https://colab.research.google.com/drive/1pPKqn3EPJSJB07Jj-Ht8Qqd4U6Ee6oi2?usp=sharing
Expected Submission: You must adapt the notebook to reduce the noise generated after Milestone 6, when the Time of Hearing (ToH) model is already integrated. Goal: train the denoiser using ToH-processed audio as input and clean audio as the target. Adapted notebook, a short report comparing before/after (MSE/SNR and spectrograms), and example audios (noisy, ToH, denoised).

Resources

Notebook Environment: The reference notebook uses Colab magics (!pip, !gdown) and IPython playback (Audio, display). For local runs, install packages via your shell and replace gdown with local files or direct URLs. Keep environment setup outside training code for reproducibility.
Librosa Loading: librosa.load(path, sr=SR) resamples to SR and returns mono by default. If stereo is required, load with mono=False and handle channel dimensions consistently.
SoundFile Export: sf.write(filename, audio, SR) expects floating-point arrays in [-1,1]; normalize before export to avoid clipping. Preserve original bit depth when comparing PSNR.
Model Saving: Saving with .keras preserves the full Keras model. Record SR, CHUNK_SIZE, architecture name, and loss configuration alongside the file for reproducibility and compatibility.
Visualization Best Practices: Plot training loss per epoch to monitor convergence; complement waveform plots with spectrograms (STFT or log-mel) to reveal band-limited artifacts.

Overlap-Add Inference (Boundary Artifact Reduction)

Windowing: Use a Hann window w[n] = 0.5(1 - cos(2π n/(N-1))) over chunk length N.
Procedure:
1. Frame the noisy signal into overlapping chunks (e.g., 50% overlap; hop = N/2).
2. Predict each framed chunk with the denoiser.
3. Multiply predictions by w[n] and accumulate at the corresponding positions.
4. Accumulate window squared w[n]^2 for normalization; divide the summed signal by summed window squares to reconstruct.

def overlap_add_denoise(x_noisy, model, chunk_size=2048, hop=None):
    import numpy as np
    if hop is None:
        hop = chunk_size // 2
    w = np.hanning(chunk_size)
    n_frames = (len(x_noisy) - chunk_size) // hop + 1
    out = np.zeros(len(x_noisy), dtype=np.float32)
    norm = np.zeros(len(x_noisy), dtype=np.float32)
    for i in range(n_frames):
        start = i * hop
        frame = x_noisy[start:start+chunk_size]
        y = model.predict(frame.reshape(1, chunk_size, 1), verbose=0).flatten()
        out[start:start+chunk_size] += y * w
        norm[start:start+chunk_size] += w * w
    norm[norm == 0] = 1.0
    return out / norm

Summary of Provided Model Variants

Autoencoder 1 (Pooling + Upsampling, tanh): Compressive; larger receptive field via pooling; may lose micro-detail, good at smoothing quantization artifacts.
Autoencoder 2 (Deep, No Pooling, tanh): Preserves temporal resolution; captures fine detail; higher compute and risk of overfitting without regularization.
Autoencoder 3 (Simple, No Pooling, linear): Baseline; faster inference; consider output clipping to [-1,1] and potential post-filtering.