Audio Signal Restoration using Convolutional Autoencoders

Table of Contents

  1. Introduction and Multimedia System Context
  2. Architecture: Convolutional Autoencoder (CAE)
  3. Training Dynamics and Optimization
  4. Performance Evaluation Metrics
  5. Dataset Construction and Augmentation
  6. Deliverables
  7. Resources

1. Introduction and Multimedia System Context

The reference script implements a Multimedia Artificial Intelligence system focused on Digital Signal Processing (DSP) of audio. The core of this system is a Convolutional Autoencoder (CAE), a Deep Learning architecture designed to learn efficient representations of structured data.

The fundamental objective is to solve a complex regression problem: mapping a noisy audio signal ($x_{noisy}$) to its clean counterpart ($y_{clean}$).

2. Architecture: Convolutional Autoencoder (CAE)

The chosen architecture is a Deep Symmetric Convolutional Autoencoder. Unlike Dense Neural Networks (MLP), Convolutional Neural Networks (CNN) are intrinsically superior for audio signals, as they exploit temporal correlation and input data topology.

A. Encoder

The encoder extracts hierarchical features and performs dimensionality reduction:

  1. 1D Convolution (Conv1D): Performs a discrete cross-correlation operation between the input x[n] (audio chunks) and a kernel or filter k. The filter is applied along the entire temporal signal, conferring translation invariance (detecting patterns regardless of their position in time) and locality (capturing short-term dependencies).
  2. Non-Linear Activation (ReLU): Applies the function f(x)=max(0,x). Non-linearity is essential, allowing the network to learn the complex and non-linear mapping required for restoration. The ReLU function is preferred over historical functions like Sigmoid for being computationally less expensive.
  3. Pooling Layer (MaxPooling1D): Reduces spatial dimensionality (subsampling) by selecting the maximum value in a window. This is key to increasing the receptive field of subsequent layers, allowing deep layers to capture longer-duration temporal structures (global context).

B. Latent Space

This is the "bottleneck" of the architecture. The autoencoder compresses the signal forcing the model to learn the Manifold Hypothesis, i.e., the low-dimensional manifold where essential information of clean audio resides, discarding noise.

C. Decoder

The decoder inverts the function, projecting the latent vector back to the signal space:

  1. Upsampling (UpSampling1D): Increases temporal dimension, reversing the pooling action. In more sophisticated architectures, Transposed Convolution could be used as a more controlled form of upsampling.
  2. Convolution (Post-Upsampling): Applies standard convolution to "smooth" the signal and recover fine details.

3. Training Dynamics and Optimization

A. Preprocessing: Windowing

The audio signal is segmented into fixed-length windows or chunks (CHUNK_SIZE = 2048). This allows assuming short-term stationarity and facilitates batch processing on GPU.

B. Loss Function

Training minimizes the Mean Squared Error (MSE):

$$L(θ) = \frac{1}{N} \sum_{i=1}^{N} ||y^{(i)} - \hat{y}^{(i)}||^2$$

MSE is the standard metric for regression and quadratically penalizes large deviations, forcing the model to prioritize correction of the largest amplitude distortions introduced by quantization.

C. Optimization Algorithm: Adam

Adam (Adaptive Moment Estimation) is used, which is the default optimizer in contemporary Deep Learning due to its robustness and convergence speed. Adam accelerates training by combining two key techniques:

  1. Momentum: Uses a moving average of the gradient to accelerate convergence.
  2. RMSProp: Uses a moving average of the squared gradient to provide an adaptive learning rate for each parameter.

D. Backpropagation

Backpropagation is the algorithmic method to efficiently calculate the loss function gradient, which is crucial for optimization.

  1. Forward Pass: Noisy audio traverses the network, calculating activations in each layer.
  2. Gradient Calculation: Systematically applies the chain rule of multivariable calculus, propagating error from the output layer backward, calculating the impact of each weight on the error.
  3. Weight Update: The Adam optimizer uses the resulting gradient ($∇_θ L(θ)$) to adjust weights (θ) in the steepest descent direction to minimize error.

4. Performance Evaluation Metrics

Objective evaluation of restoration quality is performed in the temporal domain:

  1. MSE (Mean Squared Error): The loss function used during training. Sensitive to outliers.
  2. MAE (Mean Absolute Error): Calculates the average of absolute error magnitudes. Less sensitive to outliers than MSE.
  3. RMSE (Root Mean Squared Error): The square root of MSE. Its advantage is being expressed in the same units as signal amplitude (physical units).
  4. SNR (Signal-to-Noise Ratio): The logarithmic relationship (in dB) between desired signal power and noise power (or reconstruction error). Higher SNR indicates cleaner reconstruction.
  5. Pearson Correlation Coefficient (ρ): Measures linear dependence between original and reconstructed signals. A value close to 1 indicates perfect positive correlation, evaluating phase and waveform similarity.

5. Dataset Construction and Augmentation

6. Deliverables


Resources

Overlap-Add Inference (Boundary Artifact Reduction)

def overlap_add_denoise(x_noisy, model, chunk_size=2048, hop=None):
    import numpy as np
    if hop is None:
        hop = chunk_size // 2
    w = np.hanning(chunk_size)
    n_frames = (len(x_noisy) - chunk_size) // hop + 1
    out = np.zeros(len(x_noisy), dtype=np.float32)
    norm = np.zeros(len(x_noisy), dtype=np.float32)
    for i in range(n_frames):
        start = i * hop
        frame = x_noisy[start:start+chunk_size]
        y = model.predict(frame.reshape(1, chunk_size, 1), verbose=0).flatten()
        out[start:start+chunk_size] += y * w
        norm[start:start+chunk_size] += w * w
    norm[norm == 0] = 1.0
    return out / norm

Summary of Provided Model Variants