Transform Coding for Redundancy Removal

Vicente González Ruiz & Savins Puertas Martín & Marcos Lupión Lorente

March 23, 2025

1 Spatial (stereo) decorrelation with the MST (Mid/Side Transform)
1.1 Analysis transform
1.2 Synthesis transform
1.3 Orthogonality of the transform
1.4 Quantization of the subbands
2 Temporal decorrelation using the DWT (Discrete Wavelet Transform)
2.1 About temporal redundancy in audio
2.2 Subband Coding
2.3 Multichannel filter banks and psychoacoustic frequency resolution
2.4 The Discrete Wavelet Transform
2.5 Implementation of the DWT
2.6 Example of a DWT using the MST filters
2.7 Wavelets and filter banks
2.8 Example of a DWT using “high”-order filters
2.9 Quantization of the DWT subbands
3 Overlapped block transforms to minimize distortion
4 Reducing the data overhead
5 Deliverables
6 Resources

1 Spatial (stereo) decorrelation with the MST (Mid/Side Transform)

1.1 Analysis transform

InterCom transmits a stereo (two channels) PCM signal. In most cases, the channels are highly correlated¹, which means that we can find a more efficient representation. To perform this inter-channel decorrelation [4] we can use the linear transform [8]

\begin{equation} {\mathbf w} = {\mathbf K}{\mathbf x} = \begin {bmatrix} \mathbf {K}_0 \\ \mathbf {K}_1 \end {bmatrix}{\mathbf x} = \begin {bmatrix} 1 & 1 \\ 1 & -1 \end {bmatrix} {\mathbf x}, \label {eq:forward_transform_matrix_form} \end{equation}

that can also be written as

\begin{equation} \begin {bmatrix} {\mathbf w}_0 \\ {\mathbf w}_1 \end {bmatrix} = \begin {bmatrix} 1 & 1 \\ 1 & -1 \end {bmatrix} \begin {bmatrix} {\mathbf x}_0 \\ {\mathbf x}_1 \end {bmatrix}, \label {eq:forward_transform_matrix_form2} \end{equation}

where \({\mathbf x}\in \mathbb {Z}^2\) is a stereo frame, \(\mathbf K\) is the forward (or analysis) transform matrix, and \({\mathbf w}=\begin {bmatrix} {\mathbf w}_0 & {\mathbf w}_1\end {bmatrix}^{\text T}\) is the corresponding decomposition. In this particular transform, the decomposition has two subbands \({\mathbf w}_0\) and \({\mathbf w}_1\), and each subband has only one coefficient per frame. Notice that \({\mathbf x}\in \mathbb {Z}^2\) is a vector space² if we consider also the required operations.

The proposed matrix \(\mathbf K\) corresponds to the transform used in Mid/Side (M/S) stereo coding [2] that we will call MST (Mid/Side Transform). This is similar to the \(2\times 2\) KLT (Karhunen-Loève Transform), the Haar Transform and the \(2\times 2\) Discrete Walsh-Hadamard Transform [7, 10].

In general (for all linear transforms), Eqs. \eqref{eq:forward_transform_matrix_form} and \eqref{eq:forward_transform_matrix_form2} can also be expressed as

\begin{equation} {\mathbf w}_u = \sum _i {\mathbf K}_{u,i}{\mathbf x}_i, \label {eq:forward_transform_linear_combination_form} \end{equation}

where \({\mathbf K}_{u,i}\) denotes the \(i\) -th element of the \(u\)-th row of the matrix \(\mathbf K\).

A major difference between the transformed data \(\mathbf w\) and the original data \(\mathbf x\) is that the characteristics of the elements of \(\mathbf w\) are determined by their position within the decomposition \(\mathbf w\) [7]. Thus, as a consequence of how the matrix has been defined, the subband \({\mathbf w}_0\) represents (very roughly) the low frequencies of (the sequence) \(\mathbf x\), and \({\mathbf w}_1\) the high frequencies. Therefore, the values of \({\mathbf K}_0\) (the row 0 of \(\mathbf K\)) describe a low-pass filter, the values of \({\mathbf K}_1\) describe a high-pass filter, and \(\mathbf K\) represents the filters of a filter bank (FB) with two filters. This can also be seen in the notebook A RD-comparison of “Stereo” Transforms.

1.2 Synthesis transform

The inverse (or synthesis) transform

\begin{equation} {\mathbf x} = {\mathbf K}^{-1}{\mathbf w} \label {eq:inverse_transform} \end{equation}

can be deduced from Eq. \eqref{eq:forward_transform_matrix_form}, where we have that

\begin{equation} \begin {array}{rcl} {\mathbf w}_0 & = & {\mathbf x}_0 + {\mathbf x}_1\\ {\mathbf w}_1 & = & {\mathbf x}_0 - {\mathbf x}_1. \end {array} \end{equation}

By solving \({\mathbf x}_0\) (adding) and \({\mathbf x}_1\) (subtracting) in these equations, we obtain that

\begin{equation} \begin {array}{rcl} {\mathbf x}_0 & = & \frac {1}{2}({\mathbf w}_0 + {\mathbf w}_1)\\ {\mathbf x}_1 & = & \frac {1}{2}({\mathbf w}_0 - {\mathbf w}_1), \end {array} \end{equation}

that in matrix form becomes

\begin{equation} \begin {bmatrix} {\mathbf x}_0 \\ {\mathbf x}_1 \end {bmatrix} = \frac {1}{2} \begin {bmatrix} 1 & 1 \\ 1 & -1 \end {bmatrix} \begin {bmatrix} {\mathbf w}_0 \\ {\mathbf w}_1 \end {bmatrix}. \end{equation}

Therefore,

\begin{equation} {\mathbf x} = {\mathbf K}^{-1}{\mathbf w} = \frac {1}{2}{\mathbf K}^{\text T}{\mathbf w} = \frac {1}{2}{\mathbf K}{\mathbf w} = \frac {1}{2}\begin {bmatrix} 1 & 1 \\ 1 & -1 \end {bmatrix}{\mathbf w} = \begin {bmatrix} \frac {1}{2} & \frac {1}{2} \\ \frac {1}{2} & -\frac {1}{2} \end {bmatrix}{\mathbf w}. \label {eq:inverse_transform_matrix_form} \end{equation}

1.3 Orthogonality of the transform

As can be seen (if we ignore the \(\frac {1}{2}\) scale factor), the inverse transform is the transpose of the forward transform (\({\mathbf K}^{-1}={\mathbf K}^{\text T}\)). This is a characteristic of all orthogonal transforms [7]. For the MST, specifically, it also holds that \({\mathbf K}^{\text T}={\mathbf K}\) because \(\mathbf K\) is symmetric.

In addition to verify that \({\mathbf K}^{-1}={\mathbf K}^{\text T}\), \(\mathbf K\) is orthogonal if the inner product³ of the filters⁴ of \(\mathbf K\) is \(0\) between the different filters (rows of the matrix)⁵. In our case \({\mathbf K}_0=\begin {bmatrix}1 & 1\end {bmatrix}\) and \({\mathbf K}_1=\begin {bmatrix} 1 & -1\end {bmatrix}\) , and as we can see

\begin{equation} \langle {\mathbf K}_0,{\mathbf K}_1 \rangle = \langle \begin {bmatrix} 1 & 1 \end {bmatrix} , \begin {bmatrix} 1 & -1 \end {bmatrix} \rangle = \begin {bmatrix} 1 & 1 \end {bmatrix} \cdot \begin {bmatrix} 1 & -1 \end {bmatrix} = 1\times 1 + 1\times (-1) = 0, \end{equation}

which means that the filters \({\mathbf K}_0\) and \({\mathbf K}_1\) are linearly independent⁶.

Notice also that

\begin{equation} {\mathbf w}_i = \langle {\mathbf x}, {\mathbf K}_i\rangle , \end{equation}

which basically means⁷ that \({\mathbf w}_i\) is proportional to the similarity between the input signal \(\mathbf x\) and the coefficients of the filter \({\mathbf K}_i\). These slides can help you with this key idea.

Orthogonality is important in compression applications because the statistical correlation between the subbands is 0, and therefore the contributions of the subbands to the reconstruction of the original signal \(\mathbf x\) are independent⁸. Another interesting property satisfied by many famous transforms (such as the Fourier Transform) is also orthonormality, which means that the transform is energy preserving [7] (or that the Parseval’s theorem is satisfied, in both, the analysis and the synthesis transform).

The MST is not orthonormal, because

\begin{equation} \sum _i {{\mathbf w}_i}^2 = ({\mathbf x}_0 + {\mathbf x}_1)^2 + ({\mathbf x}_0 - {\mathbf x}_1)^2 = ({\mathbf x}_0^2 + 2{\mathbf x}_0{\mathbf x}_1+{\mathbf x}_1^2) + ({\mathbf x}_0^2-2{\mathbf x_0}{\mathbf x}_1+{\mathbf x}_1^2) = 2({\mathbf x}_0^2+{\mathbf x}_1^2) = 2\sum _i {{\mathbf x}_i}^2. \label {eq:No_Parseval} \end{equation}

For this reason, we must divide the synthesized samples by \(2\) (see Eq. \eqref{eq:inverse_transform_matrix_form}). On the contrary, we would get \(2{\mathbf x}\) as the reconstructed signal instead of \(\mathbf x\).

1.4 Quantization of the subbands

Ideally, the QSS (Quantization Step Size) \(\Delta _i\) used for a subband \({\mathbf w}_i\) must operate in the RD curve \(f_i\) with the same slope as the rest of the subbands [11, 7] (this is the same as saying that we must satisfy \(f'_0(x)=f'_1(x)\), where \(f'\) denotes the derivative of \(f\)). The main drawback of this approach is that the finding of \(f_i\) is computationally intensive (we must analyze, quantize, compress, decompress, dequantize, synthesize, and compute the distortion of the data for a sufficiently high number of quantization steps), and usually we cannot do that in real time.⁹ Notice that we can use this optimal quantization scheme because the subbands are indepent one of the other, and this is true because the transforms are orthogonal.

For this reason, in the notebook A RD-comparison of “Stereo” Transforms we explore a different solution based on the idea that the contribution (in terms of energy) of the subbands to the reconstruction of the signal \(\mathbf x\) should be proportional to the gain of each synthesis¹⁰ filter of \({\mathbf K}^{-1}\) (recall that we work with orthogonal transforms and, therefore, the contributions of the subbands are independent). Thus, if the filters had different gains, the QSSs should consider this fact using a smaller QSS where the gain is higher.¹¹

By definition, the contribution of the subband \({\mathbf w}_i\) to the reconstruction of the frame is proportional to the L\(^2\) norm¹² (or the “squared” norm) of the (synthesis) filter \({\mathbf K}_i^{-1}\). Thus

\begin{equation} \begin {array}{l} \left \| {\mathbf K}_0^{-1} \right \|_2 := \sqrt {\langle \begin {bmatrix} \frac {1}{2} & \frac {1}{2} \end {bmatrix}, \begin {bmatrix} \frac {1}{2} & \frac {1}{2} \end {bmatrix} \rangle } = \sqrt {\begin {bmatrix}\frac {1}{2} & \frac {1}{2} \end {bmatrix} \cdot \begin {bmatrix} \frac {1}{2} & \frac {1}{2} \end {bmatrix}} = \frac {1}{\sqrt {2}},\\ \left \| {\mathbf K}_1^{-1} \right \|_2 := \sqrt {\langle \begin {bmatrix} \frac {1}{2} & -\frac {1}{2} \end {bmatrix}, \begin {bmatrix} \frac {1}{2} & -\frac {1}{2} \end {bmatrix} \rangle } = \sqrt {\begin {bmatrix} \frac {1}{2} & -\frac {1}{2} \end {bmatrix}\cdot \begin {bmatrix} \frac {1}{2} & -\frac {1}{2} \end {bmatrix}} = \frac {1}{\sqrt {2}}, \end {array} \end{equation}

resulting in the fact that both subbands \({\mathbf w}_1\) and \({\mathbf w}_2\) have the same gain (\(1/\sqrt {2}\)). This result tells us that both subbands could use the same quantization step size (\(\Delta _0=\Delta _1\)). In the notebook A RD-comparison of "Stereo" Transforms there is some evidence of this.

Unfortunately, most of the transforms are not implemented using matrix-vector operations, but using faster algorithms based on a lattice of computational bufferflies or filter convolutions (and therefore, we do not know \(\mathbf K\)). Fortunately, we can determine \({\mathbf K}_i^{-1}\) (and therefore, \(\mathbf K\)) by simply computing the inverse transform of the decomposition \(\begin {bmatrix} 0 & \cdots & 0 & 1 & 0 & \cdots & 0 \end {bmatrix}^{\text T}\), where the \(1\) value is in the position \(i\) (only the subband \({\mathbf w}_i=1\), the rest are “zeroed”).¹³ In our example, we get that

\begin{equation} \begin {array}{l} {\mathbf K}_0^{-1} = \frac {1}{2} \begin {bmatrix} 1 & 1 \\ 1 & -1 \end {bmatrix} \begin {bmatrix} 1 \\ 0 \end {bmatrix} = \frac {1}{2} \begin {bmatrix} 1 & 1 \end {bmatrix}, \\ {\mathbf K}_1^{-1} = \frac {1}{2} \begin {bmatrix} 1 & 1 \\ 1 & -1 \end {bmatrix} \begin {bmatrix} 0 \\ 1 \end {bmatrix} = \frac {1}{2} \begin {bmatrix} 1 & -1 \end {bmatrix}. \end {array} \end{equation}

As a final remark, we could also consider that any alternative other than \(\Delta _0=\Delta _1\) will affect to the quality and the spatial perception of the audio in a different degree. The notebook A RD-comparison of "Stereo" Transforms gives more information about this.

2 Temporal decorrelation using the DWT (Discrete Wavelet Transform)

2.1 About temporal redundancy in audio

After exploiting spatial (stereo) redundancy, the next natural step in the development of InterCom is to remove the temporal redundancy that can be found inside of each subband¹⁴. As it can be seen in the notebook Audio Viewer, most audio signals show “patterns” of samples that tend to repeat, especially locally. Another clear source of temporal redundancy is that neighboring audio samples usually show similar amplitude values.

There are several techniques that can be used to remove the temporal redundancy of a sequence of audio. One of the most straightforward is Differential Pulse Code Modulation (DPCM) [7]. However, there are more efficient decorrelation algorithms based on Transform Coding, such as the one described in the previous and in this section.

As it has been explained before, Transform Coding is based on the idea that we can decompose the input signal into a set of subbands, and if the filters used are appropriate to remove the (in this case, temporal) redundancy, we can achieve a high Transform Coding Gain [7], accumulating the most of the signal energy (and presumably most of the information) in a small number of subbands. When this happens, the quantization of the subbands will basically remove the least significant information (usually electronic noise), allowing better compression ratios than those in which we apply the same quantization process to the original samples.¹⁵

2.2 Subband Coding

Subband Coding¹⁶ is a particular case of Transform Coding where the rows of the transform matrix are the coefficients¹⁷ of digital filters that are used without splitting the signal into blocks. In this context, our analysis transform matrix \(\mathbf K\) (see the previous section) represents the coefficients of a 2-channels analysis Filter Bank (FB) [10], and the forward transform is in fact “descomposing” \(\mathbf x\) into two subbands \({\mathbf w}_0\) and \({\mathbf w}_1\) (see the Figure 1, and the notebook A Perfect Reconstruction Filter Bank (PRFB)). On the other hand, the synthesis transform matrix \({\mathbf K}^{-1}\) denotes the coefficients of the corresponding synthesis FB that allows to recover \(\mathbf x\) from \(\{{\mathbf w}_i\}\) (notice that in the figure, \({\mathbf x}={\mathbf l}^i\), \({\mathbf w}_0={\mathbf l}^{i+1}\), \({\mathbf w}_1={\mathbf h}^{i+1}\), \(\tilde \phi ={\mathbf K}_0\), \(\tilde \psi ={\mathbf K}_1\), \(\phi ={\mathbf K}^{-1}_0\), and \(\psi ={\mathbf K}^{-1}_1\)) from \(\mathbf w\).

Figure 1: A 2-channels PRFB (Perfect Reconstruction Filter Bank).

Let us suppose now that the analysis filters (represented by the coefficients of) \({\mathbf K}_0\) and \({\mathbf K}_1\) are applied to the input signal \(\mathbf x\) (now a sequence of \(N\) samples) using a convolution (without splitting \(\mathbf {x}\) into blocks as happens in Transform Coding). Let us also suppose (as happens in the MST) that \({\mathbf K}_0\) is a low-pass filter and \({\mathbf K}_1\) is a high-pass filter, and that the frequency response¹⁸ of both filters are one the inverse of the other. Under these assumptions, the complete (analysis/synthesis) transform is called a (2-channels) Perfect Reconstruction Filter Bank (PRFB), and \(\mathbf x\) can be recovered (perfectly) from a subsampled version (in this case decimating by 2 because we have two channels in the FB) of \({\mathbf w}_0\) and \({\mathbf w}_1\) (see the notebook A Perfect Reconstruction Filter Bank (PRFB)).¹⁹ To achieve this, the frequency response of \({\mathbf K}_0\) must be equal to the mirrored frequency response of \({\mathbf K}_1,\) and obviously both filters must have the same bandwidth [7]. In this situation, in which \({\mathbf K}_0\) and \({\mathbf K}_1\) are mirror filters, we say that they form a Quadrature Mirror Filters (QMF) Bank.

2.3 Multichannel filter banks and psychoacoustic frequency resolution

Using the suitable filters, it is possible to build \(M\)-channels PRFBs.²⁰ These filters can analyze (and synthesize) the signal \(\mathbf x\), decomposing it in ( almost for sure) overlaping frequency subbands with different bandwidth. The question here is to know how many filters should be used and what pass-band width should they have. At this design point, we must also consider that the accuracy of the human perception of the sound depends on the frequency (as can be checked²¹ with the notebook Tonal Generator) we are more sensitive to frequency variations when the frequency of the sound is low²². This fact is related to the way in which the critical bands are distributed in the Bark Scale.

2.4 The Discrete Wavelet Transform

As can be seen, the Bark Scale divides the audible spectrum into 24 subbands of (a priori) “whimsical” bandwidths. However, it is clear that a dyadic partition of the audible spectrum fits better than a lineal partition. Considering this reason, from all the families of transforms designed to date, the most suitable one, from a frequency partitioning point of view, is the Discrete Wavelet Transform (DWT).

The DWT has also other interesting features:

It is fast (\(O(N)\), where \(N\) is the number of “transformed” samples).
It can represent efficienty transient signals, which can occur frequently in audio.
Although we are not going to take advantage of the following characteristic (for now), one of the most interesting features of the DWT is that it can used to find a multiresolution representation of the signal.

2.5 Implementation of the DWT

The DWT can be implemented in different ways:

Defining the transform matrix \(\mathbf K\) (see these slides) and computing matrix-vector multiplications, which requires a calculation time proportional to \(O(N^2)\). However, the main problem of this type of implementation is generated by the amount of memory that \(\mathbf K\) requires, which is proportional to \(N^2\).
Cascading PRFBs (see the Figure 2). Considering that the convolution is a \(O(N\log _2N)\) operation (if it is implemented in the frequency domain), and that the number of levels in the cascade is generally small (5, for example), this implementation is faster than the based on vector-matrix arithmetic. And most importantly, we do not need to store \(\mathbf K\), but only the coefficients of the different filters that are used.²³

Figure 2: A dyadic 2-levels cascade of PRFBs.
Using lifting [9], which provides an additional speed-up factor of 2 compared to the FB implementation. DWTs implemented with lifting do not need to downsample and upsample the subbands, an operation that is wasting the calculus of half of the wavelet coefficients at each level of the cascade.

2.6 Example of a DWT using the MST filters

In order to clarify the concepts introduced above, let us build a DWT using the MST filters and lifting.

Lifting is based on the concept of dyadic multiresolution analysis, and also related to the so called polyphase representation of signals. To do that, we can rewrite the MST filter equations (our \({\mathbf K}_0\) and \({\mathbf K}_1\) filters in the previous section) as
\begin{equation} \begin {array}{rcl} {\mathbf l}^1_i & = & {\mathbf x}_{2i} + {\mathbf x}_{2i+1} \\ {\mathbf h}^1_i & = & {\mathbf x}_{2i+1} - {\mathbf x}_{2i}, \end {array} \label {eq:1dwt} \end{equation}
where the \(l\)-th subband \({\mathbf w}^l=\{{\mathbf w}_i^l~|~0\le i\le 2^{n-l}\}\), being \(2^n=N\) the number of samples in \(\mathbf x\), and where, by definition, \({\mathbf l}^0={\mathbf x}\), is the original resolution level of the signal. The subbands \({\mathbf l}^1\) and \({\mathbf h}^1\) computed by Eq. \eqref{eq:1dwt} are the same than the decimated subbands computed by the corresponding 1-level PRFB, and we say, therefore, that Eq. \eqref{eq:1dwt} computes the 1-level DWT.
Based on the 1-level DWT, we define the 2-levels DWT as
\begin{equation} \begin {array}{rcl} {\mathbf l}^2_i & = & {\mathbf l}^1_{2i} + {\mathbf l}^1_{2i+1} \\ {\mathbf h}^2_i & = & {\mathbf l}^1_{2i+1} - {\mathbf l}^1_{2i}, \end {array} \label {eq:2dwt} \end{equation}
that, as we can see, uses as input the output of Eq. \eqref{eq:1dwt}.
In general, for a \(l\)-levels DWT, we get
\begin{equation} \begin {array}{rcl} {\mathbf l}^l_i & = & {\mathbf l}^{l-1}_{2i} + {\mathbf l}^{l-1}_{2i+1} \\ {\mathbf h}^l_i & = & {\mathbf l}^{l-1}_{2i+1} - {\mathbf l}^{l-1}_{2i}. \end {array} \label {eq:ldwt} \end{equation}

The \(l\)-levels DWT splits the signal spectrum into \(l+1\) subbands. If \(l=n\), where \(N=2^n\), we have the spectrum partition
\begin{equation*} | \mathbf {l}^l_0 | \mathbf {h}^l_0 | \mathbf {h}^{l-1}_0 \mathbf {h}^{l-1}_1 | \mathbf {h}^{l-2}_0 \mathbf {h}^{l-2}_1 \mathbf {h}^{l-2}_2 \mathbf {h}^{l-2}_3 | \cdots | \mathbf {h}^1_0 \mathbf {h}^1_1 \cdots \mathbf {h}^1_{2^{n-1}-1} |, \end{equation*}
holding²⁴ that
\begin{equation} 1+\sum _{j=1}^l 2^{j-1}=2^n, \end{equation}
i.e., the number of DWT coefficients is also \(N\).
All DWT perform a number of lifting steps, each one with 2 (sub)steps:
1. A predict step, which computes the \(\mathbf h\) subbands as a prediction error (that in general should be minimized) between the even samples (usually, the values used to predict) and the odd samples (usually, the values predicted). For the MST filters, we have that (see Eq. \eqref{eq:ldwt})
  \begin{equation} {\mathbf h}^l_i = {\mathbf l}^{l-1}_{2i+1} - {\mathbf l}^{l-1}_{2i}. \end{equation}
2. An update step, which computes the \(\mathbf l\) subband considering (only) the even samples and the prediction errors. For the MST, we have that (see also Eq. \eqref{eq:ldwt})
  \begin{equation} {\mathbf l}^l_i = 2{\mathbf l}^{l-1}_{2i} + {\mathbf h}^l_i. \end{equation}
Notice that these steps are invertible:
\begin{equation} \begin {array}{rcl} {\mathbf l}^{l-1}_{2i} & = & \frac {1}{2}({\mathbf l}^l_i - {\mathbf h}^l_i)\\ {\mathbf l}^{l-1}_{2i+1} & = & {\mathbf l}^{l-1}_{2i} + {\mathbf h}^l_i. \end {array} \end{equation}

2.7 Wavelets and filter banks

In the context of wavelet theory [3], the response of the low-pass analysis filter (\({\mathbf K}_0\) in MST) to the unit impulse²⁵ is known as scaling function and is usually denoted by \(\tilde \phi \), the response of the analysis high-pass filter (\({\mathbf K}_1\)) is known as the wavelet function and it is usually denoted by \(\tilde \psi \), the response of the low-pass filter synthesis (\({\mathbf K}^{-1}_0\)) is indicated by \(\phi \) and the synthesis high-pass filter (\({\mathbf K}^{-1}_1\)) is represented by \(\psi \), which are the dual scaling and wavelet functions.

For the MST it holds that \(\tilde \phi \bot \tilde \psi \), the frequency response of \(\tilde \phi \) is equal to the mirror²⁶ of \(\phi \) and \(\tilde \psi \) is equal to the mirror of \(\psi \), and this is also true for all orthogonal DWTs. Another important characteristic of orthogonal DWTs is that the filters cannot be symmetric.²⁷

2.8 Example of a DWT using “high”-order filters

The previous MST-based DWT is similar to other transforms such as the Haar transform, in which we use a 1-order predictor to remove temporal redundancy. Let us extend the idea of lifting to a prediction of order two. For that, we define the predict step as

\begin{equation} {\mathbf h}^l_i = {\mathbf l}^{l-1}_{2i+1} - \frac {1}{2}({\mathbf l}^{l-1}_{2i} + {\mathbf l}^{l-1}_{2i+2}) \label {eq:linear_prediction} \end{equation}

and the update step as

\begin{equation} {\mathbf l}^l_i = {\mathbf l}^{l-1}_{2i} + \frac {1}{4}({\mathbf h}^l_{i-1} + {\mathbf h}^l_i), \label {eq:linear_update} \end{equation}

where the factor \(1/4\) is used to preserve the energy [9]. This transform is known as the biorthogonal (2,2) of Cohen-Daubechies-Feauveau. Biorthogonal²⁸ filters can be easely recognized because they are always symmetric.²⁹ When the PRFB filters are biorthogonal, they also satisfy \(\psi \bot \tilde \phi \) and \(\phi \bot \tilde \psi \).

This linear transform is also invertible by simply reversing the steps:

\begin{equation} \begin {array}{rcl} {\mathbf l}^{l-1}_{2i} & = & {\mathbf l}^l_i - \frac {1}{4}({\mathbf h}^l_{i-1} + {\mathbf h}^l_i)\\ {\mathbf l}^{l-1}_{2i+1} & = & {\mathbf h}^l_i + \frac {1}{2}({\mathbf l}^{l-1}_{2i} + {\mathbf l}^{l-1}_{2i+2}). \end {array} \end{equation}

2.9 Quantization of the DWT subbands

The same theory developed for the quantization of the MST subbands (see Section is applicable here: we should select the different quantization step sizes depending on the slope generated in the rate/distortion curve (see this A RD-comparison of DWTs. In the practice, when the dynamic range of the DWT coefficients depends on the gain of the synthesis filters (as occurs with PyWavelets), a constant QSS strategy, where

\begin{equation} \Delta _0=\Delta _i~\text {for}~i=1,\cdots ,l+1 \end{equation}

usually works fine.

3 Overlapped block transforms to minimize distortion

Transform Coding implies splitting the signal into blocks of data (chunks) and computing the transform of each chunk. When the output coefficients are quantized, it is possible that significant (and unpleasant) distortions may appear in the border frames of the chunks (see Fig. 3). This is a consequence of the prediction step computed by the DWT in the limits of the chunks (where we need the samples of the adjacent chunks), generating different predictions at the beginning and the end of the adjacent chunks.

Figure 3: On the top-left, three consecutive chunks of a real mono audio sequence. On the top-right, the reconstruction of the chunks without overlapping. On the bottom-left, the extended central chunk. On the bottom-right, the reconstruction of the extended chunk. See the notebook Quantization in the DWT domain. Remember that only the orange samples of the extended chunk will be used to reconstruct the original signal!

One solution to avoid signal discontinuities between chunks is to overlap the content of the blocks of data processed by the DWT. Thus, the current (\(i\)-th) “chunk” uses also the last frames of the previous (\((i-1)\)-th) chunk and the first frames of the next (\((i+1)\)-th) chunk to compute the transform of the current extended (\(i\)-th) chunk (see the Fig. 4). This has been described in the following algorithm:

Encoder:

\({\mathbf C}_{-1}\leftarrow {\mathbf 0}\), a zero chunk.
Input \({\mathbf C}_0\).
For \(i\in \{0,1,\cdots \}\):
1. Input \({\mathbf C}_{i+1}\).
2. Build the extended chunk \({\mathbf E}={\mathbf C}_{i-1}[-o:]|{\mathbf C}_i|{\mathbf C}_{i+1}[:o]\), where \(\cdot |\cdot \) denotes the concatenation of chunks, \(o\) is the overlapped area size in frames, \({\mathbf C}_{i-1}[-o:]\) the last \(o\) frames of chunk \({\mathbf C}_{i-1}\), and \({\mathbf C}_{i+1}[:o]\) are the first \(o\) frames of the chunk \({\mathbf C}_{i+1}\).
3. Compute the decomposition \({\mathbf D}_i \leftarrow \text {DWT}^l({\mathbf E})\), where \(l\) is the number of levels of the DWT (\(l=2\) in Fig. 4).
4. Output the decomposition \({\mathbf D}_i\).
5. \({\mathbf C}_{i-1}\leftarrow {\mathbf C}_i\) (we can assign the pointers, not the contents).
6. \({\mathbf C}_i\leftarrow {\mathbf C}_{i+1}\).

Notice that we are following the NumPy [1, 5] slicing notation.

Decoder:

For \(i\in \{0,1,\cdots \}\):
1. Input decomposition \({\mathbf D}_i\).
2. Compute extended chunk \({\mathbf E}\leftarrow \text {DWT}^{-l}({\mathbf D}_i)\).
3. Output chunk \({\mathbf C}_i={\mathbf E}[o:-o]\).

Figure 4: Structure in the DWT domain of an extended chunk for \(l=2\) (upper subfigure, without chunk extension). \(o\) is the number of overlapped frames between adjacent chunks. \({\mathbf C}_{i-1}[-o:]\) represents the last \(o\) frames of chunk \({\mathbf C}_{i-1}\), and \({\mathbf C}_{i+1}[:o]\) the first \(o\) frames of the chunk \({\mathbf C}_{i+1}\).

This idea has been implemented in the notebook Overlapped DWT, and the result can be seen in Fig. 3.

4 Reducing the data overhead

Unfortunately, the previous algorithm sends twice the DWT coefficients of the overlapped areas (in Fig. 4, \(\{{\mathbf D}_i.{\mathbf l}^2[-o/4:], {\mathbf D}_i.{\mathbf l}^2[:o/4], {\mathbf D}_i.{\mathbf h}^2[-o/4:], {\mathbf D}_i.{\mathbf h}^2[:o/4], {\mathbf D}_i.{\mathbf h}^1[-o/2:], {\mathbf D}_i.{\mathbf h}^1[:o/2]\}\)). To avoid this waste of bandwidth, we can reuse the received coefficients of the overlapped areas. This procedure has been described in Fig. 5, and, as can be seen, the encoding algorithm is identical to the previous one except that only the central (stereo) coefficients are sent. The rest of the coefficients that are needed to compute the inverse transform are extracted from the neighboring chunks (represented in the DWT domain). Notice that now the number of sent coefficients is \(\text {len}({\mathbf C}_i)\), the number of samples in \({\mathbf C}_i\).

Figure 5: Block overlapping in the DWT domain for \(l=2\). Only the shadded coefficients are transmitted. Notice that, to be reconstructed, each chunk depends on some coefficients of the adjacent blocks (only some dependencies have been indicated).

The codec can now be described by:

Encoder:

\({\mathbf C}_{-1}\leftarrow {\mathbf 0}\), a zero chunk.
Input \({\mathbf C}_0\).
For \(i\in \{0,1,\cdots \}\):
1. Input \({\mathbf C}_{i+1}\).
2. Build the extended chunk \({\mathbf E} = {\mathbf C}_{i-1}[-o:]|{\mathbf C}_i|{\mathbf C}_{i+1}[:o]\).
3. Compute the decomposition \({\mathbf D}_i \leftarrow \text {DWT}^l({\mathbf E})\).
4. Output the decomposition subset \(\Big \{{\mathbf D}_i.{\mathbf l}^l[\frac {o}{2^l}:-\frac {o}{2^l}], {\mathbf D}_i.{\mathbf h}^l[\frac {o}{2^l}:-\frac {o}{2^l}], {\mathbf D}_i.{\mathbf h}^{l-1}[\frac {o}{2^{l-1}}:-\frac {o}{2^{l-1}}], \cdots , {\mathbf D}_i.{\mathbf h}^1[\frac {o}{2^1}:-\frac {o}{2^1}]\Big \}\).
5. \({\mathbf C}_{i-1}\leftarrow {\mathbf C}_i\).
6. \({\mathbf C}_i\leftarrow {\mathbf C}_{i+1}\).

Decoder (ignores overlapping):

This decoder ignores the adjacent chunks in the DWT domain, but notice that it uses the right coefficients (those computed using overlapping chunks). This should provide reconstructions of the chunks with a higher quality than in the previous decoding algorithm.

For \(i\in \{0,1,\cdots \}\):
1. Input the decomposition subset \({\mathbf D}_i\).
2. Compute the chunk \({\mathbf C}_i\leftarrow \text {DWT}^{-l}({\mathbf D}_i)\).
3. Output \({\mathbf C}_i\).

Decoder (uses overlapping):

Finally, this is the definitive decoder that minimizes the reconstruction error and avoid the gliches at the borders of the chunks.

\({\mathbf D}_{-1}\leftarrow {\mathbf 0}\).
Input decomposition \({\mathbf D}_0\).
For \(i\in \{0,1,\cdots \}\):
1. Input decomposition \({\mathbf D}_{i+1}\).
2. Build the extended decomposition \({\mathbf E}_i = {\mathbf D}_{i-1}.{\mathbf l}^l[-\frac {o}{2^l}:]|{\mathbf D}_i.{\mathbf l}^l|{\mathbf D}_{i+1}.{\mathbf l}^l[:\frac {o}{2^l}]|{\mathbf D}_{i-1}.{\mathbf h}^l[-\frac {o}{2^l}:]|{\mathbf D}_i.{\mathbf h}^l|{\mathbf D}_{i+1}.{\mathbf h}^l[:\frac {o}{2^l}]|\cdots |{\mathbf D}_{i-1}.{\mathbf h}^1[-\frac {o}{2^1}:]|{\mathbf D}_i.{\mathbf h}^1|{\mathbf D}_{i+1}.{\mathbf h}^1[:\frac {o}{2^1}].\)
3. Compute the extended chunk \({\mathbf C}_i\leftarrow \text {DWT}^{-l}({\mathbf E}_i)\).
4. Output \({\mathbf C}_i[o:-o]\).
5. \({\mathbf D}_{i-1} \leftarrow {\mathbf D}_i\).
6. \({\mathbf D}_i \leftarrow {\mathbf D}_{i+1}\).

5 Deliverables

Find the following RD curves:
1. Determine the RD curves for the MST:
  As we did in the previous milestone, generate the RD curves for a set of simulated transmission contexts. Use the modules stereo_MST_coding{_16|_32}.py. As you can see, they differs in how the transform has been implemented. Notice that you can use the --minimal_quantization_step parameter to generate the different points of the RD curves (it is not necessary to use a bit-rate control algorithm neither tc).
2. Determine the RD curves for the DWT:
  Rebuild the RD curves considering also the removal of the temporal decorrelation. Use temporal_no_overlapped_DWT_coding.py. Note that the number of levels \(l\) of the DWT (computed using PyWavelets [6]) can have a high impact on the amount of energy concentration achieved by the DWT, and therefore on the efficiency of the coding system. Show such an impact. Experiment also with the wavelet name (try to optimize –minimize– the RD curve). Again, use the --minimal_quantization_step parameter to generate the different points of the RD curves.
3. Determine the RD curves for the overlapped DWT:
  Finally, redo the curves considering now the block overlapping (temporal_overlapped_DWT_coding.py). It is a good idea to put all the RD curves together (in the same graph), to compare easily.
Mark: 2 points.
Answer the questions:
1. Does the computation of the DWT using chunk overlapping increase the latency of the whole system? And what about the jitter? In what amount? Provide a reason.
2. Which other transform(s) are used in audio encoding systems (such as MP3) to exploit temporal redundancy? Enumerate the systems and the transform(s) used.
Mark: 1 point.

6 Resources

[1] S. Berg et al. The NumPy project.

[2] M. Bosi and R.E. Goldberd. Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, 2003.

[3] C.S. Burrus, R. Gopinath, and H. Guo. Wavelets and Wavelet Transforms. Rice University, 2013.

[4] A.B. Downey. Think Stats Probability and Statistics for Programmers. O’Reilly, 2011.

[5] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, et al. Array programming with NumPy. Nature, 585(7825):357–362, 2020.

[6] G. Lee, R. Gommers, F. Waselewski, K. Wohlfahrt, and A. O’Leary. PyWavelets: A Python package for wavelet analysis. Journal of Open Source Software, 4(36):1237, 2019.

[7] K. Sayood. Introduction to Data Compression (Slides). Morgan Kaufmann, 2017.

[8] G. Strang. Linear Algebra and Its Applications. Belmont, CA: Thomson, Brooks/Cole, 2006.

[9] W. Sweldens and P. Schröder. Building Your Own Wavelets at Home. Wavelets in Computer Graphics, 1997.

[10] M. Vetterli and J. Kovačević. Wavelets and Subband Coding. Prentice-hall, 1995.

[11] M. Vetterli, J. Kovačević, and V.K. Goyal. Foundations of Signal Processing. Cambridge University Press, 2014.

¹Especially when the microphone is mono because both channels are identical

²Adding two vectors in the plane produces a third one also in the plane; multiplying a vector by a real scalar produces a second vector also in the plane. These two ingrained facts make the integer/real plane be a vector space. [11]

³The inner product between two vectors is in some sense a measure of how “similar” they are [7]. In fact, the dot product computes the norm (a measure of the distance between vectors). [11] Notice also, that the inner product is also called the dot product and the scalar product when we work with real signals. [11]

⁴When we are working with discrete signals, we usually talk about vectors instead of functions. These vectors are sampled versions of the corresponding functions, or as happen in our case, the coefficients of the filters, each one representing a basis vectors.

⁵If a set of vectors are linearly independent, then the set is called a basis for the subspace generated by linear combinations of this set. The basis set contains the smallest number of linearly independent vectors required to represent each element of the vector (sub)space. The number of basis vectors required to generate the space is called the dimension of the vector space [7]. In our case, for the MST, we have two basis vectors.

⁶In terms of orthogonality, this means that we cannot derive one from the other using the operations that define a vector space, and therefore the basis vectors can be a part a basis (set) [8].

⁷Remember that for the MST a subband has only one coefficient. For other transforms, \({\mathbf w}_i\) can be made up of more than one coefficient and therefore we would be speaking of the subband coefficients, instead of only one coefficient.

⁸The total distortion is the sum of the distortion contribution of each subband [7].

⁹This would solve the problem of controlling the bit-rate because using the RD curves we know how many bits will require each subband.

¹⁰Notice that the quantization error is generated in the transform domain and perceived in the signal domain after appliying the inverse transform.

¹¹Notice that the important here is the relative gain of each subband. For example, if the gain of \({\mathbf K}_0^{-1}\) were \(2\) and the gain of \({\mathbf K}_1^{-1}\) were \(1\), we should use \(\Delta _1=2\Delta _0\) to minimize the distortion, because a quantization error of a coefficient in \(\mathbf {w}_0\) is two times the quantization error of a coefficient in \(\mathbf {w}_0\).

¹²L\(_2(f)\) (where \(f\) is a function) is the set of all functions with finite energy and constitues a vector space [7]. \(L_2({\mathbb R})\) of simply \(L_2\) is the space of all functions \(f(t)\) with a well defined integral of the square of the modulus of the function. The \(L\) signifies a Lebesque integral, the “2” denotes the integral of the square of the modulus of the function, and \(\mathbb R\) states that the independent variable of integration is a number over the whole real line. For a function \(g(t)\) to be a member of that space is denoted: \(g\in L^2({\mathbb R})\) or simply \(g\in L^2\) [3]. The computation of the L\(^2\) form is equivalent to compute the Euclidean distance in \(N\)-dimensional (in our case, \(N=2\)) spaces.

¹³Notice that this operation will “extract” the \(i\)-th column from \({\mathbf K}^{-1}\) that is equivalent to say that will “extract” the \(i\)-th row of \(\mathbf K\), \({\mathbf K}_i\) (remember that for orthogonal transforms, \({\mathbf K}^{-1}={\mathbf K}^{\text T}\)).

¹⁴Here it is supposed that the MST has been used before. Notice that, beacuse the MST and the transform used in this milestone are both lineal, the order in which the transforms are applied is irrelevant. For this reason, we could also have used the temporal transform inside of each channel of samples, and then, remove the spatial redundancy.

¹⁵Notice that is we dead-zone quantize a decomposition and most of the coefficients are close to zero, the information removed from the signal will be those with a smaller energy.

¹⁶ In signal processing, sub-band coding (SBC) is any form of transform coding that breaks a signal into a number of different frequency bands.

¹⁷In theory, such rows could be any mathematical operation, not necessarily a filter.

¹⁸The response (in the frequency domain) of the filter to the unit impulse.

¹⁹Notice that this subsampling is possible because the aliasing generated in the low-pass subband is attenuated by the aliasing generated in the high-pass subband. Notice also that the filters are not ideal, the bandwidth of the filtered signals \({\mathbf w}_0\) and \({\mathbf w}_1\) is bigger than half of the bandwidth of \(\mathbf x\). Therefore, subsampling at a ratio of one of each two coefficients, we are generating aliasing. See the sampling theorem.

²⁰Notice that our matrix \(K\) would have \(M\) rows in this case, and also \(M\) columns, to satisfy that \({\mathbf K}^{-1}={\mathbf K}^{\text T}\) if we are implementing an orthogonal transform.

²¹You need to generate tonal sounds with different frequency and amplitudes.

²²And also, we are more sensitive to low frequencies that to high ones.

²³Notice that in a typical cascade, the filters are always the same. Therefore, we only need to store in memory only a copy of each different filter.

²⁴The wavelet coefficient \({\mathbf l}^l_0\) is called the DC (Direct Current) coefficient, and the rest of \(\mathbf h\) coefficients are called AC (Alternating Current) coefficients.

²⁵The response of a filter to the unit impulse characterize the filter because the output of the filter is the set of coefficients of the filter.

²⁶In the \(Z\)-domain, it holds that \({\mathcal Z}\{\tilde \phi \}(z)={\mathcal Z}\{\phi \}(-z)\).

²⁷The symmetry of the filters is important to produce the same type of artifacts in the boundaries of the signal and also to avoid the phase-shifting of the coefficients in the wavelet domain.

²⁸All transforms express a change of basis. When the basis are not orthogonal, the synthesis transform is not the transpose of the analysis transform. When the synthesis filters are orthogonal to their corresponding “dual” analysis filters, the transform is said biorthogonal. [11]

²⁹If Eq. \eqref{eq:linear_prediction} is used in Eq.\eqref{eq:linear_update}, we can obtain the coefficients of the low-pass filter that computes the low-frequency subband: \({\mathbf l}^l_i=-1/8{\mathbf l}^{l-1}_{2i-2}+1/4{\mathbf l}^{l-1}_{2i-1}+3/4{\mathbf l}^{l-1}_{2i}+1/4{\mathbf l}^{l-1}_{2i+1}-1/8{\mathbf l}^{l-1}_{2i+2}\).