Psychoacoustics (see the sound, the human auditory system, and the human sound perception) has determined that the HAS (Human Auditory System) has a sensitivity that depends on the frequency of the sound, the so called ToH (Threshold of (Human) Hearing). This basically means that some subbands (intervals of frequencies)can be quantized with a larger quantization step than others without a noticeable increase (from a perceptual perspective) of the quantization noise [2].
A good approximation of ToH for a 20-year-old person can be obtained with [1] \begin {equation} T(f)\text {[dB]} = 3.64(f\text {[kHz]})^{-0.8} - 6.5e^{f\text {[kHz]}-3.3)^2} + 10^{-3}(f\text {[kHz]})^4. \label {eq:ToHH} \end {equation} This equation has been plotted in Fig. 1.
The number of DWT subbands \begin {equation} N_{\text {sb}} = N_{\text {levels}} + 1 \end {equation} where \(N_{\text {levels}}\) is the number of levels of the DWT [3]. Except for the \({\mathbf l}^{N_{\text {levels}}}\) subband (the lowest-pass frequency of the decomposition), it holds that \begin {equation} W({\mathbf w}_s) = \frac {1}{2}W({\mathbf w}_{s-1}), \end {equation} being \(W(\cdot )\) the bandwidth of the corresponding subband. Therefore, considering that the bandwidth of the audio signal is \(22050\) Hz, the bandwidth \(W({\mathbf w}_1)=11025\) Hz, \(W({\mathbf w}_2)=22025/4\), etc. It also holds that \begin {equation} W({\mathbf l}^{N_{\text {levels}}}) = W({\mathbf w}^{N_{\text {levels}}}). \end {equation}
The idea is to decide, knowing the frequencies represented in each DWT subband and the ToH curve (see InterCom: a Real-Time Digital Audio Full-Duplex Transmitter/Receiver), the QSS (Quantization Step Size) that should be applied to each subband.
This idea is already implemented in the module basic_ToH.py.
The frequency resolution of a dyadic subband partition generated by the DWT could not be high enough to map the ToH curve accurately.1 To overcome this, we can use a decomposition with more subbands and a good tool could be Wavelet Packets. Notice that PyWavelets provides an implementation.
So far, this technique has not been implemented.
The ToH plotted in Fig. 1 can be different to your current “perceptual hearing capabilities”.2 An optimal-user-specific ToH should take into consideration your noticeable quantization noise in each subband, defining a set of QSSs (one per subband). To find such set, the following algorithm can be used:
Starting with the lowest frequency subband (at the first iteration the rest of subbands are zero). While the noise (suppose that the quantization noise follows an uniform distribution) is imperceptible:
Notice that the QSSs are determined for the sound that you are going to play (not for the audio that you are generating). Therefore, you should use your interlocutor’s QSSs and vice versa. Implement also the transmission of this information.
Finally, this improvement should be optional through a command line parameter (the “standard” ToH of the Fig. 1, should be used by default).
Implement the functionality described in Sections 3 and 4 in a module advanced_ToH.py, and a report showing how your proposal works, including a subjective performance comparison.
[1] M. Bosi and R.E. Goldberd. Introduction to Digital Audio Coding and Standards. Kluwer Academic Publishers, 2003.
[2] K. Sayood. Introduction to Data Compression (Slides). Morgan Kaufmann, 2017.
[3] M. Vetterli and J. Kovačević. Wavelets and Subband Coding. Prentice-hall, 1995.
1For example, if \(N_{\text {levels}=5}\) we decompose the audio into 6 subbands, and the Bark scale has 24 subbands. Remember that when the Wavalet transform is dyadic, the Wavelet space is analyzed by octaves, and therefore the subbands doubles their size when we increase the frequency (see InterCom: a Real-Time Digital Audio Full-Duplex Transmitter/Receiver). Anyway, the higher the numbe of subbands, the better the approximation to the ToH curve.
2For example, your speakers could not have a flat frequency response, or your room could attenuate some frequencies.