Feedback suppression

In this milestone, we’ll solve the problem generated by the feedback of the signal emitted by our speakers that is recorded by out microphone. First, we’ll formalize the problem and then look at different solutions, with varying effectiveness and computational requirements.

1 The problem

One of the first problems we encounter with the use of the buffer.py module¹ is that, if we don’t use headphones, the sound that comes out of our PC’s (loud)speaker some time later reaches our mic(rophone), and more some time later, that sound reaches our interlocutor (the “far-end” ... in the system we are the “near-end”) in the form of an echo (signal) of its own voice, which is reproduced by his/her speaker, which can be captured again (some time later) by his/her mic and sent it back to us ... and so on, generating a rather unpleasant feedback signal.

where \(m(t)\) is the analog audio signal that makes the membrane of our microphone oscillate.

2 The trivial (and most efective) solution

3 The trivial (but limited) solution

Decrease the gain of the amplifier of your speaker to do (the energy of) \(s(t)\) as small as possible. Unfortunately, this also decreases the volume of the far-end signal (the voice of our interlocutor) :-/

4 The “simplest” solution

Lets \(\mathbf m\) the digital version of \(m(t)\), and \({\mathbf m}[t]\) it’s \(t\)-th sample³. In this solution, we send

where \(a\) is an attenuation (scalar) value, and \(d\) represents the delay⁴ (measured in sample-times) required to propagate the sound waves from our speaker to our mic. We define

as the estimated⁵ feedback signal that reaches our microphone at the same instant of time that the sample \({\mathbf n}[t]\) would have been captured in the ausence of the feedback.

Notice that it have been used the notation \(\tilde {\cdot }\) to highlight that \(\tilde {\mathbf n}\) is an approximation of \(\mathbf n\) (our sampled true-voice signal), and the notation \(\hat {\cdot }\) to emphasize that \(\hat {\mathbf s}\) is a registered⁶ prediction for \(s\) reaching our microphone. Notice also that, if \({\mathbf s}[0]\) is the first sample of a chunk (\(c\)-th chunk), the sample \({\mathbf s}[-d]\) could belong to a previous chunk (the \((c-1)\)-th chunk).

Finally, \(a\) should be choosen considering that under ausence of voice in each end, \(s(t)\approx 0\). For example, Skype finds \(d\) and \(a\) using a “call-signal” (a sequence of more-or-less tonal sounds). \(d\) is determined measuring the propagation time of the call-signal between our speaker and our mic.

5 Considering the frequency response of the near-end to estimate the feedback signal

We can improve the performance of the previous feedback cancellation solution if we take also into consideration that the feedback signal that finally reaches our microphone is the convolution of \(s(t)\) and a signal \(h(t)\) that represents the echo response of our local audioset (speaker, mic, walls, monitor, keyboard, our body, ...) to a impulse signal \(\delta (t)\).⁷ In other words, we can modify Eq. \eqref{eq:simplest} to compute

where \(\ast \) represents the (digital) convolution between (in our case of) digital signals, and \(\mathbf h\) is the digitalized (sampled + quantized) version of \(h(t)\), the response (echo signal) of the near-end (our) audioset to the impact of \(\delta (t)\).

The convolution of digital signals in the time domain can be expensive (with computational complexity \(O^2\), where \(O\) is the number of elements to process) if the number of samples or/and filter coefficientsis is high. Fortunately, thanks to the convolution theorem [3, 4], the convolution can be replaced by the dot product (with complexity \(O\)), when we consider the signals in the frequency domain. Thanks to this, we can rewrite the Eq. \eqref{eq:using_convolution} as

where \(\mathbf S\) is the (digital) Fourier transform⁸ of \(\mathbf s\), \(\mathbf H\) is the Fourier transform⁹ of \(\mathbf h\), and \({\mathcal F}^{-1}\) represents the inverse (digital) Fourier transform. Notice that all these transforms are applied to digital signals, and there exist fast algorithms (with complexity \(O\log _2O\)).

Finally, see that this feedback supression technique could ease the calculation of \(d\) because if correlation is used to estimate it, the maximun should be clearly identified.

6 Adaptive filtering using impulses

Unfortunately, none of the previous solutions is “dynamic”, in the sense that if we modify the audioset, some parameters must be modified in real-time:

Filters can be implemented in the frequency domain (such as happens in Eq. \eqref{eq:faster}) or in the signal (time in our case) domain. Signal domain (digital) convolutions are affordable if the length¹⁰ of the (digital) filters is small.¹¹

Eq. \eqref{eq:minimal_filter} represents the minimal expression for a FIR filter (whose spectral shape is completely flat because it only has a single coefficient and therefore, it attenuates equally all the frequencies). In general, the expression of a FIR filter in the time-domain (applied to a signal that previously has been delayed \(d\) samples) is

where \(K\) is the number of coefficients of the filter \(\mathbf h\), and \({\mathbf h}_k\) is its \(k\)-th coefficient. When the (coefficients of the) filter are updated for each input sample, Eq. \eqref{eq:FIR_filter} can be rewritten as

In general, the filter should not be updated so often (e.g., every second could be fast enough). In this case Eq. \eqref{eq:adaptive_FIR_filter} should be

where \(i\) represents the update iteration. If we decide to work with impulses, \({\mathbf h}^{(i)}\) is determined by the samples recorded by our soundcard after \(d\) sample-times of the generation of each impulse when the impulses should be the only source of sound. If there were other sources of sound (such as for example, our voice) the updates should be discarded or averaged using some moving-average technique.

7 Adaptive filtering using LMS (Least Mean Squares)

The LMS algorithm [2, 1] was invented by professor Bernard Widrow and his first Ph.D. student, Ted Hoff, to train the ADALINE artificial neural “network”.¹² Using LMS, ADALINE is able to distinguish two (different) patterns, even using (only a part of¹³) a single neuron.

The LMS algorithm adapts the coefficients of \(\mathbf {h}\) to minimize the error signal defined by

Unfortunately, we don’t know \(\mathbf {n}\) except when we are in silence, where \(\mathbf {n}[t]=0;\forall t\). When we are silence, the error is

Finally, notice that to use LMS, the number of coefficients of the filter \(K\) must be the same that the number of input samples considered in the adaption. Therefore, if we are working with chunks, at the moment of time we decide to update the filter, we could consider, for example, only the \(K\) samples of the current played chunk with indexes in \([d, d+K-1]\), if \(d+K\) is smaller than the chunk size. Otherwise (if \(d+K > \text {chunk~size}\)), at least two chunks should be concatenated.

8 Subtraction in the frequency domain

Ultimately, feedback in InterCom occurs because the audio signal we are generating is reproduced through our speakers (with a certain time lag). In this feedback situation, the spectrum of a newly received chunk is similar to the spectrum of a newly sent chunk. Therefore, if the chunk we want to send is pre-filtered (before being sent) using the opposite filter to that described by the spectrum of the played chunk, we should cancel the feedback.

Let \({\mathbf M} = {\mathcal F}({\mathbf m})\) the Fourier transform of the \(i\)-th chunk captured by the soundcard, and let \({\mathbf S} = {\mathcal F}({\mathbf s})\) the Fourier transform of the \(i\)-th played chunk. Considering the reasoning followed in the previous paragraph, and considering also the Eq. \eqref{eq:simplest}, it can be concluded that

where \(\hat {\mathbf {N}} = \mathcal {F}(\hat {\mathbf {n}})\), the Fourier transform of the chunk that should be sent in the \(i\)-th iteration (chunk).

9 Deliverables

10 Resources

[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[2] S. Haykin. Adaptive Filter Theory (3rd edition). Prentice Hall, 1995.

[3] J. Kovačević, V.K. Goyal, and M. Vetterli. Fourier and Wavelet Signal Processing. http://www.fourierandwavelets.org/, 2013.

[4] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Systems (2nd edition). Prentice Hall, 1997.

¹And obviously, any other parent version of buffer.py.

²I.e., the same signal that would be captured by our mic if I were using a headset.

³Or frame if we work in stereo

⁴In a digital signal the sample index indicates the position of the sample in the sequence of samples. If we also know the sample-time, i.e., the cadency of the sampler, we can also compute at which time was taken a sample.

⁵We use the word “estimated” because in this model we are ignoring several factors, such as echoes, unflat frequency responses, etc. that modify the version of \(\mathbf {s}\) that is received by the microphone.

⁶That matches in time. That are synchronized.

⁷This technique is similar to the carried out by submarines when they user the sonar to perform echo-location, or by bats to fly in the darkness.

⁸The Fourier transform is an special case of the Laplace transform where \(\sigma =0\) in the complex (Laplace) domain represented by \(s=\sigma +j\omega \) frequencies. This simplification can be used for the characterization of our near-end audioset because it can be considered as a FIR (Finite Impulse Response) system (in ausence of an audio signal, the echo signal always decays with the time). Finally, notice that \({\mathbf S}[\omega ]\) (the \(\omega \)-th frequency component of \(\mathbf S\)) is a complex number (the Fourier coefficients are complex numbers).

⁹For computing the Fourier transform of digital signals use a library such as numpy.fft.

¹⁰The number of coefficients.

¹¹Take in mind that convolution is a \(O^2\) operation, and therefore, we can only handle in real-time with our computers small filter.

¹²See https://www.youtube.com/watch?v=hc2Zj55j1zU

¹³If we do not consider the activation function, an artificial neuron and a FIR filter perform the same computation.

¹⁴High \(\mu \) values spped-up the adaption process, but can generate worse \(\mathbf {h}\) coefficients.

¹⁵Again, please, watch https://www.youtube.com/watch?v=hc2Zj55j1zU!

¹⁶More working implementations, higher grade.