Feedback suppression

Vicente González Ruiz & Savins Puertas Martín & Marcos Lupión Lorente

March 23, 2025

1 The problem

One of the first problems we encounter with the use of the buffer.py module1 is that, if we don’t use headphones, the sound that comes out of our PC’s speaker reaches our mic(rophone) some time later, and more some time later, that sound reaches our interlocutor (the “far-end” ... in the system we are the “near-end”) in the form of an echo (signal), which is reproduced by his/her speaker, which can be captured again (some time later) by his/her mic and sent it back to us ... and so on, generating a rather unpleasant signal. In other words, if \(\mathbf s\) is the (analog) signal played by our (loud)speaker and that reaches our mic, \(\mathbf n\) is the signal emited by the near-end person (I) that reaches our mic2, and \(\mathbf m\) is the (mixed) signal recorded by our microphone, we have that

\begin{equation} m(t) = n(t) + s(t), \label {eq:echo_problem} \end{equation}

where \(m(t)\) is the analog audio signal that hits the membrane of our mic.

Our problem here is to minimize the energy of \(s(t)\), i.e., to make

\begin{equation} s(t)\approx 0. \end{equation}

2 The trivial (and definitive) solution

Use a headset. In this case,

\begin{equation} m(t) \approx n(t) \label {eq:headset_solution} \end{equation}

because \(s(t)\approx 0\).

3 The trivial (but limited) solution

Decrease the gain of the amplifier of your speaker to do (the energy of) \(s(t)\) as small as possible. Unfortunately this also decreases the volume of the far-end signal (the voice of our interlocutor) :-/

4 The “simplest” subtract solution

Lets \(\mathbf m\) the digital version of \(m(t)\) and \({\mathbf m}[t]\) it’s \(t\)-th sample3. In this solution, we send

\begin{equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - a{\mathbf s}[t-d], \label {eq:simplest} \end{equation}

where \(a\) is an attenuation (scalar) value, and \(d\) represents the delay4 (measured in sample-times) required to propagate the sound from our speaker to our mic. We define

\begin{equation} \hat {\mathbf f}[t] = a{\mathbf s}[t-d] \end{equation}

as the estimated5 feedback signal.

Notice that we have used the symbol \(\tilde {}\) to highlight that \(\tilde {\mathbf n}\) will be an approximation of \(\mathbf n\).

5 Considering the frequency response of the near-end to estimate feedback signal

We can improve the performance of the feedback cancellation process if we take also into consideration that the feedback signal that finally reaches our mic is the convolution of \(s(t)\) and a signal \(h(t)\) that represents the echo response of our local audioset (speaker, mic, walls, monitor, keyboard, our body, etc.) to a impulse signal \(\delta (t)\).6 In other words, we can compute

\begin{equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathbf s}*{\mathbf h})[t-d], \label {eq:using_convolution} \end{equation}

where \(*\) represents the convolution between (in our case of) digital signals, and \(\mathbf h\) is the digitalized (discrete + quantized) version of \(h(t)\), the response (echo) of the near-end audioset to the impact of \(\delta (t)\).

The convolution of digital signals in the time domain can be expensive (with computational complexity \(O^2\), where \(O\) is the number of elements to process) if the number of samples or/and filter coefficientsis is high. Fortunately, thanks to the theorem of the convolution of signals in the frequency domain [34], the convolution can be replaced by the dot product (with complexity \(O\)), when we consider the signals in the frequency domain. Thanks to this, we can rewrite the Eq. \eqref{eq:using_convolution} as

\begin{equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathcal F}^{-1}\{{\mathbf S}{\mathbf H}\})[t-d], \label {eq:faster} \end{equation}

where \(\mathbf S\) is the (digital) Fourier transform7 of \(\mathbf s\), \(\mathbf H\) is the Fourier transform of \(\mathbf h\), and \({\mathcal F}^{-1}\) represents the inverse Fourier transform. Notice that all these transforms are applied to digital signals, and there exist fast algorithms (with complexity \(O\log _2O\)).

6 Estimation of the feedback signal using LMS (Least Mean Squares)

We just have seen how it is possible to find better estimations of the feedback signal \(\hat {\mathbf f}\) using convolutions. By definition, convolutions are performed by filters.

As we have seen, filters can be implemented in the frequency domain or in the signal (time in our case) domain. Signal domain (digital) convolutions are efficient when the length8 of the (digital) filters is small.9 For estimating the feedback signal at the sample-time \(t\), \({\mathbf f}[t]\), the length of a FIR filter should be at least \(d\), because at least we need \(d\) samples to detect the feedback signal. Therefore, we have that

\begin{equation} \hat {\mathbf f}[t] = \sum _{k=0}^{d-1}{\mathbf h}_k^{(t)}{\mathbf s}[t-k], \label {eq:LMS_feedback} \end{equation}

where \(\mathbf h\) is the near-end impulse response in the time domain.

Also, as we have seen in the previous section, it is possible to adapt the filter to the acoustic conditions, measuring the echo generated by the impulse signal. In the time domain, one of the most used techniques for computing the coefficients of a FIR filter is the LMS (Least Mean Squares) algorithm [21], among other reasons because the filter (coefficients) can be adapted to variations in the signal to filter (the filter can learn).

LMS was invented by professor Bernard Widrow and his first Ph.D. student, Ted Hoff, to train the ADALINE artificial neural network.10 Using LMS, ADALINE is able to distinguish between patterns, even using a part of a single neuron.11

LMS can be used to compute the coefficients of a filter to provide a desired output for a given input. In our context, the input signal is \(\mathbf m\) (the digital signal recorded by our soundcard) and the desired output signal is \(\mathbf n\) (the digital version of our voice). LMS, iteratively computes

\begin{align} {\mathbf h}^{(i+1)}_k & = & {\mathbf h}^{(i)}_k + 2\mu \tilde {\mathbf n}[i]{\mathbf s}[i-k] \label {eq:update} \\ \tilde {\mathbf n}[i] & = & {\mathbf m}[i] - \hat {\mathbf e}[i], \end{align}

where \(i\) represents the iteration number, and \(\mu \) is the learning rate12. These equations can be found13 using the (steepest) gradient descend algorithm. Notice that we process the signals sample-by-sample (at iteration \(i\) we compute the \(i\)-th sample of \(\tilde {\mathbf n}\) (the signal without the feedback, supposely containing only our voice), and this is the signal that we will send to the far-end in the next chunk).

7 Filtering in the frequency domain

Assume that \(s\) (the signal received by our microphone from our speaker) has essentially the same frequency spectrum as the audio signal being played (i.e., \(h(t)=\delta (t)\), and therefore, the environment does not modify the spectrum of the reproduced signal). We denote the signal reproduced by our speaker as \(f(t)\) (\(f\) for far-end). In this case, we can propose a feedback reduction system by applying a filter \(\mathbf {F}^{-1}\) to the captured signal \(\mathbf {s}\), which is equal to the “inverse” of the spectrum of \(\mathbf {f}\) (the signal we are reproducing). Specifically, this would involve:

  1. Calculating \(\mathbf {F}\), the Fourier transform of the reproduced audio signal \(\mathbf {f}\), chunk by chunk.
  2. Calculating \(\mathbf {F}^{-1}\), the “inverse” of \(\mathbf {F}\). If \(w_{\text {max}}\) is the frequency component of \(\mathbf {F}\) with the highest energy, \(g_{\text {max}}\), i.e., \(\mathbf {F}[w_{\text {max}}]=g_{\text {max}}\), and \(\mathbf {F}[w_{\text {min}}]=g_{\text {min}}\), normalize its spectrum with:

    \begin{equation} \mathbf {F}_{\text {normalized}} = \frac {\mathbf {F} - g_{\text {min}}}{g_{\text {max}} - g_{\text {min}}}, \end{equation}
    and finally calculate:
    \begin{equation} \mathbf {F}^{-1} = 1 - \mathbf {F}_{\text {normalized}}. \end{equation}
  3. Filtering the signal captured by the microphone using \(\mathbf {F}^{-1}\), taking into account the attenuation factor \(a\), which relates how much of the reproduced signal is captured by the microphone (how much of \(\mathbf {f}\) is in \(\mathbf {s}\)), i.e., calculate:

    \begin{equation} \hat {\mathbf {N}} = a\mathbf {F}^{-1}\mathbf {S}, \end{equation}
    where \(\mathcal {F}(\mathbf {s})=\mathbf {N}\).
  4. Estimating the attenuation factor \(a\) for the next chunk. Initially, we assume that \(a=1\) (and therefore, that there is no attenuation). If it is an overestimation, then some frequency components in the sent audio could disappear, and if we underestimate, then some frequency components could be amplified. Therefore, the attenuation factor should be high enough so that the energy of \(\mathbf {s}\) does not increase progressively, but low enough so that some frequency components do not become 0.

The signal to send (without the feedback) is:

\begin{equation} \mathbf {n} = \mathcal {F}^{-1}(\mathbf {N}) \end{equation}

8 Deliverables

A Python module called feedback_cancellation.py that inherits from buffer.py and that implements at least14 one of the previously described solutions. More concretely:

  1. If you implement the “\(s(t)\) delay and subtract solution”, you must estimate \(a\), \(d\) and perform the signals subtraction to remove the feedback signal (\(s(t)\)). For example, Skype finds \(d\) and \(a\) using a “call-signal” (a sequence of more-or-less tonal sounds). \(d\) is determined measuring the propagation time of the call-signal between our speaker and our mic, and \(a\) measuring the ratio between the energy of the call-signal played and the energy of the call-signal recorded.
  2. In a “\(s(t)\) delay and subtract solution”, if you also consider the frequency response of the near-end audioset to estimate a better feedback signal, first you will need to find \(\mathbf H\) (the discrete frequency response of your audioset). For this, the near-end speaker should generate an impulse signal \(\mathbf \delta \), and in absence of any other sound signal, record the echo and compute its Fourier transform (it is a good idea to repeat this process several times to obtain a better estimation of \(\mathbf H\)). Finally, notice that \({\mathbf H}[\omega ]\) (the \(\omega \)-th frequency component of \(\mathbf H\)) is a complex number (the Fourier coefficients are complex numbers).

    Notice that the frequency characterization of the far-end audioset can be also used in a “\(n(t)\) delay and subtract solution”. Remember that filtering operations must be implemented with convolutions in the temporal domain, but are dot products in the frequency domain.

  3. Use LMS to find a estimation of the echo signal and perform the echo cancellation. In this case, consider the use of a thread to compute the coefficients of the LMS filter (Equation \eqref{eq:update} can run with a cadence much higher than a sample-time) and compute the estimated feedback signal (Equation \eqref{eq:LMS_feedback}) for every chunk. Notice that for doing that, we will require the first \(k\) samples of the next chunk.
  4. For computing the Fourier transform of digital signals use a library such as numpy.fft.

Optionally, but interesting for your mark, use any other technique (for example, an artificial neural network15) for estimating the echo, and use it for removing the echo (obviously, in real-time). Take also into consideration that the parameters that determine the estimation of the echo signal should be continously16 monitored becase the physical composition of the audiosets can be dynamic (for example, the inclination of the screen of our laptop can be modified at any moment).

Finally, notice that correlation operation can help to fine-tune \(d\).

Mark: 10 points.

9 Resources

[1]   Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[2]   S. Haykin. Adaptive Filter Theory (3rd edition). Prentice Hall, 1995.

[3]   J. Kovačević, V.K. Goyal, and M. Vetterli. Fourier and Wavelet Signal Processing. http://www.fourierandwavelets.org/, 2013.

[4]   Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Systems (2nd edition). Prentice Hall, 1997.

1And obviously, any other parent version of buffer.py.

2I.e., the same signal that would be captured by our mic if I were using a headset.

3Or frame if we work in stereo

4In a digital signal the sample index indicates the position of the sample in the sequence of samples. If we also know the sample-time, i.e., the cadency of the sampler, we can also compute at which time was taken a sample.

5We use the word “estimated” because in this model we are ignoring several factors, such as echoes, unflat frequency responses, etc. that modify the version of \(\mathbf {s}\) that is received by the microphone.

6This technique is similar to the carried out by submarines when they user the sonar to perform echo-location, or by bats to fly in the darkness.

7The Fourier transform is an special case of the Laplace transform where \(\sigma =0\) in the complex (Laplace) domain represented by \(s=\sigma +j\omega \) frequencies. This simplification can be used for the characterization of our near-end audioset because it can be considered as a FIR (Finite Impulse Response) system (in ausence of an audio signal, the echo signal always decays with the time).

8The number of coefficients.

9Take in mind that convolution is a \(O^2\) operation, and therefore, we can only handle in real-time with our computers small filter.

10See https://www.youtube.com/watch?v=hc2Zj55j1zU

11If we do not consider the activation function, an artificial neuron and a FIR filter perform the same computation.

12High \(\mu \) values spped-up the adaption process, but can generate worse \(\mathbf {h}\) coefficients.

13Again, see https://www.youtube.com/watch?v=hc2Zj55j1zU!

14More working implementations, higher grade.

15ADALINE is the simplest AAN ever developed!

16A 1-seconds cadence should be enought.