Feedback suppression

Vicente González Ruiz & Savins Puertas Martín & Juan José Moreno Riado

February 21, 2026

In this milestone, we’ll solve the problem generated by the recording of the signal emitted by our speakers that is captured by our microphone, generating an unpleasant feedback that difficults our conversation (using InterCom). First, we’ll formalize the problem and then we will explore different solutions, with varying effectiveness and computational requirements.

1 The problem

One of the first problems we encounter with the use of the buffer.py module1 is that, if we don’t use headphones, the sound that comes out of our PC’s (loud)speaker some time later reaches our mic(rophone), and more some time later, that sound reaches our interlocutor (the “far-end” ... in the system we are the “near-end”) in the form of an echo (signal) of its own voice, which is reproduced by his/her speaker, which can be captured again (some time later) by his/her mic and sent it back to us ... and so on, generating a rather unpleasant feedback signal.

To formalize the problem, let’s define:

  1. \(s\) the (analog) signal played by our speaker and that reaches our mic,
  2. \(n\) the signal emited by the near-end person (that’s me) that reaches my mic2, and
  3. \(m\) the (mixed) signal recorded by my microphone.

In this situation, we have that

\begin{equation} m(t) = n(t) + s(t), \label {eq:echo_problem} \end{equation}

where \(m(t)\) is the analog audio signal that makes the membrane of our microphone oscillate.

Our problem here is to minimize the energy of \(s(t)\), i.e., to make

\begin{equation} s(t) = 0. \end{equation}

2 The trivial (and most efective) solution

Use a headset. In this case,

\begin{equation} m(t) \approx n(t) \label {eq:headset_solution} \end{equation}

because \(s(t)\approx 0\).

3 The trivial (but unpractical) solution

Decrease the gain of the amplifier of your speaker to do (the energy of) \(s(t)\) as small as possible. Unfortunately, this also decreases the volume of the far-end signal (the voice of our interlocutor) :-/

4 The “simplest” (but ineffective) solution

Lets \(\mathbf m\) the digital version of \(m(t)\), and \({\mathbf m}[t]\) it’s \(t\)-th sample3. In this solution, we send

\begin{equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - a{\mathbf s}[t-d], \label {eq:simplest} \end{equation}

where \(a\) is an attenuation (scalar) value, and \(d\) represents the delay4 (measured in sample-times) required to propagate the sound waves from our speaker to our mic. We define

\begin{equation} \hat {\mathbf s}[t] = a{\mathbf s}[t-d] \label {eq:minimal_filter} \end{equation}

as the estimated5 feedback signal that reaches our microphone at the same instant of time that the sample \({\mathbf n}[t]\) would have been captured in the ausence of the feedback.

Notice that it have been used the notation \(\tilde {\cdot }\) to highlight that \(\tilde {\mathbf n}\) is an approximation of \(\mathbf n\) (our sampled true-voice signal), and the notation \(\hat {\cdot }\) to emphasize that \(\hat {\mathbf s}\) is a registered6 prediction for \(s\) reaching our microphone. Notice also that, if \({\mathbf s}[0]\) is the first sample of a chunk (\(c\)-th chunk), the sample \({\mathbf s}[-d]\) could belong to a previous chunk (the \((c-1)\)-th chunk).

Finally, \(a\) should be choosen considering that under ausence of voice in each end, \(s(t)\approx 0\). For example, Skype estimates \(d\) and \(a\) using a “call-signal” (a sequence of more-or-less tonal sounds). \(d\) is determined measuring the propagation time of the call-signal between our speaker and our mic.

Why it’s difficult to make it work?

This algorithm is ineffective because:

  1. \(d\) can change over time (for example, if we’re using a laptop and move the screen, \(d\) will vary). This can be costly, because using correlation in the Fourier domain, this is a heavy operation to be performed in real-time.
  2. Assuming that the signal captured by the microphone is \(a{\mathbf s}[t-d]\) is an oversimplification. This signal is only similar (it’s filtered and likely contains echoes).

5 Considering the frequency response of the near-end to estimate the feedback signal

We can improve the performance of the previous feedback cancellation solution if we take also into consideration that the feedback signal that finally reaches our microphone is (at least in part) the convolution of \(s(t)\) and a signal \(h(t)\) that represents the echo response of our local audioset (speaker, mic, walls, monitor, keyboard, our body, ...) to an impulse signal \(\delta (t)\).7 In other words, we can modify Eq. \eqref{eq:simplest} to compute

\begin{equation} \tilde {\mathbf {n}}[t] = \mathbf {m}[t] - (\mathbf {h}\ast \mathbf {s}[t-d]) = \mathbf {m}[t] - (\mathbf {h}\ast \mathbf {s})[t-d], \label {eq:using_convolution} \end{equation}

where \(\ast \) represents the (digital) convolution between (in our case of) digital signals, and \(\mathbf h\) is the digitalized version of \(h(t)\).

The convolution of digital signals in the time domain can be expensive (with computational complexity \(O^2\), where \(O\) is the number of elements to process) if the number of samples or/and filter coefficientsis is high. Fortunately, thanks to the convolution theorem [34], the convolution can be replaced by the dot product (with complexity \(O\)), when we consider the signals in the frequency domain. Thanks to this, we can rewrite the Eq. \eqref{eq:using_convolution} as

\begin{equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathcal F}^{-1}\{{\mathbf S}{\mathbf H}\})[t-d], \label {eq:faster} \end{equation}

where \(\mathbf S\) is the (digital) Fourier transform8 of \(\mathbf s\), \(\mathbf H\) is the Fourier transform9 of \(\mathbf h\), and \({\mathcal F}^{-1}\) represents the inverse (digital) Fourier transform. Notice that all these transforms are applied to digital signals, and there exist fast algorithms (with complexity \(O\log _2O\)).

6 Practical issues

Unfortunately, even when we expect that this improved feedback supression algorithm is going to perform better than the previous one, the computation of the filter weights \(\mathbf h\) requires emitting impulses that can be heard by the user.

7 Adaptive filtering using LMS (Least Mean Squares)

The LMS algorithm [21] was invented by professor Bernard Widrow and his first Ph.D. student, Ted Hoff, to train the ADALINE artificial neural “network”.10 Using LMS, ADALINE is able to distinguish two (different) patterns, even using (only a part of11) a single neuron.

LMS finds \(\mathbf {h}\) without using impulses. Considering that the signal that we should send to the far-end is

\begin{equation} \tilde {\mathbf {n}}[t] = \mathbf {m}[t] - \sum _{k=0}^{K-1}{\mathbf h}_k{\mathbf s}[t-k-d], \label {eq:FIR_filter} \end{equation}

LMS minimizes

\begin{equation} \mathbf {e}[t] = \tilde {\mathbf {n}}[t] - \mathbf {n}[t] \end{equation}

using the update rule

\begin{equation} \mathbf {h}^{(i+1)}_k = \mathbf {h}^{(i)}_k + 2\mu \mathbf {e}[k]\hat {\mathbf {s}}[k], \label {eq:update} \end{equation}

where \(\mu \) is the learning rate12 and \(i\) represents the \(i\)-th update iteration. These equations can be found13 using the (steepest) gradient descend algorithm.

Unfortunately

Yes, unfortunately we only can compute \(\mathbf {h}\) if we have \(\hat {\mathbf {s}}\) (the signal played by the speaker that is recorded by our mic), i.e., me must be silent when updating. Therefore, the performance of this algorithm depends on the performance of other auxiliar technique which should determine whether in the near-end there is some source of sound.

8 Taking turns speaking is the smartest thing to do

Lets recap. Our problem is that, without a headset, we have that the signal that we send is

\begin{equation} \mathbf {m}[t] = \mathbf {n}[t] + \hat {\mathbf {s}}[t] \end{equation}

where \(\hat {\mathbf {s}}\) is a aproximated-and-delayed version of \(\mathbf {s}\), the signal played by our speaker(s). And if we were able to make \(\hat {\mathbf {s}}=\mathbf {0}\), our feedback problem would vanish, ... at least, theoretically.

But wait ..., as a polite person, if I don’t speak when I am listening to my interlocutor, I could set \(\hat {\mathbf {s}}=\mathbf {0}\) when I am speaking because my interlocutor should do the same (be silent when I speak)!

Therefore, this solution (like the previous one) does need some technique to detect when I’m speaking. The advantage here is that we only need to switch off (or at least decrease enough the) volume of our speaker, something that does not consume computational resources. Notice that the “voice detector” algorithm can run in a separate thread/process.

9 Deliverables

  1. Write a Python module, called feedback_supression.py, that inherits from buffer.py and that implements the last proposed solution (decrease \(s(t)\) when we are speaking). Define a new command-line parameter to switch on this functionality, that by default should be disabled (off).
  2. Insert the code of feedback_supression.py in a Jupyter notebook cell and generate the archive feedback_supression.py using %%writefile feedback_supression.py. The notebook should also include the names of the members of the working group and I example of how to use your implementation.
  3. Notice that for testing your implementation it is essential that your operating system does not cancel the echo. If this is not the case, you should perform the tests in Linux.

10 Resources

[1]   Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[2]   S. Haykin. Adaptive Filter Theory (3rd edition). Prentice Hall, 1995.

[3]   J. Kovačević, V.K. Goyal, and M. Vetterli. Fourier and Wavelet Signal Processing. http://www.fourierandwavelets.org/, 2013.

[4]   Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Systems (2nd edition). Prentice Hall, 1997.

1And obviously, any other parent version of buffer.py.

2I.e., the same signal that would be captured by our mic if I were using a headset.

3Or frame if we work in stereo

4In a digital signal the sample index indicates the position of the sample in the sequence of samples. If we also know the sample-time, i.e., the cadency of the sampler, we can also compute at which time was taken a sample.

5We use the word “estimated” because in this model we are ignoring several factors, such as echoes, unflat frequency responses, etc. that modify the version of \(\mathbf {s}\) that is received by the microphone.

6That matches in time. That are synchronized.

7This technique is similar to the carried out by submarines when they user the sonar to perform echo-location, or by bats to fly in the darkness.

8The Fourier transform is an special case of the Laplace transform where \(\sigma =0\) in the complex (Laplace) domain represented by \(s=\sigma +j\omega \) frequencies. This simplification can be used for the characterization of our near-end audioset because it can be considered as a FIR (Finite Impulse Response) system (in ausence of an audio signal, the echo signal always decays with the time). Finally, notice that \({\mathbf S}[\omega ]\) (the \(\omega \)-th frequency component of \(\mathbf S\)) is a complex number (the Fourier coefficients are complex numbers).

9For computing the Fourier transform of digital signals use a library such as numpy.fft.

10See https://www.youtube.com/watch?v=hc2Zj55j1zU

11If we do not consider the activation function, an artificial neuron and a FIR filter perform the same computation.

12High \(\mu \) values speed-up the adaption process, but can generate worse \(\mathbf {h}\) coefficients.

13Again, please, watch https://www.youtube.com/watch?v=hc2Zj55j1zU!