Echo Cancellation

One of the first problems we encounter with the use of the buffer.py module¹ is that, if we don’t use headphones, the sound that comes out of our PC’s speaker reaches our mic(rophone) some time later, and more some time later, that sound reaches our interlocutor (the “far-end” ... in the system we are the “near-end”) in the form of an echo (signal), which is reproduced by his/her speaker, which can be captured again (some time later) by his/her mic and sent it back to us ... and so on, generating a rather unpleasant signal. In other words, if \(\mathbf s\) is the (analog) signal played by our (loud)speaker and that reaches our mic, \(\mathbf n\) is the signal emited by the near-end person (me) that reaches our mic², and \(\mathbf m\) is the (mixed) signal recorded by our microphone, we have that \begin {equation} m(t) = n(t) + s(t), \label {eq:echo_problem} \end {equation} where \(m(t)\) is the signal that hits the membrane of our mic.

2 The trivial (and definitive) solution

Use a headset. In this case, \begin {equation} m(t) \approx n(t) \label {eq:headset_solution} \end {equation} because \(s(t)\approx 0\).

3 The trivial (but limited) solution

Decrease the gain of the amplifier of your speaker to do (the energy of) \(s(t)\) as low as possible. Unfortunately this also decreases the volume of the far-end signal (the voice of our interlocutor) :-/

4 The simplest subtract solution

Lets \(\mathbf m\) the digital version of \(m(t)\) and \({\mathbf m}[t]\) it’s \(t\)-th sample³. In this solution, we send \begin {equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - a{\mathbf s}[t-d], \label {eq:simplest} \end {equation} where \(a\) is an attenuation (scalar) value and \(d\) represents the delay⁴ (measured in sample-times) required to propagate the sound from our speaker to our mic. We define \begin {equation} \hat {\mathbf e}[t] = a{\mathbf s}[t-d] \end {equation} as the estimated echo signal.

5 Considering the frequency response of the near-end to estimate the echo signal

We can improve the performance of the echo cancellation process if we take also into consideration that the echo signal that finally reaches our mic is the convolution of \(s(t)\) and a signal \(h(t)\) that represents the echo response of our local audioset (speaker, mic, walls, monitor, keyboard, our body, etc.) to a impulse signal \(\delta (t)\).⁵ In other words, we can compute \begin {equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathbf s}*{\mathbf h})[t-d], \label {eq:using_convolution} \end {equation} where \(*\) represents the convolution between (in our case of) digital signals, and \(\mathbf h\) is the digitalized (discrete + quantized) version of \(h(t)\), the response of the near-end audioset to the impact of \(\delta (t)\).

The convolution of digital signals in the time domain can be expensive (with computational complexity \(O^2\)) if the number of samples is high. Fortunately, thanks to the theorem of the convolution of signals in the frequency domain [3, 4], the convolution can be replaced by the dot product (with \(O\)), when we consider the signals in the frequency domain. Thanks to this, we can rewrite the Eq. \eqref{eq:using_convolution} as \begin {equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathcal F}^{-1}\{{\mathbf S}{\mathbf H}\})[t-d], \label {eq:faster} \end {equation} where \(\mathbf S\) is the (digital) Fourier transform⁶ of \(\mathbf s\), \(\mathbf H\) is the Fourier transform of \(\mathbf h\), and \({\mathcal F}^{-1}\) represents the inverse Fourier transform. Notice that all these transforms are applied to digital signals, and there exist fast algorithms (\(O\log _2O\)).

6 Estimation of the echo signal using LMS (Least Mean Squares)

We just have seen how it is possible to find better estimations of the echo signal \(\mathbf e\) using convolutions. By definition, convolutions are performed by filters.

Filters can be implemented in the frequency domain (see the previous section) or in the signal (time in our case) domain. Signal domain (digital) convolutions are efficient when the length⁷ of the (digital) filters is small.⁸ For estimating the echo signal at the sample-time \(t\), \({\mathbf e}^{(t)}\), the length of a FIR filter should be at least \(d\), because at least we need \(d\) samples to detect the echo signal. Therefore, we have that \begin {equation} \hat {\mathbf e}^{(t)} = \sum _{k=0}^{d-1}{\mathbf h}_k^{(t)}{\mathbf s}[t-k], \end {equation} where \(\mathbf h\) is the near-end impulse response in the time domain.

Also, as we have seen in the previous section, it is possible to adapt the filter to the acoustic conditions, measuring the echo generated by the impulse signal. In the time domain, one of the most used techniques for computing the coefficients of a FIR filter is the LMS (Least Mean Squares) algorithm [2, 1], among other reasons because the filter (coefficients) can be adapted to variations in the signal to filter (the filter can learn).

LMS was invented by professor Bernard Widrow and his first Ph.D. student, Ted Hoff, to train the ADALINE artificial neural network.⁹ Using LMS, ADALINE is able to distinguish between patterns, even using a part of a single neuron¹⁰.

LMS can be used to compute the coefficients of a filter to provide a desired output for a given input. In our context, the input signal is \(\mathbf m\) (the digital signal recorded by our soundcard) and the desired output signal is \(\mathbf n\) (the digital version of our voice). LMS, iteratively computes

where \(i\) represents the iteration number, and \(\mu \) is the learning rate¹¹. These equations can be found¹² using the (steepest) gradient descend algorithm. Notice that we process the signals sample-by-sample (at iteration \(i\) we compute the \(i\)-th sample of \(\tilde {\mathbf n}\) (the signal without the echo, supposely containing only our voice), and this is the signal that we will send to the far-end in the next chunk).

7 Deliverables

A Python module called echo_cancellation.py that inherits from buffer.py and that implements at least¹³ one of the previously described solutions. More concretely:

Optionally, but interesting for your mark, use any other technique (for example, an artificial neural network¹⁶) for estimating the echo, and use it for removing the echo (obviously, in real-time). Take also into consideration that the parameters that determine the estimation of the echo signal should be continously¹⁷ monitored becase the physical composition of the audiosets can be dynamic (for example, the inclination of the screen of our laptop can be modified at any moment).

8 Resources

[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[2] S. Haykin. Adaptive Filter Theory (3rd edition). Prentice Hall, 1995.

[3] J. Kovačević, V.K. Goyal, and M. Vetterli. Fourier and Wavelet Signal Processing. http://www.fourierandwavelets.org/, 2013.

[4] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Systems (2nd edition). Prentice Hall, 1997.

¹And obviously, any other parent version of buffer.py.

²I.e., the same signal that would be captured by our mic if I were using a headset.

³Or frame if we work in stereo

⁴In a digital signal the sample index indicates the position of the sample in the sequence of samples. If we also know the sample-time, i.e., the cadency of the sampler, we can also compute at which time was taken a sample.

⁵This action is similar to the carried out by submarines when they user the sonar to perform echo-location, or by bats.

⁶The Fourier transform is an special case of the Laplace transform where \(\sigma =0\) in the complex (Laplace) domain represented by \(s=\sigma +j\omega \) frequencies. This simplification can be used for the characterization of our local-end audioset because it can be considered a FIR (Finite Impulse Response) system (in ausence of an audio signal, the echo always decays with the time).

⁷The number of coefficients.

⁸Take in mind that convolution is a \(O^2\) operation, and therefore, we can only handle in real-time with our computers small filter.

⁹See https://www.youtube.com/watch?v=hc2Zj55j1zU

¹⁰If we do not consider the activation function, an artificial neuron and a FIR filter perform the same computation.

¹¹High \(\mu \) values spped-up the adaption process, but can generate worse \(\mathbf h\) coefficients.

¹²Again, see https://www.youtube.com/watch?v=hc2Zj55j1zU!

¹³More working implementations, higher mark.

¹⁴Remember that ping only works between public devices.

¹⁵Remember that the chunks have a chunk number in the packet header.

¹⁶ADALINE is the simplest AAN ever developed!

¹⁷A 1-seconds cadence should be enought.