One of the first problems we encounter with the use of the buffer.py module1 is that, if we don’t use headphones, the sound that comes out of our PC’s speaker reaches our mic(rophone) some time later, and more some time later, that sound reaches our interlocutor (the “far-end” ... in the system we are the “near-end”) in the form of an echo (signal), which is reproduced by his/her speaker, which can be captured again (some time later) by his/her mic and sent it back to us ... and so on, generating a rather unpleasant signal. In other words, if \(\mathbf s\) is the (analog) signal played by our (loud)speaker and that reaches our mic, \(\mathbf n\) is the signal emited by the near-end person (me) that reaches our mic2, and \(\mathbf m\) is the (mixed) signal recorded by our microphone, we have that \begin {equation} m(t) = n(t) + s(t), \label {eq:echo_problem} \end {equation} where \(m(t)\) is the signal that hits the membrane of our mic.
Our problem here is to minimize the energy of \(s(t)\).
Use a headset. In this case, \begin {equation} m(t) \approx n(t) \label {eq:headset_solution} \end {equation} because \(s(t)\approx 0\).
Decrease the gain of the amplifier of your speaker to do (the energy of) \(s(t)\) as low as possible. Unfortunately this also decreases the volume of the far-end signal (the voice of our interlocutor) :-/
Lets \(\mathbf m\) the digital version of \(m(t)\) and \({\mathbf m}[t]\) it’s \(t\)-th sample3. In this solution, we send \begin {equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - a{\mathbf s}[t-d], \label {eq:simplest} \end {equation} where \(a\) is an attenuation (scalar) value and \(d\) represents the delay4 (measured in sample-times) required to propagate the sound from our speaker to our mic. We define \begin {equation} \hat {\mathbf e}[t] = a{\mathbf s}[t-d] \end {equation} as the estimated echo signal.
We can improve the performance of the echo cancellation process if we take also into consideration that the echo signal that finally reaches our mic is the convolution of \(s(t)\) and a signal \(h(t)\) that represents the echo response of our local audioset (speaker, mic, walls, monitor, keyboard, our body, etc.) to a impulse signal \(\delta (t)\).5 In other words, we can compute \begin {equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathbf s}*{\mathbf h})[t-d], \label {eq:using_convolution} \end {equation} where \(*\) represents the convolution between (in our case of) digital signals, and \(\mathbf h\) is the digitalized (discrete + quantized) version of \(h(t)\), the response of the near-end audioset to the impact of \(\delta (t)\).
The convolution of digital signals in the time domain can be expensive (with computational complexity \(O^2\)) if the number of samples is high. Fortunately, thanks to the theorem of the convolution of signals in the frequency domain [3, 4], the convolution can be replaced by the dot product (with \(O\)), when we consider the signals in the frequency domain. Thanks to this, we can rewrite the Eq. \eqref{eq:using_convolution} as \begin {equation} \tilde {\mathbf n}[t] = {\mathbf m}[t] - ({\mathcal F}^{-1}\{{\mathbf S}{\mathbf H}\})[t-d], \label {eq:faster} \end {equation} where \(\mathbf S\) is the (digital) Fourier transform6 of \(\mathbf s\), \(\mathbf H\) is the Fourier transform of \(\mathbf h\), and \({\mathcal F}^{-1}\) represents the inverse Fourier transform. Notice that all these transforms are applied to digital signals, and there exist fast algorithms (\(O\log _2O\)).
We just have seen how it is possible to find better estimations of the echo signal \(\mathbf e\) using convolutions. By definition, convolutions are performed by filters.
Filters can be implemented in the frequency domain (see the previous section) or in the signal (time in our case) domain. Signal domain (digital) convolutions are efficient when the length7 of the (digital) filters is small.8 For estimating the echo signal at the sample-time \(t\), \({\mathbf e}^{(t)}\), the length of a FIR filter should be at least \(d\), because at least we need \(d\) samples to detect the echo signal. Therefore, we have that \begin {equation} \hat {\mathbf e}^{(t)} = \sum _{k=0}^{d-1}{\mathbf h}_k^{(t)}{\mathbf s}[t-k], \end {equation} where \(\mathbf h\) is the near-end impulse response in the time domain.
Also, as we have seen in the previous section, it is possible to adapt the filter to the acoustic conditions, measuring the echo generated by the impulse signal. In the time domain, one of the most used techniques for computing the coefficients of a FIR filter is the LMS (Least Mean Squares) algorithm [2, 1], among other reasons because the filter (coefficients) can be adapted to variations in the signal to filter (the filter can learn).
LMS was invented by professor Bernard Widrow and his first Ph.D. student, Ted Hoff, to train the ADALINE artificial neural network.9 Using LMS, ADALINE is able to distinguish between patterns, even using a part of a single neuron10.
LMS can be used to compute the coefficients of a filter to provide a desired output for a given input. In our context, the input signal is \(\mathbf m\) (the digital signal recorded by our soundcard) and the desired output signal is \(\mathbf n\) (the digital version of our voice). LMS, iteratively computes
where \(i\) represents the iteration number, and \(\mu \) is the learning rate11. These equations can be found12 using the (steepest) gradient descend algorithm. Notice that we process the signals sample-by-sample (at iteration \(i\) we compute the \(i\)-th sample of \(\tilde {\mathbf n}\) (the signal without the echo, supposely containing only our voice), and this is the signal that we will send to the far-end in the next chunk).
A Python module called echo_cancellation.py that inherits from buffer.py and that implements at least13 one of the previously described solutions. More concretely:
If you implement the “\(s(t)\) delay and subtract solution”, you must estimate \(a\), \(d\) and perform the signals subtraction to remove the echo far-end signal (\(s(t)\)). For example, Skype finds \(d\) and \(a\) using a “call signal” (a sequence of more-or-less tonal sounds). \(d\) is determined measuring the propagation time of the call signal between our speaker and our mic, and \(a\) measuring the ratio between the energy of the call signal played and the energy of the call signal recorded.
At the same time you can also implement a “\(n(t)\) delay and subtract solution”, where you have to estimate \(a\) for your voice \(n(t)\) delayed the buffering time + average RTT + \(d\) at the far-end. The average RTT can be estimated using14 a local clock, storing in a list when the last \(N\) chunks were sent, and sending with each chunk a new field in the packet header with the last chunk number that was received. Thus, when we receive the chunk \(n\)-th15, we can copy this number in the next sent chunk and the receiver will be able to find the RTT subtracting to the time reception the corresponding time in the list.
In a “\(s(t)\) delay and subtract solution”, if you also consider the (discrete) frequency response of the near-end audioset to estimate a better echo signal, first you will need to find \(\mathbf H\) (the discrete frequency response of your audioset). For this, the near-end speaker should generate an impulse signal \(\mathbf \delta \), and in absence of any other sound signal, record the echo and compute its Fourier transform (it is a good idea to repeat this process several times to obtain a better estimation of \(\mathbf H\)). Finally, notice that \({\mathbf H}[\omega ]\) (the \(\omega \)-th frequency component of \(\mathbf H\)) is a complex number (the Fourier coefficients are complex numbers).
Notice that the frequency characterization of the far-end audioset can be also used in a “\(n(t)\) delay and subtract solution”. Remember that filtering operations must be implemented with convolutions in the temporal domain, but are products in the frequency domain.
Optionally, but interesting for your mark, use any other technique (for example, an artificial neural network16) for estimating the echo, and use it for removing the echo (obviously, in real-time). Take also into consideration that the parameters that determine the estimation of the echo signal should be continously17 monitored becase the physical composition of the audiosets can be dynamic (for example, the inclination of the screen of our laptop can be modified at any moment).
Finally, notice that signals correlation operation can help to fine-tune \(d\).
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[2] S. Haykin. Adaptive Filter Theory (3rd edition). Prentice Hall, 1995.
[3] J. Kovačević, V.K. Goyal, and M. Vetterli. Fourier and Wavelet Signal Processing. http://www.fourierandwavelets.org/, 2013.
[4] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Systems (2nd edition). Prentice Hall, 1997.
1And obviously, any other parent version of buffer.py.
2I.e., the same signal that would be captured by our mic if I were using a headset.
3Or frame if we work in stereo
4In a digital signal the sample index indicates the position of the sample in the sequence of samples. If we also know the sample-time, i.e., the cadency of the sampler, we can also compute at which time was taken a sample.
5This action is similar to the carried out by submarines when they user the sonar to perform echo-location, or by bats.
6The Fourier transform is an special case of the Laplace transform where \(\sigma =0\) in the complex (Laplace) domain represented by \(s=\sigma +j\omega \) frequencies. This simplification can be used for the characterization of our local-end audioset because it can be considered a FIR (Finite Impulse Response) system (in ausence of an audio signal, the echo always decays with the time).
7The number of coefficients.
8Take in mind that convolution is a \(O^2\) operation, and therefore, we can only handle in real-time with our computers small filter.
9See https://www.youtube.com/watch?v=hc2Zj55j1zU
10If we do not consider the activation function, an artificial neuron and a FIR filter perform the same computation.
11High \(\mu \) values spped-up the adaption process, but can generate worse \(\mathbf h\) coefficients.
12Again, see https://www.youtube.com/watch?v=hc2Zj55j1zU!
13More working implementations, higher mark.
14Remember that ping only works between public devices.
15Remember that the chunks have a chunk number in the packet header.
16ADALINE is the simplest AAN ever developed!
17A 1-seconds cadence should be enought.