In this milestone, we’ll solve the problem generated by the feedback of the signal emitted by our speakers that is recorded by out microphone. First, we’ll formalize the problem and then look at different solutions, with varying effectiveness and computational requirements.
One of the first problems we encounter with the use of the buffer.py module1 is that, if we don’t use headphones, the sound that comes out of our PC’s (loud)speaker some time later reaches our mic(rophone), and more some time later, that sound reaches our interlocutor (the “far-end” ... in the system we are the “near-end”) in the form of an echo (signal) of its own voice, which is reproduced by his/her speaker, which can be captured again (some time later) by his/her mic and sent it back to us ... and so on, generating a rather unpleasant feedback signal.
To formalize the problem, let’s define:
In this situation, we have that
where \(m(t)\) is the analog audio signal that makes the membrane of our microphone oscillate.
Our problem here is to minimize the energy of \(s(t)\), i.e., to make
Use a headset. In this case,
Decrease the gain of the amplifier of your speaker to do (the energy of) \(s(t)\) as small as possible. Unfortunately, this also decreases the volume of the far-end signal (the voice of our interlocutor) :-/
Lets \(\mathbf m\) the digital version of \(m(t)\), and \({\mathbf m}[t]\) it’s \(t\)-th sample3. In this solution, we send
where \(a\) is an attenuation (scalar) value, and \(d\) represents the delay4 (measured in sample-times) required to propagate the sound waves from our speaker to our mic. We define
as the estimated5 feedback signal that reaches our microphone at the same instant of time that the sample \({\mathbf n}[t]\) would have been captured in the ausence of the feedback.
Notice that it have been used the notation \(\tilde {\cdot }\) to highlight that \(\tilde {\mathbf n}\) is an approximation of \(\mathbf n\) (our sampled true-voice signal), and the notation \(\hat {\cdot }\) to emphasize that \(\hat {\mathbf s}\) is a registered6 prediction for \(s\) reaching our microphone. Notice also that, if \({\mathbf s}[0]\) is the first sample of a chunk (\(c\)-th chunk), the sample \({\mathbf s}[-d]\) could belong to a previous chunk (the \((c-1)\)-th chunk).
Finally, \(a\) should be choosen considering that under ausence of voice in each end, \(s(t)\approx 0\). For example, Skype finds \(d\) and \(a\) using a “call-signal” (a sequence of more-or-less tonal sounds). \(d\) is determined measuring the propagation time of the call-signal between our speaker and our mic.
We can improve the performance of the previous feedback cancellation solution if we take also into consideration that the feedback signal that finally reaches our microphone is the convolution of \(s(t)\) and a signal \(h(t)\) that represents the echo response of our local audioset (speaker, mic, walls, monitor, keyboard, our body, ...) to a impulse signal \(\delta (t)\).7 In other words, we can modify Eq. \eqref{eq:simplest} to compute
where \(\ast \) represents the (digital) convolution between (in our case of) digital signals, and \(\mathbf h\) is the digitalized (sampled + quantized) version of \(h(t)\), the response (echo signal) of the near-end (our) audioset to the impact of \(\delta (t)\).
The convolution of digital signals in the time domain can be expensive (with computational complexity \(O^2\), where \(O\) is the number of elements to process) if the number of samples or/and filter coefficientsis is high. Fortunately, thanks to the convolution theorem [3, 4], the convolution can be replaced by the dot product (with complexity \(O\)), when we consider the signals in the frequency domain. Thanks to this, we can rewrite the Eq. \eqref{eq:using_convolution} as
where \(\mathbf S\) is the (digital) Fourier transform8 of \(\mathbf s\), \(\mathbf H\) is the Fourier transform9 of \(\mathbf h\), and \({\mathcal F}^{-1}\) represents the inverse (digital) Fourier transform. Notice that all these transforms are applied to digital signals, and there exist fast algorithms (with complexity \(O\log _2O\)).
Finally, see that this feedback supression technique could ease the calculation of \(d\) because if correlation is used to estimate it, the maximun should be clearly identified.
Unfortunately, none of the previous solutions is “dynamic”, in the sense that if we modify the audioset, some parameters must be modified in real-time:
Filters can be implemented in the frequency domain (such as happens in Eq. \eqref{eq:faster}) or in the signal (time in our case) domain. Signal domain (digital) convolutions are affordable if the length10 of the (digital) filters is small.11
Eq. \eqref{eq:minimal_filter} represents the minimal expression for a FIR filter (whose spectral shape is completely flat because it only has a single coefficient and therefore, it attenuates equally all the frequencies). In general, the expression of a FIR filter in the time-domain (applied to a signal that previously has been delayed \(d\) samples) is
where \(K\) is the number of coefficients of the filter \(\mathbf h\), and \({\mathbf h}_k\) is its \(k\)-th coefficient. When the (coefficients of the) filter are updated for each input sample, Eq. \eqref{eq:FIR_filter} can be rewritten as
In general, the filter should not be updated so often (e.g., every second could be fast enough). In this case Eq. \eqref{eq:adaptive_FIR_filter} should be
where \(i\) represents the update iteration. If we decide to work with impulses, \({\mathbf h}^{(i)}\) is determined by the samples recorded by our soundcard after \(d\) sample-times of the generation of each impulse when the impulses should be the only source of sound. If there were other sources of sound (such as for example, our voice) the updates should be discarded or averaged using some moving-average technique.
Finally, notice that we must send
The LMS algorithm [2, 1] was invented by professor Bernard Widrow and his first Ph.D. student, Ted Hoff, to train the ADALINE artificial neural “network”.12 Using LMS, ADALINE is able to distinguish two (different) patterns, even using (only a part of13) a single neuron.
The LMS algorithm adapts the coefficients of \(\mathbf {h}\) to minimize the error signal defined by
Unfortunately, we don’t know \(\mathbf {n}\) except when we are in silence, where \(\mathbf {n}[t]=0;\forall t\). When we are silence, the error is
Said that, LMS defines the update rule at the \((i+1)\)-th iteration as
where \(\mu \) is the learning rate14. These equations can be found15 using the (steepest) gradient descend algorithm.
Obviously, the filter only should be updated when we are silence :-/
Finally, notice that to use LMS, the number of coefficients of the filter \(K\) must be the same that the number of input samples considered in the adaption. Therefore, if we are working with chunks, at the moment of time we decide to update the filter, we could consider, for example, only the \(K\) samples of the current played chunk with indexes in \([d, d+K-1]\), if \(d+K\) is smaller than the chunk size. Otherwise (if \(d+K > \text {chunk~size}\)), at least two chunks should be concatenated.
Ultimately, feedback in InterCom occurs because the audio signal we are generating is reproduced through our speakers (with a certain time lag). In this feedback situation, the spectrum of a newly received chunk is similar to the spectrum of a newly sent chunk. Therefore, if the chunk we want to send is pre-filtered (before being sent) using the opposite filter to that described by the spectrum of the played chunk, we should cancel the feedback.
Let \({\mathbf M} = {\mathcal F}({\mathbf m})\) the Fourier transform of the \(i\)-th chunk captured by the soundcard, and let \({\mathbf S} = {\mathcal F}({\mathbf s})\) the Fourier transform of the \(i\)-th played chunk. Considering the reasoning followed in the previous paragraph, and considering also the Eq. \eqref{eq:simplest}, it can be concluded that
where \(\hat {\mathbf {N}} = \mathcal {F}(\hat {\mathbf {n}})\), the Fourier transform of the chunk that should be sent in the \(i\)-th iteration (chunk).
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[2] S. Haykin. Adaptive Filter Theory (3rd edition). Prentice Hall, 1995.
[3] J. Kovačević, V.K. Goyal, and M. Vetterli. Fourier and Wavelet Signal Processing. http://www.fourierandwavelets.org/, 2013.
[4] Alan V. Oppenheim, Alan S. Willsky, and S. Hamid Nawab. Signals and Systems (2nd edition). Prentice Hall, 1997.
1And obviously, any other parent version of buffer.py.
2I.e., the same signal that would be captured by our mic if I were using a headset.
3Or frame if we work in stereo
4In a digital signal the sample index indicates the position of the sample in the sequence of samples. If we also know the sample-time, i.e., the cadency of the sampler, we can also compute at which time was taken a sample.
5We use the word “estimated” because in this model we are ignoring several factors, such as echoes, unflat frequency responses, etc. that modify the version of \(\mathbf {s}\) that is received by the microphone.
6That matches in time. That are synchronized.
7This technique is similar to the carried out by submarines when they user the sonar to perform echo-location, or by bats to fly in the darkness.
8The Fourier transform is an special case of the Laplace transform where \(\sigma =0\) in the complex (Laplace) domain represented by \(s=\sigma +j\omega \) frequencies. This simplification can be used for the characterization of our near-end audioset because it can be considered as a FIR (Finite Impulse Response) system (in ausence of an audio signal, the echo signal always decays with the time). Finally, notice that \({\mathbf S}[\omega ]\) (the \(\omega \)-th frequency component of \(\mathbf S\)) is a complex number (the Fourier coefficients are complex numbers).
9For computing the Fourier transform of digital signals use a library such as numpy.fft.
10The number of coefficients.
11Take in mind that convolution is a \(O^2\) operation, and therefore, we can only handle in real-time with our computers small filter.
12See https://www.youtube.com/watch?v=hc2Zj55j1zU
13If we do not consider the activation function, an artificial neuron and a FIR filter perform the same computation.
14High \(\mu \) values spped-up the adaption process, but can generate worse \(\mathbf {h}\) coefficients.
15Again, please, watch https://www.youtube.com/watch?v=hc2Zj55j1zU!
16More working implementations, higher grade.