6.1 Theory behind LPC
Last updated
Last updated
If you have reached this point, your real-time granular synthesis pitch shifter must be working so congrats! Still, you may not be completely satisfied by the results. Indeed, the output may sound a bit unnatural, like the voice of someone talking with a hot potato in their mouth!
This unnatural sound is due to the fact that the pitch shifter we implemented earlier does not preserve the energy envelope of the signal.
In the figure above, generated with this script, we can see the energy envelope of a ms voiced speech snippet (orange) and that of the speech processed with the granular synthesis effect (green). As can be clearly seen, the granular synthesis effect does not preserve the original energy envelope. We can certainly see the spectrum shifting down; however, such a shift does not maintain the structure of harmonic signals, i.e. harmonics that are multiples of a fundamental frequency. In this section, we present a concrete method to preserve this energy envelope while still performing the desired pitch shift.
A common model for speech production is that of a source followed by a resonator. The source can be a pitched sound produced by the vocal cords, or a noise-like excitation produced by a flow of air. The resonator is the transfer function of the speech apparatus (e.g. mouth and head), which is independent of the source. Think of when you whisper: you are replacing the pitched excitation with air "noise" but you preserve the resonances of normal speech. A schematic view of the model is shown below.
Mathematically, we can express the speech production mechanism in the -domain by:
where
is the resonance transfer function,
is the excitation.
In the context of spoken speech, both and will certainly change over time. However, we will assume that for short segments in time that we are operating over a stationary signal.
Our goal is to estimate the coefficients of the filter , which are called the linear predictive coding (LPC) coefficients.
In the time domain, we can express the produced speech as:
where represents the lag, namely how many past samples are used to model the speech.
As both and are unknown, we attempt to minimize the energy of the following signal:
Note: The above equation is identical to the expression for the prediction error in the standard AR linear prediction problem.
In practice, we solve this system by first defining the (symmetric) autocorrelation matrix of the input signal, where:
where is the absolute difference between the row and column indices of . This matrix has a Toeplitz structure, yielding the following system of equations (known as Yule-Walker) in order to minimize the energy of :
The above system of equations can be solved for the filter coefficients by using the Levinson-Durbin algorithm.
In order to preserve the energy envelope with the granular synthesis effect, we need to perform the following operations on each grain/buffer:
Compute the LPC coefficients for the input speech to obtain the filter coefficients .
Inverse-filter the raw samples in order to estimate the excitation from and .
Apply pitch-shifting on the excitation signal (i.e. apply the resampling on ) to obtain .
Forward-filter the modified grain with to obtain the pitched-shifted version of the input .
With this procedure, we can see below that the pitch-shifted has a more similar energy envelope to that of the original input samples. The figure below is generated with this script.
If you want to learn more about LPC coefficients, we recommended checking the Wikipedia page and this website.
Now let's go to the next section to implement this feature!