6.1 Theory behind LPC

If you have reached this point, your real-time granular synthesis pitch shifter must be working so congrats! Still, you may not be completely satisfied by the results. Indeed, the output may sound a bit unnatural, like the voice of someone talking with a hot potato in their mouth!

This unnatural sound is due to the fact that the pitch shifter we implemented earlier does not preserve the energy envelope of the signal.

In the figure above, generated with this script, we can see the energy envelope of a $40$ ms voiced speech snippet (orange) and that of the speech processed with the granular synthesis effect (green). As can be clearly seen, the granular synthesis effect does not preserve the original energy envelope. We can certainly see the spectrum shifting down; however, such a shift does not maintain the structure of harmonic signals, i.e. harmonics that are multiples of a fundamental frequency. In this section, we present a concrete method to preserve this energy envelope while still performing the desired pitch shift.

Modeling speech production

A common model for speech production is that of a source followed by a resonator. The source can be a pitched sound produced by the vocal cords, or a noise-like excitation produced by a flow of air. The resonator is the transfer function of the speech apparatus (e.g. mouth and head), which is independent of the source. Think of when you whisper: you are replacing the pitched excitation with air "noise" but you preserve the resonances of normal speech. A schematic view of the model is shown below.

Mathematically, we can express the speech production mechanism in the $Z$ -domain by:

X(z) = A(z)E(z),

where

$A(z)$ is the resonance transfer function,
$E(z)$ is the excitation.

In the context of spoken speech, both $E(z)$ and $A(z)$ will certainly change over time. However, we will assume that for short segments in time that we are operating over a stationary signal.

Our goal is to estimate the coefficients of the filter $A(z)$ , which are called the linear predictive coding (LPC) coefficients.

Time domain equations

In the time domain, we can express the produced speech as:

x[n] = e[n] + \sum_{k=1}^{p} a_k x[n-k],

where $p$ represents the lag, namely how many past samples are used to model the speech.

As both $E(z)$ and $A(z)$ are unknown, we attempt to minimize the energy of the following signal:

e[n] = x[n] - \sum_{k=1}^{p} a_k x[n-k].

Note: The above equation is identical to the expression for the prediction error in the standard AR linear prediction problem.

In practice, we solve this system by first defining the (symmetric) autocorrelation matrix $R$ of the input signal, where:

r_m = (\frac{1}{N})\sum_{k=0}^{N-m-1} x[k]x[k+m],

where $m$ is the absolute difference between the row and column indices of $R$ . This matrix $R$ has a Toeplitz structure, yielding the following system of equations (known as Yule-Walker) in order to minimize the energy of $e[n]$ :

The above system of equations can be solved for the filter coefficients $a_k$ by using the Levinson-Durbin algorithm.

Combining with Granular Synthesis

In order to preserve the energy envelope with the granular synthesis effect, we need to perform the following operations on each grain/buffer:

Compute the LPC coefficients for the input speech $x[n]$ to obtain the filter coefficients $a_k$ .
Inverse-filter the raw samples in order to estimate the excitation $e[n]$ from $a_k$ and $x[n]$ .
Apply pitch-shifting on the excitation signal (i.e. apply the resampling on $e[n]$ ) to obtain $\tilde{e}[n]$ .
Forward-filter the modified grain $\tilde{e}[n]$ with $a_k$ to obtain the pitched-shifted version of the input $\tilde{x}[n]$ .

With this procedure, we can see below that the pitch-shifted $\tilde{x}[n]$ has a more similar energy envelope to that of the original input samples. The figure below is generated with this script.

If you want to learn more about LPC coefficients, we recommended checking the Wikipedia page and this website.

Now let's go to the next section to implement this feature!

Previous6. LINEAR PREDICTION Next6.2 Implementation

Last updated 6 years ago

Was this helpful?