DSP Labs
  • INTRODUCTION
  • BILL OF MATERIALS
  • 1. OVERVIEW AND INSTALLATION
    • 1.1 Hardware
    • 1.2 Software
      • CubeMX
      • SW4STM32
      • Eclipse tips
    • 1.3 First project!
  • 2. AUDIO PASSTHROUGH
    • 2.1 Audio I/O theory
      • Microphone
      • Stereo decoder
    • 2.2 Updating peripherals
    • 2.3 Wiring audio I/O
    • 2.4 Coding passthrough
  • 3. ALIEN VOICE EFFECT
    • 3.1 How it works
    • 3.2 Real-time DSP tips
    • 3.3 Real-time with Python
    • 3.4 C implementation
  • 4. DIGITAL FILTER DESIGN
    • 4.1 Design approaches
    • 4.2 Real-time implementation
  • 5. GRANULAR SYNTHESIS
    • 5.1 How it works
    • 5.2 Implementation
  • 6. LINEAR PREDICTION
    • 6.1 Theory behind LPC
    • 6.2 Implementation
  • 7. DFT PITCH SHIFTING
    • 7.1 How it works
    • 7.2 Python implementation
Powered by GitBook
On this page
  • Modeling speech production
  • Time domain equations
  • Combining with Granular Synthesis

Was this helpful?

  1. 6. LINEAR PREDICTION

6.1 Theory behind LPC

Previous6. LINEAR PREDICTIONNext6.2 Implementation

Last updated 6 years ago

Was this helpful?

If you have reached this point, your real-time granular synthesis pitch shifter must be working so congrats! Still, you may not be completely satisfied by the results. Indeed, the output may sound a bit unnatural, like the voice of someone talking with a hot potato in their mouth!

This unnatural sound is due to the fact that the pitch shifter we implemented earlier does not preserve the energy envelope of the signal.

In the figure above, generated with , we can see the energy envelope of a 404040 ms voiced speech snippet (orange) and that of the speech processed with the granular synthesis effect (green). As can be clearly seen, the granular synthesis effect does not preserve the original energy envelope. We can certainly see the spectrum shifting down; however, such a shift does not maintain the structure of harmonic signals, i.e. harmonics that are multiples of a fundamental frequency. In this section, we present a concrete method to preserve this energy envelope while still performing the desired pitch shift.

Modeling speech production

A common model for speech production is that of a source followed by a resonator. The source can be a pitched sound produced by the vocal cords, or a noise-like excitation produced by a flow of air. The resonator is the transfer function of the speech apparatus (e.g. mouth and head), which is independent of the source. Think of when you whisper: you are replacing the pitched excitation with air "noise" but you preserve the resonances of normal speech. A schematic view of the model is shown below.

Mathematically, we can express the speech production mechanism in the ZZZ-domain by:

X(z)=A(z)E(z),X(z) = A(z)E(z),X(z)=A(z)E(z),

where

  • A(z)A(z)A(z) is the resonance transfer function,

  • E(z)E(z)E(z) is the excitation.

In the context of spoken speech, both E(z)E(z)E(z) and A(z)A(z)A(z) will certainly change over time. However, we will assume that for short segments in time that we are operating over a stationary signal.

Time domain equations

In the time domain, we can express the produced speech as:

x[n]=e[n]+∑k=1pakx[n−k],x[n] = e[n] + \sum_{k=1}^{p} a_k x[n-k],x[n]=e[n]+k=1∑p​ak​x[n−k],

where ppp represents the lag, namely how many past samples are used to model the speech.

As both E(z)E(z)E(z) and A(z)A(z)A(z) are unknown, we attempt to minimize the energy of the following signal:

e[n]=x[n]−∑k=1pakx[n−k].e[n] = x[n] - \sum_{k=1}^{p} a_k x[n-k].e[n]=x[n]−k=1∑p​ak​x[n−k].

Note: The above equation is identical to the expression for the prediction error in the standard AR linear prediction problem.

In practice, we solve this system by first defining the (symmetric) autocorrelation matrix RRR of the input signal, where:

rm=(1N)∑k=0N−m−1x[k]x[k+m],r_m = (\frac{1}{N})\sum_{k=0}^{N-m-1} x[k]x[k+m],rm​=(N1​)k=0∑N−m−1​x[k]x[k+m],

Combining with Granular Synthesis

In order to preserve the energy envelope with the granular synthesis effect, we need to perform the following operations on each grain/buffer:

  1. Compute the LPC coefficients for the input speech x[n]x[n]x[n] to obtain the filter coefficients aka_kak​.

  2. Inverse-filter the raw samples in order to estimate the excitation e[n]e[n]e[n] from aka_kak​ and x[n]x[n]x[n].

  3. Apply pitch-shifting on the excitation signal (i.e. apply the resampling on e[n]e[n]e[n]) to obtain e~[n]\tilde{e}[n]e~[n].

  4. Forward-filter the modified grain e~[n]\tilde{e}[n]e~[n] with aka_kak​ to obtain the pitched-shifted version of the input x~[n]\tilde{x}[n]x~[n].

Our goal is to estimate the coefficients of the filter A(z)A(z)A(z), which are called the (LPC) coefficients.

where mmm is the absolute difference between the row and column indices of RRR. This matrix RRR has a Toeplitz structure, yielding the following system of equations (known as ) in order to minimize the energy of e[n]e[n]e[n]:

The above system of equations can be solved for the filter coefficients aka_kak​ by using the algorithm.

With this procedure, we can see below that the pitch-shifted x~[n]\tilde{x}[n]x~[n] has a more similar energy envelope to that of the original input samples. The figure below is generated with .

If you want to learn more about LPC coefficients, we recommended checking the and .

Now let's go to the to implement this feature!

linear predictive coding
Yule-Walker
Levinson-Durbin
this script
Wikipedia page
this website
next section
this script