Unit – 7
Code Excited Linear Prediction
Q1) What are the basic operational principles of low delay CELP?
Ans: Basic operational principle of Low Delay CELP
The basic operational concept is the same as the traditional CELP algorithm, with the exception that only the index to the excitation codebook is sent. The LD-CELP algorithm's operation can be summarized as follows:
● Speech samples are partitioned into frames and subdivided into subframes, much as in traditional CELP. Each frame in LD-CELP contains 20 samples, divided into four subframes of five samples each. There is no need to buffer an entire frame before processing LP analysis since it is done in a backward adaptive manner. Before the encoding process starts, only one subframe (five samples) must be saved.
● Ten linear prediction coefficients obtained from the original speech data make up the perceptual weighting filter. Once per frame, the filter is changed. The current frame's coefficients are calculated using samples from previous frames.
● A 50th-order AR process is described by the synthesis filter. Its coefficients are derived from previous frames' synthetic speech results. Once per frame, the filter is changed. The current frame's zero-input answer can be calculated using the known initial conditions.
● Per subframe, the excitation gain is modified using a tenth-order adaptive linear predictor operating in the logarithmic-gain domain. This predictor's coefficients are modified once per frame, and the LP analysis method is performed on gain values from previous subframes.
● The excitation sequence is searched once per subframe, with the search process involving the generation of an ensemble of filtered sequences, with each excitation sequence serving as an input to the formant synthesis filter to produce an output sequence. The excitation sequence with the lowest final error is chosen.
● The original conditions are restored in the decoder. Filtering the indicated excitation sequence through the filter without any perceptual weighting produces synthetic expression. There is no need to transmit the excitation gain or the LPCs because they are backward adaptive. A post filter can be used to improve the output speech quality even further.
Q2) Describe CELP speech production model?
Ans: CELP speech production model
Long-term and short-term linear prediction models are used by the CELP coder. The speech production model's block diagram shows how an excitation sequence is extracted from the codebook through an index. To produce synthetic expression, the extracted excitation is scaled to the required level and filtered by a cascade link of pitch synthesis filter and formant synthesis filter. The formant synthesis filter produces the spectral envelope, while the pitch synthesis filter induces periodicity in the signal associated with the fundamental pitch frequency.
The CELP coder's speech development model is as simple as a white noise source stimulating the synthesis filters. What are the benefits of using this model? As compared to other coders, the following can be noted.
● As with the LPC coder, a strict voiced/unvoiced division is no longer used.
One of the key drawbacks of the LPC coder is rigid classification of a speech frame, which is responsible for significant artifacts in synthetic speech. The cascaded use of the two synthesis filters allows for effective and accurate modeling of transition frames with smoothness and consistency, resulting in synthetic speech that sounds much more natural.
● Partial phase information of the original signal is preserved.
The phase information of the original signal is not retained by the LPC model. The CELP model uses a closed-loop analysis-by-synthesis method to capture some step information. The best excitation sequence from the codebook is chosen in this method, where ‘‘best" means the sequence is capable of producing synthetic speech that is as similar to the original as possible, and ‘‘closeness" is also calculated using time-domain techniques such as signal-to-noise ratio.
The synthetic speech is thus matched not only in the magnitude spectrum domain, but also in the time domain, where a phase difference is significant. Even though a human listener is relatively unaffected by phase, keeping some phase information in the synthetic speech adds naturalness and improves accuracy.
Q3) What is the Excitation codebook search – state-save method?
Ans: Excitation codebook search – state-save method
The most computationally intensive aspect of CELP coding is the search for excitation codebooks. Many ideas have been proposed and tested over the years in relation to the subject, all with the goal of speeding up the search process without compromising the production quality. The search procedure is defined using a precise mathematical framework.
Preliminaries
A step-by-step method for searching an excitation codebook is detailed below.
2. In the excitation codebook, for each code vector:
● Calculate the optimal gain (described later) and use that value to scale the code vector.
● Apply the pitch synthesis filter to the scaled excitation code vector.
● Apply the modified formant synthesis filter to the output of the pitch synthesis filter.
● Subtract the perceptually filtered input speech from the output of the modified formant synthesis filter; the result is a sequence of errors.
● Calculate the error sequence's energy.
3. On the input subframe, the index of the excitation code vector associated with the lowest error energy is saved as information.
For each input subframe, the process is repeated. Exploring the redundancies in the search loop will help to increase computational performance. Significant computational savings can be achieved by decomposing the filters' responses into zero-state and zero-input components.
The system's difference equations are
Where,
M = Prediction order (short-term),
ai = LPC (short-term),
b = Long-term gain,
T = Pitch period
State-Save Method
The difference equations are applied in the following way (r > 0) to each of the finite-length input subframes xr[n], n = 0 to N - 1:
Q4) What do you mean by analysis-by-synthesis?
Ans: The speech signal is expressed in a speech coder by a set of parameters such as gain, filter coefficients, voicing powers, and so on. The parameters are extracted from the input signal, quantized, and then used for synthesis in an open-loop method.
Using the parameters to synthesize the signal during encoding and fine-tuning them to produce the most accurate reconstruction is a more efficient process. This is a closed-loop optimization technique in which the aim is to select the best parameters to fit the synthetic speech as closely as possible to the original speech.
The block diagram of a closed-loop encoder is shown in the figure. The principle is known as analysis-by-synthesis since the signal is synthesized during encoding for analysis purposes.
Fig 1: Block diagram showing an encoder based on the analysis-by-synthesis principle
Theoretically, all of the speech coder's parameters can be optimized together to achieve the best result. However, due to the computation needed, this method is too complicated. In practice, only a subset of parameters is chosen for closed-loop optimization, with the remaining parameters calculated using an open-loop method. The CELP coder is based on the analysis-by-synthesis theory, which involves selecting excitation sequences from a codebook using a closed-loop process. The filter coefficients, for example, are calculated in an open-loop fashion.
Q5) Describe Adaptive Codebook search in detail?
Ans: Adaptive Codebook search
Here, we study the procedure to search for the optimal adaptive code vector,
the use of fractional delay, and encoding of related parameters.
Integer Pitch Period
The adaptive codebook is searched in a closed-loop analysis-by-synthesis manner, where the index of the best code vector, t = Tmin, Tmin + 1, ..., Tmax, is located. This index is given seven bits, resulting in 27 = 128 values ranging from Tmin = 20 to Tmax = 147. The method for searching the adaptive codebook follows the same principles as the excitation codebook search. Only the highlights are provided; readers are left to check the equations' validity on their own.
The updated formant synthesis filter's zero-state answer is given by
H denotes the synthesis filter's impulse response matrix. The following is a rewrite of the equation:
which is our goal for cost-cutting. Identifying
u3 = u - y3;
We have,
It can be demonstrated that the best adaptive codebook gain for minimizing is
and the best codebook index is the one that gives you the most bang for your buck.
from Tmin to Tmax for t = Tmin to Tmax Recursive convolution can be used to achieve efficient computation since the adaptive codebook is basically an overlapping codebook with unit transfer.
Fractional Pitch Period
Fractions Pitch, Its use eliminates both reverberant distortion caused by pitch multiplication and the roughness of speakers with short pitch periods. Noise is reduced in synthetic speech as output associated with long-term prediction improves, reducing the noisy stochastic excitation portion.
The FS1016 specifies an interpolation* technique based on a 40-point Hamming windowed sinc function defined as follows to obtain the adaptive code vector with fractional delay:
for n = 20, -19, ... , 19, with f being the fractional part of the pitch period.
Subframe Encoding Strategy
The pitch length is encoded by FS1016 using different numbers of bits, depending on the location of the subframe. 8 bits are assigned to the first and third subframes, and the encoding scheme is the same as before. However, only 6 bits are used in the second and fourth subframes. These 6 bits reflect a relative change in pitch time compared to the previous subframe, varying from 31 codes lower to 32 codes higher, with both integer and fractional values taken into account.
Since pitch intervals of adjacent subframes do not vary dramatically in most cases, the scheme provides higher encoding efficiency (smaller number of bits) and lowers computational load (shorter search range); the benefits come with marginal loss in synthetic speech quality.
Q6) What is algebraic CELP?
Ans: Another effort to reduce the computational expense of regular CELP coders was Algebraic CELP, or ACELP. Many ACELP-related specifications are based on the work of previous CELP coders. The word "algebraic" refers to the development of excitation code vectors using basic algebra or mathematical laws, such as addition and shifting.
Because of the method, there is no need to physically store the entire codebook, which saves a lot of memory. Adoul and Lamblin [1987] and Adoul et al. [1987] were the first to present the original ideas, two years after Schroeder and Atal [1985]. Many researchers have perfected it, resulting in at least five standardized coders:
● ITU-T G.723.1 Multipulse Maximum Likelihood Quantization (MP-MLQ)/ ACELP (1995).
● ITU-T G.729 Conjugate Structure (CS)–ACELP (1995).
● TIA IS641 ACELP (1996)
● ETSI GSM Enhanced Full Rate (EFR) ACELP (1996).
● ETSI Adaptive Multirate (AMR) ACELP (1999).
The ACELP definition has a significant impact on the direction of speech coding development; understanding its structure is needed to comprehend some of the field's most powerful trends.
Q7) Explain CELP based on an adaptive codebook?
Ans: The pitch synthesis filter's operation is complicated by the way the zero-input response d2 (12.4) is described.
The adaptive codebook definition was created as a modification to reduce complexity while maintaining the same principle of minimizing the weighted difference through a closed-loop analysis-by-synthesis approach. To put it another way, any ‘‘cheating" is used to get around the computational barrier. The operation of the pitch synthesis filter is redefined in this proposal.
For a standard case of T < N, the effect is shown in Figure. The adaptive codebook concept obviously differs from the original pitch synthesis filters. Both definitions are correct for T ≥ N; however, for T < N, the adaptive codebook answer is generated by a periodic extension of the past multiplied by the long-term benefit.
Fig 2: Extracting a code vector from the adaptive codebook when T < N
By ‘‘cheating" in this way, the current subframe's zero-input response (with unit scaling) is totally determined from history, removing the burden associated with the higher power of b found in the standard pitch synthesis filter definition. When b = 1, the two meanings are, of course, identical. According to this reformulation, b can be solved by using simple procedures to check through the values of T.
When T < N, the adaptive codebook solution has the drawback of maintaining the same amplitude of pitch pulses in a subframe from one pulse to the next. However, it has been noted that the technique is subjectively quite close to the pitch synthesis filter's initial formulation. In other words, using the adjusted approach has no impact on subjective consistency.
Operation of the Adaptive Codebook
Since the search method for the optimum pitch time can be thought of as the evaluation of a codebook with overlapping code vectors, the formulation is called an adaptive codebook. This means that neighboring code vectors share samples, which are indexed by the pitch length T. Since it is changed from subframe to subframe, the codebook is adaptable.
Consider the adaptive codebook's structure, as shown in Figure. Tmax is the maximum value of the pitch duration, so the codebook is defined as a 1-D array of Tmax components. The elements of the codebook are now denoted as d[-Tmax] to d[-1], and the subframe index r has been removed.
Code vectors derived from this codebook are denoted by
Fig 3: Illustration of the structure of an adaptive codebook
The elements of the code vector can now be defined as follows using the above equation:
if t < N, and
d20(t)[n] = d[n - t]
if t < N, for 0 ≤ n ≤ N - 1.As a result, the codebook can be checked for all index t in order to find the best code vector d2o(t) and the best long-term benefit b. The scaled code vector is created once these variables are identified.
represents the optimal zero-input response of the modified pitch synthesis filter.
Q8) Describe Generic CELP encoders and decoders?
Ans: Generic CELP encoder and decoder
Encoder operation
Figure depicts a block diagram of a generic CELP encoder. This encoder is very basic and is only meant to be used as an example. The operation of various standard CELP coders is then discussed. The encoder functions like this:
● Frames and subframes are generated from the input speech signal. A common scheme is to use four subframes in one frame. The frame is normally 20 to 30 milliseconds long, while the subframe is 5 to 7.5 milliseconds long.
● To calculate the LPC, each frame is subjected to a short-term LP analysis. After that, each subframe is subjected to long-term LP analysis. The initial expression, or pre-emphasized speech, is usually the input to short-term LP analysis; the (short-term) prediction error is often the input to long-term LP analysis. After this step, you'll know the coefficients of the perceptual weighting filter, pitch synthesis filter, and modified formant synthesis filter.
● It is now possible to evaluate the excitation series. Each excitation code vector is the same length as the subframe, so an excitation codebook search is done once per subframe. The search procedure starts with the development of an ensemble of filtered excitation sequences with corresponding gains; for each sequence, the mean-squared error (or total of squared error) is computed, and the code vector and gain associated with the lowest error are chosen.
● The CELP bit-stream encodes, packs, and transmits the index of excitation codebook, gain, long-term LP parameters, and LPC.
Fig 4: Block diagram of a generic CELP encoder
Decoder Operation
Figure shows a block diagram of the CELP decoder. It unpacks and decodes various parameters from the bitstream, which are then guided to the appropriate block in order to synthesize expression. To improve the efficiency of the resulting signal, a post filter is applied at the top.
Fig 5: Block diagram of a generic CELP decoder
Q9) Explain zero - input zero-state method?
Ans: This approach calculates the total response by estimating the zero-input and zero-state responses separately. The total response is made up of these two replies. Figure depicts the signals that are involved.
Fig 6: Block diagram showing the signals involved in the zero-input zero-state method for the computation of output signal from a cascade connection of pitch synthesis filter and modified formant synthesis filter
The following sections go into the specifics of each signal.
● Zero-state response of pitch synthesis filter (d1) :
Because of the zero-state condition, the first T samples can be computed without the product in the above equation. As a result, the equation can be rewritten as
T < N is assumed in this case. The zero-state answer is equal to the input signal if T ≥ N.
● Zero-input response of pitch synthesis filter (d2):
● Total response of pitch synthesis filter (d ):
dr[n] = d1r[n] + d2r[ n] , 0 ≤ n ≤ N - 1,
● Total zero-state response: zero-state response of the pitch synthesis filter, filtered by the formant synthesis filter in zero-state (y1):
● Zero-input response of pitch synthesis filter, filtered by the formant synthesis filter in zero-state (y2):
● Zero-input response of formant synthesis filter (y3):
● Total response (formant synthesis filter):
yr[n] = y1r[n] + y2r[n] + y3r[n] , 0 ≤ n ≤ N - 1:
Q10) Define low delay CELP?
Ans: Low Delay CELP
Delay is a significant consideration for real-time two-way conversations, and it is described as the time it takes for sound to move from the speaker to the listener. The ability to maintain a conversation is compromised when there is an unnecessarily long pause, such as more than 150 milliseconds.
Because of the time it takes to understand the other party is speaking, the parties tend to interrupt or ‘‘talk over" each other. When the delay is high enough, conversations degrade to half-duplex mode, with only one direction of communication taking place at a time; hence, the lower the delay, the better.
Low delay is also ideal in traditional telephone networks because delay exacerbates echo problems: the longer an echo is delayed, the more noticeable and distracting it becomes to the talker. Despite the fact that echo cancelers are commonly used, high delay makes echo cancellation more difficult.
We're ready to move on to the basic operating concepts now that we've gone through the most distinguishing features of LD-CELP in the previous portion. To begin the encoding process in LD-CELP, only five samples are needed. Only the excitation signal, on the other hand, is transmitted; the predictor coefficients are modified using LP analysis on previously quantized expression.
As a result, the LD-CELP coder is essentially a backward adaptive variant of the CELP coder. The nature of CELP is preserved, namely the analysis-by-synthesis codebook quest. The LD-CELP encoder and decoder have the following general structures.
Fig 7: D-CELP encoder (top) and decoder (bottom)