• Khirod Behera

Feature Extraction From Audio

Updated: Aug 17, 2020

Audio is a very complex item to understand for a machine and nowadays audio-related activity in AI is highly used in the market so we can say it can be really a useful and knowledge full article.

As we are data scientists, we know feature extraction is a significant activity in almost all machine learning activities.

So in this article, we will talk about this feature extraction techniques audio/speech

Speech contains frequency from 0 to 5khz if

The objective of automatic speaker recognition is to extract, characterize and recognize the information about speaker identity.

--> Speech recognition system contains two main modules:

  1. Feature Matching

  2. Feature Extraction


It is the procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers.

Now let us see about feature extraction in a precise manner.


Extract a small amount of data from the voice signal that can be used to represent each speaker.

The extraction of features is an essential part of analyzing and finding relations between different features. The data provided by the audio cannot be understood by the models directly.. to make it understandable feature extraction comes into the picture. feature extraction is a process that explains most of the data but in an understandable way.

Feature extraction is required for classification, prediction, and recommendation algorithms.

Here are some feature extraction techniques.

1. Zero crossing rate

2. Energy

3. Entropy of energy

4. Spectral centroid

5. Spectral spread

6. Spectral entropy

7. Spectral flux

8. Spectral roll-off

9. MFCCs

10. Chroma

a. Chroma vector

b. .Chroma deviation

11. Perceptual Linear Prediction(PLP)

12. Relative spectra filtering of log domain coefficients PLP(RASTA-PLP)

13. Linear predictive coding(LPC)

1. Zero crossing rate:-

It is the rate of significant changes along with a signal,i.e., the rate at which the signal changes from positive to zero to negative or from negative to zero to positive.

Where S is signal of length T and 1R<0 is an indicator function

This feature is highly recommended for speech recognition and music information retrieval.

Where S is a signal of length T and 1r<0 is an indicator function


A zero-crossing rate is used for voice activity detection (VAD).

i.e finding whether human speech is present in an audio segment or not.

2. Energy:-

Energy is another parameter that classifying the voiced/unvoiced parts and the voiced part of the speech has high energy because of its periodicity and the unvoiced part of speech has low energy.

The sum of squares of the signal values, normalized by the respective frame length.

3. Entropy of energy:-

Entropy is a measure of energy dispersal the changes in entropy which can measure voice and non-voice speech if in speech have lower entropy that is clear formats the flat distribution is known as silence and if there are high entropy values it shows of having noise in speech.

This is normalized energy which can be interpreted as a measure of abrupt changes.

4. Spectral centroid:-

The center of gravity of the means where the center of mass for a sound is located and is calculated as the weighted mean of the frequencies present in the sound .if the frequencies in music are same throughout then spectral centroid would be around a center and if there are high frequencies at the end of sound then the centroid would be towards its end.

It is used to calculate the spectral centroid for each it’ll return an array with columns equal to a number of frames present in your sample.

It is calculated as the weighted mean of the frequencies present in the signal, determined using a Fourier transform with their magnitudes as the weights

The spectral centroid is a good predictor of the brightness of a sound that’s why it is widely used in digital audio and music processing as an automatic measure of musical timbre.

5. Spectral spread:-

The second central moment of the spectrum. it is the technique by which a signal generated with a particular band with is deliberately .this technique we can use in different cases/reasons including the establishment of secure communications, increasing resistance to natural interference, noise, and jamming, to prevent detection and to limit power flux density.

The spectral spread is the technique used for transmitting the signals and as the name suggests spread means the spreading of the signal.

This type of signal provides secure communication that the signal modulated with these techniques are hard to interfere and cannot be jammed, an intruder with no official access is never allowed to crack them. hence these techniques are used for military purpose

Following are the advantages of spread spectrum −

  • Cross-talk elimination

  • Better output with data integrity

  • Reduced effect of multipath fading

  • Better security

  • Reduction in noise

  • Co-existence with other systems

  • Longer operative distances

  • Hard to detect

  • Not easy to demodulate/decode

  • Difficult to jam the signals

Although spread spectrum techniques were originally designed for military uses, they are now being used widely for commercial purposes.

6.Spectral entropy:-

The entropy of the normalized spectral energies for a set of sub-frames.

It is the method for estimating the spectral is based on choosing the spectrum which corresponds to the most random or the most unpredictable time series whose autocorrelation function agrees with the known values this method is based on both for statistical mechanics and information theory

It is simply the application of maximum entropy modeling to any type of spectrum and is used in all fields where data is present in a spectral form.

In spectral analysis, the expected peak shape is often known as accuracy but in the noisy spectrum, the center of the peak may not be such a case inputting the known information allows the maximum entropy model to derive a better estimate of the center of the peak thus improving spectral accuracy.

7.Spectral flux

The squared difference between the normalized magnitudes of the spectra of two successive frames.

It is the measure of how quickly the power spectrum of a signal is changing.

It is calculated by comparing the power spectrum for one frame against the power spectrum from the previous is calculated by Euclidean distance.

It can be used to determine the timbre of an audio signal or in onset detection among other things.

8. Spectral roll-off:-

The frequency below which 85% of the magnitude distribution of the spectrum is concentrated.

This is a measure of the amount of the right skewness of the power spectrum.

The roll-off frequency is defined as for each frame as the center frequency for a spectrogram bin such that at least roll_percent (0.85 by default) of the energy of the spectrum in this frame is contained in this bin and the bins below. This can be used e.g., approximate the maximum (or minimum) frequency by setting roll_percent to a value close to 1 (or 0).


Mel frequency cepstral coefficients from a cepstral representation where the frequency bands are not linear but distributed according to the mel-scale. are commonly used to represent the texture or timbre of a sound.

-> Main Purpose of the MFCC processor is to mimic the behavior of human ears

-> MFCC’s - Less susceptible to variations

->Speech input typically recorded at a sampling rate above 10000Hz

->àThis sampling frequency chosen to minimize the effects of aliasing in the analog-to-digital conversion.

àSampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans.


The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT. The detailed description of the various steps involved in the MFCC feature extraction is explained below.

>Pre-emphasis: Pre-emphasis refers to filtering that emphasizes the higher frequencies. Its purpose is to balance the spectrum of voiced sounds that have a steep roll-off in the high-frequency region. For voiced sounds, the glottal source has an approximately −12 dB/octave slope. However, when the acoustic energy radiates from the lips, this causes a roughly +6 dB/octave boost to the spectrum.

As a result, a speech signal when recorded with a microphone from a distance has approximately a −6 dB/octave slope downward compared to the true spectrum of the vocal tract. Therefore, pre-emphasis removes some of the glottal effects from the vocal tract parameters. The most commonly used pre-emphasis filter is given by the following transfer function

H(z) = 1 − bz−1 (B.1)

where the value of b controls the slope of the filter and is usually between 0.4 and 1.0

>Windowing: The speech signal is a slowly time-varying or quasi-stationary signal. For stable acoustic characteristics, speech needs to be examined over a sufficiently short period of time. Therefore, speech analysis must always be carried out on short segments across which the speech signal is assumed to be stationary. Short-term spectral measurements are typically carried out over 20 ms windows, and advanced every 10 ms [2, 3]. Advancing the time window every 10 ms enables the temporal characteristics of individual speech sound to be tracked and the 20 ms analysis window is usually sufficient to provide good spectral resolution of these sounds, and at the same time short enough to resolve

significant temporal characteristics. The purpose of the overlapping analysis is that each speech sound of the input sequence would be approximately center Language Identification Using Spectral and Prosodic Features,

> FFT spectrum: Each windowed frame is converted into a magnitude spectrum by applying FFT.

>Mel-spectrum: Mel-Spectrum is computed by passing the Fourier transformed signal through a set of band-pass filters known as mel-filter banks. Mel is a unit of measure based on the human ears perceived frequency. It does not correspond linearly to the physical frequency of the tone, as the human auditory system apparently does not perceive pitch linearly. The mel scale is approximately a linear frequency spacing below 1 kHz, and a logarithmic spacing above 1 kHz


Chroma feature is a quality of a pitch class which refers to the “color” of the musical pitch, which can be decomposed into an octave-invariant value called “chroma” and a “pitch height” that indicates the octave pitch.

>Chroma Vector:

A chroma vector is typically a 12-element feature vector indicating how much energy of each pitch class, {C, C#, D, D#, E, ..., B}, is present in the signal. The Chroma vector is a perceptually motivated feature vector. It uses the concept of chroma in the cyclic helix representation of musical pitch perception. The Chroma vector thus represents magnitudes in twelve pitch classes in a standard chromatic scale

>Chroma deviation.

The standard deviation of the 12 chroma coefficients.


In music, the term chroma feature or chromatogram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as “pitch class profiles”, are a powerful tool for analyzing music whose pitches can be meaningfully categorized and whose tuning approximate to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music while being robust to changes in timbre and instrumentation.

The main aim of chroma features is representing the harmonic content of a short-time window of audio. The feature vector is extracted from the magnitude spectrum by using a short-time Fourier transform(STFT), Constant-Q transforms (QCT), Chroma Energy Normalized(CENS).

1.Input musical signal

2. Do spectral analysis to obtain the frequency components of the music signal.

3. Use Fourier transform to convert the signal into a spectrogram.

4. Do frequency filtering. A frequency range between 100 and 5000 Hz is used.

5. Do peak detection only the local maximum values of the spectrum are considered.

6. Do reference frequency computation procedure. Estimate the deviation with respect to 440Hz.

7. Do pitch class mapping with respect to the estimated reference frequency. This is a procedure for determining the pitch class value from frequency values. A weighting scheme with a cosine function is used. It considers the presence of harmonic frequencies taking account of a total of 8 harmonics for each frequency. To map the value on one-third of a semitone, the size of the pitch class distribution vectors must be equal to 36.

8. Normalize the feature frame by frame dividing through the maximum value to eliminate dependency on global loudness. And then we can get a result HPCP sequence.

11.Perceptual Linear Prediction Cepstral Coefficients in Speech

The idea of a perceptual front end for determining Linear Prediction Cepstral Coefficients has been applied in different ways to improve speech detection and coding, as well as noise reduction, reverberation suppression, and echo cancellation. In so doing, we improve their performance while simultaneously reducing their computational load. Linear prediction of a signal is done via Autoregressive Moving Average (ARMA) modeling of the time series. In an ARMA model, we express the current sample as:

Where x[n] is the current input signal, and y[n] is the current output. In speech processing, we do not have access to the input signal x and so we only perform Autoregressive Modeling. This is fortunate because we can solve these equations easily with the Levinson-Durbin Recursions. The perceptual linear prediction coefficients are created from the linear prediction coefficients by performing perceptual processing before performing the Autoregressive modeling. This perceptual front-end takes the following form:

After this processing, we perform the cepstral conversion. This is because linear prediction coefficients are very sensitive to frame synchronization and numerical error. In other words, the linear prediction cepstral coefficients are much more stable than the linear prediction coefficients themselves. To do this, we run the following recursion to compute the perceptual linear prediction coefficients:

12.Relative Spectral Analysis - Perceptual Linear Prediction (RASTA-PLP) :

A special band-pass filter was added to each frequency subband in the traditional PLP algorithm in order to smooth out short-term noise variations and to remove any constant offset in the speech channel. The following figure shows the most processes involved in RASTA-PLP which include calculating the critical-band power spectrum as in PLP, transforming spectral amplitude through a compressing static nonlinear transformation, filtering the time trajectory of each transformed spectral component by the bandpass filter using the equation as given below, transforming the filtered speech via expanding static nonlinear transformations, simulating the power law of hearing, and finally computing an all-pole model of the spectrum.

11.Linear predictive coding(LPC)

Linear Prediction Cepstral Coefficients For estimating the basic parameters of a speech signal, LPCC has become one of the predominant techniques. The basic theme behind this method is that one speech sample at the current time can be predicted as a linear combination of past speech samples.

The input signal is first pre-emphasized using a first-order high pass filter. Since the energy contained within a speech signal is distributed more in the lower frequencies than in the higher frequencies. In order to boost up the energies contained within high frequencies, Pre-emphasis of the signal is done. The transfer function for this filter in z-domain is expressed as

Check out for more blogs: -

How NLP is related to Speech recognition

Knowledgebase in Natural Language Processing Resume/CV parser using Natural Language Processing/NER

CoE-Ai Blog|

Researched by

Khirod Kumar Behera and Siva Sravya Veeravalli

159 views0 comments

Recent Posts

See All