Audio analysis feature extraction
   10 min read    Rohit Pruthi

We started working on a kaggle competition which deals with audio data. This notebook provides the learnings from audio analytics field.

1.0 Basics of Audio Analysis

1.1 What is sound?

  • Audio signals are transverse pressure fluctuations (compressions and rarefactions of air pressure).

  • When someone talks, it generates air pressure signals.

  • The number of times the compression/rarefaction happens is known as the frequency of sound wave.

  • Below figure shows the pressure signal changes with time

1
# ! pip install librosa
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import librosa ## for working with audio data (https://librosa.org/doc/latest/tutorial.html)
import librosa.display ## display functions with matplotlib

from pathlib import Path ## for reading the files into the notebook
from IPython.display import Audio ## play audio file

import matplotlib.pyplot as plt ## visualization
1
2
3
audio, sr = librosa.load('C:/Users/Home/Documents/Audio_data/00204008d.flac') 

## reading a sample file for trial
1
2
3
waveplot = librosa.display.waveplot(audio,sr=sr)

## Generating a pressure amplitude plot

png

As expected this is high frequency pressure variation data. For a one minute clip, there are 1.3 Mn data points with a sampling rate of 22050 Hz.

Human ears can hear from 20 to 20000 Hz and by Nyquist Theorem, 40k Hz should be the highest sampling which we can interpret.

1.2 How do we hear sound?

  • Signal processing happens via cochlea, a fluid-filled part of the ear with thousands of tiny hairs that are connected to nerves.

  • Some of the hairs are short, and some are relatively longer.

  • The shorter hairs resonate with higher sound frequencies, and the longer hairs resonate with lower sound frequencies.

  • Therefore, the ear is like a natural Fourier transform analyzer!

  • Power spectogram is a common representation of showing the fourier transform as shown in below figure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def visualize_sound_data_fft(i):
    
    try:
        
        fig, (ax1, ax2) = plt.subplots(2, 1)
        fig.suptitle('Sound Data Waveform and Power spectogram (fourier transform in DB plotted)')
        
        ## read the audio data
        ## audio, sr = librosa.load(trainfiles[i]) 

        waveplot = librosa.display.waveplot(audio,sr=sr, ax=ax1)
               
        D = np.abs(librosa.stft(audio))
        db = librosa.amplitude_to_db(D,ref=np.max)
        librosa.display.specshow(db, sr=sr, y_axis='log', x_axis='time', ax=ax2)

    
    except Exception as e:
        print("Error encountered while parsing file: ", i)
        return None 
1
visualize_sound_data_fft(3)

png

This is a reasonable representation, though difficult to interpret directly. One thing is clear though.

  • There is a consistent 4k frequency of high strength for the whole minutes apparent by the bright line at 4096.

1.3 How does the brain perceive sound?

  • The signal from cochnea helps brain recognize that the signal is speech and memory enables to understand what someone/something is saying.

  • Human ear has a preferable range of frequencies it can pick out. This forms a filter on the signal.

  • GFCC (or MFCC) - gamma tone frequency cepstral coefficient (Mel Bank Frequency Cepstral Coefficient) is a close simulation of how the ear perceives sound.

  • Typically 13 cepstral coefficients are sufficient to capture the phenomes. See below figure for the filtered spectral representation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def visualize_sound_data_waveplot_mfcc(i, num_mfcc):
    
    try:
        
        fig, (ax1, ax3) = plt.subplots(2, 1)
        fig.suptitle('Sound Data Waveform, spectogram and MFCC spectogram')
        
        
        ## read the audio data

        waveplot = librosa.display.waveplot(audio,sr=sr, ax=ax1)
        
        
        mfcc = librosa.feature.mfcc(audio,
                                    sr = sr,
                                    n_mfcc=num_mfcc)
            
        mfccplot = librosa.display.specshow(mfcc, x_axis='time', y_axis='log', ax=ax3)
        
        
        
        #return img
    
    except Exception as e:
        print("Error encountered while parsing file: ", i)
        return None 
1
visualize_sound_data_waveplot_mfcc(3,13)

png

1.4 How do you mathematically transform sound data?

Since sound is basically high frequency time domain data in its raw form, time series features can be used for it’s assessment. Moreover, it can be converted to frequency domain for spectral analysis. This is not unlike time series characterisation, although a bit more specialized for the audio signal.

1.4.1 Time series features

  • Amplitude derived features - sum, length, mean, median etc.
  • Power - transformed amplitude with time weightage
  • Auto regression features
1
#!pip install tsfresh
1
import tsfresh

Minimal features selected currently for demonstration, a wider list can be selected as well and added based on iterations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def extract_ts_features(i):
    # read file
    # audio, sr = librosa.load(trainfiles[i])
    
    df = pd.DataFrame(audio).reset_index()
    df.columns = ['time','amplitude']
    df['col_id']=i
    
    settings = tsfresh.feature_extraction.MinimalFCParameters()
    return tsfresh.feature_extraction.extract_features(df, default_fc_parameters=settings, column_id='col_id', column_sort='time')
        
1
extract_ts_features(3).T
Feature Extraction: 100%|████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.21s/it]

3
amplitude__sum_values 1.891127e+01
amplitude__median -4.939159e-05
amplitude__mean 1.429423e-05
amplitude__length 1.323000e+06
amplitude__standard_deviation 8.452455e-03
amplitude__variance 7.144399e-05
amplitude__maximum 1.081416e-01
amplitude__minimum -1.041415e-01

1.4.2 Spectral features

  • Zero crossing rate - number of times crossing through 0-axis
  • Spectral roll off - is the frequency below which 85% of accumulated spectral magnitude is concentrated
  • Spectral flux & its derivatives,
  • Spectral bandwidth and its derivaties
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def extract_spectral_features(i): 
    try:
             
        #audio, sr = librosa.load(trainfiles[i])
        #read audio file
         
        zcr = sum(librosa.zero_crossings(audio))/len(audio)
        ## 1. zero crossings rate 
        
        spectral_centroids = librosa.feature.spectral_centroid(audio, sr=sr)[0]
        spectral_centroids_delta = librosa.feature.delta(spectral_centroids)
        spectral_centroids_accelerate = librosa.feature.delta(spectral_centroids, order=2)
        ## 2. spectral centroids and derivatives
        
        spectral_rolloff = librosa.feature.spectral_rolloff(audio, sr=sr)[0]
        ## 3. spectral roll off
        
        onset_env = librosa.onset.onset_strength(y=audio, sr=sr)
        ## 4. spectral bandwidth

        spectral_bandwidth_2 = librosa.feature.spectral_bandwidth(audio, sr=sr)[0]
        spectral_bandwidth_3 = librosa.feature.spectral_bandwidth(audio, sr=sr, p=3)[0]
        spectral_bandwidth_4 = librosa.feature.spectral_bandwidth(audio, sr=sr, p=4)[0]
        ## 5. spectral bandwidth
        
        ## data frame structure definition
        spectral_features = {
        "zero_crossing_rate": zcr,
        "spectral_centroids": np.mean(spectral_centroids),
        "spectral_centroids_delta": np.mean(spectral_centroids_delta),
        "spectral_centroids_accelerate": np.mean(spectral_centroids_accelerate),
        "spectral_rolloff": np.mean(spectral_rolloff),
        "spectral_flux": np.mean(onset_env),
        "spectral_bandwidth_2": np.mean(spectral_bandwidth_2),
        "spectral_bandwidth_3": np.mean(spectral_bandwidth_3),
        "spectral_bandwidth_4": np.mean(spectral_bandwidth_4),
        }
        
        df = pd.DataFrame.from_records(data=[spectral_features])
        
        
        return df
    
    except:
        print("Error encountered while parsing file: ", i)
        return None
1
extract_spectral_features(3).T

0
zero_crossing_rate 0.350822
spectral_centroids 3919.560069
spectral_centroids_delta 0.116211
spectral_centroids_accelerate -0.000484
spectral_rolloff 5591.041192
spectral_flux 0.911508
spectral_bandwidth_2 2250.891289
spectral_bandwidth_3 2746.466933
spectral_bandwidth_4 3119.184909

1.4.3 Chroma and spectogram features

  • Pitch class profiles - used in melody and harmony characteristics
  • Normalized amplitude power spectograms

1.4.4 Cepstral coefficients

What is cepstrum?

  • The starting point is a fourier transform of the signal.
  • After that, a cepstrum is formed by taking the log magnitude of the spectrum
  • followed by an inverse Fourier transform.
  • This results in a signal that’s neither in the frequency domain (because we took an inverse Fourier transform)
  • nor in the time domain (because we took the log magnitude prior to the inverse Fourier transform).
  • The domain of the resulting signal is called the quefrency.

Cepstral features

  • Mel bank frequency cepstral coefficients
  • Gamma tonal frequency cepstral coefficients
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def extract_cepstral_features(i, n_fft = 2048, hop_length = 512, num_mfcc = 13): 
    try:
             
        #audio, sr = librosa.load(trainfiles[i])
        #read audio file
         
        d_audio = np.abs(librosa.stft(audio, n_fft=n_fft, hop_length=hop_length))
        ## 1. short term fourier transform
        
        db_audio = librosa.amplitude_to_db(d_audio, ref=np.max)
        ## 2. spectogram
        
        s_audio = librosa.feature.melspectrogram(audio, sr=sr)
        s_db_audio = librosa.amplitude_to_db(s_audio, ref=np.max)
        ## 3. Mel Spectogram
        
        y_harm, y_perc = librosa.effects.hpss(audio)
        ## 4. harmonic and percussion effects
        
        chromagram = librosa.feature.chroma_stft(audio, sr=sr, hop_length=hop_length)
        ## 5. chromagram
        
        tempo_y, _ = librosa.beat.beat_track(audio, sr=sr)
        ## 6. tempo BPM variable
        
        mfcc_alt = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=num_mfcc)
        delta = librosa.feature.delta(mfcc_alt)
        accelerate = librosa.feature.delta(mfcc_alt, order=2)
        ## 7. mfcc features
        
        ## data frame structure definition
        cepstral_features = {
        "spectrogram": np.mean(db_audio[0]),
        "mel_spectrogram": np.mean(s_db_audio[0]),
        "harmonics": np.mean(y_harm),
        "perceptual_shock_wave": np.mean(y_perc),
        "chroma1": np.mean(chromagram[0]),
        "chroma2": np.mean(chromagram[1]),
        "chroma3": np.mean(chromagram[2]),
        "chroma4": np.mean(chromagram[3]),
        "chroma5": np.mean(chromagram[4]),
        "chroma6": np.mean(chromagram[5]),
        "chroma7": np.mean(chromagram[6]),
        "chroma8": np.mean(chromagram[7]),
        "chroma9": np.mean(chromagram[8]),
        "chroma10": np.mean(chromagram[9]),
        "chroma11": np.mean(chromagram[10]),
        "chroma12": np.mean(chromagram[11]),
        "tempo_bpm": tempo_y,
        }
        
        for i in range(0, num_mfcc):
            key_name = "".join(['mfcc', str(i)])
            mfcc_value = np.mean(mfcc_alt[i])
            cepstral_features.update({key_name: mfcc_value})

            # mfcc delta coefficient
            key_name = "".join(['mfcc_delta_', str(i)])
            mfcc_value = np.mean(delta[i])
            cepstral_features.update({key_name: mfcc_value})

            # mfcc accelerate coefficient
            key_name = "".join(['mfcc_accelerate_', str(i)])
            mfcc_value = np.mean(accelerate[i])
            cepstral_features.update({key_name: mfcc_value})
        
        df = pd.DataFrame.from_records(data=[cepstral_features])
        
        
        return df
    
    except:
        print("Error encountered while parsing file: ", i)
        return None
1
extract_cepstral_features(3).T

0
spectrogram -44.306530
mel_spectrogram -46.476318
harmonics 0.000004
perceptual_shock_wave 0.000011
chroma1 0.858836
chroma2 0.341035
chroma3 0.089673
chroma4 0.055772
chroma5 0.051046
chroma6 0.045707
chroma7 0.044243
chroma8 0.054434
chroma9 0.085203
chroma10 0.163520
chroma11 0.360089
chroma12 0.828482
tempo_bpm 117.453835
mfcc0 -361.731079
mfcc_delta_0 0.008489
mfcc_accelerate_0 0.000680
mfcc1 33.958717
mfcc_delta_1 -0.006295
mfcc_accelerate_1 -0.000676
mfcc2 -4.292808
mfcc_delta_2 -0.000619
mfcc_accelerate_2 -0.000915
mfcc3 37.913280
mfcc_delta_3 0.009392
mfcc_accelerate_3 -0.001926
mfcc4 -24.322163
mfcc_delta_4 -0.004232
mfcc_accelerate_4 0.001942
mfcc5 10.843858
mfcc_delta_5 -0.000264
mfcc_accelerate_5 0.000652
mfcc6 20.269360
mfcc_delta_6 0.002119
mfcc_accelerate_6 0.001280
mfcc7 -18.988855
mfcc_delta_7 -0.005971
mfcc_accelerate_7 0.001584
mfcc8 14.028047
mfcc_delta_8 0.005159
mfcc_accelerate_8 -0.000931
mfcc9 14.278880
mfcc_delta_9 0.004275
mfcc_accelerate_9 -0.001039
mfcc10 -19.282475
mfcc_delta_10 -0.001716
mfcc_accelerate_10 -0.000845
mfcc11 20.883804
mfcc_delta_11 0.003940
mfcc_accelerate_11 -0.001970
mfcc12 -1.853068
mfcc_delta_12 -0.001870
mfcc_accelerate_12 0.000445

1.4.5 Deep learning approach

  • Spectogram, chromagraph of mel spectograms converted to images and fed to a neural network.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def generate_image_data(i, num_mfcc = 13):
    
    try:
        
        y, sr = librosa.load('C:/Users/Home/Documents/Audio_data/00204008d.flac')
        #read audio file
        
        D = librosa.amplitude_to_db(librosa.stft(y), ref=np.max)
        plt.subplot(4, 2, 1)
        librosa.display.specshow(D, y_axis='linear')
        plt.axis('off')
        
        plt.subplot(4, 2, 2)
        librosa.display.specshow(D, y_axis='log')
        plt.axis('off')

        CQT = librosa.amplitude_to_db(librosa.cqt(y, sr=sr), ref=np.max)
        plt.subplot(4, 2, 3)
        librosa.display.specshow(CQT, y_axis='cqt_note')
        plt.axis('off')
        
        plt.subplot(4, 2, 4)
        librosa.display.specshow(CQT, y_axis='cqt_hz')
        plt.axis('off')

        C = librosa.feature.chroma_cqt(y=y, sr=sr)
        plt.subplot(4, 2, 5)
        librosa.display.specshow(C, y_axis='chroma')
        plt.axis('off')

        plt.subplot(4, 2, 6)
        librosa.display.specshow(D, cmap='gray_r', y_axis='linear')
        plt.axis('off')

        plt.subplot(4, 2, 7)
        librosa.display.specshow(D, x_axis='time', y_axis='log')
        plt.axis('off')

        plt.subplot(4, 2, 8)
        Tgram = librosa.feature.tempogram(y=y, sr=sr)
        librosa.display.specshow(Tgram, x_axis='time', y_axis='tempo')
        plt.axis('off')
        
        plt.tight_layout()

        # Draw beat-synchronous chroma in natural time

        plt.figure()
        tempo, beat_f = librosa.beat.beat_track(y=y, sr=sr, trim=False)
        beat_f = librosa.util.fix_frames(beat_f, x_max=C.shape[1])
        Csync = librosa.util.sync(C, beat_f, aggregate=np.median)
        beat_t = librosa.frames_to_time(beat_f, sr=sr)
        ax1 = plt.subplot(3,1,1)
        librosa.display.specshow(C, y_axis='chroma', x_axis='time')
        plt.axis('off')
        
        ax2 = plt.subplot(3,1,2, sharex=ax1)
        librosa.display.specshow(Csync, y_axis='chroma', x_axis='time',
                         x_coords=beat_t)
        plt.axis('off')
        
        ax3 = plt.subplot(3,1,3)
        mfcc = librosa.feature.mfcc(y,sr = sr,n_mfcc=13)
            
        mfccplot = librosa.display.specshow(mfcc, x_axis='time', y_axis='log', ax=ax3)
        
        plt.axis('off')  
        
        plt.tight_layout()

        #return img
    
    except Exception as e:
        print("Error encountered while parsing file: ", i)
        return None 
1
generate_image_data(3)
C:\Users\Home\anaconda3\lib\site-packages\librosa\core\spectrum.py:1642: UserWarning: amplitude_to_db was called on complex input so phase information will be discarded. To suppress this warning, call amplitude_to_db(np.abs(S)) instead.
  "amplitude_to_db was called on complex input so phase "
C:\Users\Home\anaconda3\lib\site-packages\librosa\core\spectrum.py:1642: UserWarning: amplitude_to_db was called on complex input so phase information will be discarded. To suppress this warning, call amplitude_to_db(np.abs(S)) instead.
  "amplitude_to_db was called on complex input so phase "

png

png

Exploration in progress

  • Parallel processing of features to ensure optimal use of Kaggle resources.
  • Does tsfresh with its extensive array of timeseries features capture the essence of audio processing as well?
  • Would a combination of meta data derived from features extracted with images of spectograms lead a better result?
  • Can multiple spectograms be combined together?
  • Feed features to a modeling pipeline and tune
  • Understand image data compatibility to existing networks

Reference for reading

Given that we as a team don’t normally deal with audio data, a reference list is below which has been used to learn some the facts listed above.