The goal of the project was to count the number of elephants in a sound file.
To do so, we detected whether rumbles are belonging to the same elephant or not
Poole, Joyce H. (1999). Signals and assessment in African elephants: evidence from playback experiments. Animal Behaviour, 58(1), 185-193
Jarne, Cecilia (2019). A method for estimation of fundamental frequency for tonal sounds inspired on bird song studies. MethodX, 6, 124-131
Stoeger, Angela S. et al (2012). Visualizing Sound Emission of Elephant Vocalizations: Evidence for Two Rumble Production Types.
O'Connell-Rodwell, C.E. et al (2000). Seismic properties of Asian elephant (Elephas maximus) vocalizations and locomotion. Journal of the Acoustic Society of America, 108(6), 3066-3072
Heffner, R. S., & Heffner, H. E. (1982). Hearing in the elephant (Elephas maximus): Absolute sensitivity, frequency discrimination, and sound localization. Journal of Comparative and Physiological Psychology, 96(6), 926–944
Sound files can be analysed by transforming them into a 2D image: a spectrogram of time (seconds) vs frequency (Hertz). The third dimension is sound intensity (decibel), which can be shown as a colour or grayscale.
Elephants produce rumbles to communicate with a typical frequency of 10 – 50 Hz and lasting 2 - 6 seconds
One elephant rumble will have many harmonics, which are sound waves of increasing frequency.
An elephant can be identified by its base frequency. If there are two slightly overlapping or separated rumbles with a different base frequency, they probably belong to separate animals.
We received a set of sounds files (.wav) and metadata that pointed us to the segments where elephants were likely to produce rumbles.
Big data set
Joining the files might be a challenge
Labels / annotations don't mention the number of elephants
Segmenting data: based the metadata files, we create segments of a few seconds that contain the interesting information
Spectrograms: each data segment is transformed into a 2D image of time vs frequency (10-50 Hz), using FFT transformation algorithm, lowpass/highpass filters, and frequency filters
Noise reduction: each spectrogram is reduced of noise and transformed into a simple monochrome (black and white) image
Contours detection: each monochrome image is evaluated with a contour detection algorithm, to distinguish the separate 'objects' which in our case are the elephant rumbles
Boxing: for each contour (potential elephant rumble) we calculate the size (height and width) by drawing a box around the contour
Counting: we compare the boxes that identify the rumbles to each other in each spectrogram. Based on a few business rules, we count the number of unique elephant rumbles in each image
Aim Using the processed spectrogram data as an input to a CNN to automatically categorise how many elephants are present
Why are we doing this?
To enable automation the workflow end to end
To improve accuracy by reducing human error
To save time, enabling researchers to focus their attention on complex problems
Our Approach Transfer learning looks to take advantage of models which have been pre-trained on large datasets, then fine tuning to our specific problem. This approach is becoming very popular for several reasons (quicker time to train, better performance, not needing lots of data) and we found it to work well.
Implemented using keras with a tensorflow backend.
To evaluate the performance of our models we looked at the following measures of our two most promising architectures:
Model - Resnet50
Below configuration was found to be optimal while running the classification task on Resnet50
Batch Size: 100
Weights = "imagenet"
Intermediate dense layers:
Nodes: 4 layers of 256,128,64 respectively
activation = 'relu'
Dropout = 0.5
Final dense layer:
activation = 'softmax'
Optimizer: Adam with a learning rate of 0.001
Machine learning on spectrograms using labelled data