Microsoft Project15 & University of Oxford Capstone Project with Elephant Listening Project Team 4
Published Apr 08 2021 01:47 PM 2,088 Views

Oxford's AI Group 4 Project 15 Writeup


Who are we?







Abhishekh  Baskaran


Bas Geerdink


Chandan Konuri








Henrietta  Ridley


Jay Padmanabhan


Paulo Campos






Vishweshwar Manthani






The goal of the project was to count the number of elephants in a sound file.


To do so, we detected whether rumbles are belonging to the same elephant or not



  • Poole, Joyce H. (1999). Signals and assessment in African elephants: evidence from playback experiments. Animal Behaviour, 58(1), 185-193
  • Jarne, Cecilia (2019). A method for estimation of fundamental frequency for tonal sounds inspired on bird song studies. MethodX, 6, 124-131
  • Stoeger, Angela S. et al (2012). Visualizing Sound Emission of Elephant Vocalizations: Evidence for Two Rumble Production Types.
  • O'Connell-Rodwell, C.E. et al (2000). Seismic properties of Asian elephant (Elephas maximus) vocalizations and locomotion. Journal of the Acoustic Society of America, 108(6), 3066-3072
  • Heffner, R. S., & Heffner, H. E. (1982). Hearing in the elephant (Elephas maximus): Absolute sensitivity, frequency discrimination, and sound localization. Journal of Comparative and Physiological Psychology, 96(6), 926–944
  • Elephant Listening Project, Cornell University:
  • Project 15, Microsoft: 



  • Sound files can be analysed by transforming them into a 2D image: a spectrogram of time (seconds) vs frequency (Hertz). The third dimension is sound intensity (decibel), which can be shown as a colour or grayscale.
  • Elephants produce rumbles to communicate with a typical frequency of 10 – 50 Hz and lasting 2 - 6 seconds
  • One elephant rumble will have many harmonics, which are sound waves of increasing frequency.
  • An elephant can be identified by its base frequency. If there are two slightly overlapping or separated rumbles with a different base frequency, they probably belong to separate animals.


We received a set of sounds files (.wav) and metadata that pointed us to the segments where elephants were likely to produce rumbles.


  • Big data set
  • Joining the files might be a challenge
  • Labels / annotations don't mention the number of elephants




Data Pipeline

  1. Segmenting data: based the metadata files, we create segments of a few seconds that contain the interesting information
  2. Spectrograms: each data segment is transformed into a 2D image of time vs frequency (10-50 Hz), using FFT transformation algorithm, lowpass/highpass filters, and frequency filters
  3. Noise reduction: each spectrogram is reduced of noise and transformed into a simple monochrome (black and white) image
  4. Contours detection: each monochrome image is evaluated with a contour detection algorithm, to distinguish the separate 'objects' which in our case are the elephant rumbles
  5. Boxing: for each contour (potential elephant rumble) we calculate the size (height and width) by drawing a box around the contour
  6. Counting: we compare the boxes that identify the rumbles to each other in each spectrogram. Based on a few business rules, we count the number of unique elephant rumbles in each image





Source Code

  • The source code is made available at: 
  • All code is written in Python and runs on premise or in the cloud (Azure)
  • We used the following frameworks to process and analyze the data:
    • boto3 for connecting to Amazon AWS
    • Numpy, Pandas, SciPy and MatPlotLib for statistical analysis and visualization
    • Librosa for FFT
    • noisereduce for noice reduction
    • SoundFile
    • OpenCV for contour detection 

Video Presentation


  • We analysed 3935 elephant sounds:
    • 112 spectrograms were identified as containing 0 elephants
    • 3277 spectrograms were identified as containing 1 elephant
    • 505 spectrograms were identified as containing 2 elephants
    • 40 spectrograms were identified as containing 3 elephants


Results of the Boxing algorithm

  • The boxing algorithm was evaluated by Liz Rowland of Cornell University
  • The reported accuracy of the model is:
    • 97.29 % for the Training dataset (3180 cases)
    • 99.29 % for the Testing dataset (758 cases)
    • This proves that the model is useful for counting elephants
  • In combination with other models (elephant detection), many interesting use case can be built with this model, for example visualizing elephant movements and detecting poaching


Project 15 Architecture



Building ML Models

  • Aim
    Using the processed spectrogram data as an input to a CNN to automatically categorise how many elephants are present
  • Why are we doing this? 
    • To enable automation the workflow end to end
    • To improve accuracy by reducing human error
    • To save time, enabling researchers to focus their attention on complex problems
  • Our Approach
    Transfer learning looks to take advantage of models which have been pre-trained on large datasets, then fine tuning to our specific problem. This approach is becoming very popular for several reasons (quicker time to train, better performance, not needing lots of data) and we found it to work well. 


Model Summary

  • Implemented using keras with a tensorflow backend. 
  • To evaluate the performance of our models we looked at the following measures of our two most promising architectures:
  • Resnet50
  • accuracy: 0.9620
  • loss: 0.1622
  • VGGNet

  • accuracy: 0.9477

  • loss: 0.3252




Model - Resnet50

  • Below configuration was found to be optimal while running the classification task on Resnet50
    • Epochs: 25
    • Batch Size: 100
    • Weights = "imagenet"
    • Intermediate dense layers: 
      • Nodes: 4 layers of 256,128,64 respectively
      • activation = 'relu'
      • Dropout = 0.5
      • BatchNormalization()
    • Final dense layer:
      • Nodes: 3 
      • activation = 'softmax'
    • Optimizer: Adam with a learning rate of 0.001



Sound files


Further Research

  • Machine learning on spectrograms using labelled data 
  • Automatic classification and better acoustic analysis (
  • Further fine-tuning of the boxing algorithm might lead to even better results, e.g.
    • Fixing the time axis in the spectrograms
    • Increasing the frequency range
    • Other (better) noise reduction techniques


  • Elephant counting based on base frequency analysis is possible
  • The team delivered a ready-to-use software library for counting elephants that with a high accuracy (97% on selected cases)
  • The software can be used in the IoT Hub (Project 15) or on-premise
  • The application can be integrated into other software
  • A machine learning model (VGG or Resnet50) could be used to count the elephants instead of the rule-based boxing algorithm
  • Further research is needed to improve the results, for example for broadening to other species



  • Many thanks to all people who helped with the project, by providing insights, performing reviews, and participating in meetings:
    • Peter Wrege (Cornell University)
    • Liz Rowland (Cornell University)
    • Lee Stott (Microsoft)
    • Sarah Maston (Microsoft)
    • Thanks to the organizers of the "Artificial Intelligence – Cloud and Edge Implementations" course:
    • Ajit Jaokar (University of Oxford)
    • Peter Holland (University of Oxford)




Version history
Last update:
‎Apr 12 2021 06:20 AM
Updated by: