Guest blog by Amy Boyd Cloud Advocate
Within the UK there has been a huge TV success https://www.itv.com/loveisland over the last 4 series (2016–2019) hundreds of thousands have been watching and Amy and her data science colleagues wanted to see what interesting insights could be found..
In this article, the team wants to share the Data Science approach and technologies used to create the findings in the article Are you Love Island’s “Type on Paper”.
So the key steps to Data Science in the real world is
1. Gathering the Data
2. Using Azure Notebooks to Explore the Data
3. Which Python Packages supported Data Analysis
4. How Power BI brought the Data to Life
5. Enhancing the Visuals with Azure Storage Support
First things first, I needed to ‘Crack on’ and gather some data relating to Love Island UK series on ITV2. One option was to re-watch every episode on ITV Hub — but in lieu of that, we found that Wikipedia was a great source of data about some of the major events that happened across every series of Love Island.
Each series has a Wikipedia page containing information about the contestants, any show details and the couplings and events such as dates, challenges and exits. we took this data and collated all public information into a excel file, starting with a tab for each series and adding information as we found it.
I have created this collated dataset from public information and my own viewing of YouTube, news articles, Wikipedia and TV.
What could you find in the data? Let us know and contribute to our GitHub repository
Next, I wanted to explore what shape, format and contents the data has. I decided to use Microsoft Azure Notebooks, cloud notebooks powered by Jupyter and accessed through a web browser. Jupyter is a well-used tool for data science and is suited for Python usage. The REPL (read–eval–print loop) setup let me explore the data effectively and find interesting correlations.
I read the data from the CSV files into a Pandas data frame structure, so it was easy to start accessing rows and columns of the matrix given certain values contained in the data.
For example, accessing all the winners in the dataset:
winners = data[data[‘OUTCOME’].values == ‘WINNER’]
And building graphs to visually explore similarities and correlations in the data
areaofuk = data['Area of UK'].value_counts()destination = data['From'].value_counts()plt.pie(areaofuk, labels = areaofuk.index)
If you want to ‘Get to know’ the code a bit better, check out the full notebook here: https://github.com/amynic/love-island-project/tree/master/code
You can create a new project in Azure Notebooks easily by Cloning a GitHub repository, so give this one a try and see what you can find within the historical dataset.
As this dataset is hundreds rather than millions of rows of data — I was able to use the Azure Notebooks Free Compute offering to run my Jupyter instance in the cloud. However if I need more powerful compute in the future — for example GPU compute — I can also leverage this by creating a Data Science Virtual Machine in the Azure Cloud and pointing my Jupyter instance towards it.
Whilst using Azure Notebooks, I was able to import all the useful data analysis tools and packages I would need to analyse the data.
The main Python packages to mention are Pandas and Matplotlib. These packages made loading in datasets, manipulating them and visualising them simpler.
For example, loading a simple CSV file into a Pandas Data frame structure for manipulation is an import statement and one line of code:
import pandas as pddata = pd.read_csv('love-island-historical-dataset.csv')
The data frame is a tabular structure with labelled rows and columns. You can access columns/rows using the labels and apply operations to these dimensions. Within Python this is like manipulating a dictionary object. A data frame is a well-used data structure.
For more information on manipulating data frames I found this article by Analytics Vidhya (25/06/2019) which lists some good techniques people use to manipulate data
The Matplotlib package is very useful. This package is accessible and well documented, with many examples produced by the community and shared online
I created Python plots within my Azure Notebook experience and was able to view the distribution of columns in the dataset as well as exploring possible correlations between columns (some showing positive correlation and other hypothesis not showing a string correlation as I may have expected).
age = data[‘Age’]numcouples = data['Number of Couples']plt.figure(figsize=(10,10))plt.scatter(age,numcouples)plt.xlabel("Age")plt.ylabel("Number of couples")plt.title('A graph to show how age relates to Number of couples across the show')plt.show()
Data exploration is a key first step to understand your dataset and be able to analyse and build upon your hypothesis.
After exploring the data using Python — I switched to Microsoft Power BI to create shareable reports and dashboards with the project team.
Creating reports and dashboards using data visualisation/business intelligence services allows you to quickly create stories of your data. I was able to share my findings with my team (both technical and non-technical) and allow them to explore the data for themselves by selecting graphs to filter them and asking natural language questions of the dataset behind the visuals.
Handy Tip -> Who uses what tool?
• [End users, Technical and Non-Technical] Power BI Service, access via a web browser. View and explore reports
• [Report creator, Technical] Power BI Desktop tool, Build data models and reports/visuals that tell stories. Download for Free
Every data visualisation you create, you want it to match the theme or style of the project/brand/product you’re working on. You can do this in Power BI by creating a Power BI Theme file. This is a simple JSON file I created using Visual Studio Code. I added into the JSON schema a list of colour Hex codes that related to the bright summer colours within love island images.
At the foundation of all data science projects is a place to store the data. Datasets can be incredibly large to process on your own machine and, like us, you may be using many different services to access the data — therefore choosing cloud data storage was a good next step for our project
We accessed both datasets and accompanying images within an Azure Blob Storage account. Azure Blob storage is great for storing unstructured data objects which you need access to. You can setup a storage account using the Azure Portal UI(user interface) or via command line.
Other than just storing the data, many Microsoft and other services may have an Azure Blob Storage connector meaning you can access your data within those tools or via a REST request (API). I was able to access contestant images to enhance my data visualisation reports within Power BI
First, I created a new column in the dataset. I pointed the column calculation at the Azure Blob Storage account and appended the contestant name from another column in the dataset. Finally, I then set the data type in Power BI to Image URL, so the service knows to render the image
Join in the conversation. We have shared resources within a GitHub repository here: https://github.com/amynic/love-island-project and welcome your contributions and input/feedback
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.