Blog Post

Analytics on Azure Blog
4 MIN READ

Using Spark as a cornerstone for an Analytic Initiative

ujvalgandhi's avatar
ujvalgandhi
Icon for Microsoft rankMicrosoft
Oct 16, 2023

Set up 

The team at www.collegefootballdata.com has an excellent API (application programming interfaces) that allows you to query different endpoints and extract data for analysis. 
To get started, head over to https://collegefootballdata.com/key and get your own API key via email. We will be using this key in the subsequent code modules.  

There are several endpoints that the team has published at https://api.collegefootballdata.com/api/docs/?url=/api-docs.json 
 

As a part of this work, I want to take historical data on a ratio (Win Ratio – Total Wins to Total Games in a season) and predict this ratio for the upcoming season. We are going to be using multiple endpoints to get relevant data (feel free to add/remove data!) and see what sort of model we can build.   

Breakdown of code modules

Set up components 

  1. Note down the API key you generated from www.collegefootballdata.com because we will be using it pretty extensively 
  1. (Optional): Navigate to your Microsoft Fabric tenant and create a new workspace with Fabric capacity enabled. 
  1. If you already have a Fabric workspace and want to reuse it, you can create a new Lakehouse to keep all your artifacts separate from your other work OR you can use an existing lake house 
  1. Navigate to the Data Science section and start a new Notebook 

 

 

First Code Module: Win-Loss Records 

Code can be found HERE 
Code loops through an endpoint for Win/Loss Records for the previous 20 years and creates a Pandas/Spark data frame which you can then insert into your Lakehouse as a file or as a table.  

 

Second Code Module: Talent Rankings 

Code can be found HERE 

An important aspect of any team is how much talent the team has. Ranking agencies release a composite score of each team, and this could be a nice predictor of on-field performance. 


Third Code Module: Recruiting Rankings
 

Code can be found HERE 

Another important aspect of any team is how well the college is doing from a recruiting perspective. In theory, higher the Recruiting Rank, better the talent on the field leading to superior performance – there is no objective way to score the coaching 😊 but we will try to see if recruiting has a place in the final model.  


 

Fourth Code Module: Season Stats 

Code can be found HERE 

This is an interesting code block because it gives you the high level and key statistics for a team at the end of the season.  

 

Fifth Code Module: Advanced Season Stats 

Code can be found HERE 

There are some advanced statistics also available, and it is worth bringing this data into the data engineering piece. 

 

Now that we have all the data as separate files or tables in our Lakehouse, we are ready to move on to the next phase – where we predict the Win Ratio 

 

Machine Learning Model 
First Step: Feature Engineering 

We want to first create an integrated data set with properly defined columns so that we can build a ML (machine learning) model 
Code can be found HERE 
 
There are some assumptions I have made which can be easily modified 

  • I am using 19 years' worth of data to predict the prior year for model accuracy. This can be changed on either side to reduce the period of the data (giving more importance to the recent years) and use the current year as your prediction year. College Football typically runs from September through December so you could extend the code to run once a week where it picks up data and constantly adjusts the Win Ratio Prediction based on current weekly data that gets added to the model data 
  • I am only having records where we have Talent scores – this is a big assumption and can be taken out of the model mix 
  • There are a lot more features that can be extracted from the Season Stats and Advanced Stats data frames 

 
Second Step: ML Code 

I am predicting the Win Ratio based on a set of algorithms (This can be refined to add/remove algorithms from the list I have) and using features we have from the Feature Engineering code. The evaluation metric I am using is R2 but again this can be changed based on your mix of algorithm(s).  
Entire code can be found HERE 
The code should write the output to your Lakehouse but will also generate an output in the notebook 

A final optional step would be to plot the feature importance

 

PowerBI Dashboards 

Once you have all the data in – you can connect to the different Lakehouse tables via PowerBI and build your own visualization layer.  
A sample starter workbook can be found HERE 

Use this template file to fill in your Fabric details 

You can adapt the code (minus the Lakehouse pieces) to work in native Azure Synapse as a Spark Notebook also. 

 

Updated Oct 16, 2023
Version 1.0
  • This is a highly informative and well-written blog post. I appreciate the detailed nature of the content and the step-by-step instructions that will help anyone get started with using Fabric Data Science. ujvalgandhi clearly explains the entire data science pipeline and provides practical examples and code snippets to illustrate the concepts. I am planning to go through this myself and apply the learnings to my own projects. Thank you for sharing your insights and expertise with the community.