Using Spark as a cornerstone for an Analytic Initiative
Set up
The team at www.collegefootballdata.com has an excellent API (application programming interfaces) that allows you to query different endpoints and extract data for analysis.
To get started, head over to https://collegefootballdata.com/key and get your own API key via email. We will be using this key in the subsequent code modules.
There are several endpoints that the team has published at https://api.collegefootballdata.com/api/docs/?url=/api-docs.json
As a part of this work, I want to take historical data on a ratio (Win Ratio – Total Wins to Total Games in a season) and predict this ratio for the upcoming season. We are going to be using multiple endpoints to get relevant data (feel free to add/remove data!) and see what sort of model we can build.
Breakdown of code modules
Set up components
- Note down the API key you generated from www.collegefootballdata.com because we will be using it pretty extensively
- (Optional): Navigate to your Microsoft Fabric tenant and create a new workspace with Fabric capacity enabled.
- If you already have a Fabric workspace and want to reuse it, you can create a new Lakehouse to keep all your artifacts separate from your other work OR you can use an existing lake house
- Navigate to the Data Science section and start a new Notebook
First Code Module: Win-Loss Records
Code can be found HERE
Code loops through an endpoint for Win/Loss Records for the previous 20 years and creates a Pandas/Spark data frame which you can then insert into your Lakehouse as a file or as a table.
Second Code Module: Talent Rankings
Code can be found HERE
An important aspect of any team is how much talent the team has. Ranking agencies release a composite score of each team, and this could be a nice predictor of on-field performance.
Third Code Module: Recruiting Rankings
Code can be found HERE
Another important aspect of any team is how well the college is doing from a recruiting perspective. In theory, higher the Recruiting Rank, better the talent on the field leading to superior performance – there is no objective way to score the coaching 😊 but we will try to see if recruiting has a place in the final model.
Fourth Code Module: Season Stats
Code can be found HERE
This is an interesting code block because it gives you the high level and key statistics for a team at the end of the season.
Fifth Code Module: Advanced Season Stats
Code can be found HERE
There are some advanced statistics also available, and it is worth bringing this data into the data engineering piece.
Now that we have all the data as separate files or tables in our Lakehouse, we are ready to move on to the next phase – where we predict the Win Ratio
Machine Learning Model
First Step: Feature Engineering
We want to first create an integrated data set with properly defined columns so that we can build a ML (machine learning) model
Code can be found HERE
There are some assumptions I have made which can be easily modified
- I am using 19 years' worth of data to predict the prior year for model accuracy. This can be changed on either side to reduce the period of the data (giving more importance to the recent years) and use the current year as your prediction year. College Football typically runs from September through December so you could extend the code to run once a week where it picks up data and constantly adjusts the Win Ratio Prediction based on current weekly data that gets added to the model data
- I am only having records where we have Talent scores – this is a big assumption and can be taken out of the model mix
- There are a lot more features that can be extracted from the Season Stats and Advanced Stats data frames
Second Step: ML Code
I am predicting the Win Ratio based on a set of algorithms (This can be refined to add/remove algorithms from the list I have) and using features we have from the Feature Engineering code. The evaluation metric I am using is R2 but again this can be changed based on your mix of algorithm(s).
Entire code can be found HERE
The code should write the output to your Lakehouse but will also generate an output in the notebook
A final optional step would be to plot the feature importance
PowerBI Dashboards
Once you have all the data in – you can connect to the different Lakehouse tables via PowerBI and build your own visualization layer.
A sample starter workbook can be found HERE
Use this template file to fill in your Fabric details
You can adapt the code (minus the Lakehouse pieces) to work in native Azure Synapse as a Spark Notebook also.