Using Azure Machine Learning in winning the Microsoft & Oxford's Tale of Two Cities Hackathon
Published Aug 02 2022 03:52 PM 4,710 Views
Copper Contributor

Hackathon Challenge:

Microsoft and Oxford University partnered for A Tale of Two Cities: Data & AI Hackathon.  The challenge was for the participants to use a provided dataset between New York City and London to develop analytical insights and solutions to uncover behavioral patterns before, during, and after the pandemic. In addition, we have to utilize data & AI tools such as Azure Data servicesAzure Machine Learning service or Observable HQ.


Why I Chose Public Bike Share:

I grew up in Brooklyn, NY and have been a Citi Biker for as long as I remember. I used to wake up every morning for high school and hope there was a Citi Bike rental at the station on the bottom of my block so that I wouldn't be late for class. Now, almost 5 years later, I still use Citi Bike to get to class, just this time I go from my apartment to Washington Square Park. The pandemic has changed almost everything, but I was curious how it affected the public bike sharing system. I also attended this hackathon a few days after getting back from a family vacation to London where I was able to ride the Santander Cycles bike rentals!


My Data:

To explore this topic, we were given a dataset containing covid data for every day in both NYC and London. I also started searching for public data where I was able to find data on the Santander Cycles - their daily use for the last 10 years as well as a month-by-month average ride duration. The Citi Bike data on the other hand wasn't as pretty. Although it contained lots more information (station name, latitude/longitude, etc.) The only consistent data I could find was data for May of 2020 - 2022.


Data Links:

London Bike Data

Citi Bike Data

Covid Data


Azure Usage:

I used the Microsoft Azure Machine Learning Studio to help me with this project. I was able to get access to more compute power than I have on my personal laptop which allowed me for faster preprocessing steps (mainly feature engineering) in CSVs with millions of rows. It also gave me a space to use Jupyter notebooks servers which are my preferred way of coding and doing my EDA!


After doing some data cleaning and feature engineering, I started exploring the relationship between the pandemic and bike share for both cities.




The first thing I did was create a quick helper function to get the rolling mean for any given column to make smoother graphs and then apply this function on my covid dataframe to get rolling means for various columns.









def get_rolling_mean(col, window=7):
    return col.rolling(window=window).mean()

cols = ['nyc_cases', 'nyc_hospital', 'london_cases', 'london_hospital']
for col in cols:
    covid_df[f"rolling_{col}"] = get_rolling_mean(covid_df[col])










Looking at London's bike rentals compared to covid cases there are a few interesting things that pop out.

(Apologies for not labeling the axis on some of these graphs they are from my EDA Jupyter notebook)

Screen Shot 2022-07-29 at 5.18.01 PM.png

Initially I just plotted from the start of covid which if you can imagine the graph, it would appear every time covid spikes there is a dramatic decrease in bike rentals. Looking at the overall graph there is clearly a confounding variable - Time of Year. Covid happened to spike right around the new year which was also the time where bike rentals dropped. We can see that on the most recent spike of covid (2022) the dip in bike rentals was actually one of the highest low points, meaning Covid was not actually causing this dip. Although there were slightly more bike rentals during the pandemic in London it does not appear to be too significant of a change.


However when looking at Ride Length (in minutes) it tells a very different story.




fig = figure(figsize=(20,12))
plot(bike_dates, scale_col(bike_hire_means), label="hires", linewidth=1.3, linestyle=':')
plot(boris_ride_timecut['date'], scale_col(boris_ride_timecut['avg_ride_length']), label="ride_length")
plot(covid_df['date'], scale_col(covid_df['rolling_london_cases']), label="covid_cases", linewidth=2, linestyle='--')

axvline(pd.to_datetime('2020-03-23'), color='red', label="First London Lockdown")

title("London Covid and Bike Data")
ylabel("scaled features")




Screen Shot 2022-07-29 at 5.22.47 PM.png

The red line represents the day that London announced the first lockdown. As you can see, although the number of rides does not appear to jump up, the ride length drastically increases. This tells me that people were starting to use the bikes for different reasons, they wanted to get outside and exercise rather than using them just as a means of transportation.



(For NYC the best data I could get while avoiding confounding variables was May of each year)


The first graph here is the bike rentals per day - citi_dfs is a list of dataframes where each dataframe is the data for one year:










fig, axs = subplots(6,1, figsize=(16,10), sharey=True)
colors = plt.rcParams["axes.prop_cycle"]()
rides_by_day = np.array([df.groupby('day')['year'].count().values for df in citi_dfs])
for ii, year in enumerate(rides_by_day):
    c = next(colors)['color']
    axs[ii].plot(year, color=c, label=str(2017+ii))
    axs[ii].legend(loc='upper right')
axs[0].set_title("Citibike Rides Per Day (May)")






(Y - # Rentals, X - Day)

Screen Shot 2022-07-29 at 5.24.56 PM.png

As you can see here, bike rentals actually increased significantly during covid and NYC was able to maintain that growth.




scatter(data[:,0], data[:, 1])
ylabel("Covid Cases")
xlabel("Bike Rentals")

title("Bike Rentals by Covid Cases in May for 6 years")




Screen Shot 2022-07-29 at 5.27.16 PM.png

Looking at the correlation between bike rentals and covid cases (2017 - 2022) [where 2017-2019 were manually added 0 covid cases] there was a 0.409 correlation. Although this can be partially attributed to growth over time, it allows us to compare later down the line with Santander Cycles.


Another interesting thing I found was that since Citi Bike gave data on the station the bike was taken from and the station where it was dropped off at, I was able to feature engineer what I called "0 Distance Rides", or rides for fun!




import geopy.distance
start_coords = citi_df[['start_lat', 'start_lng']].values
end_coords = citi_df[['end_lat', 'end_lng']].values
citi_df['distance_travelled'] = [geopy.distance.geodesic(sc, ec).mi for sc, ec in zip(start_coords, end_coords)]






We can visualize that as well:




title("Number of 0 distance rides")






Screen Shot 2022-07-29 at 5.29.12 PM.png





fig, ax = subplots(figsize=(12,8))
title("Ride Length for 0 distance Rides")





Screen Shot 2022-07-29 at 5.29.20 PM.png

Looking at these 2 graphs it is clear that 2020 May was the month of for fun bike rides, and after this it fell back down to pre-covid levels. You can also see that people were taking there bike out for an average of 8 minutes longer per 0 distance ride with many more outliers (people riding for over 1.5 hours) in 2020.



One important preface to this section is that covid maintained a very similar pattern between NYC and London throughout this time which allows for general analysis without worrying too much about conditioning on number of cases.




plot(covid_df['date'], covid_df['rolling_nyc_cases'], label="nyc")
plot(covid_df['date'], covid_df['rolling_london_cases'], label="london")




Screen Shot 2022-07-29 at 5.31.40 PM.png


Looking at the Average Ride Length in both cities it is clear that as the pandemic first hit the public bike shares were a go to for getting outside and taking longer bike rides, however both cities seemed to return back to baseline "post-pandemic"




boris_ride_may = boris_ride[boris_ride['date'].dt.month == 5]

nyc_avg_mean = [df['ride_length'].mean() for df in citi_dfs]

plot(boris_ride_may['date'], boris_ride_may['avg_ride_length'], label="London")
plot(boris_ride_may['date'][6:], nyc_avg_mean, label="NYC")

title("Average ride length by year")




Screen Shot 2022-07-29 at 5.32.46 PM.png

Looking at the rides per day for both London and NYC (in May 2017-2022) we see our first big difference.




fig, axs = subplots(6,1, figsize=(16,20), sharey=True)
colors = plt.rcParams["axes.prop_cycle"]()
rides_by_day = np.array([df.groupby('day')['year'].count().values for df in citi_dfs])
for ii, year in enumerate(rides_by_day):
    c = next(colors)['color']
    axs[ii].plot(year, color="blue", label="nyc")
    axs[ii].plot(boris_bike_may[boris_bike_may.year == 2017+ii]['num_boris_bikes'].values, color="red", label="london")
    axs[ii].legend(loc='upper right')
axs[0].set_title("Bike Rides Per Day (May) London/NYC")




Screen Shot 2022-07-29 at 5.33.48 PM.png

Although both spiked in 2020, NYC was able to maintain and hold this growth whereas London reverted back to its old biking habits (very clear in 2022-May graph)


Bringing back that graph from earlier we can look at Bike Rentals by Covid Cases and compare these correlations. To get the data to actually check this correlation for London I needed to subset both the covid dataframe and the London bike dataframe to make sure I'm getting the exact same data I have for my Citi Bike. I also stacked in a np.zeros((3,31)) -imputed 0- for covid values pre-2019. This results in a (2,186) matrix.










london_may_covid_values = np.array([covid_df[( == 5) & ( == ii)]['london_cases'].values for ii in range(2020, 2023, 1)])
boris_bike_may_values = np.array([boris_bike_may[boris_bike_may.year == ii]['num_boris_bikes'].values for ii in range(2017, 2023, 1)])

london_may_covid_values = np.vstack([np.zeros((3,31)),london_may_covid_values])

london_bike_covid = np.vstack([london_may_covid_values.flatten(), boris_bike_may_values.flatten()])











Screen Shot 2022-07-29 at 5.34.52 PM.png





np.corrcoef(london_bike_covid[1], london_bike_covid[0])[0,1], np.corrcoef(data[:,0], data[:, 1])[0,1]




The correlations were 0.127 and 0.409 for London and NYC respectively. Showing that the pandemic led to more actual bike rentals (percentage wise) in NYC than in London. Personally, I think this is because NYC has less outdoor space to explore and leave your apartment compared to London, which leads people to hopping on a bike and roaming around the city.




I came up with 3 final conclusions from this analysis which I believe is supported by my data:


1. Both London and NYC used bikes as a way to get out of the house during the pandemic
2. Exercise/For fun bike rides have slowly gone back to normal levels post-pandemic in both cities
3. Biking popularity grew significantly more post covid in New York, whereas London barely increased its overall bike usage post covid



My winning prize is free access to an Oxford online course: "Digital Twins: Enhancing Model-based Design with Augmented Reality, Virtual Reality, and Mixed Real... sponsorship by AI Innovations Lab. I have done Unity game development and some 3d modeling before, so I am super excited to learn about how to incorporate mixed reality with AI and add on to my data science skill set.


About Me:

I'm super passionate about Data Science and more specifically ML/DL. I'm looking for internship opportunities for next Spring and my Junior Summer so please feel free to reach out about that, if you want any more information on the project, or just to connect! 



Version history
Last update:
‎Aug 03 2022 06:56 AM
Updated by: