Introduction & Profiles
Hi there everyone! We are the Mech Eng Defectors, 1st place prize winners of Imperial College London Data Science Society’s (ICDSS) 2021 AI Hackathon for the ‘Boston housing challenge’. As you may have inferred from our unapologetic, slightly tongue in cheek name, we are two mechanical engineers from Imperial College London looking to transition from the classical engineering scene into the realm of data science. Our names are Kyriacos and Yousef, and our bios are found below, along with relevant links for the project:
We encourage you to check out the links if you find the blog interesting, as you can find more detail about the preprocessing methods that we used. This blog will mainly focus on the methodology at a very high level and the results.
Although our academic backgrounds are somewhat unorthodox in the world of data science, our core philosophies and beliefs as engineers are no different from that of many data scientists; to use our skills, knowledge and creativity to drive positive societal change. Therefore, armed with our limited machine learning knowledge, bucket loads of enthusiasm and much coffee on standby, we embarked on a 24-hour odyssey to not only compete in the UK’s largest student held hackathon, but hopefully shed some light onto the rectification of a societal problem.
The Boston Housing Market Challenge:
As far as datasets go, the Boston Housing dataset is well known, as it has been used extensively in the past to benchmark new algorithms. Despite containing only 506 rows and 14 columns, this dataset should not be underestimated, as with careful and calculated manipulations it can be unlocked to provide some fascinating deeper insights, hence its popularity.
Despite this, most projects focus on calculating house prices based on other factors. Many even employ external datasets to help with this. We felt that doing so would be quite fun, but not as intriguing and challenging as trying to answer a question that has implications in terms of policy. We therefore spent the first 4-6 hours of the datathon scoping a research question.
Following some simple distributions of the variables as well as a correlation matrix (we used the amazing library pandas-profiling), we noticed a large disparity in the NOx levels which are present in the Boston area, ranging from non-existent to major health hazards. To add to the confusion, other factors such as crime, house prices and number of residents from lower socioeconomic backgrounds also showed huge disparities. We began to ask ourselves as to how these all could be interlinked. Are higher NOx levels directly correlated to one or more other variables? What kind of areas specifically have the highest NOx levels? Are there any exceptions? After much deliberation, we hypothesised that NOx levels can be linked to the socioeconomic status of each instance. Moreover, we wanted to discover which factors could be changed in these neighbourhoods that would lead to the greatest decline in NOx levels.
With that preliminary research, we decided to answer the following question:
How will developing low income neighbourhoods in Boston affect NOx levels?
Step 1: Verifying that our dataset is sufficient to answer the research question
Unsurprisingly, our first step was to actually verify that the Boston dataset is informative enough to answer our question. An important assumption that we make is that the towns in Boston are capable of being clustered into socioeconomic classes. We therefore resorted to the tried and tested algorithm of k-means clustering (we used sklearn) as a first step to defining these boundaries, setting a k = 3 (for low, medium and high class). However, how could we trust the output of the algorithm? As a benchmark, we found a report conducted by Boston University on socioeconomic disparity in downtown Boston and tweaked our algorithm until the output visually matched one of the figures in the report. To us, this verified that the dataset is informative enough to effectively cluster the towns into 3 distinct socioeconomic classes. We plotted our results using folium.
It’s worth noting that the Latitude and Longitude values were erroneous in our data (the same error can be found in other papers, for example, see this paper). We fixed these using Google’s geocoder API. We did so assuming that the distances of the dwellings in each town from the town center were consistent, and that the error in the plot was from the town centers being shifted. Effectively:
are the coordinates (latitude or longitude) of a dwelling (i.e. an instance in the dataset),
is the average coordinate of all the dwellings in that town (we take this to be the ‘town center’), and
is the individual distance of a dwelling from the town center.
Effectively, our algorithm replaces the average coordinate with that found through Geocoder API when the town name is searched.
Step 2: Predicting the NOx levels to reasonable accuracy using Regression
The second step in our approach was to actually determine the NOx levels. Through preliminary analysis, we found that the most informative parameters were as follows:
With this in mind, we tried a number of models to find that which provides the best accuracy. We followed the methodology described by the flowchart on the right.
Our top 3 models are summarised below:
The SVR was the highest performing model. This agreed with our intuition: one of our features (RAD) is actually an abstract index that is reported as an ordinal value, as opposed to continuous. We suspected that standard linear regression techniques would fail to capture this, and that a Support Vector method wouldn’t.
Step 3: Simulating ‘development’ of a town and measuring the effect on NOx levels
The final step of the project was to simulate development. This is where we felt that our work was ‘novel’ and smart in its approach. Looking closely at the data, we realised that we could split the attributes into two categories: geographically constrained and non-geographically constrained (i.e. socioeconomic).
For instance, the variable CRIM would be something that is not geographically constrained: it is possible to implement policies that decrease crime in neighbourhoods.
On the other hand, something like DIS is geographically fixed, since changing the distance to employment centres would either mean magically moving the towns, or the employment centres. The former is impossible, and the latter is unfeasible economically.
As a result, we defined ‘development’ of a low-income towns by replacing it’s non-geographically constrained features with those representative of high-income towns.
Effectively, this meant creating a bootstrapped high-income town dataset.
The image below shows the decrease in NOx levels when low income towns are improved (on the left), compared to the original (on the right).
Conclusion and Future Work
The evidence suggests improving low-income neighbourhood leads to decreased levels of NOx, which gives even more incentives to developing them. A recommendation we make is that governments should provide incentives for companies to move their industrial buildings away from the city center. However, we note that more research needs to be done to determine the socioeconomic impact this may have, as radical shifts may cause many residents of low-income neighbourhoods to lose their jobs as a result.
In our analysis of the non-geographically constrained features (CRIM and AGE), we noticed a decrease in either, keeping all other parameters constant, results in a decrease in NOx. This seems somewhat intuitive with AGE, since improving the infrastructure of old dwellings is likely to have positive environmental impacts, but is nonsensical with crime, since the link is not clear or intuitive. We therefore think CRIM is a variable that may explain or act as a predictor for changes in NOx levels, but not as a reason. One of our recommendations is therefore that future work should explore the causality of CRIM and AGE.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.