Educator Developer Blog

9 MIN READ

Data Analytics with PowerBI - Student project showcase - Impact on Air Quality

Copper Contributor

Feb 22, 2023

Introduction

I am Chukwuebuka (Malcom) Okonkwo and asprining Data Analyst and Winner of DataFest Africa 2022 Datathon. I am currently a Data Analyst Intern at Hamoye.com and a Chemical Engineering Undergraduate who passionate about Data, its transformation, Exploration, interpretation/Analysis and telling a story from the data using charts, Reports and Dashboards to find insights which can be used to make informed decisions.

Project

Have you ever wondered what could be the cause of poor air quality? or maybe the air is fresh during a certain season but in another season, it is hard to breathe properly. Has it ever occurred to you that there are particles in the atmosphere that contribute to poor air quality and the concentration of these particles are affected by weather conditions?

This project aims to analyze the air quality of a Beijing using data on its weather conditions from 2010–2014 and also explore how the weather affects the quality.

The tools I used to perform this analysis are Python and Power Bi.

If you new to Python or PowerBI Microsoft Learn has some excellent resources:

Get started with Power BI - Training | Microsoft Learn

Python for beginners - Training | Microsoft Learn

Case Study

A company in the environmental consulting industry is seeking to analyze the air quality in a specific city during hot and cold weather, during high-wind conditions and during precipitation. They are interested in making recommendations to the government and businesses in the region on how to mitigate the impact of weather conditions on air quality.

As a Data Analyst, you are expected to analyze the data provided, seek insights and make recommendations to achieve the set objectives.

Additionally, kindly use this dataset to analyze the historical impact of weather conditions on air quality, and make predictions on air quality during specific weather conditions. This information could be used to inform emergency response plans and prepare for potential air quality issues.

More details can be found here on the Microsoft blog.

Dataset Information

This data set has been sourced from the Machine Learning Repository of University of California, Irvine Beijing PM2.5 Data Set (UC Irvine). The dataset can be found here at archive.ics.uci.edu and the field description can be found at archive.ics.uci.edu.

Data Analysis Process

In order to discover patterns in the raw data and draw valuable information from them, the set of procedures served as crucial steps for the successful completion of this project. They are:

Background Study
Data Gathering
Data Assessment and Cleaning
Exploratory Data Analysis
Data Visualization
Insights and Recommendation

Background Study

According to Department of Health, New York, Fine particulate matter (PM2.5) is an air pollutant that is a concern for people’s health when levels in air are high. PM2.5 are tiny particles in the air that reduce visibility and cause the air to appear hazy when levels are elevated.

Air Quality in this dataset is determined by the level (Concentration in Ug/m3) of Particulate matter (PM2.5) in the atmosphere. According to Breeze Technologies, PM2.5 levels over 55Ug/m3 shows a poor level of air quality and above 110Ug/m3 shows a severe level of air quality. For this analysis, a limit of 100Ug/m3 was placed to signify that the air quality is getting to a severe level.

The bulk of the analysis is centered around how the concentration of PM 2.5 changes due to a change in the atmospheric condition.

Data Gathering

The dataset was sourced from the academic source of archive.ics.uci.edu. The data was loaded into a Jupyter Notebook to begin the data Assessment and Cleaning process using Python.

Data Assessment and Cleaning

The Dataset was assessed for issues with its Quality and issues with its structure. The snapshot of the data is seen below

The dataset was assessed visually and programmatically (using codes). Then the appropriate steps were taken to clean the data. The steps are:

Created a Date column using the year, month and day column. After it was created, the datatype was corrected
Handled the null in the pm2.5 column
In the cbwd column, replaced ‘cv’ with ‘SW’ representing the South West
The PRES column which is representing the atmospheric pressue is in Hecto-paschal. I convert the unit to atm (Standard unit for pressure) and saved it in a new column atm_pressure
Classified the months into four seasons

Winter — December, January and February
Autumn — September, October and November
Spring — March, April and May
Summer — June, July and August

The full data cleaning procedure are documented here on my GitHub Repo.

The look of the cleaned dataset is shown below

Exploratory Data Analysis

In this section, The “”Question-Visualization-Observations” framework would be used. This framework involves asking a question from the data, creating a visualization to find answers, and then recording observations.

The Questions asked of the data are

What can be observed about the PM2.5 Concentration with respect to time?
Does the Wind Speed Affect PM 2.5? In which direction does the wind direction often go? Does the Wind Direction impact the concentration of the PM 2.5?
What is the rate of Precipitation (rainfall) and Snowfall in the city especially during each season? Does this influence the concentration of the PM 2.5?
What is the atmospheric Temperature and Pressure of the city during each season? Do they affect the Air Quality?

The main purpose of exploring the data is to find patterns, identify anomalies, test hypotheses, and verify presumptions with the aid of summary statistics and graphical representations.

The exploratory data analysis process are extensively documented here on my GitHub Repo.

Data Visualization

The next step is to translate the information gotten from the data into a visual context which makes it easier to communicate my findings.

This was done using Power BI after I exported my cleaned and pre-processed data from my Jupyter Notebook. The dashboard created using Power BI is shown below

Dashboard page 1

Dashboard page 2

Insights

After Exploring the dataset and doing research on this case study and visualizing the data, I discovered the following insights:

On checking the Mean value of PM2.5 across all years and all atmospheric conditions, the mean value is 97.80Ug/m3 which is very close to the threshold (100Ug/m3). From this observation, the city’s air quality is not at its best with average PM2.5 level in the poor level.

After checking the PM2.5 level for each Month and Season, It is observed that PM2.5 level is the highest during the Winter Season (December to February) having approximately 110Ug/m3 on average.

The Average PM2.5 level during the Autumn season also surpassed the threshold having 101.58Ug/m3. The The average PM2.5 level during the spring season and summer season is 88.24Ug/m3 and 91.74Ug/m3.

From this observation, it tells that the PM2.5 level is worse during the winter season then the autumn season.

A wind rose plot was created to observe the direction where the wind often goes to. It is seen below

From the plot, we can see that the wind mostly goes in the South East direction but the higher wind speed goes in the North West direction.

Higher levels of PM2.5 occurs more often when the wind direction is going towards the South West (SW) then the South East. It gets extremely high during the winter period when the wind direction is headed towards the South East and South West

After observing the relationship between the wind speed and the PM2.5 level. It is observed that the lower the wind speed the higher the PM2.5 level.

The Average hours of Precipitation (Rainfall) and Snowfall was observed throughout the month. From the charts, the higher occurrence of rainfall occurs during the Summer Season (June - August) while the higher occurrence of Snowfall occurs during the Winter Season (December - February).

On periods where there are a low hours of precipitation (rainfall), the PM2.5 levels are extremely high. When there are longer hours of rainfall, the PM2.5 levels are low in comparison to when there are longer amount of rainfall.

Recommendations

After gathering insights from the data, I would love to make a few recommendations

Since the city’s average PM2.5 level is normally high, I strongly advise the government to check into United State’s National Action Plan on Pollutant & Control, which intends to reduce PM 2.5 (respirable, pollution particles) concentrations by 20% to 30% above 2017 annual levels in more than 100 cities. By reducing reliance on coal, limiting car emissions, expanding the production of renewable energy sources, and strictly enforcing emissions regulations, the plan promised to achieve these objectives.

When the wind is blowing in the south-west and south-east directions, high levels of PM2.5 are seen. Tracking the sources of the pollutants and putting a stop to them will help lower the PM2.5 concentration.

The PM2.5 concentration was seen to rise quickly throughout the winter months. This may be due to the extensive use of coal and other fossil fuels in the production of heat. When possible, I advise replacing biomass fuels, such as wood, animal dung, and crop wastes, or coal, in homes that use them for cooking and heating with cleaner fuels, including biogas (methane), liquid petroleum gas (LPG), electricity, or solar cookers.

From the Analysis, A High Amount of hours of rain and snow shows a low level of PM2.5 levels. According to Davis Instrument, Rain and Snow can wash particulate matter out of the air and destroy dissolvable pollutants. While the pollutants are washed out or dispersed, they are not gone. They are just moved somewhere else. They end up in someone else’s lungs, or dropped into bodies of water for aquatic plants and animals to deal with. It is advised that citizens who are very sensitive health-wise should refrain from excessive outdoors activities so that they would not get infected.

It is recommended that when the PM 2.5 concentration is above 110Ug/m3, the masses are advised to utilize pollution masks and use air purifiers indoors. PM 2,5 concentration above that level has serious health impacts

For Further Exploration

The Case study for this Project stated:

“Additionally, kindly use this dataset to analyze the historical impact of weather conditions on air quality, and make predictions on air quality during specific weather conditions. This information could be used to inform emergency response plans and prepare for potential air quality issues.”

The Analysis I carried explored the historical impact of the weather conditions on Air quality which was used to generate insights and make recommendations to curb the effect of the weather conditions on Air Quality and reduce the concentration of PM 2.5 in the atmosphere.

However, the analysis does not contain the Predictive Analysis needed to forecast and make predictions on the Air Quality in the coming days, months or years. Thus, the project could be furthered or continued from its current stage and be modelled in order to perform Predictive Analysis.

The Full Analysis has been fully recorded and documented on my GitHub Repository

Conclusion

Documenting experiments and sharing reproducible code via GitHub is important for several reasons:

Reproducibility: Documenting experiments and sharing code allows other researchers to reproduce your work and validate your findings. This is essential for the advancement of science because it allows other researchers to build upon your work and test your conclusions.
Transparency: Documenting your experiments and sharing code promotes transparency, which is a fundamental principle of scientific research. By sharing your data, methods, and code, you make it possible for other researchers to evaluate your work and ensure that it is free from errors and biases.
Collaboration: Documenting your experiments and sharing code can facilitate collaboration with other researchers who may have complementary skills or resources. This can lead to more innovative research projects and faster progress in the field.
Efficiency: Documenting your experiments and sharing code can save time and effort by enabling other researchers to use and build upon your work, rather than starting from scratch. This can help to avoid duplication of effort and promote more efficient use of research resources.
Education: Documenting your experiments and sharing code can also be a valuable educational resource for other researchers who want to learn from your work. By providing detailed documentation and well-organized code, you can help to teach others best practices in research methodology and data analysis.

Overall, documenting your experiments and sharing reproducible code is essential for ensuring the credibility and transparency of scientific research and promoting collaboration and efficiency in the field. Students get FREE Access to GitHub and GitHub resources such as Codespaces, Copilot and amazing Global community and resources see http://education.github.com

Thank you for reading and I can be contacted on LinkedIn and Twitter.

Credits: