Educator Developer Blog

5 MIN READ

How to Scrape Twitter Data for Sentiment Analysis with Python and Power BI

Flora_Oladipupo

Brass Contributor

Aug 19, 2022

OUTLINE

Introduction: What is Sentiment Analysis?
Use Case: Twitter Data
Aim of the project
Tools used and workflow
Data Gathering
Data Wrangling/Preprocessing
Sentiment Analysis
Visualization

Introduction

As a data analyst, there will be scenarios where your data will come from secondary sources. some of them will be gotten through web scraping. A simple use case here; what if a business is interested in understanding their customer perception and emotion about their brand based on their activities on twitter.

To get the data for the analysis, you have to find a way to scrape this data first, clean it, analyze it, and then use a visualization tool to present it to the business.

This project is a collaboration between Abisola Agboola (Abisola_Agboola) and me. We are both Beta Microsoft Learn Student Ambassadors. This article contains embedded links that will lead to Part 2 of this work (Visualizing the Twitter Data with Microsoft Power BI) done by Abisola_Agboola.

What is Sentiment Analysis?

Sentiment analysis is a use case of Natural Language Processing. It is used to get the tone behind an opinion, text, or sentence in a language. Therefore, it is an analysis that simplifies the task of getting to know the feeling behind people’s opinions. There are several ways this analysis is useful, ranging from its usefulness in businesses, product acceptance, perception of services, and many other uses.

Use Case: Twitter Data

To get the data for the analysis, you have to find a way to scrape this data first, clean it, analyze it, and then use a visualization tool to present it to the business.

Aim of the Project

Disclaimer

This analysis is not for the prediction of the Nigeria 2023 election result, it is rather a use case to demonstrate the twitter data scraping, transformation, analysis, and visualization.

Through this project, we wish to tell compelling story and get the public to be aware of the overall tone of their activities on twitter towards the forthcoming general election in 2023.

Important Library used

The necessary libraries and modules used in this project are listed in the Jupyter notebook containing the code. Though the major tool used were Snscraper for scraping historical data and TextBlob for determining the polarity of words to get their sentiments.

Workflow

Data gathering
Data Preprocessing
Sentiment Analysis
Data Visualization
Communicating result

Data Gathering

The data was collected using snscraper because of the lack of restriction when using the library. Snscraper allows one to scrape historical data and doesn’t require use of API keys unlike libraries like Tweepy. A total amount of 58,633 data was collected from 1/January/2022 to 30/July/2022. This data yield for each month differed as some months didn't have up to the 20,000 limit set in the code while some had past that.

The query is where the tweets that one is interested in searching for is written and a for loop is run.

The result of the query can be seen in a dataframe.

Data Wrangling/Pre-processing

The missing locations were filled with the word ‘Unknown’. Words with different spellings were replaced with uniform spelling to get the analysis accurately done. New columns were also created for each of the top three presidential candidates’ parties which are the APC, PDP, and Labour Party. Another set of columns was also created for the top three candidate names. This column was created to accurately get the number of times each name appeared in tweets.

For the know the number of times each of the top 3 candidates name and their party was mentioned in a tweet the names needs to be extracted into a separate columns by writing a function.

The result of the above code can be seen below

Data preprocessing: It’s on this step that lies the bulk of the project. For the sentiment analysis to be carried out this stage needs to be done accurately. Data pre-processing are not cast in stones. they depend on the nature of data you are working on and what needs to be changed however, there are some transformations that are fixed for the sentiment analysis to be carried out. These pre-processing are in no particular order:

Converting the words to lower case: During the preprocessing stage, the tweet column is converted to lower case words to make the words uniform.
Removing Url links, digits, punctuation, emojis and every other thing that may not be necessary for the sentiment analysis
Tokenizing the tweets column that is breaking the sentence down into bits of words
Removing stop words: This are word that don’t give meaning to the context of a sentence example is, the etc.
Lemmatizing words: This is to get the base of words ie bags the lemmatized form is bag.

A new column called Processed tweets is created and can be seen in the data frame below.

A bit of data wrangling was carried out on the Processed tweet column

Sentiment Analysis

After data wrangling/pre-processing, TextBlob library is used to get the level of the text polarity; that is, the value of how good, bad or neutral the text is which is between the range of 1 to -1. A condition is set to get the sentiment which is set at < 0 is positive, == 0 is neutral and > 1 is negative.

The link to this project code can be seen on my Github page.

Data Visualization

To visualize the data and tell more compelling story, we will be using Microsoft Power BI.

Python is not the best tool for visualization because its visual is not appealing to the eyes. The Data used for this project was saved in a file and sent to my partner for visualization.

This was carried out by my partner Abisola_Agboola. The result of which can be seen below. To see how this dashboard was build check out the part II of this article.

Part II

Click the link here https://aka.ms/twitterdataanalysispart2 to see how this Power BI visual was built and follow through to create yours.

Summary

In this article, we made it clear that in several scenarios, you will have to work with secondary data in your organization. one of the ways to get these data is through web scraping.

By following this article:

You have learnt how to scrape twitter using the snscraper library.
How to clean the data and transform it to be in a tabular manner.
How to use the TextBlob library to calculate the sentiment score based on the tweet.
How to export this data to csv/excel. this will be imported in Power BI for visualization.

You can click here to check the Part II https://aka.ms/twitterdataanalysispart2 You will be able to build your own Power BI visualization and horn your skill.