Did you know it’s possible to turn a csv file into JSON? In this article, we'll delve into the intriguing realm of ETL. You’ll learn what it is, how it works, and see it in action with a sample that you can try with the cloud-based development environment GitHub Codespaces.
What is ETL?
ETL, which stands for Extract, Transform and Load is a common process that data professionals (like data engineers and analysts) use to collect data from various sources, refining and processing it into a format that makes it useful for gathering important information, and finally sending it to its final destination so it can be used for further analysis and reporting.
Orange juice factory - Image generated by Dall-E
You can compare ETL to the journey of an orange that becomes a refreshing glass of orange juice.
Extract is like hand-picking the ripest oranges from a tree. It's the initial gathering, where we choose the data sources we want and bring them into our environment. Just as we would want the best fruit an orchard has to offer, in ETL, we want to select the most relevant data.
Transform is where the magic happens. The oranges undergo various processes to transform into a golden glass of orange juice. Like the ETL process, raw data is cleansed, reshaped, and enriched during the transformation phase, turning it into insightful information.
Load is like pouring the juice into your cup, ready to drink. In ETL, this is where we store our processed data in a place (like a database or data lake) where it's ready to be consumed, analyzed, and enjoyed.
Real-World ETL Considerations
Before we dive into this demonstration, it's crucial to understand that ETL processes in a professional setting are often more complex than what we're covering here. In the real world, ETL involves handling large volumes of data from diverse sources. These sources can range from straightforward files, like CSVs, to complex, constantly updating databases. Data engineers frequently use robust ETL tools and platforms like SQL Server Integration Services (SSIS) and the cloud-based Azure Data Factory and Azure Synapse Analytics to manage these complexities efficiently. Real-world scenarios demand more robust, secure, and scalable solutions, often involving a team of experienced data professionals and a suite of specialized tools.
ETL in action: Tools of the trade
In this basic demonstration, we’ll be using Jupyter Notebooks to run our Python code and GitHub Codespaces to host our development environment. There are many ways data professionals and enthusiasts perform ETL operations. However, ETL with a notebook is typically used for data exploration purposes or other rare one-off instances.
Before getting started with a demo, let’s walk through the tools.
Jupyter notebooks are an interactive environment that combines, code, text, and other forms of visuals and media into a single document. Code, text, and media like pictures or visualizations are contained in cells. When you run a code cell by clicking the “play” button beside it, the result is displayed directly below it, making the process and results transparent and easily shareable. If you’re new to Python, it’s an excellent way to learn!
When you’re writing Python, you’ll often need access to libraries, tools, and runtimes; this can be a challenge to manage on a local machine. GitHub Codespaces is game-changer for Python developers because it is cloud-based development environment that doesn’t require local setup or installations, streamlining your coding process and ensuring you're always working with the latest resources seamlessly.
You can get started with a Codespace by selecting the Codespaces tab under the <> Code button in a GitHub repository.
First, we'll use a common Python library called Pandas to extract the data from the csv file, as seen in the following image.
Pandas represents this data in a 'dataframe', which is a tabular structure similar to a csv with rows and columns. Dataframes are a flexible tool for a wide range of data manipulation, analysis, and visualization tasks.
We've successfully extracted the data, but like a freshly picked orange that’s just waiting to be juiced we still have some work to do. In the next cell, we convert this data into JSON, a simple, text-based format used to represent structured data. It’s commonly used in web applications to exchange and store data.
Again, we rely on the Pandas library to take 3 random rows with the sample function, then convert it from a tabular format into JSON as seen in the gif.
The final resting place will be a JSON file in our Codespace directory. As seen in the gif below, the transformed data has been loaded into a local JSON file with Pandas.
Let’s see a more advanced example of loading with Azure Cosmos DB, which is a cloud-based database service that is great for storing different types of non-tabular data. This demo uses environment variables to store the secrets of the database and requires the Azure Cosmos DB library to access and load the data into the database.
The view below is from the Data Explorer in the Azure portal that is loaded with data that originates from the csv.
And just like that, our data's new home is ready! We’re done!
From raw data to insightful information, ETL is the journey and tools like Jupyter Notebooks and GitHub Codespaces are the vehicle. Now that you've had a taste, dive in yourself and explore the GitHub repo and learn more!