With the latest advances in ML, a lot of classification and regression problems can be solved with very little effort. Just feed the data to an algorithm (or multiple algorithms in parallel using AutoML), wait for training, and pick the best model. But IoT data is not that simple. Here’s why.
First, let’s look at what a regular classification problem looks like. Take the Wheat-Seeds dataset, for example. The problem is to determine a seed’s type, based on attributes like area, perimeter, and compactness, etc, which can be observed on the seed itself. Below is a sample of the dataset:
Notice how the data required to find out the type of seed iscontained in each row.
How does that compare to an IoT dataset? Take this activity recognition dataset, for example.
The problem is similar: predict the activity based on the accelerometer data. Butsomething’s different here.
Each row in this dataset represents a time instant - like a picture. Unlike the previous example,you can’t tell what the subject is doing based on a single data point. That would be like telling how fast a person is going based off of a picture!
So how do we handle this classification problem?
Feature engineering to the rescue
One way to work with IoT classification problems is to use feature engineering. Through this process, we will perform operations on top of the raw data in order to surface the meaning behind it, before moving on to modeling.
Back to our activity recognition dataset, how can we apply feature engineering?
For each of the original features, let’s take the mean and standard deviation over 10 samples. As for the activity, we’ll take the most common one in each sample range, as an activity shouldn’t change too frequently between samples. Here’s a sample of what the altered dataset might look like:
Now, for each range of samples, we have features thatcorrelate much more closely with the activity. It’s easier to tell what each subject is doing by analyzing the measurements over a time window than it is by a single time instant. Besides the mean and standard deviation, there are manywindow functionsthat can summarize time series data in different ways. This is what we call awindow aggregation, and is just one of many feature engineering techniques for time-series data.
The Fast Fourier transform
The Fast Fourier transform is a special window aggregation function that decomposes a time series into periodic components. Like taking a signal - in this case, acceleration - and representing it as the sum of multiple sine waves. The following diagram shows what this decomposition looks like:
Notice how the original signal can be represented as the sum of three sine waves. This means that, in that time window, the signal can be summarized into three coefficients: the amplitude of each of those sine waves. This, of course, assumes our signal is approximately periodic, at least within the time window under analysis. Lucky for us, accelerometer data in activity recognition is falls into that category, as do a lot of IoT datasets!
The decomposition operation itself is quite complicated, and involves a lot of mathematics. But open source libraries like numpy already have an optimized implementation of it, so we don’t have to get into the ugly details to take advantage of this technique, as long as we know its implications.
The conclusion is thatthe Fourier transform can efficiently summarize IoT time-series data. So it is a very powerful tool for machine learning with IoT data.
So what does this process look like in the real world? Using an Azure Synapse Spark notebook, we loaded the accelerometer dataset mentioned earlier into a Spark dataframe, and after some minor data cleaning operations, this is what it looks like:
Don’t worry about the timestamp being in 1970 - this is just our Unix time “zero”.
Next, we implemented a Transformer - that’s a Python class that converts a dataframe to a different format - that can perform the Fast Fourier transform to the defined input columns. After transformation, the dataframe looks like this:
Now we have quite a few more columns: eight for each axis of acceleration. We can actually parameterize into how many components we're splitting each time series, up to the number of samples in the window. The more components, the more accurate the decomposition will be - and the more data we’ll have to process, so be careful!
Following this transformation, we trained two Random Forest Classification models using Pyspark - one on the original data, and one on transformed data - with some interesting results.
As shown above, our model’s accuracy measure - meaning, how often it gets the subject’s activity right - went from 56.7% to 70.4%, just from applying FFTs for feature engineering. This is by no means the best possible model, but it highlights the value of properly formatting data before starting the modeling process.
Curious about how the whole modeling process works? Check out the full notebookhere!