As a Data Scientist, your role is to extract insights from data to train machine learning models which will help you in either prediction or classification scenarios. Over time, there have been various tools in which you can use to clean your data, train and deploy your models. In this blog post, I will be guiding you on how you can get started as a Data Scientist with Microsoft Fabric.
Why Microsoft Fabric?
Your question might be: why Microsoft Fabric? Microsoft Fabric is a unified platform that not only allows you to perform you data science workloads but also a platform to interact with other data specialists. It is a comprehensive solution containing: data, notebooks and models, all in one place. Making it easier for collaboration with others as well as providing a seamless and efficient experiment.
Milestone 1: Load the diabetes dataset from Azure Open Datasets
Azure Open Datasets provides us with a wide range of datasets you can use to explore, in this post we will focuse on the Diabetes dataset. The dataset contains 422 patient samples with 10 different features to determine a quantitative measure of the disease progression. Examples of the features include: age in years, sex, body mass index (BMI), blood pressure (bp) and six serum measurements.
As we can load our dataset directly to our notebook, we will not need to create a Lakehouse. First we need to create a new Notebook, to do so head over to Microsoft Fabric, on the bottom left, click on the Fabric Logo. A new sidebar will pop up, select Data Science. Lastly, click on Notebook and create a new Notebook.
In our newly created Notebook, we will go ahead and load our dataset using pyspark as provided in the Azure Open Datasets. Using the code, we read the data from Azure blob storage as a parquet file, then load the first ten rows of our dataset as follows:
Milestone 2: Clean and Transform you data in Notebook
Once our dataset is loaded the next step is to clean and transform our data. This is essential not only in removing outliers and missing values but also ensuring that our model accuracy is improved. First we will convert our dataset to a pandas dataframe to make it easier to analyze using the code below:
Using df.info()we can conclude that our data does not contain any missing data, as shown below:
To further understand our model, we can create a visual correlation matrix using seaborn. A correlation matrix shows the relationship between different columns. Other than AGE, S1 and S2, the rest of the values have a high correlation with our target variable as shown below:
Additionally, you can go ahead and perform more analysis and transformation tasks such as checking for class imbalance using value_counts(), or check for outliers by using box plots or scatter plots.
Milestone 3: Train your model using mlflow and Logistic Regression.
Next step is training our model, first we define our features and targets then split our data into train and test sets. Why we do this is to evaluate the performance of our model using new, unseen data. To do this, you can use the code as follows:
In this demo, we will use mlflow to track our experiments and save our models. Mlflow is an open source platform to manage our machine learning lifecycle by providing tools such as experiments. Experiments are a set of runs that you can use to compare different models and can be managed using mlflow. Our model will be trained on the diabetes data and is what we will use to make our predictions and can be evaluated and deployed.
We will first need to create an experiment using mlflow before we train our models, this can be achieved by the code below:
Next, we use mlfow to train and log the metrics for a logistic regression model as shown below:
The above code trains a linear regression model on our dataset and logs the model, metrics, and parameters to mlflow. First, it creates a new instance of the `Linear Regression` model and trains the model on the training data. Then, it makes predictions on the test data and calculates the mean squared error and R-squared metrics. These metrics are logged using: `mlflow.log_metrics()`, and the trained model is saved as an artifact using `mlflow.sklearn.log_model()`.
Finally, the model is registered with a name in using `mlflow.register_model()`. The output prints out the model URI along with a message indicating that the training run is complete.
Milestone 4: Evaluate your model
It is essential to evaluate our models to assess its effectiveness.To evaluate the performance of our model, we used two metrics: mean square error (MSE) and R-squares(R2) score. As we have logged the metrics to Microsoft Fabric using mlflow, we will use the UI to navigate and obtain the metrics. The MSE measures the average squared difference between the predicted and actual values. On the other hand, the R2 score measures the proportion of variance in the target variable that is explained by the model.
To get our model scores, we first head back to our workspace and then select filter. Using filter, select Experiment and Model. We will first explore our experiment then model.
Under our experiment, we can see our model has been logged as well as the metrics. Our mse is 2908.55 and our r2 is 0.45. Additionally, we have our model files ready to be deployed and consumed.
Conclusion
In conclusion, Microsoft Fabric is a powerful platform that unifies the data science process and allows for easy collaboration with other data specialists. By following the milestones outlined in this blog post, you can kickstart your data science journey and leverage the platform's capabilities to clean your data, train and deploy models.
Curious to learn more on Data Science in Microsoft Fabric? You can utilize the resources below: