First published on MSDN on Mar 16, 2017
The Write-up is to demonstrate a simple ML algorithm that can pull-up the characteristic components of the data to predict the family to which it belongs. In this particular example, the data set had a list of id, ingredients and dish. There were 20 types of dish in the data set. The data-scientist is attempting to predict the dish based on available ingredients. This analysis could be extended to predict the family of the crop-disease based upon the individual characteristics of the disease like – weather type, soil-alkalinity, pesticide-application, seasonality. You could also use the same model to input sensor data (aka of ADC data) and predict the family of Engine-defect.
The dataset is derived from kaggle.com (Rather, from one of its competitions
). The competition was to apply text mining ML algorithms to predict the dish based upon the individual ingredients that have been used for its preparation.
Why am I showcasing this? The problem may find application in the CropSciences domain or Engine-Problem class classification since, the analyst have to do a similar kind of prediction of the crop-disease based on the component factors like – weather conditions, soil acidity levels, fumigation, seasonality and other factors. Take the instance of Engine- Problem class detection, the sensor-dataset may be regressed to classify the problem-class.
I basically employed Text-Mining, XGBoost and Ensemble Modeling to get the final output. I have also tried the NB algorithms but with little success as the MSE of these algos was far too high compared to that of the XGBoost Ensemble model’s.
In the dataset, we include the recipe id, the type of dish, and the list of ingredients of each recipe (of variable length). The data is stored in JSON format. An example of a recipe node in
In the test file
, the format of a recipe is the same as
, only the dish type is removed, as it is the target variable you are going to predict.
- the training set containing recipes id, type of dish, and list of ingredients
- the test set containing recipes id, and list of ingredients
Yeah! You could smell the process to get the prediction kick-started. Understanding of the below helps to comprehend the solution holistically:
Comfortable with statistics (and error comprehension) of models
Why Boosting ML?
It Works great on sparse matrices. Sparse Matrix is a matrix which has large number of zeroes in it. Remember, most of the matrix elements are drawing a zero. Since, I’ve a sparse matrix here, I expected it to give good results.
Lets start at the XGBoost model:
Now, I’ve created a sparse matrix using xgb.DMatrix of train data set. I’ve kept the set of independent variables and removed the dependent variable.