Along with guidance in the Azure Machine Learning Algorithm Cheat Sheet, keep in mind other requirements when choosing a machine learning algorithm for your solution. Following are additional factors to consider, such as the accuracy, training time, linearity, number of parameters and number of features.
Additional requirements for a data science scenario
Once you know what you want to do with your data, you need to determine additional requirements for your solution.
Make choices and possibly trade-offs for the following requirements:
Number of parameters
Number of features
Accuracy in machine learning measures the effectiveness of a model as the proportion of true results to total cases. In Machine Learning designer, theEvaluate Model modulecomputes a set of industry-standard evaluation metrics. You can use this module to measure the accuracy of a trained model.
Getting the most accurate answer possible isn’t always necessary. Sometimes an approximation is adequate, depending on what you want to use it for. If that is the case, you may be able to cut your processing time dramatically by sticking with more approximate methods. Approximate methods also naturally tend to avoid overfitting.
There are three ways to use the Evaluate Model module:
Generate scores over your training data in order to evaluate the model
Generate scores on the model, but compare those scores to scores on a reserved testing set
Compare scores for two different but related models, using the same set of data
For a complete list of metrics and approaches you can use to evaluate the accuracy of machine learning models, seeEvaluate Model module.
In supervised learning, training means using historical data to build a machine learning model that minimizes errors. The number of minutes or hours necessary to train a model varies a great deal between algorithms. Training time is often closely tied to accuracy; one typically accompanies the other.
In addition, some algorithms are more sensitive to the number of data points than others. You might choose a specific algorithm because you have a time limitation, especially when the data set is large.
In Machine Learning designer, creating and using a machine learning model is typically a three-step process:
Configure a model, by choosing a particular type of algorithm, and then defining its parameters or hyperparameters.
Provide a dataset that is labeled and has data compatible with the algorithm. Connect both the data and the model toTrain Model module.
After training is completed, use the trained model with one of thescoring modulesto make predictions on new data.
Linearity in statistics and machine learning means that there is a linear relationship between a variable and a constant in your dataset. For example, linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog).
Lots of machine learning algorithms make use of linearity. In Azure Machine Learning designer, they include:
Linear regression algorithms assume that data trends follow a straight line. This assumption isn't bad for some problems, but for others it reduces accuracy. Despite their drawbacks, linear algorithms are popular as a first strategy. They tend to be algorithmically simple and fast to train.
Nonlinear class boundary:Relying on a linear classification algorithm would result in low accuracy.
Data with a nonlinear trend:Using a linear regression method would generate much larger errors than necessary.
Number of parameters
Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect the algorithm’s behavior, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves. The training time and accuracy of the algorithm can sometimes be sensitive to getting just the right settings. Typically, algorithms with large numbers of parameters require the most trial and error to find a good combination.
Alternatively, there is theTune Model Hyperparameters modulein Machine Learning designer: The goal of this module is to determine the optimum hyperparameters for a machine learning model. The module builds and tests multiple models by using different combinations of settings. It compares metrics over all models to get the combinations of settings.
While this is a great way to make sure you’ve spanned the parameter space, the time required to train a model increases exponentially with the number of parameters. The upside is that having many parameters typically indicates that an algorithm has greater flexibility. It can often achieve very good accuracy, provided you can find the right combination of parameter settings.
Number of features
In machine learning, a feature is a quantifiable variable of the phenomenon you are trying to analyze. For certain types of data, the number of features can be very large compared to the number of data points. This is often the case with genetics or textual data.
A large number of features can bog down some learning algorithms, making training time unfeasibly long.Support vector machinesare particularly well suited to scenarios with a high number of features. For this reason, they have been used in many applications from information retrieval to text and image classification. Support vector machines can be used for both classification and regression tasks.
Feature selection refers to the process of applying statistical tests to inputs, given a specified output. The goal is to determine which columns are more predictive of the output. TheFilter Based Feature Selection modulein Machine Learning designer provides multiple feature selection algorithms to choose from. The module includes correlation methods such as Pearson correlation and chi-squared values.
You can also use thePermutation Feature Importance moduleto compute a set of feature importance scores for your dataset. You can then leverage these scores to help you determine the best features to use in a model.