Python Lesson 8: Machine Learning
1. Lesson outline
- Getting started
- A first simple example.
- A second and more complex example.
- Supervised Machine Learning Algorithms: k-Nearest Neighbors, Linear Models, Decision Trees, Support Vector Machines, the multilayer perceptron.
- Dataset transformations.
- Exercises.
2. Getting started
2.1. Why machine learning
Machine learning is about extracting knowledge from data. It is a
research field at the intersection of statistics, artificial
intelligence, and computer science and is also known as predictive
analytics or statistical learning. The application of machine
learning methods has in recent years become ubiquitous in everyday
life. From automatic recommendations of movies and TV series to watch, to what
food to order or which products to buy, to personalized online radio or music
and automatic recognizing of your friends in your photos; many modern websites and
devices have machine learning algorithms at their core. When you look
at a complex website like HBO, Facebook, Amazon, or Netflix, it is very
likely that every part of the site contains multiple machine learning
models.
The most successful machine learning algorithms are those that
automate decision-making processes by generalizing from known
examples. In this setting, which is known as supervised machine learning (SML),
the user provides the algorithm with pairs of inputs and desired
outputs, and the algorithm finds a way to produce the desired output
given an input. In particular, once trained, supervised algorithms are able to create an
output for an input that it has never been seen before, without any human guidance.
Machine learning algorithms that learn from input/output pairs are
called supervised learning algorithms because a teacher provides
supervision to the algorithms in the form of the desired outputs for
each example that they learn from. Examples: identifying the zip code
from handwritten digits on an envelope, determining whether a tumor is
benign based on medical imaging, or detecting fraudulent activity in
bank transactions.
In unsupervised learning, an utterly different approach, only input
data are known, and no output data is supplied to the algorithm who
then tries to look for patterns in the data in a completely
independent way. Nevertheless thee many successful applications of these
methods, they are usually harder to understand and evaluate. Some examples are
identifying topics in a set of blog posts, segmenting customers into
groups with similar preferences, or detecting abnormal access patterns
to a website.
Concerning data, it is helpful to consider your data a table. Each
data point that you want to reason about (each email, each customer,
each transaction) is a table row, and each property that describes
that data point (say, the age of a customer or the amount or location
of a transaction) is a column. You might describe users by their age,
their gender, when they created an account, and how often they have
bought from your online shop. You might describe the image of a tumor
by the grayscale values of each pixel, or maybe by using the size,
shape, and color of the tumor. Each entity or row here is known as a
sample (or data point) in machine learning, while the columns —the
properties that describe such entities— are called features
2.2. The libraries
The
scikit-learn
library (homepage
[https://scikit-learn.org/stable/][https://scikit-learn.org/stable/])
is an open source machine learning library that supports supervised
and unsupervised learning. It also provides tools for model
fitting, data preprocessing, model selection, model evaluation, and
many other utilities. You can see the
available supervised methods in
https://scikit-learn.org/stable/supervised_learning.html or the
unsupervised ones in
https://scikit-learn.org/stable/unsupervised_learning.html
To install scikit-learn
simply type:
conda install scikit-learn The following packages will be downloaded: package | build ---------------------------|----------------- joblib-1.1.0 | pyhd8ed1ab_0 210 KB conda-forge scikit-learn-1.1.2 | py39he5e8d7e_0 8.2 MB conda-forge threadpoolctl-3.1.0 | pyh8a188c0_0 18 KB conda-forge ------------------------------------------------------------ Total: 8.5 MB The following NEW packages will be INSTALLED: joblib conda-forge/noarch::joblib-1.1.0-pyhd8ed1ab_0 scikit-learn conda-forge/linux-64::scikit-learn-1.1.2-py39he5e8d7e_0 threadpoolctl conda-forge/noarch::threadpoolctl-3.1.0-pyh8a188c0_0 Proceed ([y]/n)? Downloading and Extracting Packages scikit-learn-1.1.2 | 8.2 MB | ##################################### | 100% joblib-1.1.0 | 210 KB | ##################################### | 100% threadpoolctl-3.1.0 | 18 KB | ##################################### | 100% Preparing transaction: done Verifying transaction: doneAn alternative way is to use
pip install scikit-learn
Moreover, we will use also the libraries numpy
, pandas
, scipy
, and matplotlib
used in other course lessons.
3. A first simple example
First we import the necessary libraries
In this first case we define our own dataset. As we mentioned, in general, a datasets is made of two parts:
Now we define our estimator, the so called support vector classification
and separate our dataset in training and test sets
We can now train the model
Once the model is trained, we can assess the results
and
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import svm from sklearn import metrics from sklearn.model_selection import train_test_split
- The characteristics of the objects, or features.
- The class to which the objects belong.
[0,1]
belongs to the
class 0
, [2,3]
belongs to the class 1
, [4,5]
belongs to the
class 2
and so on.
X = np.arange(1000).reshape((500, 2)) y = range(500)
clf = svm.SVC(gamma=0.001, C=100.)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, shuffle=True)
clf.fit(X_train,y_train)
predicted= clf.predict(X_test) print( f"Classification report for classifier {clf}:\n" f"{metrics.classification_report(y_test, predicted)}\n" )
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted) disp.figure_.suptitle("Confusion Matrix") print(f"Confusion matrix:\n{disp.confusion_matrix}") plt.show()
4. A second and more complex example
4.1. Reading the data
First we read a set of standard databases included in the
See https://scikit-learn.org/stable/datasets.html for the complete list of datasets or go to https://archive-beta.ics.uci.edu/.
You can check what are the dataset
Or you can access the data directly
The
sckit-learn
package,
correspondig to a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
from sklearn import datasets digits = datasets.load_digits()
keys
or show the content of one the keys
print("Keys of digit dataset: \n{}".format(digits.keys())) print("Description of digit dataset: \n{}".format(digits['DESCR']))
print(digits.data) print(digits.data.shape) print(digits.target) print(digits.target.shape)
data
attribute provides the instances and target
provides
the class of the different instances. Note that data
elements are
2D 8x8
arrays. In this particular dataset, one can
use the images
attribute
print(digits.images[0]) digits.images.shape
4.2. Learning and predicting
The task is to
predict, given an image, which digit it represents. We are given
samples of each of the ten possible classes (the digits zero through
nine) on which we fit an estimator to be able to predict the classes
to which unseen samples belong.
In
The
However it is much more convenient to split our original dataset into two sets, the training and the test sets, selecting elements in a random way. In this case we use half the data for training and the other half for testing.
We now train our model and check the results
Now can analyze the performance of our learning procedure:
Note that the performace of the prediction can be measured in
different ways:
scikit-learn
, an estimator for classification is a Python object that implements the fit(X, y)
and predict(T)
methods.
An example of an estimator is the class sklearn.svm.SVC
, which implements the support vector classification. The estimator’s
constructor takes as arguments the model’s parameters.
# from sklearn import svm clf = svm.SVC() #clf = svm.SVC(gamma=0.001, C=100.) pred_data = -30 clf.fit(digits.data[:pred_data], digits.target[:pred_data]) print(clf.predict(digits.data[pred_data:])) print(digits.target[pred_data:])
clf
(for classifier) estimator instance is first fitted to the
model; that is, it must learn from the model. This is done by passing
our training set to the fit method. For the training set, we’ll use
all the images from our dataset, except for the last pred_data
images, reserved to check the fit quality. We select the training
set with the [:pred_data]
Python syntax, which produces a new array
that contains all but the last items from digits.data. You can now
predict new values from the corresponding matrices. In this case,
you’ll predict using the last images, excluded from the training. By
predicting, you’ll determine the image from the training set that best
matches the last image. In this example we are using the Support Vector Machines
which are a set of supervised learning methods used
for classification, regression and outliers detection.
Also you can plot the images and check visually the calculation performance
fig, ax = plt.subplots(6,5,figsize=(10,10)) i=0 for row in range(6): for col in range(5): ax[row,col].imshow(digits.images[pred_data+i], cmap=plt.cm.gray_r, interpolation="nearest") i+=1
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( digits.data, digits.target, test_size=0.5, shuffle=False )
# Train clf.fit(X_train, y_train) # Compute predictions predicted = clf.predict(X_test)
print( f"Classification report for classifier {clf}:\n" f"{metrics.classification_report(y_test, predicted)}\n" ) # disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted) disp.figure_.suptitle("Confusion Matrix") print(f"Confusion matrix:\n{disp.confusion_matrix}") plt.show()
precision= (true positives)/(true positives + true negatives)
recall= (true positives)/(true positives + false positives)
F1-score = 2 * (precision * recall) / (precision + recall)
5. Supervised Machine Learning Algorithms
5.1. k-Nearest Neighbors algorithm
The k-Nearest Neighbors (k-NN) algorithm is arguably the simplest machine learning algorithm. Building
the model requires only of storing the training dataset. To make a prediction for a
new data point, the algorithm looks for its
nearest neighbors, i.e. the closest data points in the training dataset.
We will use in this case and in the coming section the famous Iris
dataset, first used by Sir R.A. Fisher. The dataset is taken from
Fisher’s paper Annual Eugenics, 7, Part II, 179-188 (1936). The dataset contains 3 classes with 50 instances each,
where each class refers to a type of iris plant, Setosa, Versicolour, and Virginica.
This database has four attributes: sepal length in cm, sepal width in
cm, petal length in cm, and petal width in cm. Its structure can be
easily visualized graphically, using different colors for the three
classes
To build the model
By default the
It is of interest to see how the number of neighbors affect the performace of the calculation, as required in the following exercise.
We now proceed with a different dataset, the Wisconsin Breast Cancer
dataset (
In this case we split the data including the
Overall the results are quite good, with accuracies always larger that
88%, but you can see how the test accuracy increases with the number
of neighbors until a maximum is reached.
One of the strengths of the k-NN algorithm is that the model is very
easy to understand, and it often achieves a reasonable performance
without a lot of adjustments. This algorithm offers a good baseline
method before considering more advanced techniques. Building the
nearest neighbors model is usually very fast, unless your training set
is very large (either in number of features or in number of samples)
as in this case the calculation of predicted values is computationally
expensive.
This approach often does not perform well on datasets with many features
(hundreds or more), and it badly fails when dealing with datasets where most features
are zero most of the time (the so called sparse datasets).
Now we can analyze also the regressor. In this case we will generate an artifitial set of data:
Load the library, train and predict
Note that the value for the test data cannot be much lower than the one for the trainig set.
#from sklearn import metrics #from sklearn.model_selection import train_test_split #from sklearn import datasets iris_dataset = datasets.load_iris() X, y = iris_dataset.data, iris_dataset.target print(iris_dataset.DESCR)
fig, ax = plt.subplots(4,4,figsize=(15,15)) for row in range(4): for col in range(4): ax[row,col].scatter(iris_dataset.data[:,row], iris_dataset.data[:,col], c=iris_dataset.target) ax[row,col].set_xlabel("Attrib {}".format(col)) ax[row,col].set_ylabel("Attrib {}".format(row))
from sklearn.neighbors import KNeighborsClassifier X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, shuffle=True ) clf = KNeighborsClassifier(n_neighbors=3) clf.fit(X_train, y_train)
train_test_split
function uses 25% of the data for testing and 75% of the data for training.
Note that the classifier option random_state = number
allows to separate the
data always in the same two sets making the calculations reproducible
across multiple function calls. The default value is None
and . The
most relevant parameter is n_neighbors
which default value is 5
.
The obtained results and the performance of the calculation is:
print("Test set predictions: {}".format(clf.predict(X_test))) print("Test set accuracy: {:.9f}".format(clf.score(X_test, y_test)))
Exercise 8.1 |
load_breast_cancer
dataset), which records clinical
measurements of breast tumors. Each tumor is labeled as
benign (for harmless tumors) or malignant (for cancerous tumors),
and the task is to learn to predict whether a tumor is malignant based
on the tissue measurements.
from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=66)
stratify
argument. This argument
makes a split in such a way that the proportion of values in the sample
produced will be the same as the proportion of values provided to
parameter stratify.
In principle, there are two important knobs to control the KNeighbors classifier: the
number of nearest neighbors considered and how you measure the distance between data points. In practice,
using a small and odd (to avoid confussion between classes) number of neighbors, like three or five, often works well, but you should
certainly adjust this parameter to optimize the obtained results. It is also possible to define at will the distance metrics.
The default setup uses Euclidean distance, which works fine in many settings.
We proceed by training the model for different numbers of nearest neighbors and checking the accuracy obtained as a function of the neighbors considered
training_accuracy = [] test_accuracy = [] # try n_neighbors from 1 to 10 neighbors_settings = range(1, 11) for n_neighbors in neighbors_settings: # build the model clf = KNeighborsClassifier(n_neighbors=n_neighbors) clf.fit(X_train, y_train) # record training set accuracy training_accuracy.append(clf.score(X_train, y_train)) # record generalization accuracy test_accuracy.append(clf.score(X_test, y_test)) plt.plot(neighbors_settings, training_accuracy, label="training accuracy") plt.plot(neighbors_settings, test_accuracy, label="test accuracy") plt.ylabel("Accuracy") plt.xlabel("n_neighbors") plt.legend()
data_reg=datasets.make_regression(n_samples=1000) X=data_reg[0] y=data_reg[1] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size= 0.2, random_state=0)
from sklearn.neighbors import KNeighborsRegressor reg = KNeighborsRegressor(n_neighbors=5) reg.fit(X_train, y_train) predicted=reg.predict(X_test)
print("Train set R^2: {:.9f}".format(reg.score(X_train, y_train))) print("Test set R^2: {:.9f}".format(reg.score(X_test, y_test)))
Exercise 8.2 |
6. Supervised Machine Learning Algorithms: Linear Models
Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features, which we will explain shortly.
6.1. Linear regression (aka ordinary least squares)
Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method for regression. Linear regression finds the parameters \(w\) and \(b\) that minimize the mean squared error between predictions and the true regression targets, \(y\), on the training set.
\begin{equation}
y = w[0] x[0] + w[1] x[1] + \ldots + w[p] * x[p] + b
\end{equation}
We define our working dataset
We separate into training and test sets, train the model
We show the coefficients
We present graphically the results
This example can be extended easily to the multidimensional case.
As one can see there is no way to modify anything in the fitting process, for example in the case of overfitting. To this end there exists other alternatives, e.g.,
from sklearn import metrics from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn.linear_model import LinearRegression data_reg=datasets.make_regression(n_features=1,n_samples=100, noise=10, bias=2) X=data_reg[0] y=data_reg[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) lr = LinearRegression().fit(X_train, y_train)
print("lr.coef_: {}".format(lr.coef_)) print("lr.intercept_: {}".format(lr.intercept_))
plt.plot(X_test,y_test,'.r', label='test') plt.plot(X_train,y_train,'.b', label='train') plt.plot(X,lr.predict(X),'-k', label='fit') plt.legend()
Ridge
and Lasso
.
6.2. Ridge regression
Ridge regression is also a linear model for regression, so the formula it uses to make
predictions is the same one used for ordinary least squares. In ridge regression,
though, the coefficients (\(w\)) are chosen not only so that they predict well on the training data, but also to fit an additional constraint. We also want the magnitude of coefficients to be as small as possible; in other words, all entries of \(w\) should be close to zero. This constraint is an example of what is called regularization. In the case of
First, we see the linear regression case
Now we explore the effect of alpha
We can see the effect of the
Finally, we see we
Lasso
as many as possible coefficients are taken as zero.
In general with linear regression we are overfitting the data. Ridge
is a more restricted model, so we are less likely to overfit. A less complex model means worse performance on the training
set, but better generalization. As we are only interested in generalization performance, we should choose the Ridge model over the LinearRegression model. How much importance the
model places on simplicity versus training set performance can be specified by the
user, using the alpha parameter. The optimum setting of alpha depends on the particular dataset we are using. Increasing alpha
forces coefficients to move more toward zero, which decreases training set performance but might help generalization.
from sklearn.linear_model import Ridge data_reg=datasets.make_regression(n_features=100, n_samples=1000, bias=3.5, noise=10) X=data_reg[0] y=data_reg[1]*np.random.rand(len(data_reg[1]))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.7)
lr = LinearRegression().fit(X_train, y_train) print("Training set score: {:.9f}".format(lr.score(X_train, y_train))) print("Test set score: {:.9f}".format(lr.score(X_test, y_test)))
ridge = Ridge().fit(X_train, y_train) print("Training set score: {:.9f}".format(ridge.score(X_train, y_train))) print("Test set score: {:.9f}".format(ridge.score(X_test, y_test)))
alpha
parameter
ridge = Ridge(alpha=1000).fit(X_train, y_train) print("Training set score: {:.9f}".format(ridge.score(X_train, y_train))) print("Test set score: {:.9f}".format(ridge.score(X_test, y_test)))
alpha
affects the value of the coefficients.
coeffs= [] #alphas = np.linspace(0,10,50) alphas = np.logspace(-6, 6, 200) for i in alphas: ridge = Ridge(alpha=i).fit(X_train, y_train) coeffs.append(ridge.coef_)
plt.plot(alphas,coeffs) plt.xscale("log") plt.xlabel("alpha") plt.ylabel("weights") plt.title("Ridge coefficients as a function of the regularization")
6.3. Linear models for classification
Linear models are not only used for regression, but they are also extensively used for classification. In this case the prediction is done using the formula
\begin{equation}
y = w[0] x[0] + w[1] x[1] + \ldots + w[p] * x[p] + b > 0
\end{equation}
The formula looks very similar to the one for linear regression, but instead of just
returning the weighted sum of the features, we threshold the predicted value at zero.
If the function is smaller than zero, we predict the class –1; if it is larger than zero, we
predict the class +1. This prediction rule is common to all linear models for classification. Again, there are many different ways to find the coefficients (w) and the intercept (b).
The two most common linear classification algorithms are logistic regression, implemented in
We split into two sets and we train the model,
We can check how the two clases separate plotting the classes as a function of two features (we can choose other two to see how the behave),
Now we move into the case of linear models for multiclass classification. In this case we will use a dataset for wines
One can visualize the classes (for only part of the attributes)
In this case we will use
One can see that there are convergence problems.
linear_model.LogisticRegression
, and linear support vector machines (linear SVMs), implemented in svm.LinearSVC
(SVC stands for support vector classifier). Despite its name, LogisticRegression
is a classification algorithm and not a regression algorithm, and it should not be confused with LinearRegression
.
For LogisticRegression
and LinearSVC
there is a trade-off parameter that determines the strength of the regularization is called C
, and higher values of C
correspond to less regularization. In other words, when you use a high value for the parameter C
, LogisticRegression
and LinearSVC
try to fit the training set as best as possible, while with low values of the parameter C
, the models put more emphasis on finding a coefficient vector (w) that is close to zero.
We consider the database of breast cancer where only two classes exit,
from sklearn.datasets import load_breast_cancer from sklearn.linear_model import LogisticRegression from sklearn import datasets, svm, metrics from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=42) logreg = LogisticRegression(C=1).fit(X_train, y_train) print("Training set score: {:.3f}".format(logreg.score(X_train, y_train))) print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))
plt.scatter(cancer.data[:,0],cancer.data[:,11],c=cancer.target) plt.xlabel('feature 0') plt.ylabel('feature 11')
wine=datasets.load_wine() print(wine.DESCR)
fig, ax = plt.subplots(8,8,figsize=(15,15)) for row in range(8): for col in range(8): ax[row,col].scatter(wine.data[:,row], wine.data[:,col], c=wine.target) ax[row,col].set_xlabel("{}".format(row)) ax[row,col].set_ylabel("{}".format(col))
LinearSVC
from sklearn.svm import LinearSVC
X_train, X_test, y_train, y_test = train_test_split( wine.data, wine.target, stratify=wine.target, random_state=42) SVC = LinearSVC(C=10.01,max_iter=10000).fit(X_train, y_train) print("Training set score: {:.3f}".format(SVC.score(X_train, y_train))) print("Test set score: {:.3f}".format(SVC.score(X_test, y_test)))
7. Supervised Machine Learning Algorithms: Decision Trees
Decision trees are widely used models for classification and regression tasks. Essentially, they learn a hierarchy of if/else questions, leading to a decision.
To build a tree, the algorithm searches over all possible tests and finds the one that is most informative about the target variable. A recursive process yields a binary tree of decisions, with each node containing a test. Alternatively, you can think of each test as splitting the part of the data that is currently being considered along one axis. This yields a view of the algorithm as building a hierarchical partition. As each test concerns only a single feature, the regions in the resulting partition always have axis-parallel boundaries. The recursive partitioning of the data is repeated until each region in the partition (each leaf in the decision tree) only contains a single target value (a single class or a single regression value). A leaf of the tree that contains data points that all share the same target value is called pure.
It is also possible to use trees for regression tasks, using exactly the same technique. To make a prediction, we traverse the tree based on the tests in each node and find the leaf the new data point falls into. The output for this data point is the mean target of the training points in this leaf.
Decision trees in scikit-learn are implemented in the
If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well to new data. One option is to stop building the tree after a certain depth has been reached. Here we set
It is possible easily visualize what the classification tree means:
Or to see the feature importance
Decision trees have two advantages over many of the algorithms we’ve discussed so far: the resulting model can easily be visualized and understood by nonexperts (at least for smaller trees), and the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms. In particular, decision trees work well when you have features that are on completely different scales, or a mix of binary and continuous features.
DecisionTreeRegressor
and DecisionTreeClassifier
classes. scikit-learn only implements pre-pruning, not post-pruning.
We can apply this method to the breast cancer dataset. As always, we import the dataset and split it into a training and a test part. Then we build a model using the default setting of fully developing the tree (growing the tree until all leaves are pure). We fix the randomstate in the tree, which is used for tie-breaking internally:
from sklearn.tree import DecisionTreeClassifier cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=42) tree = DecisionTreeClassifier(random_state=0) tree.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train))) print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
max_depth=2
, meaning only two consecutive questions can be asked.
tree = DecisionTreeClassifier(max_depth=2, random_state=0) tree.fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train))) print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
from sklearn.tree import export_graphviz export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"], feature_names=cancer.feature_names, impurity=False, filled=True) !pip install graphviz import graphviz with open("tree.dot") as f: dot_graph = f.read() graphviz.Source(dot_graph)
print("Feature importances:\n{}".format(tree.feature_importances_))
def plot_feature_importances_cancer(model): n_features = cancer.data.shape[1] plt.barh(range(n_features), model.feature_importances_, align='center') plt.yticks(np.arange(n_features), cancer.feature_names) plt.xlabel("Feature importance") plt.ylabel("Feature") plot_feature_importances_cancer(tree)
Exercise 8.3 |
8. Supervised Machine Learning Algorithms: Support Vector Machines
Kernelized support vector machines (often just referred to as SVMs) are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space. While there are support vector machines for classification and regression, we will restrict ourselves to the classification case, as implemented in
The model overfits quite substantially, with a perfect score on the training set and only 63% accuracy on the test set. While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to the scaling of the data. This can be partially solved using the option gamma=’scale’
But, in general, to solve the problem, all the features should vary on a similar scale. To see the scale of the different features
The solution is to preprocess the data rescaling them:
Scaling the data made a huge difference! Now we are actually in an underfitting
regime, where training and test set performance are quite similar but less close to 100% accuracy. From here, we can try increasing either C or gamma to fit a more complex model. For example:
SVC
. Similar concepts apply to support vector
regression, as implemented in SVR
.
During training, the SVM
learns how important each of the training data points is to represent the decision boundary between the two classes. Typically only a subset of the training points matter for defining the decision boundary: the ones that lie on the border between the classes. These are called support vectors and give the support vector machine its name.
To make a prediction for a new point, the distance to each of the support vectors is measured. A classification decision is made based on the distances to the support vector, and the importance of the support vectors that was learned during training (stored in the dual_coef_
attribute of SVC
).
Let us apply the SVC
to the breast cancer dataset
from sklearn.svm import SVC X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, random_state=0) svc = SVC(gamma='auto') svc.fit(X_train, y_train) print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train))) print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
svc = SVC(gamma='auto') svc.fit(X_train, y_train) print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train))) print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
plt.plot(X_train.min(axis=0), 'o', label="min") plt.plot(X_train.max(axis=0), '^', label="max") plt.legend(loc=4) plt.xlabel("Feature index") plt.ylabel("Feature magnitude") plt.yscale("log")
# compute the minimum value per feature on the training set min_on_training = X_train.min(axis=0) # compute the range of each feature (max - min) on the training set range_on_training = (X_train - min_on_training).max(axis=0) # subtract the min, and divide by range # afterward, min=0 and max=1 for each feature X_train_scaled = (X_train - min_on_training) / range_on_training print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0))) print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))
X_test_scaled = (X_test - min_on_training) / range_on_training svc = SVC() svc.fit(X_train_scaled, y_train) print("Accuracy on training set: {:.3f}".format( svc.score(X_train_scaled, y_train))) print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
svc = SVC(C=1000) svc.fit(X_train_scaled, y_train) print("Accuracy on training set: {:.3f}".format( svc.score(X_train_scaled, y_train))) print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
9. Supervised Machine Learning Algorithms: Neural Networks, the multilayer perceptron (MLP)
While deep learning shows great promise in many machine learning applications, deep learning algorithms are often tailored very carefully to a specific use case. Here, we will only discuss some relatively simple methods, namely multilayer perceptrons for classification and regression, that can serve as a starting point for more involved deep learning methods.
MLPs can be viewed as generalizations of linear models that perform multiple stages of processing to come to a decision.
\begin{equation}
y = w[0] x[0] + w[1] x[1] + \ldots + w[p] x[p] + b
\end{equation}
\(y\) can be understood as a weighted sum of the input features \(x\), weighted by the learned coefficients \(w\).
In an MLP this process of computing weighted sums is repeated multiple times, first computing hidden units that represent an intermediate processing step, which are again combined using weighted sums to yield the final result.
This model has a lot more coefficients (also called weights) to learn: there is one between every input and every hidden unit (which make up the hidden layer), and one between every unit in the hidden layer and the output. Note that computing a series of weighted sums is mathematically the same as computing just
one weighted sum, so to make this model truly more powerful than a linear model,
we need one extra trick. After computing a weighted sum for each hidden unit, a
nonlinear function is applied to the result—usually the rectifying nonlinearity (also known as rectified linear unit or
By default, the
Now we can analyze how the complexity of a neural network is controlled: the number of hidden layers, the number of units in each hidden layer, and the regularization (
While the
relu
) or the tangens hyperbolicus (tanh
). For the case of three input features a one hidden layer with three nodes
\begin{eqnarray}
h[0] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\
h[1] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\
h[2] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\
y &=& v[0] h[0] + v[1] h[1] + v[2] h[2]
\end{eqnarray}
As a first example let us consider the make_moons
dataset
from sklearn.neural_network import MLPClassifier from sklearn.datasets import make_moons X, y = make_moons(n_samples=100, noise=0.25, random_state=3) X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42) mlp = MLPClassifier( solver= 'lbfgs', random_state=0).fit(X_train, y_train) plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train) plt.xlabel("Feature 0") plt.ylabel("Feature 1")
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train))) print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
MLP
uses 100 hidden nodes, which is quite a lot for this small dataset.
We can reduce the number (which reduces the complexity of the model) and still get
a good result.
mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10, 10], random_state=0).fit(X_train, y_train) print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train))) print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
alpha
). There are actually even more.
for n_hidden_nodes_1 in [10, 100]: for n_hidden_nodes_2 in [10, 100]: for alpha in [0.0001, 0.01, 0.1, 1]: mlp = MLPClassifier(solver= 'lbfgs', random_state=0, hidden_layer_sizes=[n_hidden_nodes_1, n_hidden_nodes_2],alpha=alpha) mlp.fit(X_train, y_train) print("n_hidden=[{}, {}], alpha={:.4f}".format(n_hidden_nodes_1, n_hidden_nodes_2, alpha)) print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train))) print("Accuracy on test set: {:.3f}\n".format(mlp.score(X_test, y_test)))
MLPClassifier
and MLPRegressor
provide easy-to-use interfaces for the
most common neural network architectures, they only capture a small subset of what
is possible with neural networks. If you are interested in working with more flexible
or larger models, we encourage you to look beyond scikit-learn
into the fantastic
deep learning libraries that are out there. For Python users, the most well-established
are keras
(https://keras.io/), tensor-flow
(https://www.tensorflow.org/), and pythorch
(https://pytorch.org/).
10. Dataset transformations
We saw in previous sections that some algorithms, like neural networks and SVMs, are very sensitive to the scaling of the data. Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data.
10.1. Rescaling
Preprocessing methods like the scalers are usually applied before applying a supervised machine learning algorithm. As an example, say we want to apply the kernel
We will start with
We then fit the scaler using the fit method, applied to the training data.
The transformed data has the same shape as the original data—the features are simply shifted and scaled. You can see that all of the features are now between 0 and 1, as desired.
To apply the SVM to the scaled data, we also need to transform the test set. This is again done by calling the transform method, this time on Xtest:
Maybe somewhat surprisingly, you can see that for the test set, after scaling, the minimum and maximum are not 0 and 1. Some of the features are even outside the 0–1 range! The problem is that we used the
The third panel shows what would happen if we scaled the training set and test set separately. In this case, the minimum and maximum feature values for both the training and the test set are 0 and 1. But now the dataset looks different. The test points moved incongruously to the training set, as they were scaled differently. We changed
the arrangement of the data in an arbitrary way. Clearly this is not what we want to do.
Note that, in genera, the
Other important scale methods are
and now the scaled ones
As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t involve any complicated math, it is good practice to use the scaling mechanisms.
SVM (SVC)
to the cancer dataset, and use MinMaxScaler
for preprocessing the data. We will use the cancer
dataset
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1) print(X_train.shape) print(X_test.shape)
MinMaxScaler
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler()
scaler.fit(X_train)
#transform data X_train_scaled = scaler.transform(X_train) # print dataset properties before and after scaling print("transformed shape: {}".format(X_train_scaled.shape)) print("per-feature minimum before scaling:\n {}".format(X_train.min(axis=0))) print("per-feature maximum before scaling:\n {}".format(X_train.max(axis=0))) print("per-feature minimum after scaling:\n {}".format( X_train_scaled.min(axis=0))) print("per-feature maximum after scaling:\n {}".format( X_train_scaled.max(axis=0)))
# transform test data X_test_scaled = scaler.transform(X_test) # print test data properties after scaling print("per-feature minimum after scaling:\n{}".format(X_test_scaled.min(axis=0))) print("per-feature maximum after scaling:\n{}".format(X_test_scaled.max(axis=0)))
training
data to fix the filter, but then we applied it to a different set of data.
It is important to apply exactly the same transformation to the training set and the test set for the supervised model to work on the test set. Here, we show what happens if the scaling is not done properly
from sklearn.datasets import make_blobs # make synthetic data X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2) # split it into training and test sets X_train, X_test = train_test_split(X, random_state=5, test_size=.1) # plot the training and test sets fig, axes = plt.subplots(1, 3, figsize=(13, 4)) axes[0].scatter(X_train[:, 0], X_train[:, 1],c='blue', label="Training set", s=60) axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', c='red', label="Test set", s=60) axes[0].legend(loc='upper left') axes[0].set_title("Original Data") # scale the data using MinMaxScaler scaler = MinMaxScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) # visualize the properly scaled data axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],c='blue', label="Training set", s=60) axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',c='red', label="Test set", s=60) axes[1].set_title("Scaled Data") # rescale the test set separately # so test set min is 0 and test set max is 1 # DO NOT DO THIS! For illustration purposes only. test_scaler = MinMaxScaler() test_scaler.fit(X_test) X_test_scaled_badly = test_scaler.transform(X_test) # visualize wrongly scaled data axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],c='blue', label="training set", s=60) axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1],marker='^', c='red', label="test set", s=60) axes[2].set_title("Improperly Scaled Data") for ax in axes: ax.set_xlabel("Feature 0") ax.set_ylabel("Feature 1")
fit
and the transform
can be done more efficiently in a single step using the fit_transform
method:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # calling fit and transform in sequence (using method chaining) X_scaled = scaler.fit(X).transform(X) # same result, but more efficient computation X_scaled_d = scaler.fit_transform(X)
StandardScaler
and RobustScaler
. The StandardScaler
in scikit-learn ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features. The RobustScaler
works similarly to the StandardScaler
in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler
uses the median and quartiles, instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). The Normalizer
does a very different kind of rescaling. It scales each data point such that the feature vector has a Euclidean length of 1. In other words, it projects a data point on the circle (or sphere, in the case of higher dimensions) with a radius of 1.
The effect of working with scaled data can be really important. First we show the performance just using raw data:
from sklearn.svm import SVC X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0) svm = SVC(C=100) svm.fit(X_train, y_train) print("Test set accuracy: {:.2f}".format(svm.score(X_test, y_test)))
scaler = MinMaxScaler() scaler.fit(X_train) X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) # learning an SVM on the scaled training data svm.fit(X_train_scaled, y_train) # scoring on the scaled test set print("Scaled test set accuracy: {:.2f}".format( svm.score(X_test_scaled, y_test)))
10.2. Principal Component Analysis (PCA) for dimensionality reduction
Principal component analysis is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data.
We will work once more with the cancer dataset. First we scale the data,
and then we obtain the two first principal components,
We can see how look like the new X data
It is important to note that PCA is an unsupervised method, and does not use any class information when finding the rotation. It simply looks at the correlations in the data. You can see that the two classes separate quite well in this two-dimensional space. This leads us to believe that even a linear classifier (that would learn a line in this space) could do a reasonably good job at distinguishing the two classes.
A downside of PCA is that the two axes in the plot are often not very easy to interpret. The principal components correspond to directions in the original data, so they are combinations of the original features. However, these combinations are usually very complex, as we’ll see shortly.
Once the PCA has been performed, the fitting procedure can be conducted with the PCA components:
from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() scaler = StandardScaler() scaler.fit(cancer.data) X_scaled = scaler.transform(cancer.data)
from sklearn.decomposition import PCA # keep the first two principal components of the data pca = PCA(n_components=2) # fit PCA model to breast cancer data pca.fit(X_scaled) # transform data onto the first two principal components X_pca = pca.transform(X_scaled) print("Original shape: {}".format(str(X_scaled.shape))) print("Reduced shape: {}".format(str(X_pca.shape)))
# plot first vs. second principal component, colored by class plt.figure(figsize=(8, 8)) plt.scatter(X_pca[:, 0], X_pca[:, 1], cancer.target) plt.legend(cancer.target_names, loc="best") plt.gca().set_aspect("equal") plt.xlabel("First principal component") plt.ylabel("Second principal component")
print("PCA components:\n{}".format(pca.components_)) plt.matshow(pca.components_, cmap='viridis') plt.yticks([0, 1], ["First component", "Second component"]) plt.colorbar() plt.xticks(range(len(cancer.feature_names)), cancer.feature_names, rotation=60, ha='left') plt.xlabel("Feature") plt.ylabel("Principal components")
from sklearn.svm import SVC X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, cancer.target, random_state=10) # svm = SVC(C=100) svm.fit(X_train_pca, y_train) print("Test set accuracy: {:.2f}".format(svm.score(X_test_pca, y_test)))
10.3. Pipelines
Most machine learning applications require not only the application of a single algorithm, but the chaining together of many different processing steps and machine learning models. In this chapter, we will cover how to use the
Here, we created two steps: the first, called “scaler”, is an instance of
Calling the score method on the pipeline first transforms the test data using the scaler, and then calls the score method on the SVM using the scaled test data.
The
The pipeline objects pipelong and pipeshort do exactly the same thing, but pipeshort has steps that were automatically named. We can see the names of the steps by looking at the steps attribute:
One can access the attributes at any step,
It is possible to change the value of any control parameters of the different steps
Note the double
Pipeline
class to simplify the process of building chains of transformations and models. In particular, we will see how we to use Pipeline
for all processing steps at once.
Let’s look at how we can use the Pipeline class to express the workflow for training an SVM after scaling the data with MinMaxScaler (for now without the grid search). First, we build a pipeline object by providing it with a list of steps. Each step is a tuple containing a name (any string of your choosing) and an instance of an estimator:
from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0) from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, MinMaxScaler from sklearn.svm import SVC pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
MinMaxScaler
, and the second, called “svm”, is an instance of SVC
. One can access different steps with pipe[0]
or pipe[1]
. Now, we can fit the pipeline, like any other scikit-learn estimator:
pipe.fit(X_train, y_train)
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))
Pipeline
class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification. The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step.
Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline
, that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline
is as follows:
from sklearn.pipeline import make_pipeline # standard syntax pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))]) # abbreviated syntax pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))
print("Pipeline steps:\n{}".format(pipe_short.steps)) print("Pipeline steps:\n{}".format(pipe_long.steps))
from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler()) pipe.fit(cancer.data) print("Pipeline steps:\n{}".format(pipe.steps)) print(pipe.named_steps["pca"].components_) pipe.transform(cancer.data)
pipe.set_params(pca__n_components=3) pipe.fit(cancer.data) pipe.fit(cancer.data) pipe.transform(cancer.data).shape
_
.
Exercise 8.4 |
11. Exercises
Created: 2024-12-09 Mon 21:21