Python Lesson ML

Python Lesson 8: Machine Learning

1 Lesson outline

Getting started
A first simple example.
A second and more complex example.
Supervised Machine Learning Algorithms: k-Nearest Neighbors, Linear Models, Decision Trees, Support Vector Machines, the multilayer perceptron.
Dataset transformations.
Exercises.

2 Getting started

2.1 Why machine learning

Machine learning is about extracting knowledge from data. It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning. The application of machine learning methods has in recent years become ubiquitous in everyday life. From auto‐ matic recommendations of which movies to watch, to what food to order or which products to buy, to personalized online radio and recognizing your friends in your photos, many modern websites and devices have machine learning algorithms at their core. When you look at a complex website like Facebook, Amazon, or Netflix, it is very likely that every part of the site contains multiple machine learning models.

The most successful kinds of machine learning algorithms are those that automate decision-making processes by generalizing from known examples. In this setting, which is known as supervised learning, the user provides the algorithm with pairs of inputs and desired outputs, and the algorithm finds a way to produce the desired out‐ put given an input. In particular, the algorithm is able to create an output for an input it has never seen before without any help from a human.

Machine learning algorithms that learn from input/output pairs are called supervised learning algorithms because a “teacher” provides supervision to the algorithms in the form of the desired outputs for each example that they learn from. Examples: identifying the zip code from handwritten digits on an envelope, determining whether a tumor is benign based on a medical image, or detecting fraudulent activity in credit card transactions

In unsupervised learning, only the input data is known, and no known output data is given to the algorithm. While there are many successful applications of these methods, they are usually harder to understand and evaluate. Examples: identifying topics in a set of blog posts, segmenting customers into groups with similar preferences, or detecting abnormal access patterns to a website.

Concerning data, often it is helpful to think of your data as a table. Each data point that you want to reason about (each email, each customer, each transaction) is a row, and each property that describes that data point (say, the age of a customer or the amount or location of a transaction) is a column. You might describe users by their age, their gender, when they created an account, and how often they have bought from your online shop. You might describe the image of a tumor by the grayscale values of each pixel, or maybe by using the size, shape, and color of the tumor. Each entity or row here is known as a sample (or data point) in machine learning, while the columns~—the properties that describe these entities—are called ~features

2.2 The libraries

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. To install scikit-learn simply type:

conda install scikit-learn

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    joblib-1.1.0               |     pyhd8ed1ab_0         210 KB  conda-forge
    scikit-learn-1.1.2         |   py39he5e8d7e_0         8.2 MB  conda-forge
    threadpoolctl-3.1.0        |     pyh8a188c0_0          18 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         8.5 MB

The following NEW packages will be INSTALLED:

  joblib             conda-forge/noarch::joblib-1.1.0-pyhd8ed1ab_0
  scikit-learn       conda-forge/linux-64::scikit-learn-1.1.2-py39he5e8d7e_0
  threadpoolctl      conda-forge/noarch::threadpoolctl-3.1.0-pyh8a188c0_0


Proceed ([y]/n)? 


Downloading and Extracting Packages
scikit-learn-1.1.2   | 8.2 MB    | ##################################### | 100% 
joblib-1.1.0         | 210 KB    | ##################################### | 100% 
threadpoolctl-3.1.0  | 18 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done

An alternative way is to use pip install scikit-learn

Moreover, we will use also by default the libraries: numpy, pandas, scipy, and matplotlib.

3 A first simple example

First we import the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import metrics 
from sklearn.model_selection import train_test_split

In this case we will define our own dataset. As we mentioned, in general, a datasets is made of two parts:

The characteristics of the objects, features.
The class to which the objects belong.

We have 500 objects defined in terms of two attributes and 500 classes labeled in terms of integers, i.e. [0,1] belongs to the class 0, [2,3] belongs to the class 1, [4,5] belongs to the class 2 and so on.

X = np.arange(1000).reshape((500, 2))
y = range(500)

Now we define our estimator that will be the so called support vector classification

clf = svm.SVC(gamma=0.001, C=100.)

Separate our dataset in training and test sets

X_train, X_test, y_train,  y_test = train_test_split(
    X, y, test_size=.1, shuffle=True)

Train the model

clf.fit(X_train,y_train)

Finally, we analyze the results

predicted= clf.predict(X_test)
print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)
and

disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

plt.show()

4 A second and more complex example

4.1 Reading the data

First we read a set of standard databases include in skit-learn, correspondig to a copy of the test set of the UCI ML hand-written digits datasets

https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

from sklearn import datasets
digits = datasets.load_digits()

See https://scikit-learn.org/stable/datasets.html for the complete list of datasets or go to https://archive-beta.ics.uci.edu/

To see the keys of the dataset or to show the content of one the keys:

print("Keys of digit dataset: \n{}".format(digits.keys()))
print("Description of digit dataset: \n{}".format(digits['DESCR']))

To get access to the data

print(digits.data)
print(digits.data.shape)
print(digits.target)
print(digits.target.shape)

The attribute data provides the instances while target provides the class of the different instances. Note that data correspond to 2D arrays (8×8). In particular, for this dataset one ca use the attribute image

print(digits.images[0])
digits.images.shape

4.2 Learning and predicting

In this case, we will consider a digit dataset, the task is to predict, given an image, which digit it represents. We are given samples of each of the 10 possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong.

In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC, which implements the support vector classification. The estimator’s constructor takes as arguments the model’s parameters.

from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-10], digits.target[:-10])
print(clf.predict(digits.data[-10:]))
print(digits.target[-10:])

The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is done by passing our training set to the fit method. For the training set, we’ll use all the images from our dataset, except for the last ten images, which we’ll reserve for our predicting. We select the training set with the [:-10] Python syntax, which produces a new array that contains all but the last ten items from digits.data. You can predict new values. In this case, you’ll predict using the last image from digits.data. By predicting, you’ll determine the image from the training set that best matches the last image. In this example we are using the Support Vector Machines which are a set of supervised learning methods used for classification, regression and outliers detection. You can see the different possible supervised methods in https://scikit-learn.org/stable/supervised_learning.html or the unsupervised ones in https://scikit-learn.org/stable/unsupervised_learning.html

Also you can plot the images and check visually the performance of the calculation.

fig, ax = plt.subplots(2,5,figsize=(10,4))
i=0
for row in range(2):
    for col in range(5):
        ax[row,col].imshow(digits.images[-10+i], cmap=plt.cm.gray_r, interpolation="nearest")
        i+=1

However it is much more convenient to separate our original in two sets, training and test sets, selected both in a random way.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.5, shuffle=False
)

clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

Now can analyze the performance of our learning procedure:

print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)
#
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

plt.show()

Note that the performace of the prediction can be measured in different ways: precission= (true positives)/(trues positives + true negatives), recall= (true positives)/(trues positives + false potitives), F1 = 2 * (precision * recall) / (precision + recall), among other options. On the other hand, the confusion matrix provides a detailed view of the agreement between the predition and the tests. Note that for other election and training and test sets, the results could be different.

Exercise 8.1 Split the iris dataset into 70% train data and 30% test data. Train or fit the data into the model and calculate the accuracy of the model using the K Nearest Neighbor Algorithm for different values of k. Create a plot to present the performance for different values of k.

5 Supervised Machine Learning Algorithms: k-Nearest Neighbors

The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset—its “nearest neighbors.”

We will use in this case and in the comming section the famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher’s paper. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn import datasets
iris_dataset = datasets.load_iris()
X, y = iris_dataset.data, iris_dataset.target
print(iris_dataset.DESCR)

This database has 4 attributes that its structure can be easily visualized thought

fig, ax = plt.subplots(4,4,figsize=(15,15))
for row in range(4):
    for col in range(4):
        ax[row,col].scatter(iris_dataset.data[:,row], iris_dataset.data[:,col], c=iris_dataset.target)
        ax[row,col].set_xlabel("Attrib {}".format(col))
        ax[row,col].set_ylabel("Attrib {}".format(row))

To build the model

from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, 
    random_state=0, shuffle=True )
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)

Note that the option random_state = number allows to separate the data always in the same two sets. The most relevant parameter is n_neighbors which default value is 5.

The obtained results and the performance of the calculation is:

print("Test set predictions: {}".format(clf.predict(X_test)))
print("Test set accuracy: {:.9f}".format(clf.score(X_test, y_test)))

It is of interest to see how the number of neighbors affect the performace of the calculation, but for this, we whill switch to the load_breast_cancer dataset

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=66)

training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
# build the model
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train, y_train)
# record training set accuracy
    training_accuracy.append(clf.score(X_train, y_train))
# record generalization accuracy
    test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

In principle, there are two important parameters to the KNeighbors classifier: the number of neighbors and how you measure distance between data points. In practice, using a small number of neighbors like three or five often works well, but you should certainly adjust this parameter. Even, it is possible to choose the distance measure. By default, Euclidean distance is used, which works well in many settings.

One of the strengths of k-NN is that the model is very easy to understand, and often gives reasonable performance without a lot of adjustments. Using this algorithm is a good baseline method to try before considering more advanced techniques. Building the nearest neighbors model is usually very fast, but when your training set is very large (either in number of features or in number of samples) prediction can be slow.

This approach often does not perform well on datasets with many features (hundreds or more), and it does particularly badly with datasets where most features are 0 most of the time (so-called sparse datasets).

Now we can analyze also the regressor. In this case we will generate an artifitial set of data:

data_reg=datasets.make_regression(n_samples=1000)
X=data_reg[0]
y=data_reg[1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size= 0.2, random_state=0)

Load the library, train and predict

from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor(n_neighbors=5)
reg.fit(X_train, y_train)
predicted=reg.predict(X_test)

print("Train set R^2: {:.9f}".format(reg.score(X_train, y_train)))
print("Test set R^2: {:.9f}".format(reg.score(X_test, y_test)))

Note that the value for the test data cannot be much lower than the one for the trainig set.

Exercise 8.2 Using the digits dataset analyze the performance of the KNeighborsClassifier as a function of their parameters. Find the better parameters to minimize the overfitting.

6 Supervised Machine Learning Algorithms: Linear Models

Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features, which we will explain shortly.

6.1 Linear regression (aka ordinary least squares)

Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method for regression. Linear regression finds the parameters \(w\) and \(b\) that minimize the mean squared error between predictions and the true regression targets, \(y\), on the training set.

\begin{equation} y = w[0] x[0] + w[1] x[1] + \ldots + w[p] * x[p] + b \end{equation}

We define our working dataset

from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
data_reg=datasets.make_regression(n_features=1,n_samples=100, noise=10, bias=2)
X=data_reg[0]
y=data_reg[1]

We separate into training and test sets, train the model

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_train, y_train)

We show the coefficients

print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))

We present graphically the results

plt.plot(X_test,y_test,'.r', label='test')
plt.plot(X_train,y_train,'.b', label='train')
plt.plot(X,lr.predict(X),'-k', label='fit')
plt.legend()

This example can be extended easily to the multidimensional case.

As one can see there is no way to modify anything in the fitting process, for example in the case of overfitting. To this end there exists other alternatives, e.g., Ridge and Lasso.

6.2 Ridge regression

Ridge regression is also a linear model for regression, so the formula it uses to make predictions is the same one used for ordinary least squares. In ridge regression, though, the coefficients (\(w\)) are chosen not only so that they predict well on the training data, but also to fit an additional constraint. We also want the magnitude of coefficients to be as small as possible; in other words, all entries of \(w\) should be close to zero. This constraint is an example of what is called regularization. In the case of Lasso as many as possible coefficients are taken as zero.

In general with linear regression we are overfitting the data. Ridge is a more restricted model, so we are less likely to overfit. A less complex model means worse performance on the training set, but better generalization. As we are only interested in generalization performance, we should choose the Ridge model over the LinearRegression model. How much importance the model places on simplicity versus training set performance can be specified by the user, using the alpha parameter. The optimum setting of alpha depends on the particular dataset we are using. Increasing alpha forces coefficients to move more toward zero, which decreases training set performance but might help generalization.

from sklearn.linear_model import Ridge
data_reg=datasets.make_regression(n_features=100, n_samples=1000, bias=3.5, noise=10)
X=data_reg[0]
y=data_reg[1]*np.random.rand(len(data_reg[1]))

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.7)

First, we see the linear regression case

lr = LinearRegression().fit(X_train, y_train)
print("Training set score: {:.9f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.9f}".format(lr.score(X_test, y_test)))

Now we explore the effect of alpha

ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.9f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.9f}".format(ridge.score(X_test, y_test)))

We can see the effect of the alpha parameter

ridge = Ridge(alpha=1000).fit(X_train, y_train)
print("Training set score: {:.9f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.9f}".format(ridge.score(X_test, y_test)))

Finally, we see we alpha affects the value of the coefficients.

coeffs= []
#alphas = np.linspace(0,10,50)
alphas = np.logspace(-6, 6, 200)
for i in alphas:
    ridge = Ridge(alpha=i).fit(X_train, y_train)
    coeffs.append(ridge.coef_)

plt.plot(alphas,coeffs)
plt.xscale("log")
plt.xlabel("alpha")
plt.ylabel("weights")
plt.title("Ridge coefficients as a function of the regularization")

6.3 Linear models for classification

Linear models are not only used for regression, but they are also extensively used for classification. In this case the prediction is done using the formula

\begin{equation} y = w[0] x[0] + w[1] x[1] + \ldots + w[p] * x[p] + b > 0 \end{equation}

The formula looks very similar to the one for linear regression, but instead of just returning the weighted sum of the features, we threshold the predicted value at zero. If the function is smaller than zero, we predict the class –1; if it is larger than zero, we predict the class +1. This prediction rule is common to all linear models for classification. Again, there are many different ways to find the coefficients (w) and the intercept (b).

The two most common linear classification algorithms are logistic regression, implemented in linear_model.LogisticRegression, and linear support vector machines (linear SVMs), implemented in svm.LinearSVC (SVC stands for support vector classifier). Despite its name, LogisticRegression is a classification algorithm and not a regression algorithm, and it should not be confused with LinearRegression.

For LogisticRegression and LinearSVC there is a trade-off parameter that determines the strength of the regularization is called C , and higher values of C correspond to less regularization. In other words, when you use a high value for the parameter C , LogisticRegression and LinearSVC try to fit the training set as best as possible, while with low values of the parameter C , the models put more emphasis on finding a coefficient vector (w) that is close to zero.

We consider the database of breast cancer where only two classes exit,

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split

We split into two sets and we train the model,

X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
logreg = LogisticRegression(C=1).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))

We can check how the two clases separate plotting the classes as a function of two features (we can choose other two to see how the behave),

plt.scatter(cancer.data[:,0],cancer.data[:,11],c=cancer.target)
plt.xlabel('feature 0')
plt.ylabel('feature 11')

Now we move into the case of linear models for multiclass classification. In this case we will use a dataset for wines

wine=datasets.load_wine()
print(wine.DESCR)

One can visualize the classes (for only part of the attributes)

fig, ax = plt.subplots(8,8,figsize=(15,15))
for row in range(8):
    for col in range(8):
	ax[row,col].scatter(wine.data[:,row], wine.data[:,col], c=wine.target)
	ax[row,col].set_xlabel("{}".format(row))
	ax[row,col].set_ylabel("{}".format(col))

In this case we will use LinearSVC

from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, stratify=wine.target, random_state=42)
SVC = LinearSVC(C=10.01,max_iter=10000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(SVC.score(X_train, y_train)))
print("Test set score: {:.3f}".format(SVC.score(X_test, y_test)))

One can see that there are convergence problems.

7 Supervised Machine Learning Algorithms: Decision Trees

Decision trees are widely used models for classification and regression tasks. Essentially, they learn a hierarchy of if/else questions, leading to a decision.

To build a tree, the algorithm searches over all possible tests and finds the one that is most informative about the target variable. A recursive process yields a binary tree of decisions, with each node containing a test. Alternatively, you can think of each test as splitting the part of the data that is currently being considered along one axis. This yields a view of the algorithm as building a hierarchical partition. As each test concerns only a single feature, the regions in the resulting partition always have axis-parallel boundaries. The recursive partitioning of the data is repeated until each region in the partition (each leaf in the decision tree) only contains a single target value (a single class or a single regression value). A leaf of the tree that contains data points that all share the same target value is called pure.

It is also possible to use trees for regression tasks, using exactly the same technique. To make a prediction, we traverse the tree based on the tests in each node and find the leaf the new data point falls into. The output for this data point is the mean target of the training points in this leaf.

Decision trees in scikit-learn are implemented in the DecisionTreeRegressor and DecisionTreeClassifier classes. scikit-learn only implements pre-pruning, not post-pruning.

We can apply this method to the breast cancer dataset. As always, we import the dataset and split it into a training and a test part. Then we build a model using the default setting of fully developing the tree (growing the tree until all leaves are pure). We fix the random_state in the tree, which is used for tie-breaking internally:

from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well to new data. One option is to stop building the tree after a certain depth has been reached. Here we set max_depth=2, meaning only two consecutive questions can be asked.

tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

It is possible easily visualize what the classification tree means:

from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
feature_names=cancer.feature_names, impurity=False, filled=True)

!pip install graphviz
import graphviz
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

Or to see the feature importance

print("Feature importances:\n{}".format(tree.feature_importances_))

def plot_feature_importances_cancer(model):
    n_features = cancer.data.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
plot_feature_importances_cancer(tree)

Decision trees have two advantages over many of the algorithms we’ve discussed so far: the resulting model can easily be visualized and understood by nonexperts (at least for smaller trees), and the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms. In particular, decision trees work well when you have features that are on completely different scales, or a mix of binary and continuous features.

Exercise 8.3 Read, using pandas, the database https://raw.githubusercontent.com/CongLiu-CN/zero-to-mastery-ml/master/data/heart-disease.csv. This database contains anonymous patient data and whether or not they have heart disease. In essence, the target column is our target variable (also called y or labels) and the rest of the other columns are our independent variables (also called data or X). Separate the dataset in training and test set and use RandomForestClassifier as estimator. Evaluate the performance of the model. Prepare a loop to perform the same procedure also for the estimators LinearSVC, KNeighborsClassifier, SVC, LogisticRegression, and RandomForestClassifier. Present in a bar plot the performance of the different models.

8 Supervised Machine Learning Algorithms: Support Vector Machines

Kernelized support vector machines (often just referred to as SVMs) are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space. While there are support vector machines for classification and regression, we will restrict ourselves to the classification case, as implemented in SVC. Similar concepts apply to support vector regression, as implemented in SVR.

During training, the SVM learns how important each of the training data points is to represent the decision boundary between the two classes. Typically only a subset of the training points matter for defining the decision boundary: the ones that lie on the border between the classes. These are called support vectors and give the support vector machine its name.

To make a prediction for a new point, the distance to each of the support vectors is measured. A classification decision is made based on the distances to the support vector, and the importance of the support vectors that was learned during training (stored in the dual_coef_ attribute of SVC).

Let us apply the SVC to the breast cancer dataset

from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

The model overfits quite substantially, with a perfect score on the training set and only 63% accuracy on the test set. While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to the scaling of the data. This can be partially solved using the option gamma=’scale’

svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

But, in general, to solve the problem, all the features should vary on a similar scale. To see the scale of the different features

plt.plot(X_train.min(axis=0), 'o', label="min")
plt.plot(X_train.max(axis=0), '^', label="max")
plt.legend(loc=4)
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
plt.yscale("log")

The solution is to preprocess the data rescaling them:

# compute the minimum value per feature on the training set
min_on_training = X_train.min(axis=0)
# compute the range of each feature (max - min) on the training set
range_on_training = (X_train - min_on_training).max(axis=0)
# subtract the min, and divide by range
# afterward, min=0 and max=1 for each feature
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))

X_test_scaled = (X_test - min_on_training) / range_on_training
svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
    svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))

Scaling the data made a huge difference! Now we are actually in an underfitting regime, where training and test set performance are quite similar but less close to 100% accuracy. From here, we can try increasing either C or gamma to fit a more complex model. For example:

svc = SVC(C=1000)
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
    svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))

9 Supervised Machine Learning Algorithms: Neural Networks, the multilayer perceptron (MLP)

While deep learning shows great promise in many machine learning applications, deep learning algorithms are often tailored very carefully to a specific use case. Here, we will only discuss some relatively simple methods, namely multilayer perceptrons for classification and regression, that can serve as a starting point for more involved deep learning methods.

MLPs can be viewed as generalizations of linear models that perform multiple stages of processing to come to a decision.

\begin{equation} y = w[0] x[0] + w[1] x[1] + \ldots + w[p] x[p] + b \end{equation}

\(y\) can be understood as a weighted sum of the input features \(x\), weighted by the learned coefficients \(w\).

In an MLP this process of computing weighted sums is repeated multiple times, first computing hidden units that represent an intermediate processing step, which are again combined using weighted sums to yield the final result.

This model has a lot more coefficients (also called weights) to learn: there is one between every input and every hidden unit (which make up the hidden layer), and one between every unit in the hidden layer and the output. Note that computing a series of weighted sums is mathematically the same as computing just one weighted sum, so to make this model truly more powerful than a linear model, we need one extra trick. After computing a weighted sum for each hidden unit, a nonlinear function is applied to the result—usually the rectifying nonlinearity (also known as rectified linear unit or relu) or the tangens hyperbolicus (tanh). For the case of three input features a one hidden layer with three nodes

\begin{eqnarray} h[0] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\ h[1] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\ h[2] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\ y &=& v[0] h[0] + v[1] h[1] + v[2] h[2] \end{eqnarray}

As a first example let us consider the make_moons dataset

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=42)
mlp = MLPClassifier( solver= 'lbfgs', random_state=0).fit(X_train, y_train)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")

print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))

By default, the MLP uses 100 hidden nodes, which is quite a lot for this small dataset. We can reduce the number (which reduces the complexity of the model) and still get a good result.

mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10, 10], random_state=0).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))

Now we can analyze how the complexity of a neural network is controlled: the number of hidden layers, the number of units in each hidden layer, and the regularization (alpha). There are actually even more.

for n_hidden_nodes_1 in [10, 100]:
    for n_hidden_nodes_2 in [10, 100]:
	for alpha in [0.0001, 0.01, 0.1, 1]:
	    mlp = MLPClassifier(solver= 'lbfgs', random_state=0,
			    hidden_layer_sizes=[n_hidden_nodes_1, n_hidden_nodes_2],alpha=alpha)
	    mlp.fit(X_train, y_train)
	    print("n_hidden=[{}, {}], alpha={:.4f}".format(n_hidden_nodes_1, n_hidden_nodes_2, alpha))
	    print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
	    print("Accuracy on test set: {:.3f}\n".format(mlp.score(X_test, y_test)))

While the MLPClassifier and MLPRegressor provide easy-to-use interfaces for the most common neural network architectures, they only capture a small subset of what is possible with neural networks. If you are interested in working with more flexible or larger models, we encourage you to look beyond scikit-learn into the fantastic deep learning libraries that are out there. For Python users, the most well-established are keras (https://keras.io/), tensor-flow (https://www.tensorflow.org/), and pythorch (https://pytorch.org/).

10 Dataset transformations

We saw in previous sections that some algorithms, like neural networks and SVMs, are very sensitive to the scaling of the data. Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data.

10.1 Rescaling

Preprocessing methods like the scalers are usually applied before applying a supervised machine learning algorithm. As an example, say we want to apply the kernel SVM (SVC) to the cancer dataset, and use MinMaxScaler for preprocessing the data. We will use the cancer dataset

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=1)
print(X_train.shape)
print(X_test.shape)

We will start with MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

We then fit the scaler using the fit method, applied to the training data.

scaler.fit(X_train)

#transform data
X_train_scaled = scaler.transform(X_train)
# print dataset properties before and after scaling
print("transformed shape: {}".format(X_train_scaled.shape))
print("per-feature minimum before scaling:\n {}".format(X_train.min(axis=0)))
print("per-feature maximum before scaling:\n {}".format(X_train.max(axis=0)))
print("per-feature minimum after scaling:\n {}".format(
X_train_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n {}".format(
X_train_scaled.max(axis=0)))

The transformed data has the same shape as the original data—the features are simply shifted and scaled. You can see that all of the features are now between 0 and 1, as desired.

To apply the SVM to the scaled data, we also need to transform the test set. This is again done by calling the transform method, this time on X_test:

# transform test data
X_test_scaled = scaler.transform(X_test)
# print test data properties after scaling
print("per-feature minimum after scaling:\n{}".format(X_test_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n{}".format(X_test_scaled.max(axis=0)))

Maybe somewhat surprisingly, you can see that for the test set, after scaling, the minimum and maximum are not 0 and 1. Some of the features are even outside the 0–1 range! The problem is that we used the training data to fix the filter, but then we applied it to a different set of data.

It is important to apply exactly the same transformation to the training set and the test set for the supervised model to work on the test set. Here, we show what happens if the scaling is not done properly

from sklearn.datasets import make_blobs
# make synthetic data
X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)
# split it into training and test sets
X_train, X_test = train_test_split(X, random_state=5, test_size=.1)
# plot the training and test sets
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
axes[0].scatter(X_train[:, 0], X_train[:, 1],c='blue', label="Training set", s=60)
axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', c='red', label="Test set", s=60)
axes[0].legend(loc='upper left')
axes[0].set_title("Original Data")
# scale the data using MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# visualize the properly scaled data
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],c='blue', label="Training set", s=60)
axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',c='red', label="Test set", s=60)
axes[1].set_title("Scaled Data")
# rescale the test set separately
# so test set min is 0 and test set max is 1
# DO NOT DO THIS! For illustration purposes only.
test_scaler = MinMaxScaler()
test_scaler.fit(X_test)
X_test_scaled_badly = test_scaler.transform(X_test)
# visualize wrongly scaled data
axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],c='blue', label="training set", s=60)
axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1],marker='^', c='red', label="test set", s=60)
axes[2].set_title("Improperly Scaled Data")
for ax in axes:
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")

The third panel shows what would happen if we scaled the training set and test set separately. In this case, the minimum and maximum feature values for both the training and the test set are 0 and 1. But now the dataset looks different. The test points moved incongruously to the training set, as they were scaled differently. We changed the arrangement of the data in an arbitrary way. Clearly this is not what we want to do.

Note that, in genera, the fit and the transform can be done more efficiently in a single step using the fit_transform method:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# calling fit and transform in sequence (using method chaining)
X_scaled = scaler.fit(X).transform(X)
# same result, but more efficient computation
X_scaled_d = scaler.fit_transform(X)

Other important scale methods are StandardScaler and RobustScaler. The StandardScaler in scikit-learn ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features. The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles, instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). The Normalizer does a very different kind of rescaling. It scales each data point such that the feature vector has a Euclidean length of 1. In other words, it projects a data point on the circle (or sphere, in the case of higher dimensions) with a radius of 1.

The effect of working with scaled data can be really important. First we show the performance just using raw data:

from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=0)
svm = SVC(C=100)
svm.fit(X_train, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test, y_test)))

and now the scaled ones

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(
    svm.score(X_test_scaled, y_test)))

As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t involve any complicated math, it is good practice to use the scaling mechanisms.

10.2 Principal Component Analysis (PCA) for dimensionality reduction

Principal component analysis is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data.

We will work once more with the cancer dataset. First we scale the data,

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
scaler = StandardScaler()
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)

and then we obtain the two first principal components,

from sklearn.decomposition import PCA
# keep the first two principal components of the data
pca = PCA(n_components=2)
# fit PCA model to breast cancer data
pca.fit(X_scaled)
# transform data onto the first two principal components
X_pca = pca.transform(X_scaled)
print("Original shape: {}".format(str(X_scaled.shape)))
print("Reduced shape: {}".format(str(X_pca.shape)))

We can see how look like the new X data

# plot first vs. second principal component, colored by class
plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], cancer.target)
plt.legend(cancer.target_names, loc="best")
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")

It is important to note that PCA is an unsupervised method, and does not use any class information when finding the rotation. It simply looks at the correlations in the data. You can see that the two classes separate quite well in this two-dimensional space. This leads us to believe that even a linear classifier (that would learn a line in this space) could do a reasonably good job at distinguishing the two classes.

A downside of PCA is that the two axes in the plot are often not very easy to interpret. The principal components correspond to directions in the original data, so they are combinations of the original features. However, these combinations are usually very complex, as we’ll see shortly.

print("PCA components:\n{}".format(pca.components_))

plt.matshow(pca.components_, cmap='viridis')
plt.yticks([0, 1], ["First component", "Second component"])
plt.colorbar()
plt.xticks(range(len(cancer.feature_names)),
cancer.feature_names, rotation=60, ha='left')
plt.xlabel("Feature")
plt.ylabel("Principal components")

Once the PCA has been performed, the fitting procedure can be conducted with the PCA components:

from sklearn.svm import SVC
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, cancer.target,
random_state=10)
#
svm = SVC(C=100)
svm.fit(X_train_pca, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test_pca, y_test)))

10.3 Pipelines

Most machine learning applications require not only the application of a single algorithm, but the chaining together of many different processing steps and machine learning models. In this chapter, we will cover how to use the Pipeline class to simplify the process of building chains of transformations and models. In particular, we will see how we to use Pipeline for all processing steps at once.

Let’s look at how we can use the Pipeline class to express the workflow for training an SVM after scaling the data with MinMaxScaler (for now without the grid search). First, we build a pipeline object by providing it with a list of steps. Each step is a tuple containing a name (any string of your choosing) and an instance of an estimator:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=0)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

Here, we created two steps: the first, called “scaler”, is an instance of MinMaxScaler, and the second, called “svm”, is an instance of SVC . One can access different steps with pipe[0] or pipe[1]. Now, we can fit the pipeline, like any other scikit-learn estimator:

pipe.fit(X_train, y_train)

print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Calling the score method on the pipeline first transforms the test data using the scaler, and then calls the score method on the SVM using the scaled test data.

The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification. The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step.

Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline , that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows:

from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

The pipeline objects pipe_long and pipe_short do exactly the same thing, but pipe_short has steps that were automatically named. We can see the names of the steps by looking at the steps attribute:

print("Pipeline steps:\n{}".format(pipe_short.steps))
print("Pipeline steps:\n{}".format(pipe_long.steps))

One can access the attributes at any step,

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
pipe.fit(cancer.data)
print("Pipeline steps:\n{}".format(pipe.steps))
print(pipe.named_steps["pca"].components_)
pipe.transform(cancer.data)

It is possible to change the value of any control parameters of the different steps

pipe.set_params(pca__n_components=3)
pipe.fit(cancer.data)
pipe.fit(cancer.data)
pipe.transform(cancer.data).shape

Note the double _.

Exercise 8.4 Use the dataset local_wine. Provide the description of the dataset, the shape of the data and the feature and target names. Present a panel (13×13) in which you represent one feature vs another and the three target classes with different colors. Use the Lasso method to learn and analize the performance of the procedure using different parameters of Lasso, trying to avoid overfitting. Define a pipeline in which you normalize the data, perform a PCA with threee components and, finally, use Lasso. Compare the results with the previous analysis. Repeat the plot concerning the features, but in this case using the PCA components (3×3 plot).

11 Exercises

Exercise 8.1: Split the iris dataset into 70% train data and 30% test data. Train or fit the data into the model and calculate the accuracy of the model using the K Nearest Neighbor Algorithm for different values of k. Create a plot to present the performance for different values of k.
Exercise 8.2: Using the digits dataset analyze the performance of the KNeighborsClassifier as a function of their parameters. Find the better parameters to minimize the overfitting.
Exercise 8.3: Read, using pandas, the database https://raw.githubusercontent.com/CongLiu-CN/zero-to-mastery-ml/master/data/heart-disease.csv. This database contains anonymous patient data and whether or not they have heart disease. In essence, the target column is our target variable (also called y or labels) and the rest of the other columns are our independent variables (also called data or X). Separate the dataset in training and test set and use RandomForestClassifier as estimator. Evaluate the performance of the model. Prepare a loop to perform the same procedure also for the estimators LinearSVC, KNeighborsClassifier, SVC, LogisticRegression, and RandomForestClassifier. Present in a bar plot the performance of the different models.
Exercise 8.4: Use the dataset local_wine. Provide the description of the dataset, the shape of the data and the feature and target names. Present a panel (13×13) in which you represent one feature vs another and the three target classes with different colors. Use the Lasso method to learn and analize the performance of the procedure using different parameters of Lasso, trying to avoid overfitting. Define a pipeline in which you normalize the data, perform a PCA with threee components and, finally, use Lasso. Compare the results with the previous analysis. Repeat the plot concerning the features, but in this case using the PCA components (3×3 plot).

Author: Curro Pérez Bernal

Created: 2022-12-26 Mon 18:09

Validate