Python Lesson 8: Machine Learning

Python Lesson 8: Machine Learning

1. Lesson outline

  1. Getting started
  2. A first simple example.
  3. A second and more complex example.
  4. Supervised Machine Learning Algorithms: k-Nearest Neighbors, Linear Models, Decision Trees, Support Vector Machines, the multilayer perceptron.
  5. Dataset transformations.
  6. Exercises.

2. Getting started

2.1. Why machine learning

Machine learning is about extracting knowledge from data. It is a research field at the intersection of statistics, artificial intelligence, and computer science and is also known as predictive analytics or statistical learning. The application of machine learning methods has in recent years become ubiquitous in everyday life. From automatic recommendations of movies and TV series to watch, to what food to order or which products to buy, to personalized online radio or music and automatic recognizing of your friends in your photos; many modern websites and devices have machine learning algorithms at their core. When you look at a complex website like HBO, Facebook, Amazon, or Netflix, it is very likely that every part of the site contains multiple machine learning models. The most successful machine learning algorithms are those that automate decision-making processes by generalizing from known examples. In this setting, which is known as supervised machine learning (SML), the user provides the algorithm with pairs of inputs and desired outputs, and the algorithm finds a way to produce the desired output given an input. In particular, once trained, supervised algorithms are able to create an output for an input that it has never been seen before, without any human guidance. Machine learning algorithms that learn from input/output pairs are called supervised learning algorithms because a teacher provides supervision to the algorithms in the form of the desired outputs for each example that they learn from. Examples: identifying the zip code from handwritten digits on an envelope, determining whether a tumor is benign based on medical imaging, or detecting fraudulent activity in bank transactions. In unsupervised learning, an utterly different approach, only input data are known, and no output data is supplied to the algorithm who then tries to look for patterns in the data in a completely independent way. Nevertheless thee many successful applications of these methods, they are usually harder to understand and evaluate. Some examples are identifying topics in a set of blog posts, segmenting customers into groups with similar preferences, or detecting abnormal access patterns to a website. Concerning data, it is helpful to consider your data a table. Each data point that you want to reason about (each email, each customer, each transaction) is a table row, and each property that describes that data point (say, the age of a customer or the amount or location of a transaction) is a column. You might describe users by their age, their gender, when they created an account, and how often they have bought from your online shop. You might describe the image of a tumor by the grayscale values of each pixel, or maybe by using the size, shape, and color of the tumor. Each entity or row here is known as a sample (or data point) in machine learning, while the columns —the properties that describe such entities— are called features

2.2. The libraries

The scikit-learn library (homepage [https://scikit-learn.org/stable/][https://scikit-learn.org/stable/]) is an open source machine learning library that supports supervised and unsupervised learning. It also provides tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. You can see the available supervised methods in https://scikit-learn.org/stable/supervised_learning.html or the unsupervised ones in https://scikit-learn.org/stable/unsupervised_learning.html To install scikit-learn simply type:
conda install scikit-learn

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    joblib-1.1.0               |     pyhd8ed1ab_0         210 KB  conda-forge
    scikit-learn-1.1.2         |   py39he5e8d7e_0         8.2 MB  conda-forge
    threadpoolctl-3.1.0        |     pyh8a188c0_0          18 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         8.5 MB

The following NEW packages will be INSTALLED:

  joblib             conda-forge/noarch::joblib-1.1.0-pyhd8ed1ab_0
  scikit-learn       conda-forge/linux-64::scikit-learn-1.1.2-py39he5e8d7e_0
  threadpoolctl      conda-forge/noarch::threadpoolctl-3.1.0-pyh8a188c0_0


Proceed ([y]/n)? 


Downloading and Extracting Packages
scikit-learn-1.1.2   | 8.2 MB    | ##################################### | 100% 
joblib-1.1.0         | 210 KB    | ##################################### | 100% 
threadpoolctl-3.1.0  | 18 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done

An alternative way is to use pip install scikit-learn Moreover, we will use also the libraries numpy, pandas, scipy, and matplotlib used in other course lessons.

3. A first simple example

First we import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn import metrics 
from sklearn.model_selection import train_test_split
In this first case we define our own dataset. As we mentioned, in general, a datasets is made of two parts:
  • The characteristics of the objects, or features.
  • The class to which the objects belong.
We consider 500 objects defined in terms of two attributes and 500 classes labeled in terms of integers, i.e. [0,1] belongs to the class 0, [2,3] belongs to the class 1, [4,5] belongs to the class 2 and so on.
X = np.arange(1000).reshape((500, 2))
y = range(500)
Now we define our estimator, the so called support vector classification
clf = svm.SVC(gamma=0.001, C=100.)
and separate our dataset in training and test sets
X_train, X_test, y_train,  y_test = train_test_split(X, y, test_size=.1, shuffle=True)
We can now train the model
clf.fit(X_train,y_train)
Once the model is trained, we can assess the results
predicted= clf.predict(X_test)
print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)

and
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

plt.show()

4. A second and more complex example

4.1. Reading the data

First we read a set of standard databases included in the sckit-learn package, correspondig to a copy of the test set of the UCI ML hand-written digits datasets https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
from sklearn import datasets
digits = datasets.load_digits()
See https://scikit-learn.org/stable/datasets.html for the complete list of datasets or go to https://archive-beta.ics.uci.edu/. You can check what are the dataset keys or show the content of one the keys
print("Keys of digit dataset: \n{}".format(digits.keys()))
print("Description of digit dataset: \n{}".format(digits['DESCR']))
Or you can access the data directly
print(digits.data)
print(digits.data.shape)
print(digits.target)
print(digits.target.shape)
The data attribute provides the instances and target provides the class of the different instances. Note that data elements are 2D 8x8 arrays. In this particular dataset, one can use the images attribute
print(digits.images[0])
digits.images.shape

4.2. Learning and predicting

The task is to predict, given an image, which digit it represents. We are given samples of each of the ten possible classes (the digits zero through nine) on which we fit an estimator to be able to predict the classes to which unseen samples belong. In scikit-learn, an estimator for classification is a Python object that implements the fit(X, y) and predict(T) methods. An example of an estimator is the class sklearn.svm.SVC, which implements the support vector classification. The estimator’s constructor takes as arguments the model’s parameters.
# from sklearn import svm
clf = svm.SVC()
#clf = svm.SVC(gamma=0.001, C=100.)
pred_data = -30
clf.fit(digits.data[:pred_data], digits.target[:pred_data])
print(clf.predict(digits.data[pred_data:]))
print(digits.target[pred_data:])
The clf (for classifier) estimator instance is first fitted to the model; that is, it must learn from the model. This is done by passing our training set to the fit method. For the training set, we’ll use all the images from our dataset, except for the last pred_data images, reserved to check the fit quality. We select the training set with the [:pred_data] Python syntax, which produces a new array that contains all but the last items from digits.data. You can now predict new values from the corresponding matrices. In this case, you’ll predict using the last images, excluded from the training. By predicting, you’ll determine the image from the training set that best matches the last image. In this example we are using the Support Vector Machines which are a set of supervised learning methods used for classification, regression and outliers detection. Also you can plot the images and check visually the calculation performance
fig, ax = plt.subplots(6,5,figsize=(10,10))
i=0
for row in range(6):
    for col in range(5):
	ax[row,col].imshow(digits.images[pred_data+i], cmap=plt.cm.gray_r, interpolation="nearest")
	i+=1
However it is much more convenient to split our original dataset into two sets, the training and the test sets, selecting elements in a random way. In this case we use half the data for training and the other half for testing.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.5, shuffle=False
)
We now train our model and check the results
# Train
clf.fit(X_train, y_train)
# Compute predictions
predicted = clf.predict(X_test)
Now can analyze the performance of our learning procedure:
print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)
#
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

plt.show()
Note that the performace of the prediction can be measured in different ways:
  • precision= (true positives)/(true positives + true negatives)
  • recall= (true positives)/(true positives + false positives)
  • F1-score = 2 * (precision * recall) / (precision + recall)
among other options. On the other hand, the confusion matrix provides a detailed view of the agreement between the prediction and the tests. Note that for other election of training and test sets, the obtained results might vary.

5. Supervised Machine Learning Algorithms

5.1. k-Nearest Neighbors algorithm

The k-Nearest Neighbors (k-NN) algorithm is arguably the simplest machine learning algorithm. Building the model requires only of storing the training dataset. To make a prediction for a new data point, the algorithm looks for its nearest neighbors, i.e. the closest data points in the training dataset. We will use in this case and in the coming section the famous Iris dataset, first used by Sir R.A. Fisher. The dataset is taken from Fisher’s paper Annual Eugenics, 7, Part II, 179-188 (1936). The dataset contains 3 classes with 50 instances each, where each class refers to a type of iris plant, Setosa, Versicolour, and Virginica.
#from sklearn import metrics 
#from sklearn.model_selection import train_test_split
#from sklearn import datasets
iris_dataset = datasets.load_iris()
X, y = iris_dataset.data, iris_dataset.target
print(iris_dataset.DESCR)
This database has four attributes: sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Its structure can be easily visualized graphically, using different colors for the three classes
fig, ax = plt.subplots(4,4,figsize=(15,15))
for row in range(4):
    for col in range(4):
	ax[row,col].scatter(iris_dataset.data[:,row], iris_dataset.data[:,col], c=iris_dataset.target)
	ax[row,col].set_xlabel("Attrib {}".format(col))
	ax[row,col].set_ylabel("Attrib {}".format(row))
To build the model
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, 
    random_state=0, shuffle=True )
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
By default the train_test_split function uses 25% of the data for testing and 75% of the data for training. Note that the classifier option random_state = number allows to separate the data always in the same two sets making the calculations reproducible across multiple function calls. The default value is None and . The most relevant parameter is n_neighbors which default value is 5. The obtained results and the performance of the calculation is:
print("Test set predictions: {}".format(clf.predict(X_test)))
print("Test set accuracy: {:.9f}".format(clf.score(X_test, y_test)))
It is of interest to see how the number of neighbors affect the performace of the calculation, as required in the following exercise.
Exercise 8.1
We now proceed with a different dataset, the Wisconsin Breast Cancer dataset (load_breast_cancer dataset), which records clinical measurements of breast tumors. Each tumor is labeled as benign (for harmless tumors) or malignant (for cancerous tumors), and the task is to learn to predict whether a tumor is malignant based on the tissue measurements.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=66)
In this case we split the data including the stratify argument. This argument makes a split in such a way that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify. In principle, there are two important knobs to control the KNeighbors classifier: the number of nearest neighbors considered and how you measure the distance between data points. In practice, using a small and odd (to avoid confussion between classes) number of neighbors, like three or five, often works well, but you should certainly adjust this parameter to optimize the obtained results. It is also possible to define at will the distance metrics. The default setup uses Euclidean distance, which works fine in many settings. We proceed by training the model for different numbers of nearest neighbors and checking the accuracy obtained as a function of the neighbors considered
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
    # build the model
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train, y_train)
    # record training set accuracy
    training_accuracy.append(clf.score(X_train, y_train))
    # record generalization accuracy
    test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend() 
Overall the results are quite good, with accuracies always larger that 88%, but you can see how the test accuracy increases with the number of neighbors until a maximum is reached. One of the strengths of the k-NN algorithm is that the model is very easy to understand, and it often achieves a reasonable performance without a lot of adjustments. This algorithm offers a good baseline method before considering more advanced techniques. Building the nearest neighbors model is usually very fast, unless your training set is very large (either in number of features or in number of samples) as in this case the calculation of predicted values is computationally expensive. This approach often does not perform well on datasets with many features (hundreds or more), and it badly fails when dealing with datasets where most features are zero most of the time (the so called sparse datasets). Now we can analyze also the regressor. In this case we will generate an artifitial set of data:
data_reg=datasets.make_regression(n_samples=1000)
X=data_reg[0]
y=data_reg[1]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size= 0.2, random_state=0)
Load the library, train and predict
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor(n_neighbors=5)
reg.fit(X_train, y_train)
predicted=reg.predict(X_test)
print("Train set R^2: {:.9f}".format(reg.score(X_train, y_train)))
print("Test set R^2: {:.9f}".format(reg.score(X_test, y_test)))
Note that the value for the test data cannot be much lower than the one for the trainig set.
Exercise 8.2

6. Supervised Machine Learning Algorithms: Linear Models

Linear models are a class of models that are widely used in practice and have been studied extensively in the last few decades, with roots going back over a hundred years. Linear models make a prediction using a linear function of the input features, which we will explain shortly.

6.1. Linear regression (aka ordinary least squares)

Linear regression, or ordinary least squares (OLS), is the simplest and most classic linear method for regression. Linear regression finds the parameters \(w\) and \(b\) that minimize the mean squared error between predictions and the true regression targets, \(y\), on the training set. \begin{equation} y = w[0] x[0] + w[1] x[1] + \ldots + w[p] * x[p] + b \end{equation} We define our working dataset
from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LinearRegression
data_reg=datasets.make_regression(n_features=1,n_samples=100, noise=10, bias=2)
X=data_reg[0]
y=data_reg[1]
We separate into training and test sets, train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
lr = LinearRegression().fit(X_train, y_train)
We show the coefficients
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
We present graphically the results
plt.plot(X_test,y_test,'.r', label='test')
plt.plot(X_train,y_train,'.b', label='train')
plt.plot(X,lr.predict(X),'-k', label='fit')
plt.legend()
This example can be extended easily to the multidimensional case. As one can see there is no way to modify anything in the fitting process, for example in the case of overfitting. To this end there exists other alternatives, e.g., Ridge and Lasso.

6.2. Ridge regression

Ridge regression is also a linear model for regression, so the formula it uses to make predictions is the same one used for ordinary least squares. In ridge regression, though, the coefficients (\(w\)) are chosen not only so that they predict well on the training data, but also to fit an additional constraint. We also want the magnitude of coefficients to be as small as possible; in other words, all entries of \(w\) should be close to zero. This constraint is an example of what is called regularization. In the case of Lasso as many as possible coefficients are taken as zero. In general with linear regression we are overfitting the data. Ridge is a more restricted model, so we are less likely to overfit. A less complex model means worse performance on the training set, but better generalization. As we are only interested in generalization performance, we should choose the Ridge model over the LinearRegression model. How much importance the model places on simplicity versus training set performance can be specified by the user, using the alpha parameter. The optimum setting of alpha depends on the particular dataset we are using. Increasing alpha forces coefficients to move more toward zero, which decreases training set performance but might help generalization.
from sklearn.linear_model import Ridge
data_reg=datasets.make_regression(n_features=100, n_samples=1000, bias=3.5, noise=10)
X=data_reg[0]
y=data_reg[1]*np.random.rand(len(data_reg[1]))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size=0.7)
First, we see the linear regression case
lr = LinearRegression().fit(X_train, y_train)
print("Training set score: {:.9f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.9f}".format(lr.score(X_test, y_test)))
Now we explore the effect of alpha
ridge = Ridge().fit(X_train, y_train)
print("Training set score: {:.9f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.9f}".format(ridge.score(X_test, y_test)))
We can see the effect of the alpha parameter
ridge = Ridge(alpha=1000).fit(X_train, y_train)
print("Training set score: {:.9f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.9f}".format(ridge.score(X_test, y_test)))
Finally, we see we alpha affects the value of the coefficients.
coeffs= []
#alphas = np.linspace(0,10,50)
alphas = np.logspace(-6, 6, 200)
for i in alphas:
    ridge = Ridge(alpha=i).fit(X_train, y_train)
    coeffs.append(ridge.coef_)
plt.plot(alphas,coeffs)
plt.xscale("log")
plt.xlabel("alpha")
plt.ylabel("weights")
plt.title("Ridge coefficients as a function of the regularization")

6.3. Linear models for classification

Linear models are not only used for regression, but they are also extensively used for classification. In this case the prediction is done using the formula \begin{equation} y = w[0] x[0] + w[1] x[1] + \ldots + w[p] * x[p] + b > 0 \end{equation} The formula looks very similar to the one for linear regression, but instead of just returning the weighted sum of the features, we threshold the predicted value at zero. If the function is smaller than zero, we predict the class –1; if it is larger than zero, we predict the class +1. This prediction rule is common to all linear models for classification. Again, there are many different ways to find the coefficients (w) and the intercept (b). The two most common linear classification algorithms are logistic regression, implemented in linear_model.LogisticRegression, and linear support vector machines (linear SVMs), implemented in svm.LinearSVC (SVC stands for support vector classifier). Despite its name, LogisticRegression is a classification algorithm and not a regression algorithm, and it should not be confused with LinearRegression. For LogisticRegression and LinearSVC there is a trade-off parameter that determines the strength of the regularization is called C , and higher values of C correspond to less regularization. In other words, when you use a high value for the parameter C , LogisticRegression and LinearSVC try to fit the training set as best as possible, while with low values of the parameter C , the models put more emphasis on finding a coefficient vector (w) that is close to zero. We consider the database of breast cancer where only two classes exit,
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split
We split into two sets and we train the model,
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
logreg = LogisticRegression(C=1).fit(X_train, y_train)
print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))
We can check how the two clases separate plotting the classes as a function of two features (we can choose other two to see how the behave),
plt.scatter(cancer.data[:,0],cancer.data[:,11],c=cancer.target)
plt.xlabel('feature 0')
plt.ylabel('feature 11')
Now we move into the case of linear models for multiclass classification. In this case we will use a dataset for wines
wine=datasets.load_wine()
print(wine.DESCR)
One can visualize the classes (for only part of the attributes)
fig, ax = plt.subplots(8,8,figsize=(15,15))
for row in range(8):
    for col in range(8):
	ax[row,col].scatter(wine.data[:,row], wine.data[:,col], c=wine.target)
	ax[row,col].set_xlabel("{}".format(row))
	ax[row,col].set_ylabel("{}".format(col))
In this case we will use LinearSVC
from sklearn.svm import LinearSVC
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, stratify=wine.target, random_state=42)
SVC = LinearSVC(C=10.01,max_iter=10000).fit(X_train, y_train)
print("Training set score: {:.3f}".format(SVC.score(X_train, y_train)))
print("Test set score: {:.3f}".format(SVC.score(X_test, y_test)))
One can see that there are convergence problems.

7. Supervised Machine Learning Algorithms: Decision Trees

Decision trees are widely used models for classification and regression tasks. Essentially, they learn a hierarchy of if/else questions, leading to a decision. To build a tree, the algorithm searches over all possible tests and finds the one that is most informative about the target variable. A recursive process yields a binary tree of decisions, with each node containing a test. Alternatively, you can think of each test as splitting the part of the data that is currently being considered along one axis. This yields a view of the algorithm as building a hierarchical partition. As each test concerns only a single feature, the regions in the resulting partition always have axis-parallel boundaries. The recursive partitioning of the data is repeated until each region in the partition (each leaf in the decision tree) only contains a single target value (a single class or a single regression value). A leaf of the tree that contains data points that all share the same target value is called pure. It is also possible to use trees for regression tasks, using exactly the same technique. To make a prediction, we traverse the tree based on the tests in each node and find the leaf the new data point falls into. The output for this data point is the mean target of the training points in this leaf. Decision trees in scikit-learn are implemented in the DecisionTreeRegressor and DecisionTreeClassifier classes. scikit-learn only implements pre-pruning, not post-pruning. We can apply this method to the breast cancer dataset. As always, we import the dataset and split it into a training and a test part. Then we build a model using the default setting of fully developing the tree (growing the tree until all leaves are pure). We fix the randomstate in the tree, which is used for tie-breaking internally:
from sklearn.tree import DecisionTreeClassifier
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep and complex. Unpruned trees are therefore prone to overfitting and not generalizing well to new data. One option is to stop building the tree after a certain depth has been reached. Here we set max_depth=2, meaning only two consecutive questions can be asked.
tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))
It is possible easily visualize what the classification tree means:
from sklearn.tree import export_graphviz
export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "benign"],
feature_names=cancer.feature_names, impurity=False, filled=True)

!pip install graphviz
import graphviz
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)
Or to see the feature importance
print("Feature importances:\n{}".format(tree.feature_importances_))
def plot_feature_importances_cancer(model):
    n_features = cancer.data.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
plot_feature_importances_cancer(tree)
Decision trees have two advantages over many of the algorithms we’ve discussed so far: the resulting model can easily be visualized and understood by nonexperts (at least for smaller trees), and the algorithms are completely invariant to scaling of the data. As each feature is processed separately, and the possible splits of the data don’t depend on scaling, no preprocessing like normalization or standardization of features is needed for decision tree algorithms. In particular, decision trees work well when you have features that are on completely different scales, or a mix of binary and continuous features.
Exercise 8.3

8. Supervised Machine Learning Algorithms: Support Vector Machines

Kernelized support vector machines (often just referred to as SVMs) are an extension that allows for more complex models that are not defined simply by hyperplanes in the input space. While there are support vector machines for classification and regression, we will restrict ourselves to the classification case, as implemented in SVC. Similar concepts apply to support vector regression, as implemented in SVR. During training, the SVM learns how important each of the training data points is to represent the decision boundary between the two classes. Typically only a subset of the training points matter for defining the decision boundary: the ones that lie on the border between the classes. These are called support vectors and give the support vector machine its name. To make a prediction for a new point, the distance to each of the support vectors is measured. A classification decision is made based on the distances to the support vector, and the importance of the support vectors that was learned during training (stored in the dual_coef_ attribute of SVC). Let us apply the SVC to the breast cancer dataset
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
The model overfits quite substantially, with a perfect score on the training set and only 63% accuracy on the test set. While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to the scaling of the data. This can be partially solved using the option gamma=’scale’
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
But, in general, to solve the problem, all the features should vary on a similar scale. To see the scale of the different features
plt.plot(X_train.min(axis=0), 'o', label="min")
plt.plot(X_train.max(axis=0), '^', label="max")
plt.legend(loc=4)
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
plt.yscale("log")
The solution is to preprocess the data rescaling them:
# compute the minimum value per feature on the training set
min_on_training = X_train.min(axis=0)
# compute the range of each feature (max - min) on the training set
range_on_training = (X_train - min_on_training).max(axis=0)
# subtract the min, and divide by range
# afterward, min=0 and max=1 for each feature
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))
X_test_scaled = (X_test - min_on_training) / range_on_training
svc = SVC()
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
    svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
Scaling the data made a huge difference! Now we are actually in an underfitting regime, where training and test set performance are quite similar but less close to 100% accuracy. From here, we can try increasing either C or gamma to fit a more complex model. For example:
svc = SVC(C=1000)
svc.fit(X_train_scaled, y_train)
print("Accuracy on training set: {:.3f}".format(
    svc.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))

9. Supervised Machine Learning Algorithms: Neural Networks, the multilayer perceptron (MLP)

While deep learning shows great promise in many machine learning applications, deep learning algorithms are often tailored very carefully to a specific use case. Here, we will only discuss some relatively simple methods, namely multilayer perceptrons for classification and regression, that can serve as a starting point for more involved deep learning methods. MLPs can be viewed as generalizations of linear models that perform multiple stages of processing to come to a decision. \begin{equation} y = w[0] x[0] + w[1] x[1] + \ldots + w[p] x[p] + b \end{equation} \(y\) can be understood as a weighted sum of the input features \(x\), weighted by the learned coefficients \(w\). In an MLP this process of computing weighted sums is repeated multiple times, first computing hidden units that represent an intermediate processing step, which are again combined using weighted sums to yield the final result. This model has a lot more coefficients (also called weights) to learn: there is one between every input and every hidden unit (which make up the hidden layer), and one between every unit in the hidden layer and the output. Note that computing a series of weighted sums is mathematically the same as computing just one weighted sum, so to make this model truly more powerful than a linear model, we need one extra trick. After computing a weighted sum for each hidden unit, a nonlinear function is applied to the result—usually the rectifying nonlinearity (also known as rectified linear unit or relu) or the tangens hyperbolicus (tanh). For the case of three input features a one hidden layer with three nodes \begin{eqnarray} h[0] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\ h[1] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\ h[2] &=& tanh(w[0, 0] x[0] + w[1, 0] x[1] + w[2, 0] x[2] + w[3, 0] x[3])\\ y &=& v[0] h[0] + v[1] h[1] + v[2] h[2] \end{eqnarray} As a first example let us consider the make_moons dataset
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
random_state=42)
mlp = MLPClassifier( solver= 'lbfgs', random_state=0).fit(X_train, y_train)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
By default, the MLP uses 100 hidden nodes, which is quite a lot for this small dataset. We can reduce the number (which reduces the complexity of the model) and still get a good result.
mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10, 10], random_state=0).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
Now we can analyze how the complexity of a neural network is controlled: the number of hidden layers, the number of units in each hidden layer, and the regularization (alpha). There are actually even more.
for n_hidden_nodes_1 in [10, 100]:
    for n_hidden_nodes_2 in [10, 100]:
	for alpha in [0.0001, 0.01, 0.1, 1]:
	    mlp = MLPClassifier(solver= 'lbfgs', random_state=0,
			    hidden_layer_sizes=[n_hidden_nodes_1, n_hidden_nodes_2],alpha=alpha)
	    mlp.fit(X_train, y_train)
	    print("n_hidden=[{}, {}], alpha={:.4f}".format(n_hidden_nodes_1, n_hidden_nodes_2, alpha))
	    print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
	    print("Accuracy on test set: {:.3f}\n".format(mlp.score(X_test, y_test)))
While the MLPClassifier and MLPRegressor provide easy-to-use interfaces for the most common neural network architectures, they only capture a small subset of what is possible with neural networks. If you are interested in working with more flexible or larger models, we encourage you to look beyond scikit-learn into the fantastic deep learning libraries that are out there. For Python users, the most well-established are keras (https://keras.io/), tensor-flow (https://www.tensorflow.org/), and pythorch (https://pytorch.org/).

10. Dataset transformations

We saw in previous sections that some algorithms, like neural networks and SVMs, are very sensitive to the scaling of the data. Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data.

10.1. Rescaling

Preprocessing methods like the scalers are usually applied before applying a supervised machine learning algorithm. As an example, say we want to apply the kernel SVM (SVC) to the cancer dataset, and use MinMaxScaler for preprocessing the data. We will use the cancer dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=1)
print(X_train.shape)
print(X_test.shape)
We will start with MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
We then fit the scaler using the fit method, applied to the training data.
scaler.fit(X_train)
#transform data
X_train_scaled = scaler.transform(X_train)
# print dataset properties before and after scaling
print("transformed shape: {}".format(X_train_scaled.shape))
print("per-feature minimum before scaling:\n {}".format(X_train.min(axis=0)))
print("per-feature maximum before scaling:\n {}".format(X_train.max(axis=0)))
print("per-feature minimum after scaling:\n {}".format(
X_train_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n {}".format(
X_train_scaled.max(axis=0)))
The transformed data has the same shape as the original data—the features are simply shifted and scaled. You can see that all of the features are now between 0 and 1, as desired. To apply the SVM to the scaled data, we also need to transform the test set. This is again done by calling the transform method, this time on Xtest:
# transform test data
X_test_scaled = scaler.transform(X_test)
# print test data properties after scaling
print("per-feature minimum after scaling:\n{}".format(X_test_scaled.min(axis=0)))
print("per-feature maximum after scaling:\n{}".format(X_test_scaled.max(axis=0)))
Maybe somewhat surprisingly, you can see that for the test set, after scaling, the minimum and maximum are not 0 and 1. Some of the features are even outside the 0–1 range! The problem is that we used the training data to fix the filter, but then we applied it to a different set of data. It is important to apply exactly the same transformation to the training set and the test set for the supervised model to work on the test set. Here, we show what happens if the scaling is not done properly
from sklearn.datasets import make_blobs
# make synthetic data
X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)
# split it into training and test sets
X_train, X_test = train_test_split(X, random_state=5, test_size=.1)
# plot the training and test sets
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
axes[0].scatter(X_train[:, 0], X_train[:, 1],c='blue', label="Training set", s=60)
axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', c='red', label="Test set", s=60)
axes[0].legend(loc='upper left')
axes[0].set_title("Original Data")
# scale the data using MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# visualize the properly scaled data
axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],c='blue', label="Training set", s=60)
axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',c='red', label="Test set", s=60)
axes[1].set_title("Scaled Data")
# rescale the test set separately
# so test set min is 0 and test set max is 1
# DO NOT DO THIS! For illustration purposes only.
test_scaler = MinMaxScaler()
test_scaler.fit(X_test)
X_test_scaled_badly = test_scaler.transform(X_test)
# visualize wrongly scaled data
axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],c='blue', label="training set", s=60)
axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1],marker='^', c='red', label="test set", s=60)
axes[2].set_title("Improperly Scaled Data")
for ax in axes:
    ax.set_xlabel("Feature 0")
    ax.set_ylabel("Feature 1")
The third panel shows what would happen if we scaled the training set and test set separately. In this case, the minimum and maximum feature values for both the training and the test set are 0 and 1. But now the dataset looks different. The test points moved incongruously to the training set, as they were scaled differently. We changed the arrangement of the data in an arbitrary way. Clearly this is not what we want to do. Note that, in genera, the fit and the transform can be done more efficiently in a single step using the fit_transform method:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# calling fit and transform in sequence (using method chaining)
X_scaled = scaler.fit(X).transform(X)
# same result, but more efficient computation
X_scaled_d = scaler.fit_transform(X)
Other important scale methods are StandardScaler and RobustScaler. The StandardScaler in scikit-learn ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features. The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles, instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). The Normalizer does a very different kind of rescaling. It scales each data point such that the feature vector has a Euclidean length of 1. In other words, it projects a data point on the circle (or sphere, in the case of higher dimensions) with a radius of 1. The effect of working with scaled data can be really important. First we show the performance just using raw data:
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=0)
svm = SVC(C=100)
svm.fit(X_train, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test, y_test)))
and now the scaled ones
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# learning an SVM on the scaled training data
svm.fit(X_train_scaled, y_train)
# scoring on the scaled test set
print("Scaled test set accuracy: {:.2f}".format(
    svm.score(X_test_scaled, y_test)))
As we saw before, the effect of scaling the data is quite significant. Even though scaling the data doesn’t involve any complicated math, it is good practice to use the scaling mechanisms.

10.2. Principal Component Analysis (PCA) for dimensionality reduction

Principal component analysis is a method that rotates the dataset in a way such that the rotated features are statistically uncorrelated. This rotation is often followed by selecting only a subset of the new features, according to how important they are for explaining the data. We will work once more with the cancer dataset. First we scale the data,
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
scaler = StandardScaler()
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)
and then we obtain the two first principal components,
from sklearn.decomposition import PCA
# keep the first two principal components of the data
pca = PCA(n_components=2)
# fit PCA model to breast cancer data
pca.fit(X_scaled)
# transform data onto the first two principal components
X_pca = pca.transform(X_scaled)
print("Original shape: {}".format(str(X_scaled.shape)))
print("Reduced shape: {}".format(str(X_pca.shape)))
We can see how look like the new X data
# plot first vs. second principal component, colored by class
plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], cancer.target)
plt.legend(cancer.target_names, loc="best")
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")
It is important to note that PCA is an unsupervised method, and does not use any class information when finding the rotation. It simply looks at the correlations in the data. You can see that the two classes separate quite well in this two-dimensional space. This leads us to believe that even a linear classifier (that would learn a line in this space) could do a reasonably good job at distinguishing the two classes. A downside of PCA is that the two axes in the plot are often not very easy to interpret. The principal components correspond to directions in the original data, so they are combinations of the original features. However, these combinations are usually very complex, as we’ll see shortly.
print("PCA components:\n{}".format(pca.components_))

plt.matshow(pca.components_, cmap='viridis')
plt.yticks([0, 1], ["First component", "Second component"])
plt.colorbar()
plt.xticks(range(len(cancer.feature_names)),
cancer.feature_names, rotation=60, ha='left')
plt.xlabel("Feature")
plt.ylabel("Principal components")
Once the PCA has been performed, the fitting procedure can be conducted with the PCA components:
from sklearn.svm import SVC
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, cancer.target,
random_state=10)
#
svm = SVC(C=100)
svm.fit(X_train_pca, y_train)
print("Test set accuracy: {:.2f}".format(svm.score(X_test_pca, y_test)))

10.3. Pipelines

Most machine learning applications require not only the application of a single algorithm, but the chaining together of many different processing steps and machine learning models. In this chapter, we will cover how to use the Pipeline class to simplify the process of building chains of transformations and models. In particular, we will see how we to use Pipeline for all processing steps at once. Let’s look at how we can use the Pipeline class to express the workflow for training an SVM after scaling the data with MinMaxScaler (for now without the grid search). First, we build a pipeline object by providing it with a list of steps. Each step is a tuple containing a name (any string of your choosing) and an instance of an estimator:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
random_state=0)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC
pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])
Here, we created two steps: the first, called “scaler”, is an instance of MinMaxScaler, and the second, called “svm”, is an instance of SVC . One can access different steps with pipe[0] or pipe[1]. Now, we can fit the pipeline, like any other scikit-learn estimator:
pipe.fit(X_train, y_train)
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))
Calling the score method on the pipeline first transforms the test data using the scaler, and then calls the score method on the SVM using the scaled test data. The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification. The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step. Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline , that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))
The pipeline objects pipelong and pipeshort do exactly the same thing, but pipeshort has steps that were automatically named. We can see the names of the steps by looking at the steps attribute:
print("Pipeline steps:\n{}".format(pipe_short.steps))
print("Pipeline steps:\n{}".format(pipe_long.steps))
One can access the attributes at any step,
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
pipe.fit(cancer.data)
print("Pipeline steps:\n{}".format(pipe.steps))
print(pipe.named_steps["pca"].components_)
pipe.transform(cancer.data)
It is possible to change the value of any control parameters of the different steps
pipe.set_params(pca__n_components=3)
pipe.fit(cancer.data)
pipe.fit(cancer.data)
pipe.transform(cancer.data).shape
Note the double _.
Exercise 8.4

11. Exercises

Author: Curro Perez Bernal

Created: 2024-12-09 Mon 21:21

Validate