Python Lesson 8: Machine Learning

Python Lesson 8: Machine Learning

1. Lesson outline

  1. Getting started
  2. Working with data
  3. Traditional Supervised Learning Models
  4. Model Evaluation
  5. Neural Networks

2. Getting started

Machine learning, a research field at the intersection of statistics, artificial intelligence, and computer science that is also known as predictive analytics or statistical learning, is about extracting as much knowledge as possible from datasets. The application of machine learning methods has in recent years become ubiquitous in everyday life.

2.1. Basic definitions

The data are composed of samples characterized by some labels or classes and depending on some features. The most successful machine learning algorithms are those that automate decision-making processes by generalizing after being trained with known examples. This is known as supervised machine learning (SML), and algorithms finds a way to produce the desired output given an input. In particular, once trained, supervised algorithms are able to create an output for an input that was never seen before and without human guidance. We are concentrating on different aspects of the two main problems dealt with SML: classification and regression.

In unsupervised machine learning, only the dataset features are known, and no target is supplied to the algorithm who then tries to look for patterns in the data in a completely independent way. Despite the many successful applications of unsupervised learning, this algorithms are usually harder to understand and evaluate than SML. Some examples are identifying topics in a set of blog posts, segmenting customers into groups with similar preferences, or detecting abnormal access patterns to a website.

We consider the available data as a table. Each data point is a row of the data table and it is called a sample. The different properties that characterize the sample are called the data features and, finally, there are one or more target quantities called the data labels or classes. Each of the properties that describes a sample occupy a column of the table and they may be numerical (e.g. the age of a customer, the amount of a transaction, the temperature of a star, or the value of a pixel in a figure) or categorical (e.g. gender disctinctions, the location of a purchase, or the color of a star).

Also classes can be numerical or categorical. In the first case, we are dealing with a regression problem while the second case is associated to classification problems. In both cases, in SML, the aim is to build and train a model able to estimate the label or class value given a set of feature values.

There are also problems of the logistic regression type, where a probability is assigned to each sample using a regression model, and from the probability value a category is assigned to the sample.

The feature matrix is a table with dimensions equal to the number of samples times number of features.

In SML, data are often split into two subsets, the training and the test data. The first one is used to optimize the model while the second is used to check the performance of the trained model.

The model, usually selected from physical considerations or taking into account the nature of the problem, is an educated guess of the label dependence with the features and they have usually free parameters (aka weights) that are varied and optimized to fit the training data.

The estimations obtained from a model are called model predictions and the ground truth are the actual values for data labels.

The difference between model predictions and ground truth values is assessed with a loss function that quantifies the distance between the two data sets. The model training is intended to minimize the loss function.

Another important concept is the decision boundary which, in classification problems, is the hypersurface separating samples belonging to different classes.

Therefore, in SLM there are four main ingredients:

  • Dataset
  • Model
  • Loss function
  • Optimization method

2.2. Software required

The scikit-learn library (homepage https://scikit-learn.org/stable/) is an open source machine learning library that supports supervised and unsupervised learning. It also provides tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. You can check the available supervised methods in https://scikit-learn.org/stable/supervised_learning.html or the unsupervised ones in https://scikit-learn.org/stable/unsupervised_learning.html

To install scikit-learn simply type:

conda install scikit-learn

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    joblib-1.1.0               |     pyhd8ed1ab_0         210 KB  conda-forge
    scikit-learn-1.1.2         |   py39he5e8d7e_0         8.2 MB  conda-forge
    threadpoolctl-3.1.0        |     pyh8a188c0_0          18 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         8.5 MB

The following NEW packages will be INSTALLED:

  joblib             conda-forge/noarch::joblib-1.1.0-pyhd8ed1ab_0
  scikit-learn       conda-forge/linux-64::scikit-learn-1.1.2-py39he5e8d7e_0
  threadpoolctl      conda-forge/noarch::threadpoolctl-3.1.0-pyh8a188c0_0


Proceed ([y]/n)? 


Downloading and Extracting Packages
scikit-learn-1.1.2   | 8.2 MB    | ##################################### | 100% 
joblib-1.1.0         | 210 KB    | ##################################### | 100% 
threadpoolctl-3.1.0  | 18 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done

An alternative way is to use pip install scikit-learn

Moreover, we will also use the libraries numpy, pandas, scipy, and matplotlib, already used in other course lessons.

3. Working with data

Quality data are the key ingredient for machine learning calculations. Notwithstanding the complexity of the models, if data are not enough and of good quality, results will inevitably be poor. In general, the data collection, their quality control, and their preprocessing are the most time-consuming aspects of machine learning.

We can distingish among different data types:

  1. Numerical
    • Integer
    • Floats (continuous)
  2. Categorical
    • Nominal
    • Ordinal

The transformation of categorical into numerical -integer- data types is called data encoding. The most common ways of encode data are

  • Label encoding: Replace category by an integer.
  • Ordinal encoding: Direct translation for categorical data of the ordinal type.
  • One-hot encoding

Once all features and classes have been encoded, the next step is to perform data transformations in order to simplify the calculations and get a better grasp on the data meaning.

A common transformation is data standardization, bringing to zero the data average and to one its standard deviation as follows

\[ x_{st} = \frac{x-\mu}{\sigma}~, \] where \(\mu\) is the original data mean value and \(\sigma\) is the standard deviation.

The rescaling of features to bring them to a common interval of variation is an important step that can be done automatically by scikit-learn which greatly helps in the successful training of models. Another common way of standardized data is bringing them to the \([0,1]\) interval using the min-Max transformation

\[ x_{st} = \frac{x-x_{min}}{x_{max}-x_{min}}\in[0,1]~. \]

Another very relevant task is to reduce the number of data features selecting the really relevant ones. This can be done in two ways:

  • Feature Selection

    Feature selection implies leaving aside features that do not convey relevant information for the data modeling. If, for example, you are working with butterflies data, trying to model their survival success in a national park from different features, it seems safe to assume that their scientific name is an irrelevant feature. Constant value features are easily detected and can be safely discarded. Sometimes also features with a variance below a given threshold value are also discarded.

    In feature selection, the correlation among the different features may help in selecting features that are candidate for being discarded. One has to be careful, because sometimes a feature may seem uncorrelated but it may bring much info into the right combination with other features.

    The values of the optimized weights also give a hint on the relevance of a feature, something that is particularly true for linear standarized models.

    Often models are trained with and without the features candidate to being discarded and the different results obtained allow for a final decision on removing the feature or not.

  • Feature Transformation

    In this case the features are transformed in order to look for combinations of features that are independent and relevant. The most relevant method in this case is Principal Components Analysis (PCA) where a \(n_s\times n_f\) dataset is transformed in such a way that the original features \(\{X_i\}_{i=1}^{n_f}\) become a resulting feature set \(\{X’_i\}_{i=1}^{n_f}\). The new features are such that \(X’_1\) the maximum variance and \(X’_{n_f}\) carry the minimum possible variance. This is achieved diagonalizing the \(n_f\times n_f\) covariance matrix

    \[ \Sigma_{jk} = \frac{\sum_{i=1}^{n_f}(X_{ij}-\overline{X}_j)(X_{ik}-\overline{X}_k)}{n_f} \] where \(\overline{X}_j\) is the average value of the j-th feature.

    After diagonalization, that is solving the eigenvalue equation \(\Sigma \mathbf{v} = \lambda \mathbf{v}\), \(\Sigma’ = U\Sigma U^\dagger\). Then a number \(m\) of features with lowest eigenvalues can be removed, and the problem is transformed into a new problem with \(n_f-m\) features. The new feature matrix is

    \[X_{PCA} = X \times V\] where \(V\) is a matrix with the first \(n_f-m\) column eigenvectors. The \(X_{PCA}\) matrix has dimensions \(n_s\times(n_f-m)\) and its covariance matrix, \(\Sigma_{PCA}\) is diagonal with elements \(\lambda_1\) to \(lambda_{nf_m}\).

    We will show an example of PCA using a Kaggle dataset with the ground state energies (labels) of 16,242 molecules (number of samples, \(n_s\)). The dataset should be uploaded from this link. As described in the Kaggle website, the dataset contains 1,277 columns. The first 1,275 columns are entries in the Coulomb matrix that act as features. The 1,276th column is the Pubchem Id where the molecular structures are obtained and it is an irrelevant feature that can be discarded. The 1,277th column is the atomization energy, calculated by simulations using the Quantum Espresso package.

gs_mol_data = pd.read_csv("files/roboBohr.csv", index_col=0)
gs_mol_data.shape

We can now create the feature matrix X_matrix from the first 1,275 columns of the gs_mol_data dataframe

X_matrix = gs_mol_data[[str(val) for val in range(0, 1275)]]

We now import the PCA class from sklearn and apply the transformation to the X_matrix

from sklearn.decomposition import PCA
pca = PCA()
X_matrix_PCA = pca.fit_transform(X_matrix) 

The input data are centered but not scaled for each feature before applying the SVD. The eigenvalues of the covariance matrix are

print(pca.singular_values_)
fig, ax = plt.subplots()
ax.plot(pca.singular_values_)
ax.set_yscale("log")

The explained variance ratio of the eigenvalues can be obtained as follows, notice that the sum of explained variance ratios is one, as we are including all eigenvalues

print(pca.explained_variance_ratio_)
print(np.sum(pca.explained_variance_ratio_))

And the sum quickly converges very close to unity.

fig, ax = plt.subplots()
ax.plot(np.cumsum(pca.explained_variance_ratio_)[:100])

Therefore one can reduce the complexity of the problem removing features with low explained variance ratio using the argument n_components. To retain only 150 features make \(m = 1,275-150 = 1,125\)

pca = PCA(n_components=150)
X_matrix_PCA = pca.fit_transform(X_matrix)
print(pca.explained_variance_ratio_)
print(np.sum(pca.explained_variance_ratio_))
fig, ax = plt.subplots()
ax.plot(np.cumsum(pca.explained_variance_ratio_))

The first 200 components of the resulting rotated features are

plt.imshow(pca.components_[:,:200])
plt.colorbar()

The resulting feature matrix, X_matrix_PCA has dimensions \(n_s\times (n_f-m) = 16,242\times 150\). One can check that this is the matrix obtained with a change of basis of the centered original matrix to the basis composed by the first 150 eigenvectors of the covariance matrix

# Centering the matrix
centered_X_matrix = X_matrix - np.mean(X_matrix, axis = 0)
#
# Change of basis
calc_X_matrix_PCA = np.matmul(centered_X_matrix, pca.components_.T)
#
print(calc_X_matrix_PCA - X_matrix_PCA)

4. Traditional Supervised Learning Models

We will present three of the more common traditional SLMs: Linear regression, K-nearest neighbours (KNN), and Decision Tree (DT) models. In the three cases we will use as an example data for the power emitted at different frequencies by a black body at a temperature of 1,200 K that can be found in this link.

read_bb_data = pd.read_csv("files/bb_data.csv", header=None, names=["Frequency", "Rad. power density"])
# Define data
X = (read_bb_data["Frequency"].values).reshape(-1,1)
Y = (read_bb_data["Rad. power density"].values).reshape(-1,1)
# Plot data
plt.plot(X, Y)

This is a simple example, with a single feature (frequency in THz) and whose label is the power emitted.

We load the libraries for the different models

# Loading libraries

# Linear model
from sklearn.linear_model import LinearRegression

# Polynomial model
from sklearn.preprocessing import PolynomialFeatures

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsRegressor

# Decision Tree
from sklearn.tree import DecisionTreeRegressor

In all cases we follow three steps:

  1. Create an object of the chosen model type
  2. Use the fit method to train the model
  3. Use the predict method to predict labels for a given sample set.

Once we know the basics, we will introduce training and testing sets.

4.1. Linear model

In the simplest linear regression models the data trend is fit to a line, that is, the line that is closest to the data is computed. Therefore, for linear models

\[ f_\omega(\mathbf{X}) = \mathbf{\omega}\cdot\mathbf{X} \]

where \(\mathbf{X}=(1, x_1, x_2, \ldots x_{n_f})\) are the problem features and \(\mathbf{\omega} = (\omega_1, \omega_2, \ldots \omega_{n_f})\) are the free parameters or weights of the model.

In classification problems a linear fit is done to obtain the decision boundary that separates the different data types. The best decision boundaries are such that samples for different classes are as seggregated as possible, the distance from the samples to the decision boundary is maximized. The second condition is what is used in the definition of the so called Support Vector Models (SVM). This name stems from the fact that the samples that are closest to the decision boundary are called support vectors.

We proceed to fit a linear model to our blackbody data.

# Linear model
reg = LinearRegression().fit(X, Y)
xpred_linear = np.linspace(1,260,10)
ypred_linear = reg.predict(xpred_linear.reshape(-1,1))
print(reg.score(X, Y))
print('Slope: ', reg.coef_)
print('Intercept: ', reg.intercept_)

Linear regression can be easily extended to polynomial models. For example, a quadratic model with two features, \(X\) and \(Y\), will be defined as follows

\[ f_\omega(\mathbf{X}) = \omega_0 + \omega_1 X + \omega_2 Y + \omega_3 X^2 + \omega_4 Y^2 + \omega_5 XY \]

where it is clear how the number of free parameters scale as the number of features raised to the polynomial degree. In order to carry out a fit using a polynomial model the data need to be recasted taking into account the polynomial order and the number of features. We perform a fit using an order three polynomial

# Polynomial model
pol_degree = 3
poly = PolynomialFeatures(degree=pol_degree, include_bias=True) # degree=2 for quadratic
X_poly = poly.fit_transform(X)
reg_poly = LinearRegression().fit(X_poly, Y)
xpred_poly = np.linspace(1, 260, 25).reshape(-1,1)
X_poly = poly.transform(xpred_poly) 
ypred_poly = reg_poly.predict(X_poly)
print('Coef: ', reg_poly.coef_)
print('Intercept: ', reg_poly.intercept_)

The results of the linear and the polynomial fit are

fig, ax = plt.subplots()
ax.plot(X, Y, color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="BB data")
ax.plot(xpred_linear, ypred_linear, label="linear", color="xkcd:electric blue",lw=0.75)
ax.plot(xpred_poly, ypred_poly, label = f"{pol_degree}-polyn", color="xkcd:orchid",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.legend()

4.2. KNN model

This is a model easy to understand where a sample is classified taking into consideration its distance to the K nearest neighbours. The classification can be decided using a majority vote or weighting the vote by the distance to the corresponding samples. There are different possible distance definitions:

  1. Euclidean distance \(d_{\mathbf{XY}} = \left[\sum_i(X_i-Y_i)^2\right]^{1/2}\)
  2. Manhattan distance \(d_{\mathbf{XY}} = \sum_i|X_i-Y_i|\)
  3. Minkowski distance \(d^{(p)}_{\mathbf{XY}} = \left[\sum_i|X_i-Y_i|^p\right]^{1/p}\)
  4. Chebyshev distance \(d_{\mathbf{XY}} = \max_i|X_i-Y_i|\)

Apart from these four there are other two distances commonly used. The first one, the Hamming distance is a distance between binary numbers and it is defined as the number of different bits between the two numbers. The cosine metric considers the angular distance between points.

Apart from classification problems, the KNN algorithm can be used for regression problems. In this case, the model predicts a continuous outcome by averaging the label values of the K nearest neighbors to a new data point in the feature space. We can use this with our blackbody data, using, for example, three neares neighbors

# K-Nearest Neighbors
knext = 3
knn = KNeighborsRegressor(n_neighbors=knext)
knn.fit(X, Y)
xpred_knn = np.linspace(1,260,100)
ypred_knn = knn.predict(xpred_knn.reshape(-1,1))

4.3. Decision tree algorithms

Among the traditional SLM, decision trees are the most transparent ones, and they allow for an easy explanation of the model results. This is the approach closest to a traditional algorithm and it is based on a series of conditions based on sample features. The conditions are defined according to the information gain (or uncertainty reduction) achived in the classification problem with their introduction. We obtain a DT fit with depth dtree to the blackbody data

# Decision Tree
dtree = 4
dt = DecisionTreeRegressor(max_depth=dtree)
dt.fit(X, Y)
xpred_dt = np.linspace(1,260,100)
ypred_dt = dt.predict(xpred_dt.reshape(-1,1))

The results obtained with the different models is

fig, ax = plt.subplots()
ax.plot(X, Y, color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="BB data")
ax.plot(xpred_linear, ypred_linear, label="linear", color="xkcd:electric blue",lw=0.75)
ax.plot(xpred_poly, ypred_poly, label = f"{pol_degree}-polyn", color="xkcd:orchid",lw=0.75)
ax.plot(xpred_knn, ypred_knn, label = f"{knext}-KNN", color="xkcd:crimson",lw=0.75)
ax.plot(xpred_dt, ypred_dt, label = f"{dtree}-DT", color="xkcd:teal",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.legend()

The residual plot for the different models can be produced once the predicted values for the frequency values used in the fit are computed.

# Residual plot
# Linear case
Ypred_linear = reg.predict(X).reshape(-1,1)
X_poly = poly.transform(X) 
Ypred_poly = reg_poly.predict(X_poly).reshape(-1,1)
Ypred_knn = knn.predict(X).reshape(-1,1)
Ypred_dt = dt.predict(X).reshape(-1,1)
#
# Residual plot
fig, ax = plt.subplots()
ax.plot(X, Ypred_linear-Y, label="linear", color="xkcd:electric blue",lw=0.75)
ax.plot(X, Ypred_poly-Y, label = f"{pol_degree}-polyn", color="xkcd:orchid",lw=0.75)
ax.plot(X, Ypred_knn-Y, label = f"{knext}-KNN", color="xkcd:crimson",lw=0.75)
ax.plot(X, Ypred_dt-Y, label = f"{dtree}-DT", color="xkcd:teal",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.set_ylabel("Residuals")
ax.legend()

4.4. Loss functions

The agreement between model predictions and data samples is quantified with a loss function. Most common loss functions are distances, fullfilling the symmetric property and the triangle inequality, but this is not always so. The most common loss functions are:

  1. Mean-Squared Error (MSE) \[L_{MSE} = \frac{\sum_i^{n_s}(Y_i-Y_i^{(pred)})^2}{n_s}\]
  2. Mean Absolut Error (MAE) \[L_{MAE} = \frac{\sum_i^{n_s}|Y_i-Y_i^{(pred)}|}{n_s}\]
  3. l-norm Loss Function \[L_{l} = \left[\frac{\sum_i^{n_s}|Y_i-Y_i^{(pred)}|^l}{n_s}\right]^{1/l}\] The last case reduces to \(L_{MAE}\) for \(l=1\) and to \(L_{RMSE}\) for \(l = 2\). In the \(l\to\infty\) case the loss function is \(L_\infty = max(|Y_i-Y^{(pred)}_i|\), known as the max error. The last case defines a bound for all errors and it is useful in worst-case scenarios.
  4. Cross Entropy \[ L_{CE}(Y, Y^{(pred)}) = -\sum_i^{n_s} [Y_i \log{(Y^{(pred)}_i)} + (1-Y_i)\log{(1-Y^{(pred)}_i)} \] A popular choice for logistic regression models and classification models that deal with probabilities.
  5. Hinge function \[ L_{Hinge}(Y, f(X))= \max{(0,1-Yf(X))} \]

We can define the l-norm loss function and calculate it for the blackbody radiation fits shown above

### Loss function definition
def loss_function_l(y, ycalc,l=2):
    import numpy as np
    return (np.mean(np.abs(y-ycalc)**l))**1/l
#####
# RMSE loss
l_value = 2
# Linear case
Ypred_linear = reg.predict(X).reshape(-1,1)
print("linear: ", loss_function_l(Y, Ypred_linear, l=l_value))
X_poly = poly.transform(X) 
Ypred_poly = reg_poly.predict(X_poly).reshape(-1,1)
print("poly: ", loss_function_l(Y, Ypred_poly, l=l_value))
Ypred_knn = knn.predict(X).reshape(-1,1)
print("knn: ", loss_function_l(Y, Ypred_knn, l=l_value))
Ypred_dt = dt.predict(X).reshape(-1,1)
print("dt: ", loss_function_l(Y, Ypred_dt, l=l_value))

5. Model Evaluation

In order to carry out an unbiased model evaluation, available data are usually split into two subsets: the training set and the test set: the first one is used to train the model and search for optimal model parameter values while the second data set is used to test the predictive power of the obtained model. Both datasets should ideally come from the same original dataset whose elements are shuffled and randomly picked.

It is important to make a random selection of data. To illustrate this point, let’s make train and test sets with half the data of our blackbody radiation data.

# Number of data
total_num_data = X.shape[0]
print(total_num_data)
# Case 0  :: Divide train and test by half of data
train_dataset_0 = X[:total_num_data//2,:], Y[:total_num_data//2,:] 
test_dataset_0 = X[total_num_data//2:,:], Y[total_num_data//2:,:]
#

One can compare the loss funtion in this case with a second model where the data are randomly split into two sets. This can be done by hand

# Case 1 :: Take random data
rng = np.random.default_rng()
random_selection = np.sort(rng.choice(total_num_data, size = total_num_data//2, replace=False))
mask = np.full_like(X, True, dtype=bool)
mask[random_selection,:] = False
train_dataset_1 = X[random_selection,:], Y[random_selection,:]
test_dataset_1 = X[mask].reshape((-1,1)), Y[mask].reshape((-1,1))

Or, better, using the provided function for data subsampling train_test_split

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    red_seasonal_data.dropna()[["Temperature", "Temperature uncertainty", "encoded_season"]], red_seasonal_data.dropna()["KGCl"], test_size=0.25, shuffle=False
)

The results for the two different splittings can be computed for the different models and compared

# Linear model 
reg_0 = LinearRegression().fit(train_dataset_0[0], train_dataset_0[1])
y_0_train_linear = reg_0.predict(train_dataset_0[0]).reshape(-1,1)
y_0_test_linear = reg_0.predict(test_dataset_0[0]).reshape(-1,1)
loss_train_0_linear = loss_function_l(train_dataset_0[1], y_0_train_linear,l=2)
loss_test_0_linear = loss_function_l(test_dataset_0[1], y_0_test_linear,l=2)
print(loss_train_0_linear, loss_test_0_linear)
# Linear model 1
reg_1 = LinearRegression().fit(train_dataset_1[0], train_dataset_1[1])
y_1_train_linear = reg_1.predict(train_dataset_1[0]).reshape(-1,1)
y_1_test_linear = reg_1.predict(test_dataset_1[0]).reshape(-1,1)
loss_train_1_linear = loss_function_l(train_dataset_1[1], y_1_train_linear,l=2)
loss_test_1_linear = loss_function_l(test_dataset_1[1], y_1_test_linear,l=2)
print(loss_train_1_linear, loss_test_1_linear)
# Polynomial model 0
reg_poly_0 = LinearRegression().fit(poly.transform(train_dataset_0[0]), train_dataset_0[1])
#
y_0_train_poly = reg_poly_0.predict(poly.transform(train_dataset_0[0])).reshape(-1,1)
y_0_test_poly = reg_poly_0.predict(poly.transform(test_dataset_0[0])).reshape(-1,1)
#
loss_train_0_poly = loss_function_l(train_dataset_0[1], y_0_train_poly,l=2)
loss_test_0_poly = loss_function_l(test_dataset_0[1], y_0_test_poly,l=2)
print(loss_train_0_poly, loss_test_0_poly)
# Poly 1
reg_poly_1 = LinearRegression().fit(poly.transform(train_dataset_1[0]), train_dataset_1[1])
#
y_1_train_poly = reg_poly_1.predict(poly.transform(train_dataset_1[0])).reshape(-1,1)
y_1_test_poly = reg_poly_1.predict(poly.transform(test_dataset_1[0])).reshape(-1,1)
#
loss_train_1_poly = loss_function_l(train_dataset_1[1], y_1_train_poly,l=2)
loss_test_1_poly = loss_function_l(test_dataset_1[1], y_1_test_poly,l=2)
print(loss_train_1_poly, loss_test_1_poly)
#
fig, ax = plt.subplots()
ax.plot(X, Y, color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="BB data")
ax.plot(train_dataset_0[0], y_0_train_linear, label="train linear 0", color="xkcd:electric blue",lw=0.75)
ax.plot(test_dataset_0[0], y_0_test_linear, label="test linear 0", color="xkcd:navy blue",lw=0.75)
ax.plot(train_dataset_1[0], y_1_train_linear, label="train linear 1", color="xkcd:forest green",lw=0.75)
ax.plot(test_dataset_1[0], y_1_test_linear, label="test linear 1", color="xkcd:light green",lw=0.75)
ax.plot(train_dataset_0[0], y_0_train_poly, label="train poly 0", color="xkcd:orange",lw=0.75)
ax.plot(test_dataset_0[0], y_0_test_poly, label="test poly 0", color="xkcd:teal",lw=0.75)
ax.plot(train_dataset_1[0], y_1_train_poly, label="train poly 1", color="xkcd:pink",lw=0.75)
ax.plot(test_dataset_1[0], y_1_test_poly, label="test poly 1", color="xkcd:crimson",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.legend()

5.1. Validation curve

The data splitting helps in detecting and avoiding overfitting and underfitting. In order to explain this, the concepts of bias and variance are introduced. The bias is the loss associated with the training set. Typically the bias is less than the loss for the test dataset. The difference between the test loss and the bias is the variance. The bias is related to model complexity. We can always reduce the bias increasing model complexity by adding more free parameters (tuning the model hyperparameters). But we are falling into the dangerous pathway of overfitting. As John Von Neumann said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. The bias reduction comes at a high cost: the loss of predictive power, as can be demonstrated by a large variance increase, which together with the low bias is a sure red flag for overfitting. The opposite phenomenon, underfitting occurse when the model complexity is not high enough to reproduce data which translates into a high bias and a low variance.

In order to assess the right model complexity one needs to balance bias and variation, playing with model complexity and and the right tool to check this is the validation curve where test and training losses are plot as a function of model complexity (hyperparameters). You can compute the loss for test data as a sum of bias plus variance. Also the plot of the bias and variance versus the hyperparameters offers a similar information. We can check this making a polynomial fit to some data and increasing the order of the polynomial.

We first create some data and add to them random normal noise

# Abscissa values
npoints = 32
#
x_values = np.linspace(-5, 5, npoints)
#
# Parameter values (a_0 + a_1*x + a_2*x**2 + a_3*x**3)*exp(-0.25x) 
a_par = [50.0, -5.5, -2.5, 0.5, 0.1]
a_par.reverse()
#
# Calculate values
y_values = a_par.pop(0)*x_values
a_0 = a_par.pop(-1)
for a_val in a_par:
    y_values = (y_values + a_val)*x_values
y_values += a_0
#
y_values *= np.exp(-0.25*x_values)
#
# Build Dataframe and add noise
rng = np.random.default_rng(1234)
pd_data = pd.DataFrame({"X": x_values, 
                          "Y": y_values + rng.normal(loc = 0, scale = 5.12, size = npoints)})
#
pd_data.plot(x="X", y="Y", ls='', marker=".")
#
# Splitting data 
from sklearn.model_selection import train_test_split
#
X_train, X_test, Y_train, Y_test = train_test_split(
    pd_data["X"].values.reshape(-1, 1), pd_data["Y"].values.reshape(-1, 1), test_size=0.6, shuffle=True
)

We perform polynomial fits for different polynomial order and check the resulting bias and variance

### Dependence with Polynomial degree
in_losses = np.zeros(11)
out_losses = np.zeros(11)
#

# Get axis limits from data
x_min, x_max = np.min(x_values), np.max(x_values)
y_min, y_max = np.min(pd_data["Y"]), np.max(pd_data["Y"])

# Main loop
for pol_degree in range(1,12):
    # Define polynomial case
    poly = PolynomialFeatures(degree=pol_degree, include_bias=True) 
    #
    X_poly = poly.fit_transform(X_train)
    reg_poly = LinearRegression().fit(X_poly, Y_train)
    #
    Y_train_poly = reg_poly.predict(X_poly).reshape(-1,1)
    X_poly = poly.transform(X_test)
    Y_test_poly = reg_poly.predict(X_poly).reshape(-1,1)
    #
    #fig, ax = plt.subplots()
    #ax.plot(X_train, Y_train, ls='', marker='.',label="Train data")
    #ax.plot(X_test, Y_test,  ls='', marker='.',label="Test data")
    #ax.plot(X_train, Y_train_poly,  ls='', marker='+', label=f"Train order {pol_degree}")
    #ax.plot(X_test, Y_test_poly,  ls='', marker='*', label=f"Test order {pol_degree}")
    #ax.set_xlim((x_min, x_max))
    #ax.set_ylim((y_min, y_max))
    #ax.legend()
    in_losses[pol_degree-1] = loss_function_l(Y_train, Y_train_poly, l=2)
    out_losses[pol_degree-1] = loss_function_l(Y_test, Y_test_poly, l=2)
fig, ax = plt.subplots()
ax.plot(np.arange(1,12), in_losses, label="Bias")
ax.plot(np.arange(1,12),out_losses-in_losses, label="Variance")
ax.legend()
ax.set_yscale("log")
ax.set_xlabel("Polynomial degree")
ax.set_ylabel("Loss")

5.2. Regularization of parameters

Models can quickly run too complex. A way to control this is through parameter regularization, penalizing large parameter values. This is achieved including parameter values into the loss function as

\[ L^{(1)}_R(Y, Y^{(pred)}) = L(Y, Y^{(pred)}) + \lambda \sum_i|\omega_i|~,~~ L^{(2)}_R(Y, Y^{(pred)}) = L(Y, Y^{(pred)}) + \lambda \sum_i\omega_i^2~. \] For \(\lambda = 0\) the original model results are recovered and for increasing values of \(\lambda\) large free parameter values are increasingly penalized. Regularization in most scikit-learn models is implemented as a hyperparameter.

For example, in scikit-learn the \(L^{(2)}\) regularization in the linear regression case is performed using the Ridge class, with a hyperparameter \(\alpha\) that plays the role of \(\lambda\). We use this in the following example using the same data than in the previous subsection.

### Dependence with regularization
#
from sklearn.linear_model import Ridge
#
#
# Define polynomial case
pol_degree = 4
poly = PolynomialFeatures(degree=pol_degree, include_bias=True) 
X_train_poly = poly.fit_transform(X_train)
#
# Main loop
for alpha_val in [0.01, 0.1, 1, 10]:
    #
    reg_poly = Ridge(alpha=alpha_val).fit(X_train_poly, Y_train)
    #
    Y_train_poly = reg_poly.predict(X_train_poly).reshape(-1,1)
    X_test_poly = poly.transform(X_test)
    Y_test_poly = reg_poly.predict(X_test_poly).reshape(-1,1)
    #
    fig, ax = plt.subplots(2)
    ax[0].plot(X_train, Y_train, ls='', marker='.',label="Train data")
    ax[1].plot(X_test, Y_test,  ls='', marker='.',label="Test data")
    ax[0].plot(X_train, Y_train_poly, ls='', marker='+', label=f"Train $\\alpha= {alpha_val}$")
    ax[1].plot(X_test, Y_test_poly,  ls='', marker='*', label=f"Test $\\alpha= {alpha_val}$")
    ax[0].legend()
    ax[1].legend()

The \(L^{(1)}\) regularization in the linear regression case is performed using the Lasso class, which also has an alpha hyperparameter with identical function as in the Ridge case. For numerical reasons, using alpha = 0 with the Ridge or Lasso objects is not advised. Instead, the LinearRegression object should be used.

5.3. Model Tuning

Apart from searching for the maximum number of free parameters or hyper-parameters, a way to fine-tune a given model is repeating the calculations driving the the model training with different hyper-parameter values. This can be done in different ways through parameter discretization. A possible approach is the grid search method, defining a regular grid for each of the free parameters and evaluating the results obtained with the training of the model in each point of the grid. This scales exponentially with the number of parameters and is computationally very expensive. Another possibility is to perform a random search, defining a random grid.

In order to show how to carry out such calculations we train a SVR model to fit the blackbody data using a grid search approach.

# Model tuning
svr_model = SVR()
#
from sklearn.model_selection import GridSearchCV
#
# Fit model tuning hyper-parameter values using a grid search
# 
hyperparameter_grid = {"C":np.arange(1,8,2), "kernel":["rbf", "poly"]}
#
opt_clf_grid = GridSearchCV(estimator=svr_model, param_grid=hyperparameter_grid)
opt_clf_grid.fit(training_set[0], training_set[1])
best_model_grid = opt_clf_grid.best_estimator_
grid_train_results = best_model_grid.predict(training_set[0])
grid_test_results = best_model_grid.predict(test_set[0])
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].plot(training_set[0], training_set[1], color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[0].plot(test_set[0], test_set[1], color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[0].plot(training_set[0], grid_train_results, label="Training results", color="xkcd:electric blue",lw=0.75)
ax[0].plot(test_set[0], grid_test_results, label="Test results", color="xkcd:navy blue",lw=0.75, ls="--")
ax[0].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Fit results")
ax[0].legend()
ax[1].plot(training_set[0].ravel(), training_set[1].ravel()-grid_train_results, color="xkcd:crimson", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[1].plot(test_set[0].ravel(), test_set[1].ravel()-grid_test_results, color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[1].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Residuals")
ax[1].legend()

We can do a calculation for the same SVR model using a random search in hyperparameter space

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import gamma
a=2
x = np.linspace(gamma(a, scale=5).ppf(0.01), gamma(a,scale=5).ppf(0.99), 100)
plt.plot(x, gamma(a, scale=5).pdf(x),'r-', lw=2.5, alpha=0.6, label='gamma pdf')
#
# Fit model tuning hyper-parameter values using a random search
rng = np.random.default_rng()
# 
hyperparameter_dist = {"C":gamma(a=2,scale=5), "gamma":["scale"], "kernel":["rbf"]}
#
opt_clf_random = RandomizedSearchCV(estimator=svr_model, param_distributions=hyperparameter_dist)
opt_clf_random.fit(training_set[0], training_set[1])
#
best_model_random = opt_clf_random.best_estimator_
#
random_train_results = best_model_random.predict(training_set[0])
random_test_results = best_model_random.predict(test_set[0])
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].plot(training_set[0], training_set[1], color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[0].plot(test_set[0], test_set[1], color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[0].plot(training_set[0], random_train_results, label="Training results", color="xkcd:electric blue",lw=0.75)
ax[0].plot(test_set[0], random_test_results, label="Test results", color="xkcd:navy blue",lw=0.75, ls="--")
ax[0].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Fit results")
ax[0].legend()
ax[1].plot(training_set[0].ravel(), training_set[1].ravel()-random_train_results, color="xkcd:crimson", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[1].plot(test_set[0].ravel(), test_set[1].ravel()-random_test_results, color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[1].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Residuals")
ax[1].legend()

In any case, the test dataset should not be used to optimize the parameter values, as this is what is known as a data leakage. A possible solution, if one has a large enough dataset, is to split it into three subsets, one for training, a second one for validation purposes, and a third one for testing.

5.4. Learning curve

Another factor that sometimes help to improve the results is to increase the number of samples involved in the calculations. The validation curves depend not only on the model complexity but also on the number of samples \(n_s\). Often, the required model complexity tends to increase with the number of samples.

In most cases, more data reduces the variance and not the bias. If this is the case, the advantage provided by the inclusion of more data is relevant until variance is small. Once this point is reached, the results are not sensitive to an enlarged dataset. This is an important fact, as increasing the number of samples is sometimes an expensive, a cumbersome task, or plainly impossible. We calculate, for the black body radiation data, the in- and out-losses for different number of samples.

### Dependence with number of data
in_losses = np.zeros(20)
out_losses = np.zeros(20)
# Define polynomial case
pol_degree = 2
poly = PolynomialFeatures(degree=pol_degree, include_bias=True) 
#
# Main loop
for index in range(1,21):
    #print(index)
    # Taking index*10% of samples
    random_selection = np.sort(rng.choice(total_num_data, size = total_num_data*index//20, replace=False))
    new_set = X[random_selection,:], Y[random_selection,:]
    #
    # Splitting into train and test sets with 50% of the data
    random_selection = np.sort(rng.choice(total_num_data*index//20, size = total_num_data*index//40, replace=False))
    mask = np.full_like(new_set[0], True, dtype=bool)
    mask[random_selection,:] = False
    train_dataset = new_set[0][random_selection,:], new_set[1][random_selection,:]
    test_dataset = new_set[0][mask].reshape((-1,1)), new_set[1][mask].reshape((-1,1))
    #
    X_poly = poly.fit_transform(train_dataset[0])
    reg_poly = LinearRegression().fit(X_poly, train_dataset[1])
    #
    y_train_poly = reg_poly.predict(X_poly).reshape(-1,1)
    X_poly = poly.transform(test_dataset[0])
    y_test_poly = reg_poly.predict(X_poly).reshape(-1,1)
    #
    in_losses[index-1] = loss_function_l(train_dataset[1], y_train_poly, l=2)
    out_losses[index-1] = loss_function_l(test_dataset[1], y_test_poly, l=2)

To assess whether the number of samples is large enough, the right tool is the learning curve, where in- and out-losses are depicted as a function of the size of the employed dataset.

fig, ax = plt.subplots()
ax.plot(np.arange(5,105,5), in_losses, label="In Loss")
ax.plot(np.arange(5,105,5),out_losses, label="Out Loss")
ax.legend()
ax.set_xlabel("Percentage of samples")
ax.set_ylabel("Loss")
ax.set_title(f"Order {pol_degree} polynomial fit")

5.5. Model Metrics

From what we have already explained, the lower the loss values for a model, the better the model. In this case, one needs to know the problem and the quantities involved into. However, in some cases, loss values do not provide enough information to correctly assess model performance. A possible option is to use the \(R^2\) metric, also called the coefficient of variation, defined as

\[ R^2(Y, Y^{(pred)}) = 1 – \frac{\sum_i (Y_i -Y^{(pred)}_i)^2 }{\sum_i (Y_i -Y_{avg})^2 }~, \]

where \(Y_{avg}\) is the mean value of the \(Y\) class. This coefficient, which in simple linear regression is the square of the correlation coefficient, provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. In most cases, \(0\le R^2\le 1\), though in some cases it can be negative. I

For classification models, the model accuracy, the percentage of samples correctly classified it the most common metrics. But other metrics can be more relevant, and the choice of the right metrics depends on the model employed and the problem at hand. For example, in some cases, the different options in the classification problem may not be equivalent, and errors in some of them could be more relevant than in others; or in cases of very imbalanced classes.

A complete information about the model performance can be obtained from the confusion matrix. This is a square matrix with as many rows and columns as different classes in the problem. The rows of the matrix are associated to real class values and the columns to predicted class values. Therefore, a \(CF_{34} = 5\) matrix element is telling us that in five cases the label 3 was taken to be label 4. A diagonal matrix element value is the number of true label cases correctly identified. A perfect model would output a diagonal matrix.

Some of the more popular metrics for classification problems can be easily obtained from the data in the confusion matrix:

  • Recall: Sensitivity with respect to a class of interest.

    \[R = \frac{True Positive}{True Positive + False Negative}\]

  • Precision: Precision of a classifier with respect to its prediction for a given class.

    \[P = \frac{True Positive}{True Positive + False Positive}\]

  • f1-score: Geometrical average of recall and precision.

    \[f1 = \frac{2 P R}{R + P}\]

All of this values are relative, and they are often provided together with the support value, that is the number of occurrences for each of the labels in the dataset. It can be easily calculated as the sum of the different rows of the confusion matrix. A low support value means that the previous metrics can be misleading.

As an application to a classification problem we are going to go through a problem of image analysis using a dataset of hand-written digits datasets from the UC Irvine included in the sckit-learn package. Follow this link for the complete list of datasets. We first read the data

from sklearn import datasets
digits = datasets.load_digits()
#
#
print("Keys of digit dataset: \n{}".format(digits.keys()))
print("Description of digit dataset: \n{}".format(digits['DESCR']))
#
print(digits.data)
print(digits.data.shape)
print(digits.target)
print(digits.target.shape)

The data attribute provides the features and target provides the class of the different instances. Note that data elements are 2D 8x8 arrays (64 features). In this particular dataset, one can use the images attribute

print(digits.images[0])
digits.images.shape

The classification task is to predict, given an image, which digit it represents. We are given samples of each of the ten possible classes (the digits zero through nine) on which we fit an Support Vector Class estimator (class sklearn.svm.SVC) to be able to predict the classes to which unseen samples belong. We start splitting our original dataset into the training and test sets, selecting elements in a random way and using half the data for training and the other half for testing.

#from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.5, shuffle=False)

We now train our model and calculate the predicted values to compare with the test dataset.

from sklearn import svm
clf = svm.SVC()
# Train
clf.fit(X_train, y_train)
# Compute predictions
predicted = clf.predict(X_test)

Now can analyze the performance of our learning procedure: The classification report includes, apart from the previous metrics, the accuracy (ratio of correctly predicted labels over the total number of labels), the macro average (averaging the unweighted mean per label), the weighted average (averaging the support-weighted mean per label), and the sample average (only for multilabel classification).

print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)
#
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

We proceed with a second example using SVM with a different sklearn dataset, the Wisconsin Breast Cancer dataset (load_breast_cancer dataset), which records clinical measurements of breast tumors. Each tumor is labeled as benign (for harmless tumors) or malignant (for cancerous tumors), and the task is to train a model able to predict whether a tumor is malignant or not based on the available tissue measurements. We start loading the dataset and split the data.

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=66)

In this case we include two new arguments. The random_state argument is by default None. It can be given an integer value that acts as a seed for the shuffling applied to the data to obtain a reproducible output across multiple function calls. The stratify argument, that also has None as a default value, is very useful when the relative frequency between the different classes is very different. In this case, by share luck, it may be that the training or testing sets have no samples from one of the classes in the dataset. To avoid this, this argument is given as a value an array with the labels definition and the splitting is done in such a way that relative label frequencies are approximately preserved in each fold.

from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

The model substantially overfits as the score is perfect on the training set but with only 63% accuracy on the test set. While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to data scaling of the data. In principle this can be partially solved using the option gamma=’scale’

svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

But, we try a more general solution. In principle, for numerical reasons, all features should vary on a similar scale. We check the scale of the different features in the present dataset

fig, ax = plt.subplots()
ax.plot(X_train.min(axis=0), 'o', label="min")
ax.plot(X_train.max(axis=0), '^', label="max")
ax.legend(loc=4)
ax.xlabel("Feature index")
ax.ylabel("Feature magnitude")
ax.set_yscale("log")

The solution is to preprocess the data rescaling them. We can do this by hand, but sklearn provides an efficient way to perform a min-Max data scaling using the StandardScaler class

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # StandarScaler (min-Max scaler) object
# Scaling features
X = cancer.data
scaler.fit(X)
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, cancer.target, stratify=cancer.target, random_state=66)
#
# Train model
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test, y_test)))

Scaling the data made a huge difference! Now we are actually in an underfitting regime, where training and test set performance are quite similar and training is not so close to 100% accuracy. From here, we can try increasing the C or gamma hyperparameters to fit a more complex model.

svc = SVC(C=1000)
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(
    svc.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))

6. Neural Networks: the multilayer perceptron (MLP)

Deep learning is the state-of-the-art tool in many machine learning applications, but its algorithms are very carefully tailored to specific uses and its training is exceptionally expensive in computational terms. We will discuss the relatively simple methods, dubbed Multilayer Perceptrons (MLP) that can be used either for classification or regression. They can serve as a starting point to get later involved with far more complex deep learning methods.

As with all Neural Networs, this model has a large number of free coefficients (or weights) that should be optimized in the training: there is one between every input and every hidden unit (which make up a hidden layer), and one between every unit in each hidden layer and the output. We start importing the MLPClassifier class

from sklearn.neural_network import MLPClassifier
# Auxiliary modules
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay

As a first application, we use a simple, toy model, dataset generator provided by scikit-learn called make_moon very useful for visualizing clustering and classification algorithms. We generate a dataset with 200 points, two features and one binary class, and we plot the data.

from sklearn.datasets import make_moons
#
# Generate data
X, y = make_moons(n_samples=200, noise=0.25, random_state=4)
marker_hash = {0: "*", 1: "o"}
#
# Plot data
fig, ax = plt.subplots()
for value in range(2):
    marker = marker_hash[value]
    mask = y == value
    ax.scatter(X[mask,0], X[mask,1], marker=marker)

We now proceed to split the data into training and test datasets and fit the training data using the MultiLayer Perceptron with a single layer of seventy nodes, and calculate the predicted values for the test dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=45)
mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(70,), random_state=0,max_iter=300).fit(X_train, y_train)
Y_predicted = mlp.predict(X_test)

We can plot now the test values and the predicted test values using the same markers with different size and transparency to check what cases are misclassified.

marker_hash = {0: "s", 1: "o"}
color_hash = {0:"xkcd:navy", 1: "xkcd:crimson"}
fig, ax = plt.subplots()
for value in range(2):
    marker = marker_hash[value]
    colorv = color_hash[value]
    mask_test = y_test == value
    ax.scatter(X_test[mask_test,0], X_test[mask_test,1], marker=marker, color=colorv, s=20)
    mask_predicted = Y_predicted == value
    ax.scatter(X_test[mask_predicted,0], X_test[mask_predicted,1], marker=marker, alpha = 0.35, s=120, color=colorv)

We can assess the quality of the obtained classification using the classification report and the confusion matrix as in previous examples.

# from sklearn.metrics import classification_report
print(
    f"Classification report for classifier {mlp}:\n"
    f"{classification_report(y_test, Y_predicted)}\n"
)
#from sklearn.metrics import ConfusionMatrixDisplay
#
disp = ConfusionMatrixDisplay.from_predictions(y_test, Y_predicted)
disp.figure_.suptitle("Confusion Matrix")

A more sophisticated way of plotting data, using DecisionBoundaryDisplay, that plots the test dataset together with the boundary between the two possible values in the binary classification can be done as follows

from sklearn.inspection import DecisionBoundaryDisplay
feature_1, feature_2 = np.meshgrid(
    np.linspace(X[:, 0].min(), X[:, 0].max()),
    np.linspace(X[:, 1].min(), X[:, 1].max())
)
grid = np.vstack([feature_1.ravel(), feature_2.ravel()]).T
grid_y_pred = np.reshape(mlp.predict(grid), feature_1.shape)
display = DecisionBoundaryDisplay(
    xx0=feature_1, xx1=feature_2, response=grid_y_pred
)
display.plot()
display.ax_.scatter(
    X_test[:,0], X_test[:,1], c=y_test, edgecolor="black"
)
plt.show()

We know consider a second example using the Stellar Classification Dataset – SDSS17 from Kaggle: Link SDSS17.

The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and a single class column which identifies it to be either a star, a galaxy or a quasar. We start by reading the data into a dataframe

stellar_data = pd.read_csv("files/star_classification.csv")
stellar_data.keys()

We need to select a feature matrix, X, removing those features that we consider irrelevant for the classification, and a label vector, Y.

X = stellar_data[["alpha", "delta", 'u', 'g', 'r' 'i', 'z', 'redshift']]
Y = stellar_data['class']

We now split the data into the training and test sets and we can do a first attempt of classification using the MLP class

X_train, X_test, y_train, y_test = train_test_split(X, Y, stratify=Y, random_state=42)
#
mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10, 10], random_state=0).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
y_predicted = mlp.predict(X_test)
print(
    f"Classification report for classifier {mlp}:\n"
    f"{classification_report(y_test, y_predicted)}\n"
)
#from sklearn.metrics import ConfusionMatrixDisplay
#
disp = ConfusionMatrixDisplay.from_predictions(Y_test, Y_predicted)
disp.figure_.suptitle("Confusion Matrix")

In almost all cases, it is convenient to scale the features.

#from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # StandarScaler (min-Max scaler) object
# Scaling features
scaler.fit(X)
scaled_X = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(scaled_X, Y, stratify=Y, random_state=42)
#
# Train the model
mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10,], random_state=0, max_iter=400, tol=0.005).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
y_predicted = mlp.predict(X_test)
print(
    f"Classification report for classifier {mlp}:\n"
    f"{classification_report(y_test, y_predicted)}\n"
)

While the MLPClassifier and MLPRegressor provide easy-to-use interfaces for the most common neural network architectures, they only capture a small subset of what is possible with neural networks. If you are interested in working with more flexible or larger models, we encourage you to look beyond scikit-learn into the superb deep learning libraries that are out there. For Python users, the most well-established are keras, tensor-flow, and pythorch.

Author: Curro Perez-Bernal

Created: 2025-11-20 jue 21:36

Validate