Python Lesson 8: Machine Learning
Table of Contents
1. Lesson outline
- Getting started
- Working with data
- Traditional Supervised Learning Models
- Model Evaluation
- Neural Networks
2. Getting started
Machine learning, a research field at the intersection of statistics, artificial intelligence, and computer science that is also known as predictive analytics or statistical learning, is about extracting as much knowledge as possible from datasets. The application of machine learning methods has in recent years become ubiquitous in everyday life.
2.1. Basic definitions
The data are composed of samples characterized by some labels or classes and depending on some features. The most successful machine learning algorithms are those that automate decision-making processes by generalizing after being trained with known examples. This is known as supervised machine learning (SML), and algorithms finds a way to produce the desired output given an input. In particular, once trained, supervised algorithms are able to create an output for an input that was never seen before and without human guidance. We are concentrating on different aspects of the two main problems dealt with SML: classification and regression.
In unsupervised machine learning, only the dataset features are known, and no target is supplied to the algorithm who then tries to look for patterns in the data in a completely independent way. Despite the many successful applications of unsupervised learning, this algorithms are usually harder to understand and evaluate than SML. Some examples are identifying topics in a set of blog posts, segmenting customers into groups with similar preferences, or detecting abnormal access patterns to a website.
We consider the available data as a table. Each data point is a row of the data table and it is called a sample. The different properties that characterize the sample are called the data features and, finally, there are one or more target quantities called the data labels or classes. Each of the properties that describes a sample occupy a column of the table and they may be numerical (e.g. the age of a customer, the amount of a transaction, the temperature of a star, or the value of a pixel in a figure) or categorical (e.g. gender disctinctions, the location of a purchase, or the color of a star).
Also classes can be numerical or categorical. In the first case, we are dealing with a regression problem while the second case is associated to classification problems. In both cases, in SML, the aim is to build and train a model able to estimate the label or class value given a set of feature values.
There are also problems of the logistic regression type, where a probability is assigned to each sample using a regression model, and from the probability value a category is assigned to the sample.
The feature matrix is a table with dimensions equal to the number of samples times number of features.
In SML, data are often split into two subsets, the training and the test data. The first one is used to optimize the model while the second is used to check the performance of the trained model.
The model, usually selected from physical considerations or taking into account the nature of the problem, is an educated guess of the label dependence with the features and they have usually free parameters (aka weights) that are varied and optimized to fit the training data.
The estimations obtained from a model are called model predictions and the ground truth are the actual values for data labels.
The difference between model predictions and ground truth values is assessed with a loss function that quantifies the distance between the two data sets. The model training is intended to minimize the loss function.
Another important concept is the decision boundary which, in classification problems, is the hypersurface separating samples belonging to different classes.
Therefore, in SLM there are four main ingredients:
- Dataset
- Model
- Loss function
- Optimization method
2.2. Software required
The scikit-learn library (homepage
https://scikit-learn.org/stable/)
is an open source machine learning library that supports supervised
and unsupervised learning. It also provides tools for model
fitting, data preprocessing, model selection, model evaluation, and
many other utilities. You can check the
available supervised methods in
https://scikit-learn.org/stable/supervised_learning.html or the
unsupervised ones in
https://scikit-learn.org/stable/unsupervised_learning.html
To install scikit-learn simply type:
conda install scikit-learn
The following packages will be downloaded:
package | build
---------------------------|-----------------
joblib-1.1.0 | pyhd8ed1ab_0 210 KB conda-forge
scikit-learn-1.1.2 | py39he5e8d7e_0 8.2 MB conda-forge
threadpoolctl-3.1.0 | pyh8a188c0_0 18 KB conda-forge
------------------------------------------------------------
Total: 8.5 MB
The following NEW packages will be INSTALLED:
joblib conda-forge/noarch::joblib-1.1.0-pyhd8ed1ab_0
scikit-learn conda-forge/linux-64::scikit-learn-1.1.2-py39he5e8d7e_0
threadpoolctl conda-forge/noarch::threadpoolctl-3.1.0-pyh8a188c0_0
Proceed ([y]/n)?
Downloading and Extracting Packages
scikit-learn-1.1.2 | 8.2 MB | ##################################### | 100%
joblib-1.1.0 | 210 KB | ##################################### | 100%
threadpoolctl-3.1.0 | 18 KB | ##################################### | 100%
Preparing transaction: done
Verifying transaction: done
An alternative way is to use pip install scikit-learn
Moreover, we will also use the libraries numpy, pandas, scipy, and matplotlib, already used in other course lessons.
3. Working with data
Quality data are the key ingredient for machine learning calculations. Notwithstanding the complexity of the models, if data are not enough and of good quality, results will inevitably be poor. In general, the data collection, their quality control, and their preprocessing are the most time-consuming aspects of machine learning.
We can distingish among different data types:
- Numerical
- Integer
- Floats (continuous)
- Categorical
- Nominal
- Ordinal
The transformation of categorical into numerical -integer- data types is called data encoding. The most common ways of encode data are
- Label encoding: Replace category by an integer.
- Ordinal encoding: Direct translation for categorical data of the ordinal type.
- One-hot encoding
Once all features and classes have been encoded, the next step is to perform data transformations in order to simplify the calculations and get a better grasp on the data meaning.
A common transformation is data standardization, bringing to zero the data average and to one its standard deviation as follows
\[ x_{st} = \frac{x-\mu}{\sigma}~, \] where \(\mu\) is the original data mean value and \(\sigma\) is the standard deviation.
The rescaling of features to bring them to a common interval of
variation is an important step that can be done automatically by
scikit-learn which greatly helps in the successful training of
models. Another common way of standardized data is bringing them to
the \([0,1]\) interval using the min-Max transformation
\[ x_{st} = \frac{x-x_{min}}{x_{max}-x_{min}}\in[0,1]~. \]
Another very relevant task is to reduce the number of data features selecting the really relevant ones. This can be done in two ways:
Feature Selection
Feature selection implies leaving aside features that do not convey relevant information for the data modeling. If, for example, you are working with butterflies data, trying to model their survival success in a national park from different features, it seems safe to assume that their scientific name is an irrelevant feature. Constant value features are easily detected and can be safely discarded. Sometimes also features with a variance below a given threshold value are also discarded.
In feature selection, the correlation among the different features may help in selecting features that are candidate for being discarded. One has to be careful, because sometimes a feature may seem uncorrelated but it may bring much info into the right combination with other features.
The values of the optimized weights also give a hint on the relevance of a feature, something that is particularly true for linear standarized models.
Often models are trained with and without the features candidate to being discarded and the different results obtained allow for a final decision on removing the feature or not.
Feature Transformation
In this case the features are transformed in order to look for combinations of features that are independent and relevant. The most relevant method in this case is Principal Components Analysis (PCA) where a \(n_s\times n_f\) dataset is transformed in such a way that the original features \(\{X_i\}_{i=1}^{n_f}\) become a resulting feature set \(\{X’_i\}_{i=1}^{n_f}\). The new features are such that \(X’_1\) the maximum variance and \(X’_{n_f}\) carry the minimum possible variance. This is achieved diagonalizing the \(n_f\times n_f\) covariance matrix
\[ \Sigma_{jk} = \frac{\sum_{i=1}^{n_f}(X_{ij}-\overline{X}_j)(X_{ik}-\overline{X}_k)}{n_f} \] where \(\overline{X}_j\) is the average value of the j-th feature.
After diagonalization, that is solving the eigenvalue equation \(\Sigma \mathbf{v} = \lambda \mathbf{v}\), \(\Sigma’ = U\Sigma U^\dagger\). Then a number \(m\) of features with lowest eigenvalues can be removed, and the problem is transformed into a new problem with \(n_f-m\) features. The new feature matrix is
\[X_{PCA} = X \times V\] where \(V\) is a matrix with the first \(n_f-m\) column eigenvectors. The \(X_{PCA}\) matrix has dimensions \(n_s\times(n_f-m)\) and its covariance matrix, \(\Sigma_{PCA}\) is diagonal with elements \(\lambda_1\) to \(lambda_{nf_m}\).
We will show an example of PCA using a Kaggle dataset with the ground state energies (labels) of 16,242 molecules (number of samples, \(n_s\)). The dataset should be uploaded from this link. As described in the Kaggle website, the dataset contains 1,277 columns. The first 1,275 columns are entries in the Coulomb matrix that act as features. The 1,276th column is the Pubchem Id where the molecular structures are obtained and it is an irrelevant feature that can be discarded. The 1,277th column is the atomization energy, calculated by simulations using the
Quantum Espressopackage.
gs_mol_data = pd.read_csv("files/roboBohr.csv", index_col=0)
gs_mol_data.shape
We can now create the feature matrix X_matrix from the first 1,275 columns of the
gs_mol_data dataframe
X_matrix = gs_mol_data[[str(val) for val in range(0, 1275)]]
We now import the PCA class from sklearn and apply the transformation to the X_matrix
from sklearn.decomposition import PCA pca = PCA() X_matrix_PCA = pca.fit_transform(X_matrix)
The input data are centered but not scaled for each feature before applying the SVD. The eigenvalues of the covariance matrix are
print(pca.singular_values_)
fig, ax = plt.subplots()
ax.plot(pca.singular_values_)
ax.set_yscale("log")
The explained variance ratio of the eigenvalues can be obtained as follows, notice that the sum of explained variance ratios is one, as we are including all eigenvalues
print(pca.explained_variance_ratio_) print(np.sum(pca.explained_variance_ratio_))
And the sum quickly converges very close to unity.
fig, ax = plt.subplots() ax.plot(np.cumsum(pca.explained_variance_ratio_)[:100])
Therefore one can reduce the complexity of the problem removing
features with low explained variance ratio using the argument
n_components. To retain only 150 features make \(m = 1,275-150 = 1,125\)
pca = PCA(n_components=150) X_matrix_PCA = pca.fit_transform(X_matrix) print(pca.explained_variance_ratio_) print(np.sum(pca.explained_variance_ratio_)) fig, ax = plt.subplots() ax.plot(np.cumsum(pca.explained_variance_ratio_))
The first 200 components of the resulting rotated features are
plt.imshow(pca.components_[:,:200]) plt.colorbar()
The resulting feature matrix, X_matrix_PCA has dimensions \(n_s\times
(n_f-m) = 16,242\times 150\). One can check that this is the matrix
obtained with a change of basis of the centered original matrix to the
basis composed by the first 150 eigenvectors of the covariance matrix
# Centering the matrix centered_X_matrix = X_matrix - np.mean(X_matrix, axis = 0) # # Change of basis calc_X_matrix_PCA = np.matmul(centered_X_matrix, pca.components_.T) # print(calc_X_matrix_PCA - X_matrix_PCA)
4. Traditional Supervised Learning Models
We will present three of the more common traditional SLMs: Linear regression, K-nearest neighbours (KNN), and Decision Tree (DT) models. In the three cases we will use as an example data for the power emitted at different frequencies by a black body at a temperature of 1,200 K that can be found in this link.
read_bb_data = pd.read_csv("files/bb_data.csv", header=None, names=["Frequency", "Rad. power density"])
# Define data
X = (read_bb_data["Frequency"].values).reshape(-1,1)
Y = (read_bb_data["Rad. power density"].values).reshape(-1,1)
# Plot data
plt.plot(X, Y)
This is a simple example, with a single feature (frequency in THz) and whose label is the power emitted.
We load the libraries for the different models
# Loading libraries # Linear model from sklearn.linear_model import LinearRegression # Polynomial model from sklearn.preprocessing import PolynomialFeatures # K-Nearest Neighbors from sklearn.neighbors import KNeighborsRegressor # Decision Tree from sklearn.tree import DecisionTreeRegressor
In all cases we follow three steps:
- Create an object of the chosen model type
- Use the
fitmethod to train the model - Use the
predictmethod to predict labels for a given sample set.
Once we know the basics, we will introduce training and testing sets.
4.1. Linear model
In the simplest linear regression models the data trend is fit to a line, that is, the line that is closest to the data is computed. Therefore, for linear models
\[ f_\omega(\mathbf{X}) = \mathbf{\omega}\cdot\mathbf{X} \]
where \(\mathbf{X}=(1, x_1, x_2, \ldots x_{n_f})\) are the problem features and \(\mathbf{\omega} = (\omega_1, \omega_2, \ldots \omega_{n_f})\) are the free parameters or weights of the model.
In classification problems a linear fit is done to obtain the decision boundary that separates the different data types. The best decision boundaries are such that samples for different classes are as seggregated as possible, the distance from the samples to the decision boundary is maximized. The second condition is what is used in the definition of the so called Support Vector Models (SVM). This name stems from the fact that the samples that are closest to the decision boundary are called support vectors.
We proceed to fit a linear model to our blackbody data.
# Linear model
reg = LinearRegression().fit(X, Y)
xpred_linear = np.linspace(1,260,10)
ypred_linear = reg.predict(xpred_linear.reshape(-1,1))
print(reg.score(X, Y))
print('Slope: ', reg.coef_)
print('Intercept: ', reg.intercept_)
Linear regression can be easily extended to polynomial models. For example, a quadratic model with two features, \(X\) and \(Y\), will be defined as follows
\[ f_\omega(\mathbf{X}) = \omega_0 + \omega_1 X + \omega_2 Y + \omega_3 X^2 + \omega_4 Y^2 + \omega_5 XY \]
where it is clear how the number of free parameters scale as the number of features raised to the polynomial degree. In order to carry out a fit using a polynomial model the data need to be recasted taking into account the polynomial order and the number of features. We perform a fit using an order three polynomial
# Polynomial model pol_degree = 3 poly = PolynomialFeatures(degree=pol_degree, include_bias=True) # degree=2 for quadratic X_poly = poly.fit_transform(X)
reg_poly = LinearRegression().fit(X_poly, Y)
xpred_poly = np.linspace(1, 260, 25).reshape(-1,1)
X_poly = poly.transform(xpred_poly)
ypred_poly = reg_poly.predict(X_poly)
print('Coef: ', reg_poly.coef_)
print('Intercept: ', reg_poly.intercept_)
The results of the linear and the polynomial fit are
fig, ax = plt.subplots()
ax.plot(X, Y, color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="BB data")
ax.plot(xpred_linear, ypred_linear, label="linear", color="xkcd:electric blue",lw=0.75)
ax.plot(xpred_poly, ypred_poly, label = f"{pol_degree}-polyn", color="xkcd:orchid",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.legend()
4.2. KNN model
This is a model easy to understand where a sample is classified taking into consideration its distance to the K nearest neighbours. The classification can be decided using a majority vote or weighting the vote by the distance to the corresponding samples. There are different possible distance definitions:
- Euclidean distance \(d_{\mathbf{XY}} = \left[\sum_i(X_i-Y_i)^2\right]^{1/2}\)
- Manhattan distance \(d_{\mathbf{XY}} = \sum_i|X_i-Y_i|\)
- Minkowski distance \(d^{(p)}_{\mathbf{XY}} = \left[\sum_i|X_i-Y_i|^p\right]^{1/p}\)
- Chebyshev distance \(d_{\mathbf{XY}} = \max_i|X_i-Y_i|\)
Apart from these four there are other two distances commonly used. The first one, the Hamming distance is a distance between binary numbers and it is defined as the number of different bits between the two numbers. The cosine metric considers the angular distance between points.
Apart from classification problems, the KNN algorithm can be used for regression problems. In this case, the model predicts a continuous outcome by averaging the label values of the K nearest neighbors to a new data point in the feature space. We can use this with our blackbody data, using, for example, three neares neighbors
# K-Nearest Neighbors knext = 3 knn = KNeighborsRegressor(n_neighbors=knext) knn.fit(X, Y) xpred_knn = np.linspace(1,260,100) ypred_knn = knn.predict(xpred_knn.reshape(-1,1))
4.3. Decision tree algorithms
Among the traditional SLM, decision trees are the most transparent
ones, and they allow for an easy explanation of the model
results. This is the approach closest to a traditional algorithm and
it is based on a series of conditions based on sample features. The
conditions are defined according to the information gain (or
uncertainty reduction) achived in the classification problem with
their introduction. We obtain a DT fit with depth dtree to the blackbody data
# Decision Tree dtree = 4 dt = DecisionTreeRegressor(max_depth=dtree) dt.fit(X, Y) xpred_dt = np.linspace(1,260,100) ypred_dt = dt.predict(xpred_dt.reshape(-1,1))
The results obtained with the different models is
fig, ax = plt.subplots()
ax.plot(X, Y, color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="BB data")
ax.plot(xpred_linear, ypred_linear, label="linear", color="xkcd:electric blue",lw=0.75)
ax.plot(xpred_poly, ypred_poly, label = f"{pol_degree}-polyn", color="xkcd:orchid",lw=0.75)
ax.plot(xpred_knn, ypred_knn, label = f"{knext}-KNN", color="xkcd:crimson",lw=0.75)
ax.plot(xpred_dt, ypred_dt, label = f"{dtree}-DT", color="xkcd:teal",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.legend()
The residual plot for the different models can be produced once the predicted values for the frequency values used in the fit are computed.
# Residual plot
# Linear case
Ypred_linear = reg.predict(X).reshape(-1,1)
X_poly = poly.transform(X)
Ypred_poly = reg_poly.predict(X_poly).reshape(-1,1)
Ypred_knn = knn.predict(X).reshape(-1,1)
Ypred_dt = dt.predict(X).reshape(-1,1)
#
# Residual plot
fig, ax = plt.subplots()
ax.plot(X, Ypred_linear-Y, label="linear", color="xkcd:electric blue",lw=0.75)
ax.plot(X, Ypred_poly-Y, label = f"{pol_degree}-polyn", color="xkcd:orchid",lw=0.75)
ax.plot(X, Ypred_knn-Y, label = f"{knext}-KNN", color="xkcd:crimson",lw=0.75)
ax.plot(X, Ypred_dt-Y, label = f"{dtree}-DT", color="xkcd:teal",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.set_ylabel("Residuals")
ax.legend()
4.4. Loss functions
The agreement between model predictions and data samples is quantified with a loss function. Most common loss functions are distances, fullfilling the symmetric property and the triangle inequality, but this is not always so. The most common loss functions are:
- Mean-Squared Error (MSE) \[L_{MSE} = \frac{\sum_i^{n_s}(Y_i-Y_i^{(pred)})^2}{n_s}\]
- Mean Absolut Error (MAE) \[L_{MAE} = \frac{\sum_i^{n_s}|Y_i-Y_i^{(pred)}|}{n_s}\]
- l-norm Loss Function \[L_{l} = \left[\frac{\sum_i^{n_s}|Y_i-Y_i^{(pred)}|^l}{n_s}\right]^{1/l}\] The last case reduces to \(L_{MAE}\) for \(l=1\) and to \(L_{RMSE}\) for \(l = 2\). In the \(l\to\infty\) case the loss function is \(L_\infty = max(|Y_i-Y^{(pred)}_i|\), known as the max error. The last case defines a bound for all errors and it is useful in worst-case scenarios.
- Cross Entropy \[ L_{CE}(Y, Y^{(pred)}) = -\sum_i^{n_s} [Y_i \log{(Y^{(pred)}_i)} + (1-Y_i)\log{(1-Y^{(pred)}_i)} \] A popular choice for logistic regression models and classification models that deal with probabilities.
- Hinge function \[ L_{Hinge}(Y, f(X))= \max{(0,1-Yf(X))} \]
We can define the l-norm loss function and calculate it for the blackbody radiation fits shown above
### Loss function definition
def loss_function_l(y, ycalc,l=2):
import numpy as np
return (np.mean(np.abs(y-ycalc)**l))**1/l
#####
# RMSE loss
l_value = 2
# Linear case
Ypred_linear = reg.predict(X).reshape(-1,1)
print("linear: ", loss_function_l(Y, Ypred_linear, l=l_value))
X_poly = poly.transform(X)
Ypred_poly = reg_poly.predict(X_poly).reshape(-1,1)
print("poly: ", loss_function_l(Y, Ypred_poly, l=l_value))
Ypred_knn = knn.predict(X).reshape(-1,1)
print("knn: ", loss_function_l(Y, Ypred_knn, l=l_value))
Ypred_dt = dt.predict(X).reshape(-1,1)
print("dt: ", loss_function_l(Y, Ypred_dt, l=l_value))
5. Model Evaluation
In order to carry out an unbiased model evaluation, available data are usually split into two subsets: the training set and the test set: the first one is used to train the model and search for optimal model parameter values while the second data set is used to test the predictive power of the obtained model. Both datasets should ideally come from the same original dataset whose elements are shuffled and randomly picked.
It is important to make a random selection of data. To illustrate this point, let’s make train and test sets with half the data of our blackbody radiation data.
# Number of data total_num_data = X.shape[0] print(total_num_data) # Case 0 :: Divide train and test by half of data train_dataset_0 = X[:total_num_data//2,:], Y[:total_num_data//2,:] test_dataset_0 = X[total_num_data//2:,:], Y[total_num_data//2:,:] #
One can compare the loss funtion in this case with a second model where the data are randomly split into two sets. This can be done by hand
# Case 1 :: Take random data rng = np.random.default_rng() random_selection = np.sort(rng.choice(total_num_data, size = total_num_data//2, replace=False)) mask = np.full_like(X, True, dtype=bool) mask[random_selection,:] = False train_dataset_1 = X[random_selection,:], Y[random_selection,:] test_dataset_1 = X[mask].reshape((-1,1)), Y[mask].reshape((-1,1))
Or, better, using the provided function for data subsampling train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(
red_seasonal_data.dropna()[["Temperature", "Temperature uncertainty", "encoded_season"]], red_seasonal_data.dropna()["KGCl"], test_size=0.25, shuffle=False
)
The results for the two different splittings can be computed for the different models and compared
# Linear model
reg_0 = LinearRegression().fit(train_dataset_0[0], train_dataset_0[1])
y_0_train_linear = reg_0.predict(train_dataset_0[0]).reshape(-1,1)
y_0_test_linear = reg_0.predict(test_dataset_0[0]).reshape(-1,1)
loss_train_0_linear = loss_function_l(train_dataset_0[1], y_0_train_linear,l=2)
loss_test_0_linear = loss_function_l(test_dataset_0[1], y_0_test_linear,l=2)
print(loss_train_0_linear, loss_test_0_linear)
# Linear model 1
reg_1 = LinearRegression().fit(train_dataset_1[0], train_dataset_1[1])
y_1_train_linear = reg_1.predict(train_dataset_1[0]).reshape(-1,1)
y_1_test_linear = reg_1.predict(test_dataset_1[0]).reshape(-1,1)
loss_train_1_linear = loss_function_l(train_dataset_1[1], y_1_train_linear,l=2)
loss_test_1_linear = loss_function_l(test_dataset_1[1], y_1_test_linear,l=2)
print(loss_train_1_linear, loss_test_1_linear)
# Polynomial model 0
reg_poly_0 = LinearRegression().fit(poly.transform(train_dataset_0[0]), train_dataset_0[1])
#
y_0_train_poly = reg_poly_0.predict(poly.transform(train_dataset_0[0])).reshape(-1,1)
y_0_test_poly = reg_poly_0.predict(poly.transform(test_dataset_0[0])).reshape(-1,1)
#
loss_train_0_poly = loss_function_l(train_dataset_0[1], y_0_train_poly,l=2)
loss_test_0_poly = loss_function_l(test_dataset_0[1], y_0_test_poly,l=2)
print(loss_train_0_poly, loss_test_0_poly)
# Poly 1
reg_poly_1 = LinearRegression().fit(poly.transform(train_dataset_1[0]), train_dataset_1[1])
#
y_1_train_poly = reg_poly_1.predict(poly.transform(train_dataset_1[0])).reshape(-1,1)
y_1_test_poly = reg_poly_1.predict(poly.transform(test_dataset_1[0])).reshape(-1,1)
#
loss_train_1_poly = loss_function_l(train_dataset_1[1], y_1_train_poly,l=2)
loss_test_1_poly = loss_function_l(test_dataset_1[1], y_1_test_poly,l=2)
print(loss_train_1_poly, loss_test_1_poly)
#
fig, ax = plt.subplots()
ax.plot(X, Y, color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="BB data")
ax.plot(train_dataset_0[0], y_0_train_linear, label="train linear 0", color="xkcd:electric blue",lw=0.75)
ax.plot(test_dataset_0[0], y_0_test_linear, label="test linear 0", color="xkcd:navy blue",lw=0.75)
ax.plot(train_dataset_1[0], y_1_train_linear, label="train linear 1", color="xkcd:forest green",lw=0.75)
ax.plot(test_dataset_1[0], y_1_test_linear, label="test linear 1", color="xkcd:light green",lw=0.75)
ax.plot(train_dataset_0[0], y_0_train_poly, label="train poly 0", color="xkcd:orange",lw=0.75)
ax.plot(test_dataset_0[0], y_0_test_poly, label="test poly 0", color="xkcd:teal",lw=0.75)
ax.plot(train_dataset_1[0], y_1_train_poly, label="train poly 1", color="xkcd:pink",lw=0.75)
ax.plot(test_dataset_1[0], y_1_test_poly, label="test poly 1", color="xkcd:crimson",lw=0.75)
ax.set_xlabel("Frequency (THz)")
ax.legend()
5.1. Validation curve
The data splitting helps in detecting and avoiding overfitting and underfitting. In order to explain this, the concepts of bias and variance are introduced. The bias is the loss associated with the training set. Typically the bias is less than the loss for the test dataset. The difference between the test loss and the bias is the variance. The bias is related to model complexity. We can always reduce the bias increasing model complexity by adding more free parameters (tuning the model hyperparameters). But we are falling into the dangerous pathway of overfitting. As John Von Neumann said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. The bias reduction comes at a high cost: the loss of predictive power, as can be demonstrated by a large variance increase, which together with the low bias is a sure red flag for overfitting. The opposite phenomenon, underfitting occurse when the model complexity is not high enough to reproduce data which translates into a high bias and a low variance.
In order to assess the right model complexity one needs to balance bias and variation, playing with model complexity and and the right tool to check this is the validation curve where test and training losses are plot as a function of model complexity (hyperparameters). You can compute the loss for test data as a sum of bias plus variance. Also the plot of the bias and variance versus the hyperparameters offers a similar information. We can check this making a polynomial fit to some data and increasing the order of the polynomial.
We first create some data and add to them random normal noise
# Abscissa values
npoints = 32
#
x_values = np.linspace(-5, 5, npoints)
#
# Parameter values (a_0 + a_1*x + a_2*x**2 + a_3*x**3)*exp(-0.25x)
a_par = [50.0, -5.5, -2.5, 0.5, 0.1]
a_par.reverse()
#
# Calculate values
y_values = a_par.pop(0)*x_values
a_0 = a_par.pop(-1)
for a_val in a_par:
y_values = (y_values + a_val)*x_values
y_values += a_0
#
y_values *= np.exp(-0.25*x_values)
#
# Build Dataframe and add noise
rng = np.random.default_rng(1234)
pd_data = pd.DataFrame({"X": x_values,
"Y": y_values + rng.normal(loc = 0, scale = 5.12, size = npoints)})
#
pd_data.plot(x="X", y="Y", ls='', marker=".")
#
# Splitting data
from sklearn.model_selection import train_test_split
#
X_train, X_test, Y_train, Y_test = train_test_split(
pd_data["X"].values.reshape(-1, 1), pd_data["Y"].values.reshape(-1, 1), test_size=0.6, shuffle=True
)
We perform polynomial fits for different polynomial order and check the resulting bias and variance
### Dependence with Polynomial degree
in_losses = np.zeros(11)
out_losses = np.zeros(11)
#
# Get axis limits from data
x_min, x_max = np.min(x_values), np.max(x_values)
y_min, y_max = np.min(pd_data["Y"]), np.max(pd_data["Y"])
# Main loop
for pol_degree in range(1,12):
# Define polynomial case
poly = PolynomialFeatures(degree=pol_degree, include_bias=True)
#
X_poly = poly.fit_transform(X_train)
reg_poly = LinearRegression().fit(X_poly, Y_train)
#
Y_train_poly = reg_poly.predict(X_poly).reshape(-1,1)
X_poly = poly.transform(X_test)
Y_test_poly = reg_poly.predict(X_poly).reshape(-1,1)
#
#fig, ax = plt.subplots()
#ax.plot(X_train, Y_train, ls='', marker='.',label="Train data")
#ax.plot(X_test, Y_test, ls='', marker='.',label="Test data")
#ax.plot(X_train, Y_train_poly, ls='', marker='+', label=f"Train order {pol_degree}")
#ax.plot(X_test, Y_test_poly, ls='', marker='*', label=f"Test order {pol_degree}")
#ax.set_xlim((x_min, x_max))
#ax.set_ylim((y_min, y_max))
#ax.legend()
in_losses[pol_degree-1] = loss_function_l(Y_train, Y_train_poly, l=2)
out_losses[pol_degree-1] = loss_function_l(Y_test, Y_test_poly, l=2)
fig, ax = plt.subplots()
ax.plot(np.arange(1,12), in_losses, label="Bias")
ax.plot(np.arange(1,12),out_losses-in_losses, label="Variance")
ax.legend()
ax.set_yscale("log")
ax.set_xlabel("Polynomial degree")
ax.set_ylabel("Loss")
5.2. Regularization of parameters
Models can quickly run too complex. A way to control this is through parameter regularization, penalizing large parameter values. This is achieved including parameter values into the loss function as
\[ L^{(1)}_R(Y, Y^{(pred)}) = L(Y, Y^{(pred)}) + \lambda
\sum_i|\omega_i|~,~~ L^{(2)}_R(Y, Y^{(pred)}) = L(Y, Y^{(pred)}) +
\lambda \sum_i\omega_i^2~. \] For \(\lambda = 0\) the original model
results are recovered and for increasing values of \(\lambda\) large
free parameter values are increasingly penalized. Regularization in
most scikit-learn models is implemented as a hyperparameter.
For example, in scikit-learn the \(L^{(2)}\) regularization in the linear regression case is performed using the Ridge class, with a hyperparameter \(\alpha\) that plays the role of \(\lambda\). We use this in the following example using the same data than in the previous subsection.
### Dependence with regularization
#
from sklearn.linear_model import Ridge
#
#
# Define polynomial case
pol_degree = 4
poly = PolynomialFeatures(degree=pol_degree, include_bias=True)
X_train_poly = poly.fit_transform(X_train)
#
# Main loop
for alpha_val in [0.01, 0.1, 1, 10]:
#
reg_poly = Ridge(alpha=alpha_val).fit(X_train_poly, Y_train)
#
Y_train_poly = reg_poly.predict(X_train_poly).reshape(-1,1)
X_test_poly = poly.transform(X_test)
Y_test_poly = reg_poly.predict(X_test_poly).reshape(-1,1)
#
fig, ax = plt.subplots(2)
ax[0].plot(X_train, Y_train, ls='', marker='.',label="Train data")
ax[1].plot(X_test, Y_test, ls='', marker='.',label="Test data")
ax[0].plot(X_train, Y_train_poly, ls='', marker='+', label=f"Train $\\alpha= {alpha_val}$")
ax[1].plot(X_test, Y_test_poly, ls='', marker='*', label=f"Test $\\alpha= {alpha_val}$")
ax[0].legend()
ax[1].legend()
The \(L^{(1)}\) regularization in the linear regression case is
performed using the Lasso class, which also has an alpha
hyperparameter with identical function as in the Ridge case. For
numerical reasons, using alpha = 0 with the Ridge or Lasso objects
is not advised. Instead, the LinearRegression object should be used.
5.3. Model Tuning
Apart from searching for the maximum number of free parameters or hyper-parameters, a way to fine-tune a given model is repeating the calculations driving the the model training with different hyper-parameter values. This can be done in different ways through parameter discretization. A possible approach is the grid search method, defining a regular grid for each of the free parameters and evaluating the results obtained with the training of the model in each point of the grid. This scales exponentially with the number of parameters and is computationally very expensive. Another possibility is to perform a random search, defining a random grid.
In order to show how to carry out such calculations we train a SVR model to fit the blackbody data using a grid search approach.
# Model tuning
svr_model = SVR()
#
from sklearn.model_selection import GridSearchCV
#
# Fit model tuning hyper-parameter values using a grid search
#
hyperparameter_grid = {"C":np.arange(1,8,2), "kernel":["rbf", "poly"]}
#
opt_clf_grid = GridSearchCV(estimator=svr_model, param_grid=hyperparameter_grid)
opt_clf_grid.fit(training_set[0], training_set[1])
best_model_grid = opt_clf_grid.best_estimator_
grid_train_results = best_model_grid.predict(training_set[0])
grid_test_results = best_model_grid.predict(test_set[0])
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].plot(training_set[0], training_set[1], color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[0].plot(test_set[0], test_set[1], color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[0].plot(training_set[0], grid_train_results, label="Training results", color="xkcd:electric blue",lw=0.75)
ax[0].plot(test_set[0], grid_test_results, label="Test results", color="xkcd:navy blue",lw=0.75, ls="--")
ax[0].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Fit results")
ax[0].legend()
ax[1].plot(training_set[0].ravel(), training_set[1].ravel()-grid_train_results, color="xkcd:crimson", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[1].plot(test_set[0].ravel(), test_set[1].ravel()-grid_test_results, color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[1].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Residuals")
ax[1].legend()
We can do a calculation for the same SVR model using a random search in hyperparameter space
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import gamma
a=2
x = np.linspace(gamma(a, scale=5).ppf(0.01), gamma(a,scale=5).ppf(0.99), 100)
plt.plot(x, gamma(a, scale=5).pdf(x),'r-', lw=2.5, alpha=0.6, label='gamma pdf')
#
# Fit model tuning hyper-parameter values using a random search
rng = np.random.default_rng()
#
hyperparameter_dist = {"C":gamma(a=2,scale=5), "gamma":["scale"], "kernel":["rbf"]}
#
opt_clf_random = RandomizedSearchCV(estimator=svr_model, param_distributions=hyperparameter_dist)
opt_clf_random.fit(training_set[0], training_set[1])
#
best_model_random = opt_clf_random.best_estimator_
#
random_train_results = best_model_random.predict(training_set[0])
random_test_results = best_model_random.predict(test_set[0])
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,4))
ax[0].plot(training_set[0], training_set[1], color="xkcd:olive", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[0].plot(test_set[0], test_set[1], color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[0].plot(training_set[0], random_train_results, label="Training results", color="xkcd:electric blue",lw=0.75)
ax[0].plot(test_set[0], random_test_results, label="Test results", color="xkcd:navy blue",lw=0.75, ls="--")
ax[0].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Fit results")
ax[0].legend()
ax[1].plot(training_set[0].ravel(), training_set[1].ravel()-random_train_results, color="xkcd:crimson", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Training data")
ax[1].plot(test_set[0].ravel(), test_set[1].ravel()-random_test_results, color="xkcd:forest green", ls='', marker='.', ms = 1.25, alpha = 0.75, label="Test data")
ax[1].set_xlabel("Frequency (THz)")
ax[1].set_ylabel("Residuals")
ax[1].legend()
In any case, the test dataset should not be used to optimize the parameter values, as this is what is known as a data leakage. A possible solution, if one has a large enough dataset, is to split it into three subsets, one for training, a second one for validation purposes, and a third one for testing.
5.4. Learning curve
Another factor that sometimes help to improve the results is to increase the number of samples involved in the calculations. The validation curves depend not only on the model complexity but also on the number of samples \(n_s\). Often, the required model complexity tends to increase with the number of samples.
In most cases, more data reduces the variance and not the bias. If this is the case, the advantage provided by the inclusion of more data is relevant until variance is small. Once this point is reached, the results are not sensitive to an enlarged dataset. This is an important fact, as increasing the number of samples is sometimes an expensive, a cumbersome task, or plainly impossible. We calculate, for the black body radiation data, the in- and out-losses for different number of samples.
### Dependence with number of data
in_losses = np.zeros(20)
out_losses = np.zeros(20)
# Define polynomial case
pol_degree = 2
poly = PolynomialFeatures(degree=pol_degree, include_bias=True)
#
# Main loop
for index in range(1,21):
#print(index)
# Taking index*10% of samples
random_selection = np.sort(rng.choice(total_num_data, size = total_num_data*index//20, replace=False))
new_set = X[random_selection,:], Y[random_selection,:]
#
# Splitting into train and test sets with 50% of the data
random_selection = np.sort(rng.choice(total_num_data*index//20, size = total_num_data*index//40, replace=False))
mask = np.full_like(new_set[0], True, dtype=bool)
mask[random_selection,:] = False
train_dataset = new_set[0][random_selection,:], new_set[1][random_selection,:]
test_dataset = new_set[0][mask].reshape((-1,1)), new_set[1][mask].reshape((-1,1))
#
X_poly = poly.fit_transform(train_dataset[0])
reg_poly = LinearRegression().fit(X_poly, train_dataset[1])
#
y_train_poly = reg_poly.predict(X_poly).reshape(-1,1)
X_poly = poly.transform(test_dataset[0])
y_test_poly = reg_poly.predict(X_poly).reshape(-1,1)
#
in_losses[index-1] = loss_function_l(train_dataset[1], y_train_poly, l=2)
out_losses[index-1] = loss_function_l(test_dataset[1], y_test_poly, l=2)
To assess whether the number of samples is large enough, the right tool is the learning
curve, where in- and out-losses are depicted as a function of the size
of the employed dataset.
fig, ax = plt.subplots()
ax.plot(np.arange(5,105,5), in_losses, label="In Loss")
ax.plot(np.arange(5,105,5),out_losses, label="Out Loss")
ax.legend()
ax.set_xlabel("Percentage of samples")
ax.set_ylabel("Loss")
ax.set_title(f"Order {pol_degree} polynomial fit")
5.5. Model Metrics
From what we have already explained, the lower the loss values for a model, the better the model. In this case, one needs to know the problem and the quantities involved into. However, in some cases, loss values do not provide enough information to correctly assess model performance. A possible option is to use the \(R^2\) metric, also called the coefficient of variation, defined as
\[ R^2(Y, Y^{(pred)}) = 1 – \frac{\sum_i (Y_i -Y^{(pred)}_i)^2 }{\sum_i (Y_i -Y_{avg})^2 }~, \]
where \(Y_{avg}\) is the mean value of the \(Y\) class. This coefficient, which in simple linear regression is the square of the correlation coefficient, provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. In most cases, \(0\le R^2\le 1\), though in some cases it can be negative. I
For classification models, the model accuracy, the percentage of samples correctly classified it the most common metrics. But other metrics can be more relevant, and the choice of the right metrics depends on the model employed and the problem at hand. For example, in some cases, the different options in the classification problem may not be equivalent, and errors in some of them could be more relevant than in others; or in cases of very imbalanced classes.
A complete information about the model performance can be obtained from the confusion matrix. This is a square matrix with as many rows and columns as different classes in the problem. The rows of the matrix are associated to real class values and the columns to predicted class values. Therefore, a \(CF_{34} = 5\) matrix element is telling us that in five cases the label 3 was taken to be label 4. A diagonal matrix element value is the number of true label cases correctly identified. A perfect model would output a diagonal matrix.
Some of the more popular metrics for classification problems can be easily obtained from the data in the confusion matrix:
Recall: Sensitivity with respect to a class of interest.
\[R = \frac{True Positive}{True Positive + False Negative}\]
Precision: Precision of a classifier with respect to its prediction for a given class.
\[P = \frac{True Positive}{True Positive + False Positive}\]
f1-score: Geometrical average of recall and precision.
\[f1 = \frac{2 P R}{R + P}\]
All of this values are relative, and they are often provided together with the support value, that is the number of occurrences for each of the labels in the dataset. It can be easily calculated as the sum of the different rows of the confusion matrix. A low support value means that the previous metrics can be misleading.
As an application to a
classification problem we are going to go through a problem of image
analysis using a dataset of hand-written digits datasets from the UC
Irvine included in the sckit-learn package. Follow this link for the
complete list of datasets. We first read the data
from sklearn import datasets
digits = datasets.load_digits()
#
#
print("Keys of digit dataset: \n{}".format(digits.keys()))
print("Description of digit dataset: \n{}".format(digits['DESCR']))
#
print(digits.data)
print(digits.data.shape)
print(digits.target)
print(digits.target.shape)
The data attribute provides the features and target provides
the class of the different instances. Note that data elements are
2D 8x8 arrays (64 features). In this particular dataset, one can
use the images attribute
print(digits.images[0]) digits.images.shape
The classification task is to predict, given an image, which digit it
represents. We are given samples of each of the ten possible classes
(the digits zero through nine) on which we fit an Support Vector Class
estimator (class sklearn.svm.SVC) to be able to predict the classes
to which unseen samples belong. We start splitting our original
dataset into the training and test sets, selecting elements in a
random way and using half the data for training and the other half for
testing.
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.5, shuffle=False)
We now train our model and calculate the predicted values to compare with the test dataset.
from sklearn import svm clf = svm.SVC() # Train clf.fit(X_train, y_train) # Compute predictions predicted = clf.predict(X_test)
Now can analyze the performance of our learning procedure: The classification report includes, apart from the previous metrics, the accuracy (ratio of correctly predicted labels over the total number of labels), the macro average (averaging the unweighted mean per label), the weighted average (averaging the support-weighted mean per label), and the sample average (only for multilabel classification).
print(
f"Classification report for classifier {clf}:\n"
f"{metrics.classification_report(y_test, predicted)}\n"
)
#
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")
We proceed with a second example using SVM with a different sklearn dataset, the Wisconsin Breast Cancer
dataset (load_breast_cancer dataset), which records clinical
measurements of breast tumors. Each tumor is labeled as
benign (for harmless tumors) or malignant (for cancerous tumors),
and the task is to train a model able to predict whether a tumor is malignant or not based
on the available tissue measurements. We start loading the dataset and split the data.
from sklearn.datasets import load_breast_cancer cancer = load_breast_cancer() X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, stratify=cancer.target, random_state=66)
In this case we include two new arguments. The random_state argument
is by default None. It can be given an integer value that acts as a
seed for the shuffling applied to the data to obtain a reproducible
output across multiple function calls. The stratify argument, that
also has None as a default value, is very useful when the relative
frequency between the different classes is very different. In this
case, by share luck, it may be that the training or testing sets have
no samples from one of the classes in the dataset. To avoid this, this
argument is given as a value an array with the labels definition and
the splitting is done in such a way that relative label frequencies
are approximately preserved in each fold.
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, random_state=0)
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
The model substantially overfits as the score is perfect on the training set but with only 63% accuracy on the test set. While SVMs often perform quite well, they are very sensitive to the settings of the parameters and to data scaling of the data. In principle this can be partially solved using the option gamma=’scale’
svc = SVC(gamma='auto')
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))
But, we try a more general solution. In principle, for numerical reasons, all features should vary on a similar scale. We check the scale of the different features in the present dataset
fig, ax = plt.subplots()
ax.plot(X_train.min(axis=0), 'o', label="min")
ax.plot(X_train.max(axis=0), '^', label="max")
ax.legend(loc=4)
ax.xlabel("Feature index")
ax.ylabel("Feature magnitude")
ax.set_yscale("log")
The solution is to preprocess the data rescaling them. We can do this by hand, but sklearn provides an efficient way to perform a min-Max data scaling using the StandardScaler class
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # StandarScaler (min-Max scaler) object
# Scaling features
X = cancer.data
scaler.fit(X)
X_scaled = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, cancer.target, stratify=cancer.target, random_state=66)
#
# Train model
svc = SVC()
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test, y_test)))
Scaling the data made a huge difference! Now we are actually in an underfitting regime, where training and test set performance are quite similar and training is not so close to 100% accuracy. From here, we can try increasing the C or gamma hyperparameters to fit a more complex model.
svc = SVC(C=1000)
svc.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(
svc.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(svc.score(X_test_scaled, y_test)))
6. Neural Networks: the multilayer perceptron (MLP)
Deep learning is the state-of-the-art tool in many machine learning applications, but its algorithms are very carefully tailored to specific uses and its training is exceptionally expensive in computational terms. We will discuss the relatively simple methods, dubbed Multilayer Perceptrons (MLP) that can be used either for classification or regression. They can serve as a starting point to get later involved with far more complex deep learning methods.
As with all Neural Networs, this model has a large number of free
coefficients (or weights) that should be optimized in the training:
there is one between every input and every hidden unit (which make up
a hidden layer), and one between every unit in each hidden layer and
the output. We start importing the MLPClassifier class
from sklearn.neural_network import MLPClassifier # Auxiliary modules from sklearn.metrics import classification_report from sklearn.metrics import ConfusionMatrixDisplay from sklearn.preprocessing import StandardScaler from sklearn.metrics import ConfusionMatrixDisplay
As a first application, we use a simple, toy model, dataset generator provided by scikit-learn called make_moon very useful for visualizing clustering and classification algorithms. We generate a dataset with 200 points, two features and one binary class, and we plot the data.
from sklearn.datasets import make_moons
#
# Generate data
X, y = make_moons(n_samples=200, noise=0.25, random_state=4)
marker_hash = {0: "*", 1: "o"}
#
# Plot data
fig, ax = plt.subplots()
for value in range(2):
marker = marker_hash[value]
mask = y == value
ax.scatter(X[mask,0], X[mask,1], marker=marker)
We now proceed to split the data into training and test datasets and fit the training data using the MultiLayer Perceptron with a single layer of seventy nodes, and calculate the predicted values for the test dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=45) mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(70,), random_state=0,max_iter=300).fit(X_train, y_train) Y_predicted = mlp.predict(X_test)
We can plot now the test values and the predicted test values using the same markers with different size and transparency to check what cases are misclassified.
marker_hash = {0: "s", 1: "o"}
color_hash = {0:"xkcd:navy", 1: "xkcd:crimson"}
fig, ax = plt.subplots()
for value in range(2):
marker = marker_hash[value]
colorv = color_hash[value]
mask_test = y_test == value
ax.scatter(X_test[mask_test,0], X_test[mask_test,1], marker=marker, color=colorv, s=20)
mask_predicted = Y_predicted == value
ax.scatter(X_test[mask_predicted,0], X_test[mask_predicted,1], marker=marker, alpha = 0.35, s=120, color=colorv)
We can assess the quality of the obtained classification using the classification report and the confusion matrix as in previous examples.
# from sklearn.metrics import classification_report
print(
f"Classification report for classifier {mlp}:\n"
f"{classification_report(y_test, Y_predicted)}\n"
)
#from sklearn.metrics import ConfusionMatrixDisplay
#
disp = ConfusionMatrixDisplay.from_predictions(y_test, Y_predicted)
disp.figure_.suptitle("Confusion Matrix")
A more sophisticated way of plotting data, using DecisionBoundaryDisplay, that plots the test dataset together with the boundary between the two possible values in the binary classification can be done as follows
from sklearn.inspection import DecisionBoundaryDisplay
feature_1, feature_2 = np.meshgrid(
np.linspace(X[:, 0].min(), X[:, 0].max()),
np.linspace(X[:, 1].min(), X[:, 1].max())
)
grid = np.vstack([feature_1.ravel(), feature_2.ravel()]).T
grid_y_pred = np.reshape(mlp.predict(grid), feature_1.shape)
display = DecisionBoundaryDisplay(
xx0=feature_1, xx1=feature_2, response=grid_y_pred
)
display.plot()
display.ax_.scatter(
X_test[:,0], X_test[:,1], c=y_test, edgecolor="black"
)
plt.show()
We know consider a second example using the Stellar Classification Dataset – SDSS17 from Kaggle:
Link SDSS17.
The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and a single class column which identifies it to be either a star, a galaxy or a quasar. We start by reading the data into a dataframe
stellar_data = pd.read_csv("files/star_classification.csv")
stellar_data.keys()
We need to select a feature matrix, X, removing those features that we consider irrelevant for the classification, and a label vector, Y.
X = stellar_data[["alpha", "delta", 'u', 'g', 'r' 'i', 'z', 'redshift']] Y = stellar_data['class']
We now split the data into the training and test sets and we can do a first attempt of classification using the MLP class
X_train, X_test, y_train, y_test = train_test_split(X, Y, stratify=Y, random_state=42)
#
mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10, 10], random_state=0).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
y_predicted = mlp.predict(X_test)
print(
f"Classification report for classifier {mlp}:\n"
f"{classification_report(y_test, y_predicted)}\n"
)
#from sklearn.metrics import ConfusionMatrixDisplay
#
disp = ConfusionMatrixDisplay.from_predictions(Y_test, Y_predicted)
disp.figure_.suptitle("Confusion Matrix")
In almost all cases, it is convenient to scale the features.
#from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # StandarScaler (min-Max scaler) object
# Scaling features
scaler.fit(X)
scaled_X = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(scaled_X, Y, stratify=Y, random_state=42)
#
# Train the model
mlp = MLPClassifier( solver= 'lbfgs', hidden_layer_sizes=[10,], random_state=0, max_iter=400, tol=0.005).fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test, y_test)))
y_predicted = mlp.predict(X_test)
print(
f"Classification report for classifier {mlp}:\n"
f"{classification_report(y_test, y_predicted)}\n"
)
While the MLPClassifier and MLPRegressor provide easy-to-use interfaces for the
most common neural network architectures, they only capture a small subset of what
is possible with neural networks. If you are interested in working with more flexible
or larger models, we encourage you to look beyond scikit-learn into the superb
deep learning libraries that are out there. For Python users, the most well-established
are keras, tensor-flow, and pythorch.
Created: 2025-11-20 jue 21:36