Building and optimizing pipelines in scikit-learn (Tutorial)


A well-known development practice for data scientists involves the definition of machine learning pipelines (aka workflows) to execute a sequence of typical tasks: data normalization, imputation of missing values, outlier elicitation, dimensionality reduction, classification. Scikit-learn provides a pipeline module to automate this process. In this tutorial we will introduce this module, with a particular focus on:

  • Creating the pipeline;
  • Automatic parameters' optimization for each component of the pipeline;
  • Automatic selection of the pipeline's building blocks.

This tutorial extends an example taken from the official documentation for the library. In order to start, install scikit-learn v0.19.1 (the most recent version while we are writing this):

pip install sklearn=0.19.1

Almost everything should work with older versions of the library, except for some methods that have been moved between different modules.

This tutorial is an abridged version of the Italian one: if you are interested, check out the original version.

Pipeline Setup

Let's start by loading a dataset available within scikit-learn, and split it between training and testing parts:

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'])

The Boston dataset is a small set composed of 506 samples and 13 features used for regression problems. Let us import all the modules required throughout this tutorial:

from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge

The pipeline we are going to setup is composed of the following tasks:

  1. Data Normalization: in this tutorial we have selected three different normalization methods, including the QuantileTransformer (check out the documentation)..
  2. Dimensionality Reduction: we selected Principal Component Analysis (PCA) and a univariate feature selection algorithm as possible candidates.
  3. Regression: we apply a simple regularized linear method, although the method is easily extendable to other learning algorithms.

We begin by manually implementing a pipeline without any dedicated scikit-learn module, to highlight how many repetitive activities are necessary. We are going to manually instantiate and initialize a single method for every step of the pipeline:

scaler = StandardScaler()
pca = PCA()
ridge = Ridge()

Now, we chain the different components in a pipeline-like approach, by manually passing the training dataset to every step:

X_train = scaler.fit_transform(X_train)
X_train = pca.fit_transform(X_train)
ridge.fit(X_train, y_train)

Quite repetitive, isn't it? Let's show how this can be accomplished by using a scikit-learn pipeline object:

from sklearn.pipeline import Pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reduce_dim', PCA()),
        ('regressor', Ridge())
        ])

The pipeline is just a list of ordered elements, each with a name and a corresponding object instance. The pipeline module leverages on the common interface that every scikit-learn library must implement, such as: fit, transform and predict.

Given the pipeline so far created, it is possible to train and test it by using just a couple of commands:

pipe = pipe.fit(X_train, y_train)
print('Testing score: ', pipe.score(X_test, y_test))

It is also possible to index the pipeline to access a specific element and retrieve a single value, for example the explained variance in the PCA step:

print(pipe.steps[1][1].explained_variance_)

[ 6.17666461 1.40357729 1.22791087 0.89037592 0.84781455 0.65543078 0.4911068 0.40790576 0.27463223 0.21616899 0.20742042 0.16826568 0.06711765]

The following picture illustrates both the training and testing data flow within the pipeline structure (copyright by Sebastian Raschka):

Pipeline

On every object within the pipeline the methods fit_transform are invoked during training, while transform (or predict) are called during test. So far using pipelines is just a matter of code cleaness and minimization. Now let's jump into model's hyper-parameter tuning.

Pipeline Tuning (Base Version)

Hyper-parameters are parameters that are manually tuned by a human operator to maximize the model performance against a validation set through a grid search.
Let's start with a trivial example, where we aim at optimizing the number of components selected by the PCA and the regularization factor of the linear regression model. If you are not familiar with the GridSearchCV module in sklearn, this is the right moment to read the official tutorial about this module.

Concerning PCA, we want to evaluate how accuracy varies with the number of components, from 1 to 10:

import numpy as np
n_features_to_test = np.arange(1, 11)

As for the regularization factor, we consider an exponential range of values (as suggested in the aforementioned tutorial):

alpha_to_test = 2.0**np.arange(-6, +6)

It's possible to notice that the two parameters are correlated, and should be optimized in combination. That is, a variation in the number of PCA components might imply a variation in the regularization factor, and viceversa. Thereby, it is important to evaluate all their possible combinations, and this is where the pipeline module really supports us. First of all, we define a dictionary with all the parameters we would like to combine in the evaluation:

params = {'reduce_dim__n_components': n_features_to_test,\
              'regressor__alpha': alpha_to_test}

It is worth remarking the convention adopted to name the parameters: name of the pipeline step, followed by a double underscore (__), then finally the name of the parameter within the step. The optimization is invoked as follows:

from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))
In[*]: gridsearch.best_params_
Out[*]: {'reduce_dim__n_components': 8, 'regressor__alpha': 8.0}

In the next section we show how to automatically select the best performing algorithms to adopt in the pipeline.

Pipeline Tuning (Advanced Version)

So far we selected a range of values for every parameter to be optimized. We can follow the same approach, this time to decide which algorithm we should use, for example, to perform data normalization:

scalers_to_test = [StandardScaler(), RobustScaler(), QuantileTransformer()]

The intuition under the hood is to tackle this task as a new hyper-parameter that contains three possible categorical alternatives, one per candidate algorithm. Thanks to the pipeline module we can add this new hyper-parameter to the same grid search:

params = {'scaler': scalers_to_test,
        'reduce_dim__n_components': n_features_to_test,\
        'regressor__alpha': alpha_to_test}

The second and third arguments follow the aforementioned naming convention, identifying a specific parameter within the step, while this time the first argument addresses the whole step. In theory, we could also apply the same approach to the dimensionality reduction step, for example to choose between PCA and SelectKBest. The only problem in this case is that PCA relies on a parameter named n_components, while SelectKBest requires to optimize a parameter named k.

Luckily, GridSearchCV also allows to optimize lists of parameter dictionaries, which solves this issue as well:

params = [
        {'scaler': scalers_to_test,
         'reduce_dim': [PCA()],
         'reduce_dim__n_components': n_features_to_test,\
         'regressor__alpha': alpha_to_test},

        {'scaler': scalers_to_test,
         'reduce_dim': [SelectKBest(f_regression)],
         'reduce_dim__k': n_features_to_test,\
         'regressor__alpha': alpha_to_test}
        ]

We can then launch again our grid-search:

gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))

In our example, we ended up by selecting a robust scaling, a 9-component PCA, and a linear regression with low regularization:

In[*]: gridsearch.best_params_
Out[*]: 
{'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=9, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False),
 'reduce_dim__n_components': 9,
 'regressor__alpha': 8.0,
 'scaler': RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
        with_scaling=True)}

Needless to say, such a small dataset is not significantly realistic, but the same approach can be easily applied to more complex use cases. When the overall number of hyper-parameters is very high, we might need to replace the optimization method (e.g. applying a randomized grid search).


If you liked this post and you would like to keep in touch with our activities, you can become a member of the Italian Association for Machine Learning, or follow us on Facebook or LinkedIn.

Previous Post Next Post