When I started building models in Sklearn, I would break each pre-processing step into its cell or chunk of code. This was a good way to start because you could easily break down the steps into readable chunks. However, while it was easier to read, it lacked repeatability. The next time you presented your model with new data, you had to run through all the steps to transform the data before running the model on the new data, which presented many problems. For example, the dimensionality of your data can change if there are new columns created during One Hot Encoding.
The Answer? Pipelines
Anatomy of a Pipeline
First, as usual, the imports needed to run this script.
from sklearn.pipeline import Pipeline from sklearn.compose import make_column_selector as selector from sklearn.compose import ColumnTransformer from sklearn. pre-processing import MinMaxScaler from sklearn. pre-processing import OneHotEncoder from sklearn.ensemble import RandomForestRegressorpy
Next is a function for making a column transformer. I prefer to call this via a function to make the code more reusable.
The column transformer allows you to combine any number of pre-processing steps into a single transformer. We have a
MinMaxScaler for numeric columns and an
OneHotEncoder for categorical values in the example below. You could include any transformer from Sklearn in these steps.
A good example of this from their documentation:
Column Transformer with Mixed Types
In addition to demonstrating both numeric and categorical columns, this shows the column selector, which allows you to select columns based on different criteria. The numeric columns are selected via
selector(dtype_exclude="object"). Below demonstrates selecting columns by their name as a simple python list. You can combine any of these select styles with the different transformers you supply. Additionally, you name your transformers such as
cat; see below for later identification in your fit model.
def make_coltrans(): column_trans = ColumnTransformer(transformers= [('num', MinMaxScaler(), selector(dtype_exclude="object")), ('cat', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['CAT_FIELD_ONE', 'CAT_FIELD_TWO'])], remainder='drop') return column_trans
Creating the pipeline now is a very simple step after creating the column transformer. All you need to do is order the pipeline sequence based on the logical ordering of the steps you would normally take. Here we have two steps, the column transformer and the classifier. We name the steps like
clf below in the column transformer.
def create_pipe(clf): '''Create a pipeline for a given classifier. The classifier needs to be an instance of the classifier with all parameters needed specified.''' # Each pipeline uses the same column transformer. column_trans = make_coltrans() pipeline = Pipeline([('prep',column_trans), ('clf', clf)]) return pipeline
Creating and Fitting the Model
Finally, we can create an instance of the classifier and pass that to our function above that create the pipeline.
# Create the classifier instance and build the pipeline. clf = RandomForestClassifier(random_state=42, class_weight='balanced') pipeline = create_pipe(clf) # Fit the model to the training data pipeline.fit(X_train, y_train)
The above demonstrates the simplicity of setting up a pipeline. The first time you walk through this, it can seem a little confusing versus doing each step independently. The benefit is that each time you want to apply this to a new model, or, even better, run new data against your fit model, all of the transformations to the data will happen automatically.