Consider the task of chaining a PCA and regression, where PCA performs dimensionality reduction and regression does the prediction.
Example taken from the sklearn documentation:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
n_components = [5, 10]
Cs = np.logspace(-4, 4, 3)
param_grid = dict(pca__n_components=n_components, logistic__C=Cs)
estimator = GridSearchCV(pipe,param_grid)
estimator.fit(X_digits, y_digits)
How can I perform dimensionality reduction only on a subset of my feature set using FunctionTransformer (for example, restrict PCA to the last ten columns of X_digits)?
A Kruger :
You can first create a function (called last_ten_columns below) that returns the last 10 columns of the input X_digits. Create the function transformer that points to the function, and use it as the first step of the pipeline.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn import linear_model, decomposition, datasets\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.preprocessing import FunctionTransformer\n\nlogistic = linear_model.LogisticRegression()\n\npca = decomposition.PCA()\n\ndef last_ten_columns(X):\n return X[:, -10:]\n\nfunc_trans = FunctionTransformer(last_ten_columns)\n\npipe = Pipeline(steps=[('func_trans',func_trans), ('pca', pca), ('logistic', logistic)])\n\ndigits = datasets.load_digits()\nX_digits = digits.data\ny_digits = digits.target\n\nn_components = [5, 10]\nCs = np.logspace(-4, 4, 3)\n\nparam_grid = dict(pca__n_components=n_components, logistic__C=Cs)\nestimator = GridSearchCV(pipe, param_grid)\nestimator.fit(X_digits, y_digits)\n",
2018-11-30T17:15:29