# Motivation

Linear models are extremely fast to train and score, but don’t handle filter logic natively when segmenting populations. To segment N populations you might train N linear models, which gets operationally messy when N is large.

Tree-based models are often more accurate and natively handle filter logic in a single model, but are slower to train and score (I’m thinking state of the art here: random forests and boosted decision trees). Loosely speaking, they do this by branching on the binary segment features early, if they turn out to be important, and effectively training independent trees downstream.

It is definitely possible to fake filter logic in a single linear model, but you have to engineer the feature space correctly. This is a proof of concept illustrating how to do it and how not to do it.

*(Hint: adding a simple binary feature indicating segment membership is not enough.)*

# What I’m about to do, and results

I’m going to mock data from two segments/clusters/categories. The predictors are continuous, and the response variable is continuous. I’ll train each segment individually with a linear model (ordinary least squares, or OLS) and a tree-based model (gradient boosted tree regression, or GBR). This will set the baselines. To be crystal clear, the goal is accurate out-of-sample predictions of the response variable, as measured by root-mean-square error (RMSE).

I’ll then try three strategies to trick the algorithms into giving me accurate predictions from just one model. The main idea is to manipulate the feature space. The following table illustrates the strategies and results.

Strategies and results
Strategy |
Feature space |
GBR RMSE |
OLS RMSE |

Two independent models |
{x_seg_1} and {x_seg_2} |
0.212 |
0.193 |

reshape1 |
{x, is_segment_1} |
0.206 |
0.927 |

reshape2 |
{x, is_segment_1, is_segment_2} |
0.206 |
0.927 |

reshape3 |
{x_seg_1, is_segment_1, x_seg_2, is_segment_2} |
0.203 |
0.193 |

The x is the continuous predictor variable, and the “is_segment_1” and “is_segment_2” variables are binary in {0,1} indicating segment membership. The strategies are called “reshape” because you can accomplish the transformations using R’s reshape2 package. They can also be thought of as pivots, or as dummy coding and, in the last case, feature interactions.

It’s very important to control the hidden y-intercept parameter in the OLS regression. It’s worth playing around with that to see how it changes the results, but ultimately you will find that it needs to be left free to fit for the two independent models, it should be fixed at zero for the final winning strategy, and for the losing OLS strategies it does not matter.

# Python session proving the concept

## Preliminaries

I’ll use Scikit-learn regressors.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

The following functions will be reused.

def print_it(y_pred,y_actual,params):
'''
Print RMSE and any model parameters.
'''
rmse=mean_squared_error(y_pred,y_actual)
print "RMSE: %f, params: %s" % (rmse,', '.join(str(p) for p in params),)
def plot_it(xdat,ydat,xname,yname,filename=None):
'''
Scatter plot of x vs y, with 1:1 line drawn through.
'''
fig, ax = plt.subplots()
ax.scatter(xdat,ydat)
ax.set_ylabel(yname)
ax.set_xlabel(xname)
lims = [
np.min([ax.get_xlim(), ax.get_ylim()]), # min of both axes
np.max([ax.get_xlim(), ax.get_ylim()]), # max of both axes
]
ax.plot(lims, lims, 'k-', alpha=0.75, zorder=0)
ax.set_aspect('equal')
ax.set_xlim(lims)
ax.set_ylim(lims)
plt.show()
if filename is not None:
plt.savefig(filename)
def model_it(x_train,y_train,x_eval,y_eval,model):
'''
Train and score.
'''
model.fit(x_train,y_train)
y_pred=model.predict(x_eval)
return y_pred
def pipeline(x_train,y_train,x_eval,y_eval,model,filename=None):
'''
Train, score, print and plot
'''
y_pred=model_it(x_train,y_train,x_eval,y_eval,model)
params=[]
if hasattr(model,'intercept_'):
params.append(model.intercept_)
if hasattr(model,'coef_'):
params.append(model.coef_)
if hasattr(model,'feature_importances_'):
params.append(model. feature_importances_)
print_it(y_pred,y_eval,params)
plot_it(y_eval,y_pred,'y actual','y predicted',filename=filename)
return y_pred

## Generate mock data

Generate 1000 random data points from two different clusters, or segments.

npts=500
mean1=[0,-5]
cov1=[[1,0.9],[0.9,1]]
mean2=[5,10]
cov2=[[1,-0.9],[-0.9,1]]
x1_train,y1_train=np.random.multivariate_normal(mean1,cov1,npts).T
x1_eval, y1_eval =np.random.multivariate_normal(mean1,cov1,npts).T
x2_train,y2_train=np.random.multivariate_normal(mean2,cov2,npts).T
x2_eval, y2_eval =np.random.multivariate_normal(mean2,cov2,npts).T

This is what it looks like.

plot_it(np.concatenate([x1_train,x2_train]),np.concatenate([y1_train,y2_train]),'x','y')

A predictor variable x and a response variable y, drawn from two different multivariate normal distributions.

Transform the data to adhere to Scikit-learn’s input convention.

x1_train=x1_train[np.newaxis].T
x1_eval =x1_eval[np.newaxis].T
x2_train=x2_train[np.newaxis].T
x2_eval =x2_eval[np.newaxis].T

Here’s what the head and tail of the segment 1 predictors look like.

x1_train[:10,:]

array([[ 0.55334004],
[ 0.58077088],
[ 0.45761136],
[ 1.2671348 ],
[-2.1250927 ],
[-0.92971722],
[ 0.26437481],
[-0.21786847],
[ 0.56024118],
[ 0.01138431]])

x1_train[-10:,:]

array([[ 0.321091 ],
[-1.46003373],
[ 0.54108937],
[-1.20992139],
[-0.84189187],
[-0.64191648],
[ 0.88482971],
[ 0.39905366],
[ 0.0056274 ],
[-0.27271367]])

## Train a model for each segment

Train a linear model on the first segment.

y1_pred=pipeline(x1_train,y1_train,x1_eval,y1_eval,LinearRegression())

RMSE: 0.207597, params: -5.02259822453, [ 0.87758793]

Linear model result from segment 1.

Train a linear model on the second segment.

y2_pred=pipeline(x2_train,y2_train,x2_eval,y2_eval,LinearRegression())

RMSE: 0.178626, params: 9.98068042904, [-0.92247714]

Linear model result from segment 2.

Evaluate the union of the results.

print_it(np.concatenate([y1_pred,y2_pred]),np.concatenate([y1_eval,y2_eval]),[])
plot_it(np.concatenate([y1_eval,y2_eval]),np.concatenate([y1_pred,y2_pred]),'y actual','y predicted')

RMSE: 0.193112, params:

Union of linear model results from segments 1 and 2.

Now train two gradient boosting regressors. GBR on segment 1:

y1_pred=pipeline(x1_train,y1_train,x1_eval,y1_eval,GradientBoostingRegressor())

RMSE: 0.223226, params: [ 1.]

GBR on segment 2:

y2_pred=pipeline(x2_train,y2_train,x2_eval,y2_eval,GradientBoostingRegressor())

RMSE: 0.201560, params: [ 1.]

And evaluation on the union of the results.

print_it(np.concatenate([y1_pred,y2_pred]),np.concatenate([y1_eval,y2_eval]),[])
plot_it(np.concatenate([y1_eval,y2_eval]),np.concatenate([y1_pred,y2_pred]),'y actual','y predicted')

RMSE: 0.212393, params:

Union of gradient boosting regression results from segments 1 and 2.

## reshape1: add a feature that flags segment 1 membership

def reshape1(x1,x2,y1,y2,n):
'''
Add a binary feature that flags the segment 1 as 1, or segment 2 as 0. Append the entities together.
'''
a = np.zeros((npts*2,2))
a[0:npts,0:1]=x1
a[npts:2*npts,0:1]=x2
a[0:npts,1:2]=np.ones(npts)[np.newaxis].T
b=np.concatenate([y1,y2])
return a,b

a_train,b_train=reshape1(x1_train,x2_train,y1_train,y2_train,npts)
a_eval,b_eval=reshape1(x1_eval,x2_eval,y1_eval,y2_eval,npts)

Here’s what it looks like.

a_train[:10,:]

array([[ 0.55334004, 1. ],
[ 0.58077088, 1. ],
[ 0.45761136, 1. ],
[ 1.2671348 , 1. ],
[-2.1250927 , 1. ],
[-0.92971722, 1. ],
[ 0.26437481, 1. ],
[-0.21786847, 1. ],
[ 0.56024118, 1. ],
[ 0.01138431, 1. ]])

a_train[-10:,:]

array([[ 1.29078669, 0. ],
[ 0.54627936, 0. ],
[-0.11236272, 0. ],
[ 0.22546577, 0. ],
[-0.72594666, 0. ],
[-1.16852349, 0. ],
[-0.14354644, 0. ],
[ 0.29371461, 0. ],
[ 0.59094059, 0. ],
[-0.7981938 , 0. ]])

The gradient boosting regressor handles this fine.

b_pred=pipeline(a_train,b_train,a_eval,b_eval,GradientBoostingRegressor())

RMSE: 0.205698, params: [ 0.49427717 0.50572283]

Result of a single gradient boosting regression model under the reshape1 feature engineering strategy.

Linear regression does not. It doesn’t matter if you turn hidden y-intercept fitting on or off.

b_pred=pipeline(a_train,b_train,a_eval,b_eval,LinearRegression())

RMSE: 0.926650, params: 9.95478714523, [ -0.04020456 -14.97904361]

Result of a single linear regression model under the reshape1 feature engineering strategy.

## reshape2: add two features that positively indicate segment membership

def reshape2(x1,x2,y1,y2,n):
'''
For each segment, add a binary feature positively indicating segment membership.
'''
a = np.zeros((npts*2,3))
a[0:npts,0:1]=x1
a[0:npts,1:2]=np.ones(npts)[np.newaxis].T
a[npts:2*npts,0:1]=x2
a[npts:2*npts,2:3]=np.ones(npts)[np.newaxis].T
b=np.concatenate([y1,y2])
return a,b

a_train,b_train=reshape2(x1_train,x2_train,y1_train,y2_train,npts)
a_eval,b_eval=reshape2(x1_eval,x2_eval,y1_eval,y2_eval,npts)

Here’s what it looks like.

a_train[:10,:]

array([[ 0.55334004, 1. , 0. ],
[ 0.58077088, 1. , 0. ],
[ 0.45761136, 1. , 0. ],
[ 1.2671348 , 1. , 0. ],
[-2.1250927 , 1. , 0. ],
[-0.92971722, 1. , 0. ],
[ 0.26437481, 1. , 0. ],
[-0.21786847, 1. , 0. ],
[ 0.56024118, 1. , 0. ],
[ 0.01138431, 1. , 0. ]])

a_train[-10:,:]

array([[ 1.29078669, 0. , 1. ],
[ 0.54627936, 0. , 1. ],
[-0.11236272, 0. , 1. ],
[ 0.22546577, 0. , 1. ],
[-0.72594666, 0. , 1. ],
[-1.16852349, 0. , 1. ],
[-0.14354644, 0. , 1. ],
[ 0.29371461, 0. , 1. ],
[ 0.59094059, 0. , 1. ],
[-0.7981938 , 0. , 1. ]])

The gradient boosting regressor handles this fine.

b_pred=pipeline(a_train,b_train,a_eval,b_eval,GradientBoostingRegressor())

RMSE: 0.205690, params: [ 0.49456452 0.2443662 0.26106927]

Result of a single gradient boosting regression model under the reshape2 feature engineering strategy.

Linear regression does not. It doesn’t matter if you turn hidden y-intercept fitting on or off.

b_pred=pipeline(a_train,b_train,a_eval,b_eval,LinearRegression())

RMSE: 0.926650, params: 0.0, [-0.04020456 -5.02425646 9.95478715]

Result of a single linear regression model under the reshape2 feature engineering strategy.

## reshape3: add two features that flag segments, and place predictors into their own dimensions

def reshape3(x1,x2,y1,y2,n):
'''
For each segment, add a binary feature flagging segment membership, and break out the continuous predictors into their own dimensions.
'''
a = np.zeros((npts*2,4))
a[0:npts,0:1]=x1
a[0:npts,1:2]=np.ones(npts)[np.newaxis].T
a[npts:2*npts,2:3]=x2
a[npts:2*npts,3:4]=np.ones(npts)[np.newaxis].T
b=np.concatenate([y1,y2])
return a,b

a_train,b_train=reshape3(x1_train,x2_train,y1_train,y2_train,npts)
a_eval,b_eval=reshape3(x1_eval,x2_eval,y1_eval,y2_eval,npts)

a_train[:10,:]

array([[ 0.55334004, 1. , 0. , 0. ],
[ 0.58077088, 1. , 0. , 0. ],
[ 0.45761136, 1. , 0. , 0. ],
[ 1.2671348 , 1. , 0. , 0. ],
[-2.1250927 , 1. , 0. , 0. ],
[-0.92971722, 1. , 0. , 0. ],
[ 0.26437481, 1. , 0. , 0. ],
[-0.21786847, 1. , 0. , 0. ],
[ 0.56024118, 1. , 0. , 0. ],
[ 0.01138431, 1. , 0. , 0. ]])

a_train[-10:,:]

array([[ 0. , 0. , 1.29078669, 1. ],
[ 0. , 0. , 0.54627936, 1. ],
[ 0. , 0. , -0.11236272, 1. ],
[ 0. , 0. , 0.22546577, 1. ],
[ 0. , 0. , -0.72594666, 1. ],
[ 0. , 0. , -1.16852349, 1. ],
[ 0. , 0. , -0.14354644, 1. ],
[ 0. , 0. , 0.29371461, 1. ],
[ 0. , 0. , 0.59094059, 1. ],
[ 0. , 0. , -0.7981938 , 1. ]])

The gradient boosting regressor still handles it well.

b_pred=pipeline(a_train,b_train,a_eval,b_eval,GradientBoostingRegressor())

RMSE: 0.202675, params: [ 0.23502952 0.22039532 0.27983743 0.26473773]

Result of a single gradient boosting regression model under the reshape3 feature engineering strategy.

And now the linear regressor handles it, too. It makes the most sense to turn hidden y-intercept fitting off because the two binary features are effectively achieving the same thing.

b_pred=pipeline(a_train,b_train,a_eval,b_eval,LinearRegression(fit_intercept=False))

RMSE: 0.193112, params: 0.0, [ 0.87758793 -5.02259822 -0.92247714 9.98068043]

Result of a single linear regression model under the reshape3 feature engineering strategy.

# Why things came out this way

Decision trees are naturally suited to segmentation, which is one reason they are so powerful. Linear regression is not because the model is a simple linear combination of predictors — there is no branching.

What linear models do on a binary feature is effectively fit a constant offset wherever that feature is 1. This amounts to a y-intercept for that segment. If you keep the continuous predictor in one dimension, as in the reshape1 and reshape2 strategies, you’re allowing the model to fit only one slope parameter. This is fine if the predictors for two segments each follow joint distributions with the response variable that happen to have the same slope, and that slope lies along the line connecting their means, but this is a very strong assumption and will typically lead to a poor fit.

When you feed linear models multiple binary features, you’re letting the regressor fit multiple y-intercepts wherever each of them is 1. To fit multiple slopes, you have to break out the continuous predictor into independent dimensions, which is what I did in the reshape3. This is in essence a feature interaction of dummy codes and the continuous predictor. There are at least a couple ways to accomplish this, and I’ve only illustrated one.

I believe there is a widespread misconception among data science practitioners that you can throw any binary feature at machine learning models, and I think it comes from a specific analytic exercise of testing the statistical significance of categorical variables via t-tests under linear models, possibly conflated with the excellent performance of tree-based models under any encoding. Maybe a statistician can weigh in. It’s pretty clear when you think about it that dummy coding alone in linear modeling does not give the most accurate possible predictions in a typical predictive analysis application, though.

# Ideas to take away

- Linear and gradient boosting regressors perform fine when trained on the two segments individually.
- Traditional dummy coding is sufficient for tree-based methods when the goal is maximally accurate out-of-sample predictions. Multiple coding strategies work equally well.
- Traditional dummy coding is not sufficient in linear regression when the goal is maximally accurate out-of-sample predictions using one model. The solution is dummy coding followed by interactions of the segment membership features and the continuous predictor.
- Beware the hidden y-intercept parameter.

Here are some more things to think about.

- The winning strategy (reshape3) will likely not work as desired with naive regularization.
- Dimensionality can blow up due to the interactions, which would generally necessitate L2 regularization even more.
- Interactions with binary features are particularly amenable to sparse vector representations. There are many zeros.
- The winning strategy should work well with streaming and mini-batch training, so long as mini-batches are restricted to individual segments.
- You can fake the same result by training N linear models and appending the fitted coefficients and intercepts, rather than on the input data. You’d still have to do the reshape3 preprocessing step on the vectors you’re scoring, which amounts to feature dimension renaming in sparse representation, but you would be able to regularize each segment individually, without busting the whole idea.