Auto prepend 4 spaces for Markdown code blocks

It becomes very cumbersome very quickly if you write Markdown and want to copy large code blocks into the source. Here’s a shortcut to prepend 4 spaces in just a couple steps for Mac users.

Step 1: Create an appify script.

Following these instructions, create appify with the following contents and make it executable.

#!/usr/bin/env bash

APPNAME=${2:-$(basename "$1" ".sh")}

if [ -a "$" ]; then
echo "$PWD/$ already exists :("
exit 1

mkdir -p "$DIR"
cp "$1" "$DIR/$APPNAME"
chmod +x "$DIR/$APPNAME"

echo "$PWD/$"

Step 2: Create a pb2md script.

Create with the following contents and make it executable.

#!/usr/bin/env bash

pbpaste | awk '{print "    "$0}' | pbcopy

Step 3: Appify it.

mv /Applications

That’s it.

Now you can copy large blocks of code into your clipboard (pasteboard), run pb2md on the command line or as an app, then paste into your Markdown source.

Guaranteeing k samples in streaming sampling without replacement


When doing streaming sampling without replacement of a finite data set of known size N in Pig, you’d do

data = LOAD 'data' AS (f1:int,f2:int,f3:int);
X = SAMPLE data p;

for some number p between 0 and 1. If you need at least k samples use

p = \frac{1}{N}[k + \frac{1}{2} z^2 + \frac{1}{2}\sqrt{z^2(4k + z^2)}].

The table below helps in choosing z. It reads like this: setting z to the specified value guarantees you’ll get at least k about CL of the time for any k larger than 20 or so; for smaller k you’ll do even better.

Choosing z.
z CL
0 \geq 50\%
1 \geq 84\%
2 \geq 98\%
3 \geq 99.9\%
4 \geq 99.997\%
5 \geq 99.99997\%

Problem statement

You have a data set of finite size N and you want k random samples without replacement. A computationally efficient procedure is to stream through the data and emit entries with some probability p by generating a uniform random number at each entry and emitting the entry if that number is \leq p. Here’s what I just said in pseudocode:

given table of size N
given desired number of sample k
let p <= k/N
for each row in table
  if rand_unif() ≤ p
    emit row
  end if
end for

Pig provides this functionality with

data = LOAD 'data' AS (f1:int,f2:int,f3:int);
X = SAMPLE data 0.001;

Hive provides this API


but it behaves differently as it gives you a 0.1% size block or more of the table. To replicate the pseudocode behavior maybe you’d do

SELECT * FROM data WHERE RAND() < 0.001;

These queries are probabilistic insofar as the size of the output is size k = 0.001 N only on average.

I got to thinking about what happens when you place stronger requirements on the final sample size. I’ll touch on three requirements.

Requirement 1. Produce approximately k samples.

Requirement 2. Produce at least k samples.

Requirement 3. Produce exactly k samples.

My assumptions going into this are

  1. The data set is static, large, distributed and finite.
  2. Its size is known up front.
  3. It may be ordered or not.
  4. There is no key (say, a user hash, or rank) available for me to use, well distributed or otherwise.

Requirement 1: approximately k

In this case you could use p = k/N. On average you get k.

Requirement 2: at least k

At least k can be guaranteed in one or sometimes (rarely) more passes. I’ll develop a model for the probability of generating at least k in one pass. To strictly guarantee at least k you would have to check the final sample size and, if it undershoots, make another pass through the data, but you can tune the probability such that the chance of this happening is very low.

Random sampling of a stream as a Poisson process

The probabilistic emission of rows in the pseudocode above can be modeled as a Poisson process. The number of events, or points emitted, over the entire data set follows a Poisson distribution. The Poisson distribution, Poisson(\lambda), is fully described by a single parameter \lambda, which is the mean rate of events over some fixed amount of time, or in our case, the mean number of samples over one pass. That is, \lambda = pN in one pass over the full data set.

The expectation of the the number of samples X is \mathrm{E}(X) = \lambda, and the variance is \mathrm{Var}(X) = \lambda.

Large λ

When \lambda is large, the central limit theorem kicks in and the Poisson distribution converges to a normal distribution with mean \lambda and variance \lambda. \lambda can be as small as 20 for N(\lambda,\lambda) to be a good approximation. This is convenient because all of the usual Gaussian statistics can be applied.

The number of times you get at least k samples is described by a one-sided z-statistic and can be read off of a standard z-score table. z is the number of standard deviations from the mean. The probability of getting at least k samples is CL = 84\% at z = 1. Here are four such sets of useful values, with illustrations.

z table: area under N(\lambda,\lambda) above k = \lambda - z\sqrt{\lambda}.
z CL Illustration
0 50.0%  cl50
1 84.1%  cl84
2 97.7%  cl98
3 99.9%  cl99_9

To guarantee at least k samples in CL of your queries you’d choose p = \lambda(k,z)/N, where \lambda(k,z) solves \lambda - z\sqrt{\lambda} = k. In other words, you’d choose a rate p = \lambda/N such that k is z standard deviations below \lambda. The (useful) solution is \lambda(k,z) = k + \frac{1}{2} z^2 + \frac{1}{2}\sqrt{z^2(4k + z^2)}.

Monte Carlo gut check

To prove to myself the math is right, I’ll run 8 parallel Monte Carlo simulations of 10,000 iteration each in bash.

function monte_carlo () {
  awk "$@" -v seed=$RANDOM '
    lam = k + 0.5*z**2 + 0.5*sqrt(z**2*(4.0*k + z**2))
    p = lam/n_data
    for (j=1;j<=n_experiments;j++) {
        for (i=1;i<=n_data;i++) {
            if (rand()<=p) {
        if (emit>=k) {
    print seed,k,lam,p,n_data,n
  }' ;
export -f monte_carlo

parallel -N0 monte_carlo -v n_experiments=1e4 -v n_data=1e3 -v k=100 -v z=0 ::: 1 2 3 4 5 6 7 8

In one of the runs at CL = 84\% I got at least k = 100 samples 8,664 out of 10,000 times. In this case, \lambda = 110.5. Here’s a table of typical results at different z value settings.

Monte Carlo runs of 10,000 iterations each, setting k = 100.
z CL num \geq k \lambda
0 \geq 50\% 5,183 100.0
1 \geq 84\% 8,664 110.5
2 \geq 98\% 9,846 122.1
3 \geq 99.9\% 9,998 134.8
4 \geq 99.997\% 10,000 148.8

It’s clear from the simulations that the true confidence is slightly higher than advertised, but this is expected. There are two sources of bias: finite k in the central limit theorem approximation, and discreteness of the random variable at k. Given how I implemented the Monte Carlo, both push the true confidence higher, so the stated lower limit on confidence holds.

The “guarantee” of at least k in a single pass is a probabilistic one, and it implies that at, say, a CL = 99.9\% specification I would have to go over the data set a second time roughly 1 out of every (1-CL)^{-1} = 1,000 times that I undertook this whole exercise. At this specification the need to rerun is rare, but it will eventually happen. When it does, I’d have to go through the full data set again with a smaller p to get a fair random sample, specifically, I would reapply the same rule for a new \lambda(k\to k-k_1,z), where k_1 is the actual sample size that the first iteration yielded me. It is even rarer that I’d have to do a third pass at CL = 99.9\%. This case happens one in a million times.

It’s fairly obvious to me just thinking about it that it’s better to set CL high and try to do just one pass than it is to set CL low and to do multiple passes. For example, if CL = 50\% (z = 0) then nearly half the time I’d be rerunning twice or more times to build up a fair sample. Passes over big data are expensive as it is, so it’s better to eat k_1 - k too many samples in one pass than to have to do additional passes on the data.

Requirement 3: exactly k

This case is slightly tricky. You’ll first run a mildly modified version of the procedure above that gives you at least k. You’ll then have to go through the random samples themselves and pick out exactly k at random.

I’ll punt on this one since in this post I was primarily after the rule on emit probability anyway.

Other approaches

There are other ways to do the sampling if you have a rank or a well distributed key available in the table. If you don’t you could produce a key in a pass and then sample in a 2nd pass… but this looks a lot like running rand_unif() for each row, like we’ve already done.

Parallel cardinality on your laptop

This post is about showing your laptop who’s boss.  It contains:

  1. Parallel grep.
  2. Parallel cardinality.
  3. A cool set union command line snippet.
  4. A cool set intersection command line snippet.
  5. >3x speedups on everyday Unix commands useful to data scientists.
  6. A map-reduce command line paradigm that’s one better than the usual cat | map | sort | reduce > data.out.

Parallel grep

Say I want to know how many times feature 15577606 appears in the kddb training set (19,264,097 lines) from my last post. A standard grep takes me 1m 39s to return the 199 lines that contain that feature.

grep 15577606 kddb | wc -l

The GNU parallel utility gives me a nice 3.1x speedup at 32s using multiple threads.

cat kddb | parallel --block 10M --pipe grep 15577606 | wc -l

Conceptually this is map-reduce. parallel partitions out --block size blocks of lines to multiple threads running what you provide as the --pipe argument. This is the map stage. The last shell pipe collects all map outputs and reduces them in a final aggregation operation with wc. There’s no strict guarantee of the order data comes out of the parallel stage.

You know it’s working when you burn a hole through your table/jeans, and/or you can’t hear your own thoughts over the noise of the CPU fan. Despite that, this is one million times better than the nohup command & nonsense we are all guilty of perpetrating.

Parallel feature cardinality

Now I’ll use this to do something awesome: count the occurrence of each of the O(107) features in the training file. In the map phase I’ll run “feature_count.awk,” which contains the following awk snippet.

    for (i=2;i<=NF;i++) {
    for (i in n) {
        print i,n[i]

The reduce stage is a straightforward one liner that does almost the same thing. All together:

# training set feature cardinality
cat kddb | \
 parallel --block 10M --pipe ./feature_count.awk | \
 awk '{n[$1]+=$2} END {for (i in n) {print i,n[i]}}' > kddb_feature_count.txt

In 5m34s I computed the exact cardinality of 28,875,157 unique features. In serial, this would have taken me more than 15m. I’ll use those extra 10 minutes to learn more about the data.

What I learned about the data

20,663,711 features (72%) appear just once. Off the top of my head, I wouldn’t expect machine learning algorithms to learn robust rules for these features, so I might cut them out entirely, which would likely provide real gains in training and scoring time.

Running the same cardinality pipe on the test set returns only 2,990,384 unique features there.

# test set feature cardinality
cat kddb.t | \
 parallel --block 10M --pipe ./feature_count.awk | \
 awk '{n[$1]+=$2} END {for (i in n) {print i,n[i]}}' > kddb.t_feature_count.txt

The superset of all features in the training and test sets is 29,890,095.

# set union for first column of two files
awk '
!($1 in n) {m++;n[$1]} 
END {print m}
' kddb_feature_count.txt kddb.t_feature_count.txt

But only 7% (1,975,446) test set features overlap with the training set features,

# set intersection for first column of two files
awk '
NR == FNR {n[$1]} 
NR > FNR && ($1 in n) {m++} 
END {print m}
' kddb_feature_count.txt kddb.t_feature_count.txt

and slightly fewer (1,888,445) have more than 1 appearance in the training set.

# set intersection plus filter for first column of two files
awk '
NR == FNR && $2 > 1 {n[$1]} 
NR > FNR && ($1 in n) {m++} 
END {print m}
' kddb_feature_count.txt kddb.t_feature_count.txt

I might want to pare down the feature set in the training data prior to any analysis, keeping in mind that this is only really valid for a static data set. In a real production setting, the safest assumption is that entity vectors arriving in our system will eventually contain features that span the full vector space, if we wait long enough.


Fast and lean ad hoc binary classifier evaluation


In this post I’m going to see if I can beat perf, which is a C++ utility I’ve used that computes precision, recall, AUC, etc.

A scenario where this would come up: you’re a data scientist who’s done binary classification on a fairly large data set in an ad hoc proof of concept analysis, and you want to compute the usual evaluation metrics as fast as possible. Should you use perf, or roll your own?

I’ll be exploring streaming computation of some useful metrics:

  1. Single-pass precision, recall, accuracy, lift and F1-score.
  2. Single-pass (plus a little) AUC.
  3. Single-pass RMSE.

Data, procedure and result

I grabbed the biggest data set in Chih-Jen Lin‘s perpetually handy lib-SVM repository. It’s the KDD Cup 2010 Bridge to Algebra data set with the winning team’s best feature set. Big is good because I want an inefficient algorithm to hurt. Some stats about the data:

  • 19,264,097 training entities
  • 748,401 test entities
  • 29,890,095 dimensions

I count 30-ish nonzero features per entity on average, for a part in a million sparsity.

The procedure (laid out in detail in the Procedure section below) is to download the data, unzip it, train the training set, score the scoring set and then compute evaluation metrics. In this post I’m focused on the algorithm performance in evaluation metrics computation in that final stage.

Time-wise, it looks like I can beat perf in computing precision, recall, accuracy, lift, F1-score and RMSE. I can also beat it in AUC computation (a wholly different beast) as long as I can accept just 3 or 4 decimals of accuracy, which is usually more than enough. If I want another digit of AUC accuracy, all else being equal, I can eat just one more algebraic calculation per entity by using the trapezoidal rule rather than a Riemann sum.

I can always beat perf in evaluating 0.5 million entities or more. Without having dug into the perf source code, I do not know offhand why the developer imposed that upper limit.

And since I’m totally beating up on perf, I’ll also reveal that I can do all this with two orders of magnitude fewer lines of code. All my implementations are in gawk.

The full, end-to-end pipeline using the best algos takes 421s. Training is 37% of the total time (it takes more time to unzip than to train) and everything downstream from scoring to post-processing to evaluation takes < 2% of the total time. Here are the timings of each step of the analysis.

Timing at each step in the pipeline.
Step Time
Download training set 1m 21.510s
Download evaluation set 0m 5.281s
Unzip training set 2m 44.246s
Unzip evaluation set 0m 6.451s
Train training set 2m 35.673s
Score evaluation set 0m 4.764s
Paste and logistic-transform evaluation data 0m 1.975s
Subsample to 500,000 entities 0m 0.212s
Fixed threshold eval (perf, 500,000 entities) 0m 0.297s
Fixed threshold eval (awk, 500,000 entities) 0m 0.252s
Fixed threshold eval (perf, all entities)
Fixed threshold eval (awk, all entities) 0m 0.357s
AUC eval (perf, 500,000 entities) 0m 0.296s
AUC eval (awk, 500,000 entities, 3 decimals) 0m 0.278s
AUC eval (perf, all entities)
AUC eval (awk, all entities, 3 decimals) 0m 0.424s
RMSE eval (perf, 500,000 entities) 0m 0.263s
RMSE eval (awk, 500,000 entities) 0m 0.227s
RMSE eval (perf, all entities)
RMSE eval (awk, all entities) 0m 0.347s



bunzip2 kddb.bz2
bunzip2 kddb.t.bz2


Convert and stream to vw format and train.

awk '{$1=$1*2-1" |Features"; print $0}' kddb | \
 vw --loss_function logistic --final_regressor model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
final_regressor = model.vw
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.693147   0.693147            1         1.0   1.0000   0.0000       26
0.950634   1.208120            2         2.0  -1.0000   0.8532       23
0.764694   0.578754            4         4.0  -1.0000  -0.5043       25
0.826334   0.887974            8         8.0   1.0000  -0.1111       23
0.736578   0.646821           16        16.0   1.0000   0.7315       41
0.668554   0.600530           32        32.0   1.0000   0.9360       29
0.560276   0.451999           64        64.0   1.0000   2.2459       23
0.470294   0.380312          128       128.0   1.0000   2.1648       25
0.384321   0.298348          256       256.0  -1.0000   1.2838       32
0.338601   0.292881          512       512.0   1.0000   3.7020       35
0.353033   0.367465         1024      1024.0   1.0000  -0.2938       20
0.404202   0.455370         2048      2048.0  -1.0000  -0.0990       59
0.407307   0.410412         4096      4096.0  -1.0000   1.5638       35
0.412617   0.417928         8192      8192.0   1.0000   3.0536       21
0.375746   0.338875        16384     16384.0   1.0000   4.2012       35
0.390718   0.405689        32768     32768.0   1.0000   2.8355       32
0.360178   0.329639        65536     65536.0   1.0000   0.3467       34
0.354580   0.348982       131072    131072.0   1.0000   2.2873       38
0.343257   0.331934       262144    262144.0   1.0000   1.7774       47
0.339802   0.336347       524288    524288.0   1.0000   3.6705       36
0.345376   0.350950      1048576   1048576.0   1.0000   0.2121       26
0.323524   0.301671      2097152   2097152.0   1.0000   2.2170       35
0.315078   0.306633      4194304   4194304.0   1.0000  -0.1576       29
0.307970   0.300861      8388608   8388608.0   1.0000   3.3606       23
0.309194   0.310418     16777216  16777216.0  -1.0000  -0.8738       33

finished run
number of examples per pass = 19264097
passes used = 1
weighted example sum = 1.92641e+07
weighted label sum = 1.38952e+07
average loss = 0.31017
best constant = 0.721302
total feature number = 585609985


Convert to vw format and score.

time awk '{$1=$1*2-1" |Features"; print $0}' kddb.t | \
 vw --loss_function logistic --initial_regressor model.vw \
 -p kddb.t_scores.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
predictions = kddb.t_scores.txt
using no cache
Reading datafile = 
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.096423   0.096423            1         1.0   1.0000   2.2904       20
0.062938   0.029453            2         2.0   1.0000   3.5102       21
0.043892   0.024845            4         4.0   1.0000   3.6864       23
0.027345   0.010798            8         8.0   1.0000   4.3615       23
0.469994   0.912643           16        16.0  -1.0000  -0.5099       59
0.382650   0.295306           32        32.0   1.0000   4.6787       35
0.263003   0.143355           64        64.0   1.0000   5.2928       22
0.367205   0.471407          128       128.0   1.0000   3.9544       35
0.397651   0.428097          256       256.0   1.0000   5.3137       33
0.347004   0.296358          512       512.0  -1.0000   3.1846       35
0.324560   0.302116         1024      1024.0   1.0000   2.5141       20
0.329715   0.334869         2048      2048.0   1.0000   3.9218       25
0.341305   0.352895         4096      4096.0   1.0000   9.2050       23
0.318626   0.295948         8192      8192.0   1.0000   0.6735       35
0.302794   0.286961        16384     16384.0   1.0000   0.4491       35
0.306975   0.311156        32768     32768.0   1.0000   2.1338       25
0.310069   0.313162        65536     65536.0   1.0000   7.8721       23
0.297661   0.285254       131072    131072.0   1.0000   3.9228       25
0.291197   0.284732       262144    262144.0  -1.0000   3.4579       35
0.299229   0.307261       524288    524288.0   1.0000   4.9444       29

finished run
number of examples per pass = 748401
passes used = 1
weighted example sum = 748401
weighted label sum = 580347
average loss = 0.30118
best constant = 0.775449
total feature number = 22713476

Paste evaluation data

I’ll paste together the scores and the true classes for the downstream evaluation exercises. This is mainly for API compliance… strictly speaking you can skirt this step in a real production implementation as long as you’re continually associating entities with their true classes and their scores.

I’m going to apply the logistic transformation on the raw scores here so that scores are probabilities in [0,1].

paste -d' ' <(cut -d' ' -f 1 kddb.t) <(awk '{print 1/(1+exp(-$1))}' kddb.t_scores.txt) > kddb.t_eval.txt

Evaluate fixed-threshold metrics with perf

I’ll compute precision, recall, accuracy, lift and F1-score with perf.

perf -PRE -REC -ACC -LFT -PRF < kddb.t_eval.txt
Aborting.  Exceeded 500000 items.

Fail. I need to subsample down to < 500,000.

awk '
BEGIN { srand(42) }
rand() < 500000/748401 {
 n++; if (n>=500000) { exit 0 } print
}' kddb.t_eval.txt > kddb.t_eval.subsample.txt

Now evaluate with perf on the subsample.

perf -PRE -REC -ACC -LFT -PRF 
ACC    0.88585   pred_thresh  0.500000
PRE    0.90768   pred_thresh  0.500000
REC    0.97002   pred_thresh  0.500000
PRF    0.93782   pred_thresh  0.500000
LFT    1.02285   pred_thresh  0.500000

Try to beat it with awk

When I first wrote this I did not think it would beat perf. At minimum I just thought I could evaluate on the full scoring set.

Here I run on on the subsampled data.

awk -v OFS=$'\t' '
 if ($2 >= 0.5) {
  if ($1 > 0) { tp++ } else { fp++ }
 } else {
  if ($1 > 0) { fn++ } else { tn++ }
 print "ACC",accuracy
 print "PRE",precision
 print "REC",recall
 print "PRF",f1score
 print "LFT",lift
}' kddb.t_eval.subsample.txt
ACC	0.885848
PRE	0.907684
REC	0.97002
PRF	0.937817
LFT	1.02285

Running on the full evaluation set works fine.

ACC	0.88651
PRE	0.908369
REC	0.970005
PRF	0.938176
LFT	1.02326

Evaluate AUC, a threshold-sweep metric, with perf

AUC, the area under the receiver operating characteristic curve, requires sweeping the threshold across the range of scores. A sweep is also needed in exercises where you want to look at any of the other metrics as a function of reach, for example, when maximizing some notion of return on investment.

Perf calls AUC “ROC,” and here it is on the subsampled data.

perf -ROC < kddb.t_eval.subsample.txt
ROC    0.79828

Again, it doesn’t want to compute AUC on the full evaluation set.

Try to beat it in awk

Naively, the sweep screams “sort,” which is annoying. Sorting the subsampled data takes more than 1s, so I can’t do that and expect to beat perf.

time sort -t' ' -g -r -k 2,2 kddb.t_eval.subsample.txt > /dev/null

Without looking into the perf source code, I’m just going to try to beat it another way.

I’ll bin the scores into 10^{\text{decimals}} decimal places, and count positive and negative instances per bin. This is effectively an empirical probability distribution, or a histogram. I’ll assume the scores are in [0,1], which is generically a strong assumption, but I know my data conforms to this in this case.

Here are the two passes:

  1. First I’ll pass over the full data set of size N, counting positive and negative instances by bin. For the benchmark set, N=500,000.
  2. The second pass will be over the 10^{\text{decimals}} bins, where I’ll compute
    1. the cumulative true-positives and false-positives, which is proportional to the respective cumulative distribution functions, and
    2. a streaming AUC as a discrete Riemann sum (imagine the ROC curve parameterized by score, not by the false-positive rate).

Here it is:

awk -v OFS=$'\t' -v decimals=3 '
BEGIN { max=10^decimals; min=1 }
 if ($1 > 0) { pos[score_bin]++ } else { neg[score_bin]++ }
 for (i = max-1; i >= min; i--) {
 print "ROC",auc/(ctp*cfp)
}' kddb.t_eval.subsample.txt
ROC	0.799652

I’ll benchmark using the average of 10 runs after a few warm starts (this is how I do all the evaluation metric benchmarks).

for a in {1..3}; do perf -ROC < kddb.t_eval.subsample.txt &gt; /dev/null; done
time for a in {1..10}; do perf -ROC < kddb.t_eval.subsample.txt; done
perf baseline is 270ms with AUC 0.79828
decimals AUC error time speed-up
1 0.877566 0.079286 247ms +23ms
2 0.810426 0.012146 252ms +18ms
3 0.799652 0.001372 248ms +22ms
4 0.798422 0.000142 280ms -10ms
5 0.798296 0.000016 376ms -106ms

So I can get 3 or 4 digits of accuracy in essentially the same time as perf. I cannot get close to perf at 5 digits of accuracy, which suggests to me I didn’t come up with the most efficient AUC algo, or perf is also doing an approximate AUC computation, or …?

In any case, here’s AUC on the full data set.

ROC 0.801368

I could deal with far more entities, but I would eventually run into integer overflows in the cardinality computations, in which case I would move toward parallel computation and/or efficient probabilistic data structures.

I tested the trapezoidal rule, which gave me another order of magnitude in agreement with perf with only a small increase of compute time. There’s one additional sum per entity.

Evaluate RMSE with perf

Another fixed-threshold metric. I’m pursuing this because this was the KDD 2010 eval metric, but normally I would not use RMSE for a classification problem.

Here’s perf.

perf -RMS < kddb.t_eval.subsample.txt
RMS    0.29650

And rolling my own.

awk -v OFS=$'\t' '
    print "RMS",sqrt((NR*s2 - s1 * s1)/(NR * (NR - 1)))
}' kddb.t_eval.subsample.txt
RMS	0.296497

And RMSE on the full data set.

RMS	0.295575

What did we learn?

Ok, a disclaimer: I admit it’s total BS to claim not to look under the hood in an exercise like this when there’s no mechanism to enforce that, but whatever, this was fun and I really didn’t so let’s go with that. Peaking at the header of the perf source code now, I see a MAX_BINS global variable that is set to 1000, which is about the level my timing equals that of perf, so maybe it is in fact doing an AUC approximation like mine.

But we learned how to stream binary evaluation metrics efficiently, and even one continuous metric. In this particular case I saved very little real time in terms of human productivity, however, for a much larger data set I now know how I might approach these computations in a parallel compute environment. In this case the time saved might be palpable. I can speak from experience that really does come up, and any time saved waiting for a cluster to finish a job I submitted is hugely valuable.

That said, this is not quite how you would want to implement these evaluation metrics in a true Big Data setting, but more on that in a future post.

Faking Filters in Linear Modeling: Proof of Concept


Linear models are extremely fast to train and score, but don’t handle filter logic natively when segmenting populations. To segment N populations you might train N linear models, which gets operationally messy when N is large.

Tree-based models are often more accurate and natively handle filter logic in a single model, but are slower to train and score (I’m thinking state of the art here: random forests and boosted decision trees).  Loosely speaking, they do this by branching on the binary segment features early, if they turn out to be important, and effectively training independent trees downstream.

It is definitely possible to fake filter logic in a single linear model, but you have to engineer the feature space correctly.  This is a proof of concept illustrating how to do it and how not to do it.

(Hint: adding a simple binary feature indicating segment membership is not enough.)

What I’m about to do, and results

I’m going to mock data from two segments/clusters/categories.  The predictors are continuous, and the response variable is continuous.  I’ll train each segment individually with a linear model (ordinary least squares, or OLS) and a tree-based model (gradient boosted tree regression, or GBR).  This will set the baselines.  To be crystal clear, the goal is accurate out-of-sample predictions of the response variable, as measured by root-mean-square error (RMSE).

I’ll then try three strategies to trick the algorithms into giving me accurate predictions from just one model.  The main idea is to manipulate the feature space.  The following table illustrates the strategies and results.

Strategies and results
Strategy Feature space GBR RMSE OLS RMSE
Two independent models {x_seg_1} and {x_seg_2} 0.212 0.193
reshape1 {x, is_segment_1} 0.206 0.927
reshape2 {x, is_segment_1, is_segment_2} 0.206 0.927
reshape3 {x_seg_1, is_segment_1, x_seg_2, is_segment_2} 0.203 0.193

The x is the continuous predictor variable, and the “is_segment_1” and “is_segment_2” variables are binary in {0,1} indicating segment membership.  The strategies are called “reshape” because you can accomplish the transformations using R’s reshape2 package.  They can also be thought of as pivots, or as dummy coding and, in the last case, feature interactions.

It’s very important to control the hidden y-intercept parameter in the OLS regression.  It’s worth playing around with that to see how it changes the results, but ultimately you will find that it needs to be left free to fit for the two independent models, it should be fixed at zero for the final winning strategy, and for the losing OLS strategies it does not matter.

Python session proving the concept


I’ll use Scikit-learn regressors.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

The following functions will be reused.

def print_it(y_pred,y_actual,params):
    Print RMSE and any model parameters.
    print "RMSE: %f, params: %s" % (rmse,', '.join(str(p) for p in params),)

def plot_it(xdat,ydat,xname,yname,filename=None):
    Scatter plot of x vs y, with 1:1 line drawn through.
    fig, ax = plt.subplots()
    lims = [
      np.min([ax.get_xlim(), ax.get_ylim()]),  # min of both axes
      np.max([ax.get_xlim(), ax.get_ylim()]),  # max of both axes
    ax.plot(lims, lims, 'k-', alpha=0.75, zorder=0)
    if filename is not None:

def model_it(x_train,y_train,x_eval,y_eval,model):
    Train and score.
    return y_pred

def pipeline(x_train,y_train,x_eval,y_eval,model,filename=None):
    Train, score, print and plot
    if hasattr(model,'intercept_'):
    if hasattr(model,'coef_'):
    if hasattr(model,'feature_importances_'):
      params.append(model. feature_importances_)
    plot_it(y_eval,y_pred,'y actual','y predicted',filename=filename)
    return y_pred

Generate mock data

Generate 1000 random data points from two different clusters, or segments.


x1_eval, y1_eval =np.random.multivariate_normal(mean1,cov1,npts).T
x2_eval, y2_eval =np.random.multivariate_normal(mean2,cov2,npts).T

This is what it looks like.


A predictor variable x and a response variable y, drawn from two different multivariate normal distributions.


Transform the data to adhere to Scikit-learn’s input convention.

x1_eval =x1_eval[np.newaxis].T
x2_eval =x2_eval[np.newaxis].T

Here’s what the head and tail of the segment 1 predictors look like.

array([[ 0.55334004],
       [ 0.58077088],
       [ 0.45761136],
       [ 1.2671348 ],
       [-2.1250927 ],
       [ 0.26437481],
       [ 0.56024118],
       [ 0.01138431]])
array([[ 0.321091  ],
       [ 0.54108937],
       [ 0.88482971],
       [ 0.39905366],
       [ 0.0056274 ],

Train a model for each segment

Train a linear model on the first segment.

RMSE: 0.207597, params: -5.02259822453, [ 0.87758793]

Linear model result from segment 1.


Train a linear model on the second segment.

RMSE: 0.178626, params: 9.98068042904, [-0.92247714]

Linear model result from segment 2.


Evaluate the union of the results.

plot_it(np.concatenate([y1_eval,y2_eval]),np.concatenate([y1_pred,y2_pred]),'y actual','y predicted')
RMSE: 0.193112, params:

Union of linear model results from segments 1 and 2.

Now train two gradient boosting regressors.  GBR on segment 1:

RMSE: 0.223226, params: [ 1.]

GBR on segment 2:

RMSE: 0.201560, params: [ 1.]

And evaluation on the union of the results.

plot_it(np.concatenate([y1_eval,y2_eval]),np.concatenate([y1_pred,y2_pred]),'y actual','y predicted')
RMSE: 0.212393, params:

Union of gradient boosting regression results from segments 1 and 2.


reshape1: add a feature that flags segment 1 membership

def reshape1(x1,x2,y1,y2,n):
    Add a binary feature that flags the segment 1 as 1, or segment 2 as 0. Append the entities together.
    a = np.zeros((npts*2,2))
    return a,b

Here’s what it looks like.

array([[ 0.55334004,  1.        ],
       [ 0.58077088,  1.        ],
       [ 0.45761136,  1.        ],
       [ 1.2671348 ,  1.        ],
       [-2.1250927 ,  1.        ],
       [-0.92971722,  1.        ],
       [ 0.26437481,  1.        ],
       [-0.21786847,  1.        ],
       [ 0.56024118,  1.        ],
       [ 0.01138431,  1.        ]])
array([[ 1.29078669,  0.        ],
       [ 0.54627936,  0.        ],
       [-0.11236272,  0.        ],
       [ 0.22546577,  0.        ],
       [-0.72594666,  0.        ],
       [-1.16852349,  0.        ],
       [-0.14354644,  0.        ],
       [ 0.29371461,  0.        ],
       [ 0.59094059,  0.        ],
       [-0.7981938 ,  0.        ]])

The gradient boosting regressor handles this fine.

RMSE: 0.205698, params: [ 0.49427717  0.50572283]

Result of a single gradient boosting regression model under the reshape1 feature engineering strategy.


Linear regression does not. It doesn’t matter if you turn hidden y-intercept fitting on or off.

RMSE: 0.926650, params: 9.95478714523, [ -0.04020456 -14.97904361]

Result of a single linear regression model under the reshape1 feature engineering strategy.


reshape2: add two features that positively indicate segment membership

def reshape2(x1,x2,y1,y2,n):
    For each segment, add a binary feature positively indicating segment membership.
    a = np.zeros((npts*2,3))
    return a,b

Here’s what it looks like.

array([[ 0.55334004,  1.        ,  0.        ],
       [ 0.58077088,  1.        ,  0.        ],
       [ 0.45761136,  1.        ,  0.        ],
       [ 1.2671348 ,  1.        ,  0.        ],
       [-2.1250927 ,  1.        ,  0.        ],
       [-0.92971722,  1.        ,  0.        ],
       [ 0.26437481,  1.        ,  0.        ],
       [-0.21786847,  1.        ,  0.        ],
       [ 0.56024118,  1.        ,  0.        ],
       [ 0.01138431,  1.        ,  0.        ]])
array([[ 1.29078669,  0.        ,  1.        ],
       [ 0.54627936,  0.        ,  1.        ],
       [-0.11236272,  0.        ,  1.        ],
       [ 0.22546577,  0.        ,  1.        ],
       [-0.72594666,  0.        ,  1.        ],
       [-1.16852349,  0.        ,  1.        ],
       [-0.14354644,  0.        ,  1.        ],
       [ 0.29371461,  0.        ,  1.        ],
       [ 0.59094059,  0.        ,  1.        ],
       [-0.7981938 ,  0.        ,  1.        ]])

The gradient boosting regressor handles this fine.

RMSE: 0.205690, params: [ 0.49456452  0.2443662   0.26106927]

Result of a single gradient boosting regression model under the reshape2 feature engineering strategy.


Linear regression does not. It doesn’t matter if you turn hidden y-intercept fitting on or off.

RMSE: 0.926650, params: 0.0, [-0.04020456 -5.02425646  9.95478715]

Result of a single linear regression model under the reshape2 feature engineering strategy.


reshape3: add two features that flag segments, and place predictors into their own dimensions

def reshape3(x1,x2,y1,y2,n):
    For each segment, add a binary feature flagging segment membership, and break out the continuous predictors into their own dimensions.
    a = np.zeros((npts*2,4))
    return a,b
array([[ 0.55334004,  1.        ,  0.        ,  0.        ],
       [ 0.58077088,  1.        ,  0.        ,  0.        ],
       [ 0.45761136,  1.        ,  0.        ,  0.        ],
       [ 1.2671348 ,  1.        ,  0.        ,  0.        ],
       [-2.1250927 ,  1.        ,  0.        ,  0.        ],
       [-0.92971722,  1.        ,  0.        ,  0.        ],
       [ 0.26437481,  1.        ,  0.        ,  0.        ],
       [-0.21786847,  1.        ,  0.        ,  0.        ],
       [ 0.56024118,  1.        ,  0.        ,  0.        ],
       [ 0.01138431,  1.        ,  0.        ,  0.        ]])
array([[ 0.        ,  0.        ,  1.29078669,  1.        ],
       [ 0.        ,  0.        ,  0.54627936,  1.        ],
       [ 0.        ,  0.        , -0.11236272,  1.        ],
       [ 0.        ,  0.        ,  0.22546577,  1.        ],
       [ 0.        ,  0.        , -0.72594666,  1.        ],
       [ 0.        ,  0.        , -1.16852349,  1.        ],
       [ 0.        ,  0.        , -0.14354644,  1.        ],
       [ 0.        ,  0.        ,  0.29371461,  1.        ],
       [ 0.        ,  0.        ,  0.59094059,  1.        ],
       [ 0.        ,  0.        , -0.7981938 ,  1.        ]])

The gradient boosting regressor still handles it well.

RMSE: 0.202675, params: [ 0.23502952  0.22039532  0.27983743  0.26473773]

Result of a single gradient boosting regression model under the reshape3 feature engineering strategy.


And now the linear regressor handles it, too. It makes the most sense to turn hidden y-intercept fitting off because the two binary features are effectively achieving the same thing.

RMSE: 0.193112, params: 0.0, [ 0.87758793 -5.02259822 -0.92247714  9.98068043]

Result of a single linear regression model under the reshape3 feature engineering strategy.


Why things came out this way

Decision trees are naturally suited to segmentation, which is one reason they are so powerful.  Linear regression is not because the model is a simple linear combination of predictors — there is no branching.

What linear models do on a binary feature is effectively fit a constant offset wherever that feature is 1.  This amounts to a y-intercept for that segment.  If you keep the continuous predictor in one dimension, as in the reshape1 and reshape2 strategies, you’re allowing the model to fit only one slope parameter.  This is fine if the predictors for two segments each follow joint distributions with the response variable that happen to have the same slope, and that slope lies along the line connecting their means, but this is a very strong assumption and will typically lead to a poor fit.

When you feed linear models multiple binary features, you’re letting the regressor fit multiple y-intercepts wherever each of them is 1.  To fit multiple slopes, you have to break out the continuous predictor into independent dimensions, which is what I did in the reshape3.  This is in essence a feature interaction of dummy codes and the continuous predictor.  There are at least a couple ways to accomplish this, and I’ve only illustrated one.

I believe there is a widespread misconception among data science practitioners that you can throw any binary feature at machine learning models, and I think it comes from a specific analytic exercise of testing the statistical significance of categorical variables via t-tests under linear models, possibly conflated with the excellent performance of tree-based models under any encoding.  Maybe a statistician can weigh in.  It’s pretty clear when you think about it that dummy coding alone in linear modeling does not give the most accurate possible predictions in a typical predictive analysis application, though.

Ideas to take away

  1. Linear and gradient boosting regressors perform fine when trained on the two segments individually.
  2. Traditional dummy coding is sufficient for tree-based methods when the goal is maximally accurate out-of-sample predictions.  Multiple coding strategies work equally well.
  3. Traditional dummy coding is not sufficient in linear regression when the goal is maximally accurate out-of-sample predictions using one model.  The solution is dummy coding followed by interactions of the segment membership features and the continuous predictor.
  4. Beware the hidden y-intercept parameter.

Here are some more things to think about.

  • The winning strategy (reshape3) will likely not work as desired with naive regularization.
  • Dimensionality can blow up due to the interactions, which would generally necessitate L2 regularization even more.
  • Interactions with binary features are particularly amenable to sparse vector representations.  There are many zeros.
  • The winning strategy should work well with streaming and mini-batch training, so long as mini-batches are restricted to individual segments.
  • You can fake the same result by training N linear models and appending the fitted coefficients and intercepts, rather than on the input data. You’d still have to do the reshape3 preprocessing step on the vectors you’re scoring, which amounts to feature dimension renaming in sparse representation, but you would be able to regularize each segment individually, without busting the whole idea.

Monte Hall Monte Carlo

A friend recently pointed out an article on Marilyn vos Savant of Ask Marilyn fame, The Time Everyone “Corrected” the World’s Smartest Woman.

I love the Monty Hall problem because it is the quintessential example of how most people can be so wrong, and yet the data just does not care.  In this particular instance, it also underscores how men whose purported trade is rational, quantitative thought and behavior can be completely irrational pricks, but more on that later.

Simply stated, you’re a contestant on a gameshow and you’re presented three doors.  Two have a goat behind them, and one has a car.  You do not know which has what, but the host does.  You pick door 1 at random, but then the host, instead of showing you what’s behind door 1, opens door 2 to reveal a goat.  The host allows you to change your answer to door 3 if you please.  Should you?

Poll your friends and family.  I bet you will find they would not feel they have any particular reason to switch, thinking there’s a 50-50 chance the car is behind the door they first picked and the door they have the option to switch to. Let’s Monte Carlo that.

awk -v seed=42 \
 -v switch_doors=0 \
 -v describe_each_game=1 \
 -v num_games=10 '
 for (i = 1; i <= num_games; i++) {
  door_with_car = int(rand()*3+1)
  door_you_pick_first = 1
  if (door_you_pick_first == door_with_car) {
   door_youre_shown_that_has_goat = int(rand()*2+2)
  } else {
   if (door_with_car == 2) {
   } else {
  door_you_can_switch_to = !(door_youre_shown_that_has_goat-2)+2

  if (switch_doors) {
   door_you_go_with_in_the_end = door_you_can_switch_to
  } else {
   door_you_go_with_in_the_end = door_you_pick_first

  if (door_you_go_with_in_the_end == door_with_car) {

  if (describe_each_game) {
   desc = "You first pick " door_you_pick_first
   desc = desc ", you are shown " door_youre_shown_that_has_goat
   desc = desc ", in the end you go with " door_you_go_with_in_the_end
   if (door_you_go_with_in_the_end == door_with_car) {
    desc = desc ". You win the car."
   } else {
    desc = desc ". You do not win the car."
   print desc
 print "You won " games_won " games out of " num_games ", or " games_won/num_games " of the time."

Here’s the output of my session.

You first pick 1, you are shown 2, in the end you go with 1. You win the car.
You first pick 1, you are shown 2, in the end you go with 1. You win the car.
You first pick 1, you are shown 2, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 3, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 2, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 2, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 3, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 2, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 2, in the end you go with 1. You do not win the car.
You first pick 1, you are shown 2, in the end you go with 1. You win the car.
You won 3 games out of 10, or 0.3 of the time.

The basic result is unchanged when I increase the number of iterations and change the seed. You win about p = 1/3 of the time if you do not switch doors.

When I Monte Carlo the game where you do switch to the non-goat door, you win about 1 - p of the time, which you can realize either just by thinking about it (you get all the opposite outcomes of each iteration above), or by setting switch_doors=1 and redoing the Monte Carlo.

You can work out the answer analytically using Bayes’ Theorem, or by playing the game in your living room with your friends, which is really just a kind of real life Monte Carlo.  Analytic solution aside, I find it extremely difficult to argue against the result of a correctly implemented Monte Carlo.

This is the real lesson I take away from the Monte Hall problem.  I do not have to know Bayes’ Theorem to get at the right answer, I can just do it, and with computers the best way to “do it” is to set up Monte Carlo simulations.  It does not matter if I am a man or a woman, it does not matter whether I am a misogynist prick or went to a fancy school or I’m the loudest talker or the highest paid person in the room.

If you don’t trust the data, you get the goat.

Is the unofficial tagline of my blog.

Now here’s a streamlined bash implementation at a million iterations each, with and without the switching strategy. I time it at about 1 s total runtime.

for switch_doors in {0,1}; do
 echo -n $switch_doors ""
 awk -v seed=42 -v num_games=1000000 'BEGIN {srand(seed); for (i=1;i<=num_games;i++) {print int(rand()*3+1)}}' | \
 awk -v switch_doors=$switch_doors '
 $1 == 1 && !switch_doors {win++}
 #$1 == 1 && switch_doors {} # do not win
 #$1 != 1 && !switch_doors {} # do not win
 $1 != 1 && switch_doors {shown=!($1-2)+2;switch_to=!(shown-2)+2;switch_to==$1?win++:0}
 END {print win,NR,win/NR}'


0 333878 1000000 0.333878
1 666122 1000000 0.666122