Analytic models discover patterns in past data to predict future, and model training is often the biggest and most time-consuming component of analytic project. So what does a lazy data scientist do?

**Instance based learning** to the rescue! Also called, lazy learning and case/memory/example based learning, instance based learning doesn’t attempt to build the models *a-priori* at all, but only queries the training data on demand during scoring for each specific case. Table 1 gives differences between usual machine learning methods and statistical learning methods.

Usual/Conventional Machine Learning | Instance Based Learning |
---|---|

Prepare the data for model training | Prepare the data for model training. No difference here |

Train model from training data to estimate model parameters i.e. discover patterns | Do not train model. Pattern discovery postponed until scoring query received |

Store the model in suitable form | There is no model to store |

Generalize the rules in form of model, even before scoring instance is seen | No generalization before scoring. Only generalize for each scoring instance individually as and when seen |

Predict for unseen scoring instance using model | Predict for unseen scoring instance using training data directly |

Can throw away input/training data after model training | Input/training data must be kept since each query uses part or full set of training observations |

Requires a known model form | May not have explicit model form |

Storing models generally requires less storage | Storing training data generally requires more storage |

Scoring for new instance is generally fast | Storing for new instance may be slow |

One of the most common examples of Instance based learning is **. k-NN algorithm works on assumption that predicted value of similar observations must be similar. ** For example, all analytic professionals employed in Bangalore with 5-10 years of experience will be earning roughly same. If you think about it, this is essentially what any other model is trying to do. And this will work great is there are lot of similar observations and similarity can be appropriately defined. In practice though, we only have access to limited number of features and what defines “similarity” isn’t obvious. Before we go into details of k-NN, let’s dive into an example.

You can follow code in R if you like; otherwise directly jump to plots and discussions below.

` # generate training data`

set.seed(100)

tds <- data.frame(X1=runif(50)*5+7, X2=runif(50)+2)

tds$Y <- 1.3 * tds$X1 - 0.4 * tds$X2 + runif(10)/3

`# build linear regression model`

m1 <- lm(Y ~ X1 + X2, data=tds)

summary(m1)$adj.r.squared # Adjusted R-Square is 0.9966

sum(m1$residuals^2) # Sum-of-Residual Squared Errors is 0.4411

`# plot 1`

plot(m1$model$Y, m1$fitted.values, main='Actual-vs-Fitted', xlab='Actual Y', ylab='Fitted Y')

abline(0,1)

`# build simple k-NN model`

k <- 3

`# compute pair-wise Euclidian distance between all observations`

distmat <- as.matrix(dist(tds[, -3]))

`# generate prediction for all observations`

for (i in 1:nrow(tds)) {

# we use k+1 here to ignore 0 distance of observation to itself

idx <- distmat[i, order(distmat[i,])][1:(k+1)]

idx <- names(idx[!(as.numeric(names(idx)) == i)])

tds[i, paste0("Y_", k, "_NN")] <- mean(tds[idx, "Y"])

}

`# Sum-of-Residual Squared Errors `

sum((tds$Y - tds[, paste0("Y_", k, "_NN")])^2) # 2.5744 for k=3, 2.0074 for k=5

`# plot 2 and 3`

plot(tds$Y, tds[, paste0("Y_", k, "_NN")],

main=paste0('Actual-vs-Fitted by ', k, '-NN'), xlab='Actual Y', ylab='Fitted Y')

abline(0,1)

In this example, we used training data to score observations in training data. We did so to compute prediction error and compare linear regression model against k-NN model. However, in practice, this is not necessary for Instance based learning.

First thing to note that linear regression gave very good fit with almost 100% R2 value. This is no surprise since data was made-up! However, 3-Nearest Neighbour algorithm also gave pretty good fit, though not as much as linear regression. This is both visible from Actual-vs-Fitted plots and also from sum of squared residual errors, which is 0.44 for linear regression and 2.57 for 3-NN model. You can also observe that increasing k increases model performance and sum of squared residual errors reduces from 2.57 to 2.00. Last point to note is that k-NN does relatively worse at the edges. This is no surprise since at the edge all neighbours are one-sided! What k to select is often based on cross-validation and trial-and-errors, but also on scoring run-time requirements as higher k increases prediction time for unseen instance.

One extension of k-NN which takes advantage of usual machine learning as well instance based learning is locally weighted k-NN regression. Taking average of k neighbours is simplest form of prediction, but we could also build a localized regression model on k neighbours and use that to predict for new observation. k will generally be higher for these cases and so will the scoring time since we are training a model each time. This will increase prediction accuracy.

We did just that below, and you can see better fit, even for edge cases. Sum of squared residual errors also reduced to 0.8467.

`# locally weighted 5-NN regression`

for (i in 1:nrow(tds)) {

idx <- distmat[i, order(distmat[i,])][1:(k+1)]

idx <- names(idx[!(as.numeric(names(idx)) == i)])

m <- lm(Y ~ X1 + X2, tds[idx, ])

tds[i, paste0("Y_", k, "_NN_reg")] <- predict(m, tds[i, ])

}

Other algorithms in this category include collaborative filtering and case based reasoning.k-NN also has many nuances which we skipped. Perhaps next time!

Till then, try being lazy data scientist.

Other Articles by the same author:

Other Related Links that you may like:

## Comments