Share with your network!

Analytic models discover patterns in past data to predict future, and model training is often the biggest and most time-consuming component of analytic project. So what does a lazy data scientist do?

data scientist
Instance based learning to the rescue! Also called, lazy learning and case/memory/example based learning, instance based learning doesn’t attempt to build the models a-priori at all, but only queries the training data on demand during scoring for each specific case. Table 1 gives differences between usual machine learning methods and statistical learning methods.

Table 1

Usual/Conventional Machine LearningInstance Based Learning
Prepare the data for model trainingPrepare the data for model training. No difference here
Train model from training data to estimate model parameters i.e. discover patternsDo not train model. Pattern discovery postponed until scoring query received
Store the model in suitable formThere is no model to store
Generalize the rules in form of model, even before scoring instance is seenNo generalization before scoring. Only generalize for each scoring instance individually as and when seen
Predict for unseen scoring instance using modelPredict for unseen scoring instance using training data directly
Can throw away input/training data after model trainingInput/training data must be kept since each query uses part or full set of training observations
Requires a known model formMay not have explicit model form
Storing models generally requires less storageStoring training data generally requires more storage
Scoring for new instance is generally fastStoring for new instance may be slow

k-Nearest Neighbour Algorithm

One of the most common examples of Instance based learning is . k-NN algorithm works on assumption that predicted value of similar observations must be similar. For example, all analytic professionals employed in Bangalore with 5-10 years of experience will be earning roughly same. If you think about it, this is essentially what any other model is trying to do. And this will work great is there are lot of similar observations and similarity can be appropriately defined. In practice though, we only have access to limited number of features and what defines “similarity” isn’t obvious. Before we go into details of k-NN, let’s dive into an example.

You can follow code in R if you like; otherwise directly jump to plots and discussions below.

# generate training data
set.seed(100)
tds <- data.frame(X1=runif(50)*5+7, X2=runif(50)+2)
tds$Y <- 1.3 * tds$X1 - 0.4 * tds$X2 + runif(10)/3

# build linear regression model
m1 <- lm(Y ~ X1 + X2, data=tds)
summary(m1)$adj.r.squared # Adjusted R-Square is 0.9966
sum(m1$residuals^2) # Sum-of-Residual Squared Errors is 0.4411

# plot 1
plot(m1$model$Y, m1$fitted.values, main='Actual-vs-Fitted', xlab='Actual Y', ylab='Fitted Y')
abline(0,1)

# build simple k-NN model
k <- 3

# compute pair-wise Euclidian distance between all observations
distmat <-  as.matrix(dist(tds[, -3]))

# generate prediction for all observations
for (i in 1:nrow(tds)) {
# we use k+1 here to ignore 0 distance of observation to itself
idx <- distmat[i, order(distmat[i,])][1:(k+1)]
idx <- names(idx[!(as.numeric(names(idx)) == i)])
tds[i, paste0("Y_", k, "_NN")] <- mean(tds[idx, "Y"])
}

# Sum-of-Residual Squared Errors
sum((tds$Y - tds[, paste0("Y_", k, "_NN")])^2) # 2.5744 for k=3, 2.0074 for k=5
# plot 2 and 3
plot(tds$Y, tds[, paste0("Y_", k, "_NN")],
main=paste0('Actual-vs-Fitted by ', k, '-NN'), xlab='Actual Y', ylab='Fitted Y')
abline(0,1)

Actual vs Fitted by Linear Regression
 

Actual vs Fitted by 3-NN
 

Actual vs Fitted by 5-NN
In this example, we used training data to score observations in training data. We did so to compute prediction error and compare linear regression model against k-NN model. However, in practice, this is not necessary for Instance based learning.

First thing to note that linear regression gave very good fit with almost 100% R2 value. This is no surprise since data was made-up! However, 3-Nearest Neighbour algorithm also gave pretty good fit, though not as much as linear regression. This is both visible from Actual-vs-Fitted plots and also from sum of squared residual errors, which is 0.44 for linear regression and 2.57 for 3-NN model. You can also observe that increasing k increases model performance and sum of squared residual errors reduces from 2.57 to 2.00. Last point to note is that k-NN does relatively worse at the edges. This is no surprise since at the edge all neighbours are one-sided! What k to select is often based on cross-validation and trial-and-errors, but also on scoring run-time requirements as higher k increases prediction time for unseen instance.

Other Instance based learning algorithms

One extension of k-NN which takes advantage of usual machine learning as well instance based learning is locally weighted k-NN regression. Taking average of k neighbours is simplest form of prediction, but we could also build a localized regression model on k neighbours and use that to predict for new observation. k will generally be higher for these cases and so will the scoring time since we are training a model each time. This will increase prediction accuracy.

We did just that below, and you can see better fit, even for edge cases. Sum of squared residual errors also reduced to 0.8467.

# locally weighted 5-NN regression
for (i in 1:nrow(tds)) {
idx <- distmat[i, order(distmat[i,])][1:(k+1)]
idx <- names(idx[!(as.numeric(names(idx)) == i)])
m <- lm(Y ~ X1 + X2, tds[idx, ])
tds[i, paste0("Y_", k, "_NN_reg")] <- predict(m, tds[i, ])
}

Actual vs Fitted by 5-NN Regression
Other algorithms in this category include collaborative filtering and case based reasoning.k-NN also has many nuances which we skipped. Perhaps next time!

Till then, try being lazy data scientist.

Other Articles by the same author:

Curse Dimensionality

Semi-Supervised Clustering

Other Related Links that you may like:

Overview of Text Mining

Role of Business Analyst