support@edupristine.com1800 200 5835Request a Call
Call Me

How to be a lazy data scientist

November 23, 2015
, , , , ,

Analytic models discover patterns in past data to predict future, and model training is often the biggest and most time-consuming component of analytic project. So what does a lazy data scientist do?

data scientist

Instance based learning to the rescue! Also called, lazy learning and case/memory/example based learning, instance based learning doesn’t attempt to build the models a-priori at all, but only queries the training data on demand during scoring for each specific case. Table 1 gives differences between usual machine learning methods and statistical learning methods.

Table 1
Usual/Conventional Machine Learning Instance Based Learning
Prepare the data for model training Prepare the data for model training. No difference here
Train model from training data to estimate model parameters i.e. discover patterns Do not train model. Pattern discovery postponed until scoring query received
Store the model in suitable form There is no model to store
Generalize the rules in form of model, even before scoring instance is seen No generalization before scoring. Only generalize for each scoring instance individually as and when seen
Predict for unseen scoring instance using model Predict for unseen scoring instance using training data directly
Can throw away input/training data after model training Input/training data must be kept since each query uses part or full set of training observations
Requires a known model form May not have explicit model form
Storing models generally requires less storage Storing training data generally requires more storage
Scoring for new instance is generally fast Storing for new instance may be slow

k-Nearest Neighbour Algorithm

One of the most common examples of Instance based learning is . k-NN algorithm works on assumption that predicted value of similar observations must be similar. For example, all analytic professionals employed in Bangalore with 5-10 years of experience will be earning roughly same. If you think about it, this is essentially what any other model is trying to do. And this will work great is there are lot of similar observations and similarity can be appropriately defined. In practice though, we only have access to limited number of features and what defines “similarity” isn’t obvious. Before we go into details of k-NN, let’s dive into an example.

You can follow code in R if you like; otherwise directly jump to plots and discussions below.

# generate training data
tds <- data.frame(X1=runif(50)*5+7, X2=runif(50)+2)
tds$Y <- 1.3 * tds$X1 - 0.4 * tds$X2 + runif(10)/3

# build linear regression model
m1 <- lm(Y ~ X1 + X2, data=tds)
summary(m1)$adj.r.squared # Adjusted R-Square is 0.9966
sum(m1$residuals^2) # Sum-of-Residual Squared Errors is 0.4411

# plot 1
plot(m1$model$Y, m1$fitted.values, main='Actual-vs-Fitted', xlab='Actual Y', ylab='Fitted Y')
# build simple k-NN model
k <- 3

# compute pair-wise Euclidian distance between all observations
distmat <-  as.matrix(dist(tds[, -3]))

# generate prediction for all observations
for (i in 1:nrow(tds)) {
# we use k+1 here to ignore 0 distance of observation to itself
idx <- distmat[i, order(distmat[i,])][1:(k+1)]
idx <- names(idx[!(as.numeric(names(idx)) == i)])
tds[i, paste0("Y_", k, "_NN")] <- mean(tds[idx, "Y"])

# Sum-of-Residual Squared Errors
sum((tds$Y - tds[, paste0("Y_", k, "_NN")])^2) # 2.5744 for k=3, 2.0074 for k=5
# plot 2 and 3
plot(tds$Y, tds[, paste0("Y_", k, "_NN")],
main=paste0('Actual-vs-Fitted by ', k, '-NN'), xlab='Actual Y', ylab='Fitted Y')
Actual vs Fitted by Linear Regression
Actual vs Fitted by 3-NN
Actual vs Fitted by 5-NN

In this example, we used training data to score observations in training data. We did so to compute prediction error and compare linear regression model against k-NN model. However, in practice, this is not necessary for Instance based learning.

First thing to note that linear regression gave very good fit with almost 100% R2 value. This is no surprise since data was made-up! However, 3-Nearest Neighbour algorithm also gave pretty good fit, though not as much as linear regression. This is both visible from Actual-vs-Fitted plots and also from sum of squared residual errors, which is 0.44 for linear regression and 2.57 for 3-NN model. You can also observe that increasing k increases model performance and sum of squared residual errors reduces from 2.57 to 2.00. Last point to note is that k-NN does relatively worse at the edges. This is no surprise since at the edge all neighbours are one-sided! What k to select is often based on cross-validation and trial-and-errors, but also on scoring run-time requirements as higher k increases prediction time for unseen instance.

Other Instance based learning algorithms

One extension of k-NN which takes advantage of usual machine learning as well instance based learning is locally weighted k-NN regression. Taking average of k neighbours is simplest form of prediction, but we could also build a localized regression model on k neighbours and use that to predict for new observation. k will generally be higher for these cases and so will the scoring time since we are training a model each time. This will increase prediction accuracy.

We did just that below, and you can see better fit, even for edge cases. Sum of squared residual errors also reduced to 0.8467.

# locally weighted 5-NN regression
for (i in 1:nrow(tds)) {
idx <- distmat[i, order(distmat[i,])][1:(k+1)]
idx <- names(idx[!(as.numeric(names(idx)) == i)])
m <- lm(Y ~ X1 + X2, tds[idx, ])
tds[i, paste0("Y_", k, "_NN_reg")] <- predict(m, tds[i, ])
Actual vs Fitted by 5-NN Regression

Other algorithms in this category include collaborative filtering and case based reasoning.k-NN also has many nuances which we skipped. Perhaps next time!

Till then, try being lazy data scientist.

Other Articles by the same author:

Curse Dimensionality

Semi-Supervised Clustering

Other Related Links that you may like:

Overview of Text Mining

Role of Business Analyst


About the Author

Ashish Gupa has over 8 years' experience in analytics, machine learning, and business consulting, spanning banking and retail domains, and Asia and Americas geographies

Popular Courses

Global Association of Risk Professionals, Inc. (GARP®) does not endorse, promote, review or warrant the accuracy of the products or services offered by EduPristine for FRM® related information, nor does it endorse any pass rates claimed by the provider. Further, GARP® is not responsible for any fees or costs paid by the user to EduPristine nor is GARP® responsible for any fees or costs of any person or entity providing any services to EduPristine Study Program. FRM®, GARP® and Global Association of Risk Professionals®, are trademarks owned by the Global Association of Risk Professionals, Inc

CFA® Institute does not endorse, promote, or warrant the accuracy or quality of the products or services offered by EduPristine. CFA® Institute, CFA® Program, CFA® Institute Investment Foundations and Chartered Financial Analyst® are trademarks owned by CFA® Institute.

Utmost care has been taken to ensure that there is no copyright violation or infringement in any of our content. Still, in case you feel that there is any copyright violation of any kind please send a mail to and we will rectify it.

Post ID = 84805