You’ve heard about Big Data enough. It’s a fad that is rising and fading simultaneously, depending on whom you talk to! While need for tools and skills on Big Data continues to be on rise, buzz on Big Data is either paradigm shift or over-hype. So it may come as surprise to you that not all data is Big Data. In fields such as medicines, sociology, psychology, geology, etc. small samples are not rare occurrences but the norm. Most experiments involving primary research with real people will have small data due to sheer cost of conducting in-person interviews. Sometimes population from which sample is drawn itself may be small to begin with, say, number of countries in the world, or number of exoplanets discovered in the Universe.
In times such as these, nuances of statistics come handy. While an expert data scientist is generally well familiar with algorithms behind modeling techniques, he or she is generally not familiar with or doesn’t give importance to assumptions behind the working of that technique. Some people believe data science to be statistics sans checking for assumptions – and it is true for most part, since using repertoire of tools and algorithms, performance validation on test data, and ensemble approach to counter high variance problem, most models do good job. As they say, proof of pudding is in eating; so as long as model makes accurate predictions on unseen data, it does its job. Except, when it doesn’t.
Problems of small data
So what are problems if you have small data? Surely, you can use same methods and models. Well, small data exacerbates certain issues, like…
- Outliers – Outlier handling is important for many models, but can be lived with if proportion of outliers is small. This is obviously not the case with small data since even few outliers will form large proportion and significantly alter the model.
- Train and Test Data – A good design choice in model training is to split the data on which model is trained (“train data”) and report generalized performance on unseen data (“test data” or “holdout data”). In case, holdout data is used for tuning model parameters (sometimes called “cross-validation data”), you may need to split all observations into three sets. With small data, one doesn’t have luxury to keep out many samples, and even when one does so, number of observations in test data may be too few to give meaningful performance estimate, and/or number of observations in cross-validation data may be too few to guide parameter search optimally.
- Overfitting – If your training dataset itself is small, overfitting is more likely to occur. And using cross-validation to reduce overfitting has risk mentioned above.
- Measurement Errors – Each metric, either a predictor or target, is measured in real world, and has associated measurement error. At small scale, effects of such errors become important and affect the model adversely.
- Missing Values – Missing values in data has effect in similar direction as of measurement errors, but perhaps more in magnitude. Limited number of observations means that imputing missing values can be difficult. Further, if target has missing values then whole observation may have to be dropped which is not desirable in such cases.
- Sampling Bias – Problem with small data can be worse if data is biased and not sampled randomly from population. This is often problem in sociology research, if not controlled in design, where test subjects are often people in same circle or environment as the researcher, say, undergraduate students of the college in which researcher practices.
How to handle them
- Data review – Since abnormal data values impact predictive capacity more for small data, spend time in reviewing, cleaning, and managing your data. This means detecting outliers, imputing missing values or deciding how to use them, and understanding impact of measurement errors.
- Simpler models – Lesser the degrees of freedom compared to number of training observations, more robust are parameter estimates. Prefer simpler models when possible and limit number of parameters to be estimated. This means going for Logistics Regression rather than Neural Network, or k-Nearest Neighbours rather than Regression Splines. Use simplifying assumptions, such as those that favour Linear Discriminant Analysis over Quadratic Discriminant Analysis.
- Domain expertise – Use prior experience and domain expertise to decide on model form. Small data doesn’t offer luxury of testing different model forms and hence expert opinion counts more. Use domain knowledge to design features effectively and do feature selection. We cannot afford to throw all possible features in the mix and let the model figure out right set.
- Consortium approach – Build and grow your data over time and across sectors. Even small data adds up over time. Using slightly unrelated data to increase number of observations and then subtracting impact of unrelated-ness mathematically can still produce better performing models. For example, use Panel Regression instead of separate Linear Regressions for different groups within the data.
- Ensemble approach – Build multiple simple models rather than one best model, and use bagging or stacking approach. Ensemble models tend to reduce overfitting without increasing number of parameters to be estimated.
- No cross-validation data – This extends idea of simpler models. Don’t over-use cross-validation data for hyper-parameter optimization. If number of observations is really small, do not use cross-validation data for model training.
- Regularization – Regularization is way to produce more robust parameter estimates and is very useful in small data space. While regularization does add one more parameter to modeling process, often this increase is worthwhile. Lasso or L1 regularization produces fewer non-zero parameters and indirectly does feature selection. Ridge or L2 regularization produces smaller (in absolute value) and conservative coefficient estimates.
- Confidence intervals – Try to predict with margin of error, rather than point estimates. Models on small data will have large confidence intervals but it is better to be aware of range when making actionable decisions on predictions rather than not know.
A good experiment design is important if one expects sample size to be small. Data scientists should get involved from data gathering step itself to ensure that data is not biased, not missing, and is representative of the population. This is surely better than try to make do with small and unclean data later on.
So how small is small data? There is probably not a clear boundary but you would know if you see it. I would hazard a guess that fewer than couple of thousands is small in most cases, albeit number depends on dimensions of data and problem being attempted.