By using Histograms for estimating Probability Density Function , we realized that we need to optimize on number of bins (or bin-width, as both are inversely related), and starting position of bins. Despite that, histograms suffer from discontinuity in estimation and require too many observations as number of dimensions grow.
Solution to these two problems lays in continuous density estimation methods. Recall that an observation contributes to density observation within its bin in histogram model, and not at all outside, even tiny bit outside. This can be made continuous by letting each observation contribute to density value in inverse proportion of distance between the observation and the point where density is estimated. A generalized form can be modeled as
where N is total number of observations, h is a parameter equivalent to bin-width,and K is kernel function representing impact of an observation x_i on density value at x. This can represent histogram if K is defined as
Various Kernel functions exist. However, to be meaningful, they all satisfy following requirements
First equation requires that function be symmetric around zero. Second equation requires that area under the curve of function sums to unity. Third equation requires that function is non-negative at all places. Further, Kernel functions are typically uni-modal functions (have single peak) and have values which drop when moving away from zero. Kernel functions are essentially symmetric probability distribution functions.
Following are some of the common Kernels.
More such examples are at Wikipedia page on Kernels.
Density Estimation using Kernels
Density Estimation using Kernels requires two parameter inputs: First, the shape of the Kernel function, from among many options; Second: bandwidth parameter,h.
Lower bandwidth means granular density representation, which is generally better, unless we overfit. Higher bandwidth means smoother density representation. Choice of bandwidth matters. Figure 1 and Figure 2 show different estimates by varying bandwidth.
Figure 2 – Impact of Bandwidth on Kernel Density Estimation (Source)
Shape of Kernel function isn’t so much important though! Figure 3a shows estimates from Gaussian, Epanechnikov, Rectangular, Triangular, Biweight, Cosine, and Optcosine overlaid on top of each other, for same bandwidth. You can notice that they are practically on top of each other. Same story holds even on sparse data (Figure 3b). Though impact of Kernel function is more visible here, it isn’t significant enough to warrant much consideration to method of selecting right Kernel function.
Figure 3a – Effect of Kernel Function on Density Estimation
Figure 3b – Effect of Kernel Function on Density Estimation
How to select right bandwidth?
There are different methods for selecting right bandwidth value.
Business judgement and visual inspection are handy if data in 1- or 2-dimensional only, and analyst can provide sufficient attention to each problem. Over years, some rules of thumb have also emerged. Scott’s rule of thumb provides optimal bandwidth
is sample standard deviation of training observations and N is total number of training data points. This assumes underlying true distribution to be multi-variate Normal and use of Gaussian Kernel. Since choice of Kernel doesn’t matter much, this can be used in other cases too. Silverman’s rule of thumb relaxes assumption of multi-variate Normal distribution and computes optimal bandwidth
and IQR is inter-quartile-range i.e. difference between 75%ile value and 25%ile value.
Rules of thumb go only so far, and if precise estimation is important, cross-validation method come handy, but are more complex. Even more complex variant involves varying bandwidth as function of X Both of these methods are beyond scope of this blog post but can be read in leisure in this document.
Kernel Density Eestimation (KDE) estimates density as sum of contribution of each training observation. However, that density function (model) must be exported and stored in suitable format. Outcome of KDE is collection of (x,y) pairs where y=f ̂(x) is density estimate at x. While more granular estimates are better, too many points make run-time scoring time-consuming.
Since KDE provides continuous and non-abrupt estimates, estimated values aren’t zero for bin where no observation falls, nor are they zero on left and right of data. Instead they asymptotically go to zero on both side of data range.
So play around with various parameters of KDE in R/Python (code below) and explore the methods.
Parting words: we have so far talked about density estimation in single dimensional data, but similar methods apply for multi-dimensional data too.
KDE in R
k <- density(data, bw=1, adjust=1, weights=NULL, kernel="gaussian", n=1024, cut=5, na.rm=T)
plot(k, xlim=c(0, max(data)))
k$bw # bandwidth
data.frame(x=K$x, y=k$y) # result
# other packages sm.density, KernSmooth.bkde
KDE in Python
from scipy.stats import gaussian_kde
k = gaussian_kde(d)
k.factor # bandwidth
k.evaluate(xgrid) # result
# area under curve using trapezoid rule
sum(k.evaluate(xgrid))/xgrid.size * (xgrid.max() - xgrid.min())
# alternate package
from sklearn.neighbors import KernelDensity
k = KernelDensity(kernel='gaussian', bandwidth=2)