Share with your network!

By using Histograms for estimating Probability Density Function , we realized that we need to optimize on number of bins (or bin-width, as both are inversely related), and starting position of bins. Despite that, histograms suffer from discontinuity in estimation and require too many observations as number of dimensions grow.

Solution to these two problems lays in continuous density estimation methods. Recall that an observation contributes to density observation within its bin in histogram model, and not at all outside, even tiny bit outside. This can be made continuous by letting each observation contribute to density value in inverse proportion of distance between the observation and the point where density is estimated. A generalized form can be modeled as

where N is total number of observations, h is a parameter equivalent to bin-width,and K is kernel function representing impact of an observation x_i on density value at x. This can represent histogram if K is defined as

Various Kernel functions exist. However, to be meaningful, they all satisfy following requirements

First equation requires that function be symmetric around zero. Second equation requires that area under the curve of function sums to unity. Third equation requires that function is non-negative at all places. Further, Kernel functions are typically uni-modal functions (have single peak) and have values which drop when moving away from zero. Kernel functions are essentially symmetric probability distribution functions.
Following are some of the common Kernels.

More such examples are at Wikipedia page on Kernels.

Density Estimation using Kernels

Density Estimation using Kernels requires two parameter inputs: First, the shape of the Kernel function, from among many options; Second: bandwidth parameter,h.

Lower bandwidth means granular density representation, which is generally better, unless we overfit. Higher bandwidth means smoother density representation. Choice of bandwidth matters. Figure 1 and Figure 2 show different estimates by varying bandwidth.

Kernel density estimation

Kernel Density Estimation using varying Bandwidth

Figure 2 – Impact of Bandwidth on Kernel Density Estimation (Source)

Impact of Bandwidth on Kernel Density Estimation

Shape of Kernel function isn’t so much important though! Figure 3a shows estimates from Gaussian, Epanechnikov, Rectangular, Triangular, Biweight, Cosine, and Optcosine overlaid on top of each other, for same bandwidth. You can notice that they are practically on top of each other. Same story holds even on sparse data (Figure 3b). Though impact of Kernel function is more visible here, it isn’t significant enough to warrant much consideration to method of selecting right Kernel function.

Figure 3a – Effect of Kernel Function on Density Estimation

Effect of Kernel Function on Density Estimation

Figure 3b – Effect of Kernel Function on Density Estimation

Effect of Kernel Function on Density Estimation

How to select right bandwidth?

There are different methods for selecting right bandwidth value.

Business judgement and visual inspection are handy if data in 1- or 2-dimensional only, and analyst can provide sufficient attention to each problem. Over years, some rules of thumb have also emerged. Scott’s rule of thumb provides optimal bandwidth

is sample standard deviation of training observations and N is total number of training data points. This assumes underlying true distribution to be multi-variate Normal and use of Gaussian Kernel. Since choice of Kernel doesn’t matter much, this can be used in other cases too. Silverman’s rule of thumb relaxes assumption of multi-variate Normal distribution and computes optimal bandwidth

and IQR is inter-quartile-range i.e. difference between 75%ile value and 25%ile value.

Rules of thumb go only so far, and if precise estimation is important, cross-validation method come handy, but are more complex. Even more complex variant involves varying bandwidth as function of X Both of these methods are beyond scope of this blog post but can be read in leisure in this document.

Conclusion

Kernel Density Eestimation (KDE) estimates density as sum of contribution of each training observation. However, that density function (model) must be exported and stored in suitable format. Outcome of KDE is collection of (x,y) pairs where y=f ̂(x) is density estimate at x. While more granular estimates are better, too many points make run-time scoring time-consuming.

Since KDE provides continuous and non-abrupt estimates, estimated values aren’t zero for bin where no observation falls, nor are they zero on left and right of data. Instead they asymptotically go to zero on both side of data range.

So play around with various parameters of KDE in R/Python (code below) and explore the methods.

Parting words: we have so far talked about density estimation in single dimensional data, but similar methods apply for multi-dimensional data too.

KDE in R

k <- density(data, bw=1, adjust=1, weights=NULL, kernel="gaussian", n=1024, cut=5, na.rm=T)
plot(k, xlim=c(0, max(data)))
k$bw # bandwidth
data.frame(x=K$x, y=k$y) # result
# other packages sm.density, KernSmooth.bkde

KDE in Python

from scipy.stats import gaussian_kde
k = gaussian_kde(d)
k.factor # bandwidth
k.evaluate(xgrid) # result
# area under curve using trapezoid rule
sum(k.evaluate(xgrid))/xgrid.size * (xgrid.max() - xgrid.min())
# alternate package
from sklearn.neighbors import KernelDensity
k = KernelDensity(kernel='gaussian', bandwidth=2)
k.fit(d)
k.score_samples(xgrid)

Other Articles by the same author:

Curse Dimensionality

Semi-Supervised Clustering

Other Related Links that you may like:

Overview of Text Mining

Role of Business Analyst