Share with your network!

Probability Density Functions (PDFs) describe probability of observing some continuous random variable in some region of space. For one dimensional random variable X, recall that PDF f(x) follows the properties that

Probability that variable takes values between

Probability that variable takes values exactly equal to

Estimating such PDF from sample of observations is common problem in Machine Learning. This comes handy in many outlier detection algorithms where we seek to estimate “true” distribution based on sample observations and then classify some of existing or new observations as outlier or not. For instance, an auto-insurer interested in catching fraud might examine claim amount request for each type of body-work, say, bumper replacement, and mark for potential fraud any amount which is too high. By way of another example, a child psychologist may examine time taken to complete a given task across different children and mark those children who take too long or too short a time for potential investigation.

In this blog post, we discuss how can we learn the PDF from sample of observations, so that we can calculate probability for each observation and decide if it is common or rare occurrence.

Density Estimation using Histogram

First we generate some random data for demonstration.

data <- c(rnorm(200, 10, 20), rnorm(200, 60, 30), runif(200, 120, 180)) # 600 points

Next, we visualize them for our understanding, using histogram, as in Figure 1.

# Plot 1
hist(data, breaks=50, freq=F, main="Univariate Distribution", xlab="Data Value")
# Plot 2
hist(data, breaks=20, freq=F, main="", xlab="20 Data Bins", col='red', border='red')

hist(data, breaks=100, freq=F, main="Univariate Distribution", xlab=NULL, xaxt='n', yaxt='n')

Figure 1 – Data Visualization using 50-Bin Histogram

Data Visualization using 50-Bin Histogram

While histograms are charts for data visualization, you can also see that they are our first estimate of density. More specifically, we can estimate density by dividing data into bins and assuming that density is constant within that bin range and has value equal to number of observations falling into that bin as proportion of total number of observations

Hence, estimated PDF is

And you realize that you have made assumption about bin-width which will impact density estimate. Hence bin-width is a parameter to density estimation model using histogram. However, overlooked fact is that we also are working with one more parameter – which is the starting position of first bin. You can see how that may affect density estimations for all bins. To see impact of bin-width, Figure 2 overlays density estimates with 20-bin and 100-bin histograms. Look at encircled region, where fewer/coarser bins give flat density estimate, while many/finer bins give varying density estimate. For yellow point, density estimates will range from 0.004 to 0.008 from two different models.

Figure 2 – Difference between 20 and 100 Bin Histograms

Difference between 20 and 100  Bin Histograms

Thus, selecting parameters right is crucial to get the density estimation right. We will get to that, but note that there are also other problems with histograms. Density estimates using histograms are quite jerky and discontinuous. Density is flat for a bin and then suddenly changes drastically for a point infinitesimally outside the bin. This makes consequence of wrong estimate even worse for practical problems.

Lastly, we have been working with single dimensional variable for ease of illustration, but in practice most problems are multi-dimensions. Since number of bins grow exponentially with number of dimensions, number of observations required to estimate the density also grow. In fact, it is plausible that despite having millions of observations, many bins remain empty or contain single digit observations. With just 50 bins each in just 3 dimensions, we have 503=125000 cells which needs to be populated. That comes about average of 8 observations per cell, assuming uniform distribution, a million observation training data.

How to select right parameters?

For bin-width n number of observations N for bin J proportion of observations is

and density estimate is

Statistical theory proves that while f(x) is expected value of density in the bin, variance of density is

While we can get better density estimate by reducing bin-width n , we increase variance of estimation, as we can intuitively feel about too fine bin-width. We can use leave one-out cross-validation technique to estimate optimal set of parameters. We can estimate density using all observations but one, and then compute density of that left out observation and measure error in estimation. Solving this mathematically for histograms gives a closed form solution for loss function for given bin-width.

where m is number of bins. Technical details of above are in this lecture [pdf] . We can plot this loss function for various numbers of bins (Figure 3)

getLoss <- function(n.break) {
N <- 600
res <- hist(data, breaks=n.break, freq=F)
bin <- as.numeric(res$breaks)
h <- bin[2]-bin[1]
p <- res$density
p <- p * h
return ( 2/(h*(N-1)) - ( (N+1)/(h*(N-1))*sum(p*p) ) )
loss.func <- data.frame(n=1:600)
loss.func$J <- sapply(loss.func$n, function(x) getLoss(x))
# Plot 3
plot(loss.func$n, loss.func$J, col='red', type='l')
opt.break <- max(loss.func[loss.func$J == min(loss.func$J), 'n'])
# Plot 4
hist(data, breaks=opt.break, freq=F, main="Univariate Distribution", xlab="15 Data Bins")

and get optimal number as 15. Actually anything from 8-15 is fine.

Figure 3 – Loss function against Number of Bins

Loss function against Number of Bins

Consequently, below Figure 4 is density estimation which balances density values as well granularity (with optimal bias-variance tradeoff).

Figure 4 – Optimal Histogram

Optimal Histogram

If you feel little uneasy at this point then I am with you. Even though, number of bins is mathematically optimal, it feels too coarse a estimation. There is no intuitive feeling why we have done the best job. And not to forget other concerns about starting position, discontinuous estimation, and curse of dimensionality . Dispair not, there is a better way. In next post we will talk about Density Estimation using Kernels.

Other Articles by the same author:

Curse Dimensionality

Semi-Supervised Clustering

Other Related Links that you may like:

Overview of Text Mining

Role of Business Analyst