A practical advice on non-parametric density estimation.

Always start from the histogram, any non-parametric density estimation methods are essentially fancier versions of a histogram.

Compare the problem of choosing and optimal size of bins in histogram with choice of h in kernel estimator

The number of bins is too small. Important features, such as mode, of this distribution are not revealed
The number of bins is too small. Important features, such as
mode, of this distribution are not revealed
Optimal number of bins (Optimal according to Sturges' rule, but the rule is besides the point)
Optimal number of bins (Optimal according to Sturges’ rule, but the rule is besides the point)
capture
The number of bins is too large. The distribution is overtted.

The point of the exercise is to reveal all features of data; and that what important to keep in mind.

The bandwidth h is too large. Local features of this distribution are not revealed
The bandwidth h is too large. Local features of this distribution
are not revealed
The bandwidth h is selected by a rule-of-thumb called normal reference bandwidth
The bandwidth h is selected by a rule-of-thumb called normal
reference bandwidth
The bandwidth h is too small. The distribution is overtted.
The bandwidth h is too small. The distribution is overfitted.

 

 

Capture
While histogram takes an average within a bin, kernel estimation naturally extends this idea and takes a fancier version of average around given point. How much info around a point to use is governed by the bandwidth. Conceptually a bandwidth and a bin are identical.

 

And now take a look at a perfect application of the idea in

Nissanov, Zoya, and Maria Grazia Pittau. “Measuring changes in the Russian middle class between 1992 and 2008: a nonparametric distributional analysis.” Empirical Economics 50.2 (2016): 503-530.

Comparison between income distributions in the period 1992–2008. Authors’ calculation on weighted household income data from RLMS. Kernel density estimates are obtained using adaptive bandwidth
Comparison between income distributions in the period 1992–2008. Authors’ calculation on
weighted household income data from RLMS. Kernel density estimates are obtained using adaptive bandwidth

Going back to advice: keep in mind that you doing it to reveal features of data and it has to be strictly more informative than a histogram, otherwise the computational costs are not justified.

Advertisements

I'd like to know your thoughts about it

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s