On choosing between distribution models fitted to data

It is a very important topic and the reseacher oftentimes finds himself puzzled about which goodness of fit metric should be used. Here's a short handout:

Distance or "geometrically" based statistics.

  1. Kolmogorov-Smirnov and Anderson-Darling statistics: were meant to test the g.o.f. of pdfs with defined parameters, not fitted to empirical data. Corrections are available for just a few distributions. 
  2. Chi-Square statistic depends on the definition of the bins into which the data is grouped. There's no rule of thumb to determine the appropriate number of bins. It's also intended to large datasets.
  3. The above cannot be used to compare distributions with differing numbers of parameters. This a pdf with 4 parameters is likely to fit better the data than a pdf with 2 parameters. But that's a false imporvement due to over-fitting.
  4. Trucated, censored or binned data cannot be handled by either of the above.

Information Theory based Distance or "geometrically" based statistics.

These methods rank the proposed models. So it's important that there's good candidate models
  1. AIC (Akaike Information Criterion). Penalises parameters and is based on the concept of entropy. Is related to the Boltzmann concept, and Fisher's, and Kullback-Leibler discrepancy. There's also a AICc version for small number of sample.
  2. SIC - BIC (Schwarz or Bayesian Information Criterion). More strict than the previous penalizing number of parameters.
  3. TIC (Takeuchi Information Criterion): useful when the candidate models aren't close approximations to the "real" underlying function.
  4. HQIC (Hannan-Quinn Information Criterion): middle criteriorn between AIC and SIC.
References:

Burnham and Anderson () Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.

No comments:

Post a Comment