Bayesian information criterion

Bayesian Information Criterion

The Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models. The model with the lowest BIC is preferred. It is based on the likelihood function and it is closely related to the Akaike Information Criterion (AIC). The BIC was developed by Gideon E. Schwarz and is sometimes referred to as the Schwarz Information Criterion (SIC).

Definition

The BIC is formally defined as:

\[ \text{BIC} = -2 \ln(L) + k \ln(n) \]

where:

\( L \) is the likelihood of the model,
\( k \) is the number of parameters in the model,
\( n \) is the number of data points.

The BIC can be used for model comparison, where the model with the lowest BIC is considered the best model among the candidates.

Derivation

The BIC is derived from a Bayesian perspective. It approximates the Bayes factor, which is used in Bayesian statistics to compare models. The derivation involves the use of asymptotic approximations and the Laplace method for integrals. The BIC can be seen as an approximation to the log of the marginal likelihood of a model.

Properties

1. 1. Consistency

One of the key properties of the BIC is its consistency. As the sample size \( n \) increases, the probability that the BIC will select the true model (assuming it is among the set of candidate models) approaches 1. This is in contrast to the AIC, which does not have this property.

1. 1. Penalty Term

The BIC includes a penalty term for the number of parameters in the model. This penalty term, \( k \ln(n) \), increases with the number of parameters and the sample size. This discourages overfitting by penalizing models with more parameters.

1. 1. Likelihood Function

The BIC is based on the likelihood function, which measures how well the model explains the observed data. The likelihood function is a fundamental concept in statistical inference and is used in various model selection criteria.

Comparison with Other Criteria

1. 1. Akaike Information Criterion (AIC)

The AIC is another popular model selection criterion. It is defined as:

\[ \text{AIC} = -2 \ln(L) + 2k \]

The main difference between the AIC and the BIC is the penalty term. The AIC uses a constant penalty of \( 2k \), while the BIC uses \( k \ln(n) \). This means that the BIC imposes a heavier penalty for models with more parameters, especially as the sample size increases.

1. 1. Cross-Validation

Cross-validation is a model selection method that involves partitioning the data into subsets, training the model on some subsets, and validating it on the remaining subsets. Unlike the BIC, cross-validation does not rely on asymptotic approximations and can be more robust in small sample sizes.

Applications

The BIC is widely used in various fields, including:

1. 1. Econometrics

In econometrics, the BIC is used for selecting among different econometric models. It helps in identifying the model that best explains the economic phenomena under study while avoiding overfitting.

1. 1. Machine Learning

In machine learning, the BIC is used for model selection in algorithms such as Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs). It helps in determining the optimal number of components or states in these models.

1. 1. Bioinformatics

In bioinformatics, the BIC is used for selecting models in the analysis of genetic data. It helps in identifying the most likely genetic models that explain the observed data.

Limitations

1. 1. Large Sample Sizes

While the BIC is consistent, it can be overly conservative in large sample sizes. The penalty term \( k \ln(n) \) can become very large, leading to the selection of overly simple models.

1. 1. Assumptions

The BIC relies on certain assumptions, such as the models being nested and the sample size being large. Violations of these assumptions can affect the performance of the BIC.

References