Categorical distribution

Introduction

The categorical distribution is a discrete probability distribution that describes the possible outcomes of a categorical variable. This distribution is a generalization of the Bernoulli distribution for a random variable that can take on more than two possible outcomes. It is a fundamental concept in probability theory and statistics, particularly in the context of machine learning, natural language processing, and other fields that involve discrete data.

Definition

A categorical distribution is defined over a finite set of \( k \) possible outcomes, denoted as \( \{1, 2, \ldots, k\} \). The probability of each outcome \( i \) is given by \( p_i \), where \( p_i \geq 0 \) for all \( i \) and \( \sum_{i=1}^k p_i = 1 \). Formally, the probability mass function (PMF) of a categorical distribution is:

\[ P(X = i) = p_i \quad \text{for} \quad i = 1, 2, \ldots, k \]

Here, \( X \) is a random variable representing the outcome of the categorical distribution.

Properties

Mean and Variance

The mean of a categorical distribution is not typically defined in the same way as for continuous distributions, but the expected value of the random variable \( X \) can be expressed as:

\[ \mathbb{E}[X] = \sum_{i=1}^k i \cdot p_i \]

The variance of \( X \) is given by:

\[ \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

where \( \mathbb{E}[X^2] \) is the second moment of the distribution.

Entropy

The entropy of a categorical distribution, which measures the uncertainty associated with the distribution, is given by:

\[ H(X) = -\sum_{i=1}^k p_i \log(p_i) \]

This is a fundamental concept in information theory and is used to quantify the amount of information contained in the distribution.

Applications

Categorical distributions are widely used in various fields, including:

Machine Learning

In machine learning, categorical distributions are often used to model the probabilities of different classes in classification problems. For example, in a Naive Bayes classifier, the categorical distribution can be used to represent the likelihood of different classes given the input features.

Natural Language Processing

In natural language processing (NLP), categorical distributions are used to model the probabilities of different words or tokens in a text corpus. This is particularly useful in tasks such as text classification, language modeling, and topic modeling.

Bayesian Inference

In Bayesian inference, the categorical distribution is used as the likelihood function for discrete data. It is often combined with a Dirichlet distribution as the conjugate prior, resulting in a Dirichlet-categorical model.

Relationship to Other Distributions

The categorical distribution is closely related to several other probability distributions:

Bernoulli Distribution

The Bernoulli distribution is a special case of the categorical distribution with \( k = 2 \). It describes the probability of success and failure in a single trial.

Multinomial Distribution

The multinomial distribution is a generalization of the categorical distribution for multiple trials. If \( X_1, X_2, \ldots, X_n \) are independent and identically distributed (i.i.d.) random variables following a categorical distribution, then their sum follows a multinomial distribution.

Dirichlet Distribution

The Dirichlet distribution is a continuous multivariate distribution that serves as a conjugate prior for the categorical distribution in Bayesian inference. It is used to model the distribution of probabilities for the different outcomes of a categorical variable.

Parameter Estimation

Parameter estimation for the categorical distribution involves estimating the probabilities \( p_i \) for each outcome \( i \). This can be done using the method of maximum likelihood estimation (MLE) or Bayesian estimation.

Maximum Likelihood Estimation

In MLE, the probabilities \( p_i \) are estimated by maximizing the likelihood function. Given a sample of \( n \) observations \( X_1, X_2, \ldots, X_n \), the likelihood function is:

\[ L(p_1, p_2, \ldots, p_k) = \prod_{j=1}^n p_{X_j} \]

The MLE of \( p_i \) is given by:

\[ \hat{p}_i = \frac{n_i}{n} \]

where \( n_i \) is the number of times outcome \( i \) occurs in the sample.

Bayesian Estimation

In Bayesian estimation, the probabilities \( p_i \) are treated as random variables with a prior distribution. The Dirichlet distribution is commonly used as the prior for the categorical distribution. The posterior distribution is then obtained by combining the prior with the likelihood function.

Example

Consider a simple example of a categorical distribution with three possible outcomes: \( \{A, B, C\} \). Let the probabilities of these outcomes be \( p_A = 0.2 \), \( p_B = 0.5 \), and \( p_C = 0.3 \). The PMF of this distribution is:

\[ P(X = A) = 0.2, \quad P(X = B) = 0.5, \quad P(X = C) = 0.3 \]

The entropy of this distribution is:

\[ H(X) = - (0.2 \log(0.2) + 0.5 \log(0.5) + 0.3 \log(0.3)) \]

Conclusion

The categorical distribution is a fundamental concept in probability and statistics, with applications in various fields such as machine learning, natural language processing, and Bayesian inference. Understanding its properties, relationships to other distributions, and methods for parameter estimation is crucial for effectively modeling and analyzing discrete data.