Open Study Answer #137

As Draft PDF

Proposed answer to the following questions:


Both variance and entropy are measures of uncertainty. Variance assumes values vary as points in a space with distances between. In this document, the variance of a random vector refers to the variance of the distance from its mean (sum of the variances of each component).

Random one-hot vectors are a convenient spacial representation for categorical random variables. A one-hot vector has all components equal to \(0\) except one component that equals \(1\). This representation has been used in genetics [1]. For genetic loci with only two alleles, a one-hot vector has two redundant components. “Half” of such one-hot vectors are typically used in genetics (e.g. [2] p.40, [3], [4] ). The variance of the “half one-hot vector” is exactly half the variance of its full one-hot vector.

Main Result

Given \(N\) independent random one-hot vectors: \(X_1\), \(X_2\), …, \(X_N\) denote \[\begin{eqnarray*} X_* = X_1 \times X_2 \times \dots \times X_N \end{eqnarray*}\] as the Cartesian product.

The variance of \(X_*\) can be adjusted to form a lower bound to the collision entropy, \(\operatorname{H}_2(X_*)\), and Shannon entropy, \(\operatorname{H}(X_*)\): \[ -N \log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} \; \le \; \operatorname{H}_2(X_*) \; \le \; \operatorname{H}({ X_*}) \]

If every \(X_i\) takes only two equally likely values, then the lower bounds reach equality: \[ -N \log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} = \operatorname{H}_2({ X_*}) = \operatorname{H}({ X_*}) = N \]

Proof

Let \(M_i\) be length of \(X_i\) (the number of categorical values represented by \(X_i\)). Let \(p_{i,j}\) represent the probability of \(X_i\) taking the \(j\)-th categorical value.

For every \(1 \le i \le N\), \[\begin{eqnarray*} \sum_{j=1}^{M_i} p_{i,j} = 1 \end{eqnarray*}\]

The expectation and variance of the \(i\)-th one-hot vector \(X_i\) is \[\begin{eqnarray*} \operatorname{E}\!\left({ X_i}\right) & = & \left(\; {p}_{i,1} \;,\; {p}_{i,2} \;,\; \dots \;,\; {p}_{i,M_i} \;\right) \\ \operatorname{Var}({ X_i}) & = & \sum_{j=1}^{M_i} p_{i,j} \left[ (1 - p_{i,j})^2 + \sum_{k \not= j} (0 - p_{i,k})^2 \right] \\ & = & \sum_{j=1}^{M_i} p_{i,j} \left[ 1 - 2 p_{i,j} + \sum_{k=1}^{M_i} p_{i,k}^2 \right] \\ & = & 1 - 2 \sum_{j=1}^{M_i} p_{i,j}^2 + \sum_{k=1}^{M_i} p_{i,k}^2 \\ & = & 1 - \sum_{j=1}^{M_i} p_{i,j}^2 \end{eqnarray*}\]

Thus the variance of \(X_i\) equals the probability of two independent samples from \(X_i\) being distinct. This probability of distinction has been called logical entropy [5].

The complement \[ 1 - \operatorname{Var}({ X_i}) = \sum_{j=1}^{M_i} p_{i,j}^2 \] is the chance of repetition, which is expected probability. Taking the negative log gives Rényi entropy of order 2, also called collision entropy: \[ -\log_2{( 1 - \operatorname{Var}({ X_i}))} = -\log_2{\left( \sum_{j=1}^{M_i} p_{i,j}^2 \right)} = \operatorname{H}_2(X_i) \] Since negative log is a concave function, the negative log of expected probability (collision entropy), is a lower bound to the expected negative log of probability (Shannon entropy) by Jensen’s inequality: \[ \operatorname{H}_2(X_i) = -\log_2{\left( \sum_{j=1}^{M_i} p_{i,j}^2 \right)} \le \sum_{j=1}^{M_i} p_{i,j} (-\log_2{p_{i,j}}) = \operatorname{H}({ X_i}) \]

The total variance, can be adjusted to equal the average probability of one-hot vector repetition (per one-hot vector): \[ 1 - \frac{ \operatorname{Var}({ X_*}) }{N} = 1 - \frac{1}{N} \sum_{i=1}^N \operatorname{Var}({ X_i}) = \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{M_i} p_{i,j}^2 \]

Negative log with Jensen’s inequality can then establish yet another lower bound: \[ -\log_2{\left( \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{M_i} p_{i,j}^2 \right)} \le \frac{1}{N} \sum_{i=1}^N \left( -\log_2{\sum_{j=1}^{M_i} p_{i,j}^2} \right) = \frac{1}{N} \sum_{i=1}^N \operatorname{H}_2(X_i) \]

Collision and Shannon entropy are additive for independent variables. Putting everything together we get: \[ -N \log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} \; \le \; \operatorname{H}_2(X_*) \; \le \; \operatorname{H}({ X_*}) \]

References

1. Menozzi P, Piazza A, Cavalli-Sforza L (1978) Synthetic maps of human gene frequencies in Europeans. Science 201:786–792. https://doi.org/10.1126/science.356262

2. Weir BS (1996) Genetic data analysis II: Methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass

3. Weir BS, Hill WG (2002) Estimating F-Statistics. Annual Review of Genetics 36:721–750. https://doi.org/10.1146/annurev.genet.36.050802.093940

4. Patterson N, Price AL, Reich D (2006) Population Structure and Eigenanalysis. PLoS Genetics 2:e190. https://doi.org/10.1371/journal.pgen.0020190

5. Ellerman D (2017) Logical information theory: New logical foundations for information theory. Logic Journal of the IGPL 25:806–835. https://doi.org/10.1093/jigpal/jzx022