Introduction

Both variance and entropy are measures of uncertainty. Variance assumes values vary as points in a space separated by distances. In this document, the variance of a random vector refers to the variance of the distance from its mean (sum of the variances of each component).

Random one-hot vectors are a convenient spacial representation for categorical random variables. A one-hot vector has all components equal to 00 except one component that equals 11. This representation has been used in genetics [1]. For genetic loci with only two alleles, a one-hot vector has two redundant components. “Half” of such one-hot vectors are typically used in genetics (e.g. [2] p.40, [3], [4] ). The variance of the “half one-hot vector” is exactly half the variance of its full one-hot vector.

Main Result

Given NN independent random one-hot vectors X1X_1, X2X_2, …, XNX_N, define X=X1×X2××XN X_* = X_1 \times X_2 \times \dots \times X_N as the Cartesian product.

The variance of XX_* can be adjusted to form a lower bound to the collision entropy, H2(X)\operatorname{H}_2(X_*), and Shannon entropy, H(X)\operatorname{H}(X_*): log2(1Var(X)N)    H2(X)N    H(X)N - \log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} \; \le \; \frac{\operatorname{H}_2({ X_*})}{N} \; \le \; \frac{\operatorname{H}({ X_*})}{N}

If every XiX_i takes only two equally likely values, then the lower bounds reach equality: log2(1Var(X)N)=H2(X)N=H(X)N=1 -\log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} = \frac{\operatorname{H}_2({ X_*})}{N} = \frac{\operatorname{H}({ X_*})}{N} = 1

Proof

Let MiM_i be length of XiX_i (the number of categorical values represented by XiX_i). Let pi,jp_{i,j} represent the probability of XiX_i taking the jj-th categorical value.

For every 1iN1 \le i \le N, j=1Mipi,j=1 \sum_{j=1}^{M_i} p_{i,j} = 1

The expectation and variance of the ii-th one-hot vector XiX_i is E ⁣(Xi)=(  pi,1  ,  pi,2  ,    ,  pi,Mi  )Var(Xi)=j=1Mipi,j[(1pi,j)2+kj(0pi,k)2]=j=1Mipi,j[12pi,j+k=1Mipi,k2]=12j=1Mipi,j2+k=1Mipi,k2=1j=1Mipi,j2 \begin{aligned} \operatorname{E}\!\left({ X_i}\right) & = \left(\; {p}_{i,1} \;,\; {p}_{i,2} \;,\; \dots \;,\; {p}_{i,M_i} \;\right) \\ \operatorname{Var}({ X_i}) & = \sum_{j=1}^{M_i} p_{i,j} \left[ (1 - p_{i,j})^2 + \sum_{k \not= j} (0 - p_{i,k})^2 \right] \\ & = \sum_{j=1}^{M_i} p_{i,j} \left[ 1 - 2 p_{i,j} + \sum_{k=1}^{M_i} p_{i,k}^2 \right] \\ & = 1 - 2 \sum_{j=1}^{M_i} p_{i,j}^2 + \sum_{k=1}^{M_i} p_{i,k}^2 \\ & = 1 - \sum_{j=1}^{M_i} p_{i,j}^2 \end{aligned}

Thus the variance of XiX_i equals the probability of two independent samples from XiX_i being distinct. This probability of distinction has been called logical entropy [5].

The complement 1Var(Xi)=j=1Mipi,j2 1 - \operatorname{Var}({ X_i}) = \sum_{j=1}^{M_i} p_{i,j}^2 is the chance of repetition, which is expected probability. Taking the negative log gives Rényi entropy of order 2, also called collision entropy: log2(1Var(Xi))=log2(j=1Mipi,j2)=H2(Xi) -\log_2{( 1 - \operatorname{Var}({ X_i}))} = -\log_2{\left( \sum_{j=1}^{M_i} p_{i,j}^2 \right)} = \operatorname{H}_2(X_i) Since negative log is a concave function, the negative log of expected probability (collision entropy), is a lower bound to the expected negative log of probability (Shannon entropy) by Jensen’s inequality: H2(Xi)=log2(j=1Mipi,j2)j=1Mipi,j(log2pi,j)=H(Xi) \operatorname{H}_2(X_i) = -\log_2{\left( \sum_{j=1}^{M_i} p_{i,j}^2 \right)} \le \sum_{j=1}^{M_i} p_{i,j} (-\log_2{p_{i,j}}) = \operatorname{H}({ X_i})

The total variance, can be adjusted to equal the average probability of one-hot vector repetition (per one-hot vector): 1Var(X)N=11Ni=1NVar(Xi)=1Ni=1Nj=1Mipi,j2 1 - \frac{ \operatorname{Var}({ X_*}) }{N} = 1 - \frac{1}{N} \sum_{i=1}^N \operatorname{Var}({ X_i}) = \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{M_i} p_{i,j}^2

Negative log with Jensen’s inequality can then establish yet another lower bound: log2(1Ni=1Nj=1Mipi,j2)1Ni=1N(log2j=1Mipi,j2)=1Ni=1NH2(Xi) -\log_2{\left( \frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{M_i} p_{i,j}^2 \right)} \le \frac{1}{N} \sum_{i=1}^N \left( -\log_2{\sum_{j=1}^{M_i} p_{i,j}^2} \right) = \frac{1}{N} \sum_{i=1}^N \operatorname{H}_2(X_i)

Collision and Shannon entropy are additive for independent variables. Putting everything together we get log2(1Var(X)N)    H2(X)N    H(X)N -\log_2{\left( 1 - \frac{ \operatorname{Var}({ X_*}) }{N} \right)} \; \le \; \frac{\operatorname{H}_2({ X_*})}{N} \; \le \; \frac{\operatorname{H}({ X_*})}{N}

References

1.
Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in Europeans. Science. 1978;201: 786–792. doi:10.1126/science.356262
2.
Weir BS. Genetic data analysis II: Methods for discrete population genetic data. Sunderland, Mass: Sinauer Associates; 1996.
3.
Weir BS, Hill WG. Estimating F-Statistics. Annual Review of Genetics. 2002;36: 721–750. doi:10.1146/annurev.genet.36.050802.093940
4.
Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLoS Genetics. 2006;2: e190–. doi:10.1371/journal.pgen.0020190
5.
Ellerman D. Logical information theory: New logical foundations for information theory. Logic Journal of the IGPL. 2017;25: 806–835. doi:10.1093/jigpal/jzx022