Open Study Answer #135

As Draft PDF

Proposed answer to the following questions:

Related questions:


WORK IN PROGRESS

Copredictor Definition

Given any discrete random variables \(X\) and \(Y\), define a copredictor of \(Y\) with \(X\) as any random variable \(W\) such that \[\begin{align*} & \text{$X$ and $W$ are independent and} \\ & Y = f(X,W) \text{ for some function $f$} \end{align*}\]

Copredictor Existence

Given any discrete random variables \(X\) and \(Y\), there exists a copredictor \(W\). The range of \(W\) is \((\operatorname{rng}{ X}) \mapsto (\operatorname{rng}{ Y})\), that is all functions from the finite range of \(X\) to the range of \(Y\). Consider any \(x\), \(y\), \(w\) from \(\operatorname{rng}{ X}\), \(\operatorname{rng}{ Y}\), and \((\operatorname{rng}{ X}) \mapsto (\operatorname{rng}{ Y})\) respectively. Denote \[\begin{eqnarray*} S & = & \{\omega:W(\omega)=w\} \cap \{\omega:X(\omega)=x\} \cap \{\omega:Y(\omega)=y\} \\ \end{eqnarray*}\] Define \(W\) such that \[\begin{eqnarray*} \operatorname{P}({ S}) & = & \; \operatorname{P}({ X=x}) \! \prod_{x' \in \operatorname{rng}{ X}} \operatorname{P}({ Y=w(x')|X=x'}) \end{eqnarray*}\] when \(w(x) = y\) otherwise \(\operatorname{P}({ S}) = 0\).

Proof TO DO

A Functional Representation Lemma can be found on page 626 of [1] and is a more powerful result beyond just existence of a copredictor. A further Strong Functional Representation Lemma can be found in [2].

Optimal Copredictor

Given any copredictor \(W\) of \(Y\) with \(X\), it follows that \[\begin{eqnarray*} \operatorname{H}({ Y}) & = & I(X,W;Y) + H(Y|X,W) \\ & = & I(X,W;Y) + 0 \\ & = & I(X;Y) + I(W;Y) - I(X;W;Y) \\ & = & I(X;Y) + I(W;Y) + I(X;W|Y) \end{eqnarray*}\] since \(I(X;W;Y) + I(X;W|Y) = I(X;Y) = 0\).

We seek a copredictor that maximizes how much predictive information is provided ‘exclusively’ by independent variables \(X\) and \(W\) and not their interaction \(I(X;W|Y)\).

To this end we want the minimum \(I(X;W|Y)\) across all possible copredictors \(W\) of \(Y\) with \(X\). This minimum is defined as the excess functional information in [2]. In this document, any copredictor achieving this minimum is considered an optimal copredictor.

The sum of mutual information and excess functional information is the conditional entropy of \(Y\) given an optimal copredictor \(W\) with \(X\):

\[\begin{eqnarray*} H(Y|W) & = & I(X;Y|Y) + H(Y|X,W) \\ & = & I(X;Y) - I(X;W;Y) + 0 \\ & = & I(X;Y) + I(X;W|Y) \\ \end{eqnarray*}\]

Thus minimizing \(I(X;W|Y)\) is equivalent to minimizing \(H(Y|W)\). The optimal copredictor is also a copredictor that achieves a minimum \(H(Y|W)\).

Linear Subspace of Copredictors

Consider any finite discrete random variables \(X\) and \(Y\). Let \(F\) denote \((\operatorname{rng}{ X}) \mapsto (\operatorname{rng}{ Y})\), the set of all functions from the range of \(X\) to the range of \(Y\).

The following linear subspace of copredictors captures all of the possible values of \(H(Y|W)\) and \(I(X;W|Y)\). Consider all copredictors \(W\) of \(Y\) with \(X\) where the range of \(W\) is \(F\), and for each \(w \in F\), \(x \in \operatorname{rng}{ X}\), \(y \in \operatorname{rng}{ Y}\) with \[\begin{eqnarray*} S & = & \{\omega:W(\omega)=w\} \cap \{\omega:X(\omega)=x\} \cap \{\omega:Y(\omega)=y\} \end{eqnarray*}\] the following holds \[\begin{eqnarray*} \operatorname{P}({ S}) & = & \begin{cases} \operatorname{P}({ Y=y}) \operatorname{P}({ X=x}) & \text{ if } w(x) = y \\ 0 & \text{ otherwise } \end{cases} \end{eqnarray*}\]

Each possible copredictor \(W\) defines a point in the subspace \(F \mapsto \mathbb{R}\) where \(q_w := P(W=w)\) is the coordinate along the dimension \(w \in F\).

This is the linear subspace of possible \(q\) defined by the linear constraints: \[\begin{eqnarray*} P(Y=y|X=x) & = & \sum_{\substack{w \in F \\ w(x)=y}} q_w \end{eqnarray*}\] for all \(y \in \operatorname{rng}{ Y}\), and \(x \in \operatorname{rng}{ X}\) and \[\begin{eqnarray*} q_w & \ge & 0 \end{eqnarray*}\] for all \(w \in F\).

We show that \(H(Y|W)\) is a linear function in terms of \(q_w\) across \(w \in F\). Because of this, minimizing \(H(Y|W)\) is minimizing a linear objective function in the subspace.

For any \(w \in F\) and \(y \in \operatorname{rng}{ Y}\), \[\begin{eqnarray*} \operatorname{P}({ Y=y|W=w}) & = & \frac{\operatorname{P}({ Y=y,W=w})}{\operatorname{P}({ W=w})} \\ & = & \sum_{x \in \operatorname{rng}{ X}} \frac{\operatorname{P}({ Y=y,X=x,W=w})}{\operatorname{P}({ W=w})} \\ & = & \sum_{\substack{x \in \operatorname{rng}{ X} \\ w(x)=y }} \frac{\operatorname{P}({ X=x}) \operatorname{P}({ W=w})}{\operatorname{P}({ W=w})} \\ & = & \sum_{\substack{x \in \operatorname{rng}{ X} \\ w(x)=y }} \operatorname{P}({ X=x}) \\ & = & \operatorname{P}({ X \in w^{-1}(y)}) \\ \end{eqnarray*}\] For any \(w \in F\) let \[\begin{eqnarray*} m(w) & = & \sum_{y \in \operatorname{rng}{ Y}} \operatorname{P}({ X \in w^{-1}(y)}) \log_2\frac{1}{\operatorname{P}({ X \in w^{-1}(y)})} \end{eqnarray*}\] Note that \(m(w)\) does not depend on any probabilities \(\operatorname{P}({ W=w})\) of a specific copredictor \(W\). The function \(m\) only depends on functions \(w \in F\). Combining the previous results shows that \(H(Y|W)\) is a linear objective function. \[\begin{eqnarray*} H(Y|W) & = & \sum_{w \in F} P(W=w) \sum_{y \in \operatorname{rng}{ Y}} P(Y=y|W=w) \log_2\frac{1}{P(Y=y|W=w)} \\ & = & \sum_{w \in F} q_w \, m(w) \end{eqnarray*}\]

Any copredictor \(Z\) can be simplified such that \(H(Y|Z) = H(Y|W)\) for some \(W\) in the above linear subspace. For every value \(z\) in the range of \(Z\), it must correspond to some function \(w \in F = (\operatorname{rng}{ X} \mapsto \operatorname{rng}{ Y})\) since \(H(Y|X,W) = 0\). Let \(W\) be a random variable with range \(F\) such that \(P(W=w)\) equals the sum of probabilities \(\operatorname{P}({ Z=z})\) for which \(z\) corresponds to \(w\).

Linear Optimization

Minimization of the previously identified linear objective function within the linear subspace of copredictors is a standard form linear programming problem [3]. It is a primal problem for which a dual problem exists. Although the primal problem is an optimization on \(|Y|^{|X|}\) variables, the dual problem is an optimzation on only \(|Y| \cdot |X|\) variables.

In matrix notation, the dual linear program is to maximize \(b^T r\) under constraint \(A^T r \le m\). The resulting maximum \(b^T r\) equals the minimum possible \(H(Y|W)\).

The vector \(b\) consists of the values \(P(Y=y|X=x)\) across values \(x \in \operatorname{rng}{ X}\) and \(y \in \operatorname{rng}{ Y}\). The vector \(m\) consists of the values \(m(w)\) across \(w \in F\). The matrix expression \(A^T r \le m\) represents the linear constaints \[\begin{eqnarray*} \sum_{x \in \operatorname{rng}{ X}} r_{(x,w(x))} \le m(w) \end{eqnarray*}\] across all \(w \in F\). The maximized \(b^T r\) represents \[\begin{eqnarray*} \sum_{\substack{x \in \operatorname{rng}{ X} \\ y \in \operatorname{rng}{ Y}}} \operatorname{P}({ Y=y|X=x}) \, r_{(x,y)} \end{eqnarray*}\] and will equal the minimum possible \(H(Y|W)\) across all copredictors \(W\).

References

1. El Gamal AA, Kim Y-H (2011) Network information theory. Cambridge University Press, Cambridge ; New York

2. Li CT, Gamal AE (2017) Strong functional representation lemma and applications to coding theorems. In: 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, Aachen, Germany, pp 589–593

3. Cornuejols G, Tütüncü R (2007) Optimization methods in finance. Cambridge University Press, Cambridge, UK ; New York