Open Study Answer #136

As Draft PDF

Proposed answer to the following questions:


The name \(F_{ST}\) was used as early as [1] for a certain measure between populations. In the many decades since, the name has been used to have slightly different meanings. The recent publication [2] covers a long list of different \(F_{ST}\) meanings in many papers over the decades. Most of the papers cover additional concepts which are not required to define and gain intuition on \(F_{ST}\). And the mathematical details behind a formal definition are spread across many papers.

This documents gives a simple formal definition to \(F_{ST}\), equivalent to [2] and [3]. This simple definition also makes clear how \(F_{ST}\) is precisely a ratio of variances.

The Definition

Random variables \(A_S\), \(A_T\) and \(D\) model uncertainty for \(F_{ST}\):

“Top” population can mean ancestral population (as in [2]), or it can mean “total” population (the original meaning in [1]).

Given assumptions

the definition follows

\[\begin{eqnarray*} F_{ST} & := & \frac{ \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) }{ \operatorname{Var}({ A_T}) } \end{eqnarray*}\]

Convenient Expectations

Conveniently, expectations of allele variables are allele frequencies. The following variable definition will be convenient \[\begin{eqnarray*} p & := & \operatorname{E}\!\left({ A_T}\right) \end{eqnarray*}\]

Due to the assumptions of \(F_{ST}\) the following are conveniently true \[\begin{align*} p &= \operatorname{E}\!\left({ A_S}\right) & p &= \operatorname{E}\!\left({ A_T^2}\right) \\ \operatorname{E}\!\left({ A_S}\right) &= \operatorname{E}\!\left({ A_S^2}\right) & \operatorname{E}\!\left({ A_S|D}\right) &= \operatorname{E}\!\left({ A_S^2|D}\right) \\ \end{align*}\] and \[ \operatorname{Var}({ A_T}) = \operatorname{E}\!\left({ A_T^2}\right) - \operatorname{E}\!\left({ A_T}\right)^2 = p - p^2 = p(1-p) \]

\(F_{ST}\) as variance explained or uncertainty reduced

In light of the following theorem, \(F_{ST}\) can be interpreted as allele variance explained by random descent/drift/divergence. Alternatively, an interpretation can also be allele uncertainty reduced by knowing descent/drift/divergence.

Theorem 1

\[\begin{eqnarray*} \operatorname{Var}({ A_T}) & = & \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) + \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) \end{eqnarray*}\]

Proof

\[\begin{eqnarray*} \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) & = & \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) - \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)}\right)^2 \\ & = & \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) - p^2 \\ \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) & = & \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S^2|D}\right) - \operatorname{E}\!\left({ A_S|D}\right)^2 }\right) \\ & = & p - \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) \\ \end{eqnarray*}\]

Unbiased Estimators

Consider observing two independent random descents/drifts/divergences \(D_1\) and \(D_2\) under the assumptions for \(D\) of \(F_{ST}\). Furthermore, for each \(j \in \{1, 2 \}\), consider observing \(n_j\) independent random gametes within each resulting sub-population. Define \(n_1 + n_2\) independent observed alleles \(A_{S,j,i}\) with \(i\) indexing sampled gametes within each sampled sub-populations resulting from the independent descents/drifts/divergences. For convenience define the following: \[\begin{align*} \hat{p}_1 & := \frac{1}{n_1} \sum_{i=1}^{n_1} A_{S,1,i} & \hat{p}_2 & := \frac{1}{n_2} \sum_{i=1}^{n_2} A_{S,2,i} \end{align*}\]

The “Hudson” estimator of \(F_{ST}\) is defined in [2] as \[ \frac{ (\hat{p}_1 - \hat{p}_2)^2 - \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1-1} - \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2-1} }{ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) } \]

We first show that the denominator is an unbiased estimator of \(2 \operatorname{Var}({ A_T})\). With \(\hat{p}_1\) and \(\hat{p}_1\) independent it follows:

\[\begin{eqnarray*} \operatorname{E}\!\left({ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) }\right) & = & \operatorname{E}\!\left({ \hat{p}_1}\right) \operatorname{E}\!\left({ 1-\hat{p}_2}\right) + \operatorname{E}\!\left({ \hat{p}_2}\right) \operatorname{E}\!\left({ 1-\hat{p}_1}\right) \\ & = & 2 p (1-p) \\ & = & 2 \operatorname{Var}({ A_T}) \end{eqnarray*}\]

We now show the “Hudson” numerator is an unbiased estimator of \(2 \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)})\). For each \(j \in \{0,1\}\) define: \[\begin{eqnarray*} \hat{v}_j & := & \frac{1}{n_j-1} \sum_{i=1}^{n_j} (A_{S,j,i} - \hat{p}_i)^2 \\ \end{eqnarray*}\] It follows as a classic unbiased estimator of variance [4] that: \[\begin{eqnarray*} \operatorname{E}\!\left({ \hat{v}_j }\right) & = & \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D_1})}\right) \\ \end{eqnarray*}\] Since \(\operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) = \operatorname{Var}({ A_T}) - \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right)\), it follows that an unbiased estimator of \(2 \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)})\) is \[ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) - \hat{v}_1 - \hat{v}_2 \] We now show this is equivalent to the “Hudson” numerator. Note that \[\begin{eqnarray*} \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) & = & \hat{p}_1 + \hat{p}_2 - 2 \hat{p}_1 \hat{p}_2 \\ \hat{v}_i & = & \hat{p}_i - \hat{p}_i^2 + \frac{\hat{p}_i (1 - \hat{p}_i)}{n_i-1} \\ (\hat{p}_1 - \hat{p}_2)^2 & = & \hat{p}_1^2 + \hat{p}_2^2 - 2 \hat{p}_1 \hat{p}_2 \\ \end{eqnarray*}\] it follows that \[\begin{eqnarray*} \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) - \hat{v}_1 - \hat{v}_2 & = & (\hat{p}_1 - \hat{p}_2)^2 - \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1-1} - \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2-1} \\ \end{eqnarray*}\] which is the numerator in the “Hudson” estimator in [2].

References

1. Wright S (1949) THE GENETICAL STRUCTURE OF POPULATIONS. Annals of Eugenics 15:323–354. https://doi.org/10.1111/j.1469-1809.1949.tb02451.x

2. Bhatia G, Patterson N, Sankararaman S, Price AL (2013) Estimating and interpreting FST: The impact of rare variants. Genome Research 23:1514–1521. https://doi.org/10.1101/gr.154831.113

3. Patterson NJ Notes on Fst

4. DeGroot MH, Schervish MJ (2002) Probability and statistics, 3rd ed. Addison-Wesley, Boston