# Castedo Ellerman

Proposed answer to the following questions:

The name $$F_{ST}$$ was used as early as [1] for a certain measure between populations. In the many decades since, the name has been used to have slightly different meanings. The recent publication [2] covers a long list of different $$F_{ST}$$ meanings in many papers over the decades. Most of the papers cover additional concepts which are not required to define and gain intuition on $$F_{ST}$$. And the mathematical details behind a formal definition are spread across many papers.

This documents gives a simple formal definition to $$F_{ST}$$, equivalent to [2] and [3]. This simple definition also makes clear how $$F_{ST}$$ is precisely a ratio of variances.

## The Definition

Random variables $$A_S$$, $$A_T$$ and $$D$$ model uncertainty for $$F_{ST}$$:

• $$A_S$$ and $$A_T$$ for the allele found in a random gamete from the “Sub-population” and “Top” population, respectively
• $$D$$ for the random decent, drift, or divergence of the “sub-population” from the “top” population

“Top” population can mean ancestral population (as in [2]), or it can mean “total” population (the original meaning in [1]).

Given assumptions

• $$A_S$$ and $$A_T$$ take values of $$0$$ or $$1$$
• $$A_T$$ and $$D$$ are independent
• $$\operatorname{E}\!\left({ A_S}\right) = \operatorname{E}\!\left({ A_T}\right)$$

the definition follows

$\begin{eqnarray*} F_{ST} & := & \frac{ \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) }{ \operatorname{Var}({ A_T}) } \end{eqnarray*}$

## Convenient Expectations

Conveniently, expectations of allele variables are allele frequencies. The following variable definition will be convenient $\begin{eqnarray*} p & := & \operatorname{E}\!\left({ A_T}\right) \end{eqnarray*}$

Due to the assumptions of $$F_{ST}$$ the following are conveniently true \begin{align*} p &= \operatorname{E}\!\left({ A_S}\right) & p &= \operatorname{E}\!\left({ A_T^2}\right) \\ \operatorname{E}\!\left({ A_S}\right) &= \operatorname{E}\!\left({ A_S^2}\right) & \operatorname{E}\!\left({ A_S|D}\right) &= \operatorname{E}\!\left({ A_S^2|D}\right) \\ \end{align*} and $\operatorname{Var}({ A_T}) = \operatorname{E}\!\left({ A_T^2}\right) - \operatorname{E}\!\left({ A_T}\right)^2 = p - p^2 = p(1-p)$

## $$F_{ST}$$ as variance explained or uncertainty reduced

In light of the following theorem, $$F_{ST}$$ can be interpreted as allele variance explained by random descent/drift/divergence. Alternatively, an interpretation can also be allele uncertainty reduced by knowing descent/drift/divergence.

Theorem 1

$\begin{eqnarray*} \operatorname{Var}({ A_T}) & = & \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) + \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) \end{eqnarray*}$

Proof

$\begin{eqnarray*} \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)}) & = & \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) - \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)}\right)^2 \\ & = & \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) - p^2 \\ \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) & = & \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S^2|D}\right) - \operatorname{E}\!\left({ A_S|D}\right)^2 }\right) \\ & = & p - \operatorname{E}\!\left({ \operatorname{E}\!\left({ A_S|D}\right)^2}\right) \\ \end{eqnarray*}$

## Unbiased Estimators

Consider observing two independent random descents/drifts/divergences $$D_1$$ and $$D_2$$ under the assumptions for $$D$$ of $$F_{ST}$$. Furthermore, for each $$j \in \{1, 2 \}$$, consider observing $$n_j$$ independent random gametes within each resulting sub-population. Define $$n_1 + n_2$$ independent observed alleles $$A_{S,j,i}$$ with $$i$$ indexing sampled gametes within each sampled sub-populations resulting from the independent descents/drifts/divergences. For convenience define the following: \begin{align*} \hat{p}_1 & := \frac{1}{n_1} \sum_{i=1}^{n_1} A_{S,1,i} & \hat{p}_2 & := \frac{1}{n_2} \sum_{i=1}^{n_2} A_{S,2,i} \end{align*}

The “Hudson” estimator of $$F_{ST}$$ is defined in [2] as $\frac{ (\hat{p}_1 - \hat{p}_2)^2 - \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1-1} - \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2-1} }{ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) }$

We first show that the denominator is an unbiased estimator of $$2 \operatorname{Var}({ A_T})$$. With $$\hat{p}_1$$ and $$\hat{p}_1$$ independent it follows:

$\begin{eqnarray*} \operatorname{E}\!\left({ \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) }\right) & = & \operatorname{E}\!\left({ \hat{p}_1}\right) \operatorname{E}\!\left({ 1-\hat{p}_2}\right) + \operatorname{E}\!\left({ \hat{p}_2}\right) \operatorname{E}\!\left({ 1-\hat{p}_1}\right) \\ & = & 2 p (1-p) \\ & = & 2 \operatorname{Var}({ A_T}) \end{eqnarray*}$

We now show the “Hudson” numerator is an unbiased estimator of $$2 \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)})$$. For each $$j \in \{0,1\}$$ define: $\begin{eqnarray*} \hat{v}_j & := & \frac{1}{n_j-1} \sum_{i=1}^{n_j} (A_{S,j,i} - \hat{p}_i)^2 \\ \end{eqnarray*}$ It follows as a classic unbiased estimator of variance [4] that: $\begin{eqnarray*} \operatorname{E}\!\left({ \hat{v}_j }\right) & = & \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D_1})}\right) \\ \end{eqnarray*}$ Since $$\operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right) = \operatorname{Var}({ A_T}) - \operatorname{E}\!\left({ \operatorname{Var}({ A_S|D})}\right)$$, it follows that an unbiased estimator of $$2 \operatorname{Var}({ \operatorname{E}\!\left({ A_S|D}\right)})$$ is $\hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) - \hat{v}_1 - \hat{v}_2$ We now show this is equivalent to the “Hudson” numerator. Note that $\begin{eqnarray*} \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) & = & \hat{p}_1 + \hat{p}_2 - 2 \hat{p}_1 \hat{p}_2 \\ \hat{v}_i & = & \hat{p}_i - \hat{p}_i^2 + \frac{\hat{p}_i (1 - \hat{p}_i)}{n_i-1} \\ (\hat{p}_1 - \hat{p}_2)^2 & = & \hat{p}_1^2 + \hat{p}_2^2 - 2 \hat{p}_1 \hat{p}_2 \\ \end{eqnarray*}$ it follows that $\begin{eqnarray*} \hat{p}_1 (1-\hat{p}_2) + \hat{p}_2 (1-\hat{p}_1) - \hat{v}_1 - \hat{v}_2 & = & (\hat{p}_1 - \hat{p}_2)^2 - \frac{\hat{p}_1 (1 - \hat{p}_1)}{n_1-1} - \frac{\hat{p}_2 (1 - \hat{p}_2)}{n_2-1} \\ \end{eqnarray*}$ which is the numerator in the “Hudson” estimator in [2].

# References

1. Wright S (1949) THE GENETICAL STRUCTURE OF POPULATIONS. Annals of Eugenics 15:323–354. https://doi.org/10.1111/j.1469-1809.1949.tb02451.x

2. Bhatia G, Patterson N, Sankararaman S, Price AL (2013) Estimating and interpreting FST: The impact of rare variants. Genome Research 23:1514–1521. https://doi.org/10.1101/gr.154831.113

3. Patterson NJ Notes on Fst

4. DeGroot MH, Schervish MJ (2002) Probability and statistics, 3rd ed. Addison-Wesley, Boston