variant | sequences | date | |
---|---|---|---|
0 | wildtype | 1 | 2022-01-01 |
1 | wildtype | 12 | 2022-01-02 |
2 | wildtype | 17 | 2022-01-03 |
3 | variant | 1 | 2022-01-03 |
4 | wildtype | 11 | 2022-01-04 |
5 | variant | 1 | 2022-01-04 |
6 | wildtype | 15 | 2022-01-05 |
March 11, 2024
Let’s talk about SARS-CoV-2 evolution!
We are interested in the selection in SARS-CoV-2 for two main reasons.
Awareness: Quantifying variant advantages to understand the viral population right now or in short-term
Preparation: Forecasting variant advantage to predict viral diversity in the future
Instead of focusing on sequence-based methods which may be computationally infeasible with large numbers of sequences, we use coarser labels for estimating fitness of groups.
These labels can be at granular as you choose such as:
Remark: For those familiar with the selective coefficient \(s\), the relative fitness \(\lambda\) is given by \((1 + s) = \exp(\lambda)\)
Generally, we want to take sequences and classify them into variant groups such as Nextstrain clade or Pango lineage and then count their occurance per day or week. This will give us data of the form:
variant | sequences | date | |
---|---|---|---|
0 | wildtype | 1 | 2022-01-01 |
1 | wildtype | 12 | 2022-01-02 |
2 | wildtype | 17 | 2022-01-03 |
3 | variant | 1 | 2022-01-03 |
4 | wildtype | 11 | 2022-01-04 |
5 | variant | 1 | 2022-01-04 |
6 | wildtype | 15 | 2022-01-05 |
Data of this form is automatically generated at the level of US state and country in forecasts-ncov.
If we’re interested in estimating the relative fitness of variants from sequence counts, we need to model the change in frequencies over time.
We want to turn these into frequency and estimate the relative fitness we discussed before using the following equation:
\[ p_{v}(t) = \frac{p_{v}(0)\exp(\lambda_v t)}{\sum_{u} p_{u}(0)\exp(\lambda_u t)}. \]
This means that we’re estimating the following parameters:
\(p_{v}(t)\) is the frequency of variant \(v\) at time \(t\).
\(p_{v}(0)\) is the initial frequency of variant \(v\).
\(\lambda_{v}\) is the relative fitness of variant \(v\).
We’ll often work with the growth advantage \(\Delta_v = \exp(\lambda_{v} \tau)\) where \(\tau\) is the generation time.
evofr
We’ve developed a package called evofr that implements these models.
import evofr as ef
# Loading data as VariantFrequencies
variant_counts = pd.read_csv("./data/variant_counts.tsv", sep="\t")
data = ef.VariantFrequencies(
raw_seq = variant_counts,
pivot = "wildtype"
)
# Define model: tau is generation time
model = ef.MultinomialLogisticRegression(tau=1)
# Do inference using NUTS MCMC
inference_method = ef.InferNUTS(
num_samples=500,
num_warmup=100
)
posterior = inference_method.fit(model, data)
evofr
This idea is pretty simple, we’re just saying that we think the relative fitnesses should be similar across geographies: \[ \lambda_{v, g} \sim \text{Normal}(\bar{\lambda}_v, \sigma_{v}) \]
\(\lambda_{v,g}\) is the relative fitness of variant \(v\) in geography \(g\).
\(\bar{\lambda}\) is the mean relative fitness of variant \(v\) across geographies.
\(\sigma_v\) is the standard deviation in the relative fitness of \(v\).
This is called partial pooling and it allows us to share information about the relative fitness across locations.
Low-data locations can receive information about variants that they haven’t seen yet or are below detectable levels.
import evofr as ef
# Loading data as HierFrequencies
# Your data .csv should now have a location column
variant_counts = pd.read_csv("./data/variant_counts.tsv", sep="\t")
data = ef.HierFrequencies(
raw_seq = variant_counts,
pivot = "wildtype",
group="location"
)
model = ef.HierMLR(tau=1, pool_scale=1e-4)
# Do inference using NUTS MCMC
inference_method = ef.InferNUTS(
num_samples=500,
num_warmup=100
)
posterior = inference_method.fit(model, data)
Fitness models provide accurate short-term forecasts of SARS-CoV-2 variant frequency
evofr
.