iBet uBet web content aggregator. Adding the entire web to your favor.

Abstract

Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (http://sars2evo.datamonkey.org/).

coronavirus, web tool, phylogeny

Introduction

The early evolutionary history of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which causes COVID-19, remains unclear despite an unprecedented scope of global genome sequencing of SARS-CoV-2 and a multitude of phylogenetic analyses (Forster et al. 2020; Lemey et al. 2020; Rambaut, Holmes, et al. 2020; Tang et al. 2020; Worobey et al. 2020; Chiara et al. 2021; da Silva Filipe et al. 2021; Komissarov et al. 2021; Lemieux et al. 2021; Pekar et al. 2021). Sophisticated investigations have shown that traditional molecular phylogenetic analyses do not produce reliable evolutionary inferences about the early history of SARS-CoV-2 due to low sequence divergence, a limited number of phylogenetically informative sites, and the ubiquity of sequencing errors (De Maio et al. 2020; Mavian et al. 2020; Turakhia et al. 2020). In particular, the root of the SARS-CoV-2 phylogeny remains elusive (Morel et al. 2021; Pipes et al. 2021) because the closely related nonhuman coronavirus (outgroups) are more than 1,100 base differences from human SARS-CoV-2 genomes, as compared with fewer than 30 differences between human SARS-CoV-2 genomes’ sequenced early on (December 2019 and January 2020) (Andersen et al. 2020; Castells et al. 2020; Gómez-Carballa et al. 2020; Lai et al. 2020; Mavian et al. 2020; Morel et al. 2021; Pipes et al. 2021; Wenzel 2020). Without a reliable root of the SARS-CoV-2 phylogeny, the most recent ancestor sequence cannot be accurately reconstructed, and it is also not possible to assess the genetic diversity of SARS-CoV-2 that existed at the time of its first outbreak. Consequently, we cannot determine if any of the coronaviruses isolated to date carry the genome of the progenitor of all human SARS-CoV-2 infections. Knowing the progenitor genome will also help determine how close the earliest patients sampled in China are to “patient zero,” that is, the first human transmission case.

The orientation and order of early mutations giving rise to common coronavirus variants will also be compromised if the earliest coronavirus isolates are incorrectly used to root the SARS-CoV-2 phylogenies (Dearlove et al. 2020; Fauver et al. 2020; StefanelLi et al. 2020; Tang et al. 2020). Some investigations of COVID-19 patients and their coronaviruses’ genomes already reported the presence of multiple variants (Lu et al. 2020), as genomes of viral samples from December 2019 in China had as many as five differences. These observations require an explicit test to see if one of the early sampled coronavirus genomes was the progenitor of all the strains infecting humans.

Traditionally, ancestral sequences are estimated using a rooted phylogeny (Yang et al. 1995; Nei and Kumar 2002). This ancestral sequence can then be compared with the sequenced genomes to locate the one that is most similar to the inferred progenitor and/or placed closest to the root in the phylogeny. However, as noted above, attempts using ad hoc and traditional methods are fraught with difficulties and have not produced consistent and robust results (Morel et al. 2021; Pipes et al. 2021). Some methods also incorporate sampling times in phylogenetic inference, but they will automatically favor placing the earliest sampled genomes at or near the root of the tree. This fact introduces circularity in testing the hypothesis that the earliest sampled genomes were ancestral because sampling time is used in the inference procedure (Pipes et al. 2021; Pekar et al. 2021).

Results and Discussion

A Mutational Order Approach for SARS-CoV-2

We applied a mutational order approach (MOA) to directly reconstruct the ancestral sequence and the mutational history of CoV-2 genomes (Jahn et al. 2016; Ross and Markowetz 2016; Miura et al. 2018). MOA does not infer phylogeny as an intermediate step. It is often used to build the evolutionary history of tumor cells that evolve clonally and without recombination. This approach is suitable for analyzing SARS-CoV-2 genomes because of their quasi-species evolutionary behavior (clonal) and the lack of evidence of significant recombination within human outbreaks (Richard et al. 2020), both of which preserve the collinearity of variants in genomes. Although reports of recombination among circulating SARS-CoV-2 genomes have begun to appear (Jackson et al. 2020), the fraction of circulating recombinant strains is likely very small and geographically limited and will not affect analyses conducted on sequences sampled in 2020. This feature permits effective use of shared co-occurrence of variants in genomes and the frequencies of individual variants for inferring the mutational history, notwithstanding the presence of sequencing errors and other artifacts (Kim and Simon 2014; Jahn et al. 2016) (see Materials and Methods).

We advanced MOA to make it applicable for analyzing SARS-CoV-2 genomes. Thus was needed because the normal cell sequence in tumors provides the ancestral (noncancerous) genome sequence to orient the mutational changes, but such a direct ancestor is not available for coronaviruses in which the closest outgroup sequences are over 30 times more different than any two SARS-CoV-2 strains. Also, SARS-CoV-2 genome evolution may not satisfy the perfect phylogeny assumption because some genomic positions have likely experienced multiple and recurrent mutations (van Dorp et al. 2020; Martin et al. 2021). So, we used shared co-occurrence of variants among genomes to reduce the impact of the violation of the perfect phylogeny assumption and select mutation orientations and histories in the maximum likelihood approach. We also devised a bootstrap procedure to place confidence limits on the inferred mutation order in which bootstrap replicate data sets are generated by sampling genomes with replacement.

Here we present results from analyses of two snapshots of the fast-growing collection of SARS-CoV-2 genomes to make inferences and assess the robustness of the inferred mutational histories to the growing genome collection that is expanding at an unprecedented rate. We first present results from the 29KG data set and then evaluate the concordance of the mutational history inferred by using an expanded 68KG data set, which establishes that the conclusions are robust to the sampling of genomes. The first snapshot was retrieved fromGISAID (Shu and Mccauley 2017) on July 7, 2020 and consisted of 60,332 genomes. Of these, 29,681 were selected because they were longer than the 28,000 bases threshold we imposed (29KG data set) and did not include an excessive number of unresolved bases in any genomic regions. This second snapshot was acquired on October 12, 2020 from GISAID and contained 133,741 genomes, of which 68,057 genomes met the inclusion criteria (68KG data set).

We then applied mutational fingerprints inferred using the 68KG data set to an expanded data set of 172,480 genomes (sampled on December 30, 2020; 172KG) to track global spatiotemporal dynamics SARS-CoV-2. We have also set up a live dashboard showing regularly updated results because the processes of data analysis, manuscript preparation, and peer review of scientific articles are much slower than the pace of expansion of SARS-CoV-2 genome collection. Also, we provide a simple “in-the-browser” tool to classify any SARS-CoV-2 genome based on key mutations derived by the MOA analysis (http://sars2evo.datamonkey.org/).

Mutational History and Progenitor of SARS-COV-2

We used MOA to reconstruct the history of mutations that gave rise to 49 common single nucleotide variants (SNVs) in the 29KG data set. These variants occur with >1% variant frequency (vf > 1%)—a threshold chosen to avoid including variants that may be due to sequencing errors (fig. 1a). To simplify notation, we used the inferred mutation history to denote key groups of mutations by assigning Greek symbols (μ, ν, α, β, γ, δ, and ε) to them. Individual mutations were assigned numbers and letters based on the reconstructed order and their parent-offspring relationships (table 1). We estimated the timing of mutation for each mutation based on the timestamp of the viral samples’ genome sequences in which it first appeared (table 1, see Materials and Methods). The inferred mutation order was in excellent agreement with the temporal pattern of the first appearance of variants in the 29KG data set. The timestamp of 47 out of 49 mutations was greater than or equal to the timestamp of the corresponding preceding mutation in mutational history. The exceptions were seen for two low-frequency offshoot mutations (β_3b and β_3c; see Materials and Methods). This concordance provides independent validation of the reconstructed mutation graph because neither sampling dates nor locations were used in MOA analysis.

Fig. 1

Counts of SNVs and genomes in the 29KG data set. (a) Cumulative count of SNVs presented in the 29KG genome data set at different frequencies. (b) The number of genomes in the 29KG collection that were isolated weekly during the pandemic. (c) The number of base differences from proCoV2 (see fig. 2) for genomes sampled in December 2019 and January 2020. The 18 genomes sampled in December 2019 in China (red) have three common SNVs different from proCoV2. In contrast, six genomes sampled in January 2020 in China (Asia, red) and the United States (North America, blue) show no base differences. Multiple genomes (2 and 15) were sampled on two different days. (d) Temporal and spatial distribution of strains identical to proCoV2 at the protein sequence level, that is, they have only μ mutations. The color scheme used to mark sampling locations is shown in panel b.

Open in new tab Download slide

Table 1.

Open in new tab

SARS-CoV-2 variants in 29KG dataset.

Mutant (major)	Mutant (minor)	Gene	GenomicPosition	Nucleotide change	Amino acid change	Time(days)	VariantFrequency	Genomesmapped	Firstlocation
μ₁		ORF1ab	2416	U>C		0	98.1%	0	China, Asia
μ₂		ORF1ab	19524	U>C		0	98.6%	0	China, Asia
μ₃		S	23929	U>C		0	98.4%	18	China, Asia
α₁		ORF1ab	18060	U>C		0	95.1%	849	China, Asia
	α_1a	N	28657	C>U		63	1.3%	2	France, Europe
	α_1b	ORF1ab	9477	U>A	F>Y	63	1.2%	3	France, Europe
	α_1c	N	28863	C>U	S>L	63	1.2%	5	France, Europe
	α_1d	ORF3a	25979	G>U	G>V	63	1.2%	344	France, Europe
α₂		ORF1ab	8782	U>C		0	91.0%	47	China, Asia
α₃		ORF8	28144	C>U	S>L	0	90.8%	1115	China, Asia
	α_3a	ORF1ab	1606	U>C		43	1.7%	501	United Kingdom, Europe
	α_3b	ORF1ab	11083	G>U	L>F	24	9.2%	376	China, Asia
	α_3c	N	28311	C>U	P>L	64	1.9%	3	South Korea, Asia
	α_3d	ORF1ab	13730	C>U	A>V	71	1.8%	3	Taiwan/Malaysia, Asia
	α_3e	ORF1ab	6312	C>A	T>K	71	1.7%	483	Taiwan/Malaysia, Asia
	α_3f	ORF3a	26144	G>U	G>V	28	5.1%	121	China, Asia
	α_3g	ORF1ab	14805	C>U		54	6.0%	334	United Kingdom, Europe
	α_3h	ORF1ab	17247	U>C		64	2.0%	580	Switzerland, Europe
	α_3i	ORF1ab	2558	C>U	P>S	54	1.7%	26	United Kingdom, Europe
	α_3j	ORF1ab	2480	A>G	I>V	54	1.6%	462	United Kingdom, Europe
β₁		ORF1ab	3037	C>U		31	77.0%	11	China, Asia
β₂		S	23403	A>G	D>G	31	77.1%	36	China, Asia
β₃		ORF1ab	14408	C>U	P>L	41	76.9%	3032	Saudi Arabia, Middle East
	β_3a	ORF1ab	20268	A>G		64	5.7%	1213	Italy, Europe
	β_3b	N	28854	C>U	S>L	29	3.1%	527	China, Asia
	β_3c	ORF1ab	15324	C>U		29	2.3%	678	China, Asia
	β_3d	ORF3a	25429	G>U	V>L	77	1.7%	485	United Kingdom, Europe
	β_3e	N	28836	C>U	S>L	74	1.6%	3	Switzerland, Europe
	β_3f	ORF1ab	13862	C>U	T>I	74	1.6%	50	Switzerland, Europe
	β_3g	ORF1ab	10798	C>A	D>E	86	1.4%	414	United Kingdom, Europe
γ₁		ORF3a	25563	G>U	Q>H	41	29.8%	884	Saudi Arabia, Middle East
	γ_1a	ORF1ab	18877	C>U		41	4.0%	757	Saudi Arabia, Middle East
	γ_1b	M	26735	C>U		41	1.5%	439	Saudi Arabia, Middle East
δ₁		ORF1ab	1059	C>U	T>I	54	23.0%	5157	Singapore, Asia
	δ_1a	S	24368	G>U	D>Y	75	1.3%	389	Sweden, Europe
	δ_1b	ORF8	27964	C>U	S>L	76	2.7%	790	USA, North America
	δ_1c	ORF1ab	11916	C>U	S>L	72	1.6%	166	USA, North America
	δ_1d	ORF1ab	18998	C>U	A>V	72	1.0%	305	USA, North America
ε₁		N	28881	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₂		N	28882	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₃		N	28883	G>C	G>R	54	25.7%	5365	United Kingdom, Europe
	ε_3a	ORF1ab	313	C>U		66	2.1%	608	USA, North America
	ε_3b	ORF1ab	19839	U>C		64	1.5%	452	Switzerland, Europe
	ε_3c	M	27046	C>U	T>M	69	1.6%	453	Worldwide
	ε_3d	ORF1ab	10097	G>A	G>S	69	2.5%	5	Denmark, Europe
	ε_3e	S	23731	C>U		69	2.5%	403	Denmark, Europe
	ε_3f	N	28580	G>U	D>Y	69	1.2%	353	Chile, South America
ν₁		ORF1ab	17858	A>G	Y>C	59	4.7%	32	USA, North America
ν₂		ORF1ab	17747	C>U	P>L	59	4.7%	1374	USA, North America

Mutant (major)	Mutant (minor)	Gene	GenomicPosition	Nucleotide change	Amino acid change	Time(days)	VariantFrequency	Genomesmapped	Firstlocation
μ₁		ORF1ab	2416	U>C		0	98.1%	0	China, Asia
μ₂		ORF1ab	19524	U>C		0	98.6%	0	China, Asia
μ₃		S	23929	U>C		0	98.4%	18	China, Asia
α₁		ORF1ab	18060	U>C		0	95.1%	849	China, Asia
	α_1a	N	28657	C>U		63	1.3%	2	France, Europe
	α_1b	ORF1ab	9477	U>A	F>Y	63	1.2%	3	France, Europe
	α_1c	N	28863	C>U	S>L	63	1.2%	5	France, Europe
	α_1d	ORF3a	25979	G>U	G>V	63	1.2%	344	France, Europe
α₂		ORF1ab	8782	U>C		0	91.0%	47	China, Asia
α₃		ORF8	28144	C>U	S>L	0	90.8%	1115	China, Asia
	α_3a	ORF1ab	1606	U>C		43	1.7%	501	United Kingdom, Europe
	α_3b	ORF1ab	11083	G>U	L>F	24	9.2%	376	China, Asia
	α_3c	N	28311	C>U	P>L	64	1.9%	3	South Korea, Asia
	α_3d	ORF1ab	13730	C>U	A>V	71	1.8%	3	Taiwan/Malaysia, Asia
	α_3e	ORF1ab	6312	C>A	T>K	71	1.7%	483	Taiwan/Malaysia, Asia
	α_3f	ORF3a	26144	G>U	G>V	28	5.1%	121	China, Asia
	α_3g	ORF1ab	14805	C>U		54	6.0%	334	United Kingdom, Europe
	α_3h	ORF1ab	17247	U>C		64	2.0%	580	Switzerland, Europe
	α_3i	ORF1ab	2558	C>U	P>S	54	1.7%	26	United Kingdom, Europe
	α_3j	ORF1ab	2480	A>G	I>V	54	1.6%	462	United Kingdom, Europe
β₁		ORF1ab	3037	C>U		31	77.0%	11	China, Asia
β₂		S	23403	A>G	D>G	31	77.1%	36	China, Asia
β₃		ORF1ab	14408	C>U	P>L	41	76.9%	3032	Saudi Arabia, Middle East
	β_3a	ORF1ab	20268	A>G		64	5.7%	1213	Italy, Europe
	β_3b	N	28854	C>U	S>L	29	3.1%	527	China, Asia
	β_3c	ORF1ab	15324	C>U		29	2.3%	678	China, Asia
	β_3d	ORF3a	25429	G>U	V>L	77	1.7%	485	United Kingdom, Europe
	β_3e	N	28836	C>U	S>L	74	1.6%	3	Switzerland, Europe
	β_3f	ORF1ab	13862	C>U	T>I	74	1.6%	50	Switzerland, Europe
	β_3g	ORF1ab	10798	C>A	D>E	86	1.4%	414	United Kingdom, Europe
γ₁		ORF3a	25563	G>U	Q>H	41	29.8%	884	Saudi Arabia, Middle East
	γ_1a	ORF1ab	18877	C>U		41	4.0%	757	Saudi Arabia, Middle East
	γ_1b	M	26735	C>U		41	1.5%	439	Saudi Arabia, Middle East
δ₁		ORF1ab	1059	C>U	T>I	54	23.0%	5157	Singapore, Asia
	δ_1a	S	24368	G>U	D>Y	75	1.3%	389	Sweden, Europe
	δ_1b	ORF8	27964	C>U	S>L	76	2.7%	790	USA, North America
	δ_1c	ORF1ab	11916	C>U	S>L	72	1.6%	166	USA, North America
	δ_1d	ORF1ab	18998	C>U	A>V	72	1.0%	305	USA, North America
ε₁		N	28881	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₂		N	28882	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₃		N	28883	G>C	G>R	54	25.7%	5365	United Kingdom, Europe
	ε_3a	ORF1ab	313	C>U		66	2.1%	608	USA, North America
	ε_3b	ORF1ab	19839	U>C		64	1.5%	452	Switzerland, Europe
	ε_3c	M	27046	C>U	T>M	69	1.6%	453	Worldwide
	ε_3d	ORF1ab	10097	G>A	G>S	69	2.5%	5	Denmark, Europe
	ε_3e	S	23731	C>U		69	2.5%	403	Denmark, Europe
	ε_3f	N	28580	G>U	D>Y	69	1.2%	353	Chile, South America
ν₁		ORF1ab	17858	A>G	Y>C	59	4.7%	32	USA, North America
ν₂		ORF1ab	17747	C>U	P>L	59	4.7%	1374	USA, North America

Amino acid change is shown only for non-synonymous change.

Table 1.

Open in new tab

SARS-CoV-2 variants in 29KG dataset.

Mutant (major)	Mutant (minor)	Gene	GenomicPosition	Nucleotide change	Amino acid change	Time(days)	VariantFrequency	Genomesmapped	Firstlocation
μ₁		ORF1ab	2416	U>C		0	98.1%	0	China, Asia
μ₂		ORF1ab	19524	U>C		0	98.6%	0	China, Asia
μ₃		S	23929	U>C		0	98.4%	18	China, Asia
α₁		ORF1ab	18060	U>C		0	95.1%	849	China, Asia
	α_1a	N	28657	C>U		63	1.3%	2	France, Europe
	α_1b	ORF1ab	9477	U>A	F>Y	63	1.2%	3	France, Europe
	α_1c	N	28863	C>U	S>L	63	1.2%	5	France, Europe
	α_1d	ORF3a	25979	G>U	G>V	63	1.2%	344	France, Europe
α₂		ORF1ab	8782	U>C		0	91.0%	47	China, Asia
α₃		ORF8	28144	C>U	S>L	0	90.8%	1115	China, Asia
	α_3a	ORF1ab	1606	U>C		43	1.7%	501	United Kingdom, Europe
	α_3b	ORF1ab	11083	G>U	L>F	24	9.2%	376	China, Asia
	α_3c	N	28311	C>U	P>L	64	1.9%	3	South Korea, Asia
	α_3d	ORF1ab	13730	C>U	A>V	71	1.8%	3	Taiwan/Malaysia, Asia
	α_3e	ORF1ab	6312	C>A	T>K	71	1.7%	483	Taiwan/Malaysia, Asia
	α_3f	ORF3a	26144	G>U	G>V	28	5.1%	121	China, Asia
	α_3g	ORF1ab	14805	C>U		54	6.0%	334	United Kingdom, Europe
	α_3h	ORF1ab	17247	U>C		64	2.0%	580	Switzerland, Europe
	α_3i	ORF1ab	2558	C>U	P>S	54	1.7%	26	United Kingdom, Europe
	α_3j	ORF1ab	2480	A>G	I>V	54	1.6%	462	United Kingdom, Europe
β₁		ORF1ab	3037	C>U		31	77.0%	11	China, Asia
β₂		S	23403	A>G	D>G	31	77.1%	36	China, Asia
β₃		ORF1ab	14408	C>U	P>L	41	76.9%	3032	Saudi Arabia, Middle East
	β_3a	ORF1ab	20268	A>G		64	5.7%	1213	Italy, Europe
	β_3b	N	28854	C>U	S>L	29	3.1%	527	China, Asia
	β_3c	ORF1ab	15324	C>U		29	2.3%	678	China, Asia
	β_3d	ORF3a	25429	G>U	V>L	77	1.7%	485	United Kingdom, Europe
	β_3e	N	28836	C>U	S>L	74	1.6%	3	Switzerland, Europe
	β_3f	ORF1ab	13862	C>U	T>I	74	1.6%	50	Switzerland, Europe
	β_3g	ORF1ab	10798	C>A	D>E	86	1.4%	414	United Kingdom, Europe
γ₁		ORF3a	25563	G>U	Q>H	41	29.8%	884	Saudi Arabia, Middle East
	γ_1a	ORF1ab	18877	C>U		41	4.0%	757	Saudi Arabia, Middle East
	γ_1b	M	26735	C>U		41	1.5%	439	Saudi Arabia, Middle East
δ₁		ORF1ab	1059	C>U	T>I	54	23.0%	5157	Singapore, Asia
	δ_1a	S	24368	G>U	D>Y	75	1.3%	389	Sweden, Europe
	δ_1b	ORF8	27964	C>U	S>L	76	2.7%	790	USA, North America
	δ_1c	ORF1ab	11916	C>U	S>L	72	1.6%	166	USA, North America
	δ_1d	ORF1ab	18998	C>U	A>V	72	1.0%	305	USA, North America
ε₁		N	28881	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₂		N	28882	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₃		N	28883	G>C	G>R	54	25.7%	5365	United Kingdom, Europe
	ε_3a	ORF1ab	313	C>U		66	2.1%	608	USA, North America
	ε_3b	ORF1ab	19839	U>C		64	1.5%	452	Switzerland, Europe
	ε_3c	M	27046	C>U	T>M	69	1.6%	453	Worldwide
	ε_3d	ORF1ab	10097	G>A	G>S	69	2.5%	5	Denmark, Europe
	ε_3e	S	23731	C>U		69	2.5%	403	Denmark, Europe
	ε_3f	N	28580	G>U	D>Y	69	1.2%	353	Chile, South America
ν₁		ORF1ab	17858	A>G	Y>C	59	4.7%	32	USA, North America
ν₂		ORF1ab	17747	C>U	P>L	59	4.7%	1374	USA, North America

Mutant (major)	Mutant (minor)	Gene	GenomicPosition	Nucleotide change	Amino acid change	Time(days)	VariantFrequency	Genomesmapped	Firstlocation
μ₁		ORF1ab	2416	U>C		0	98.1%	0	China, Asia
μ₂		ORF1ab	19524	U>C		0	98.6%	0	China, Asia
μ₃		S	23929	U>C		0	98.4%	18	China, Asia
α₁		ORF1ab	18060	U>C		0	95.1%	849	China, Asia
	α_1a	N	28657	C>U		63	1.3%	2	France, Europe
	α_1b	ORF1ab	9477	U>A	F>Y	63	1.2%	3	France, Europe
	α_1c	N	28863	C>U	S>L	63	1.2%	5	France, Europe
	α_1d	ORF3a	25979	G>U	G>V	63	1.2%	344	France, Europe
α₂		ORF1ab	8782	U>C		0	91.0%	47	China, Asia
α₃		ORF8	28144	C>U	S>L	0	90.8%	1115	China, Asia
	α_3a	ORF1ab	1606	U>C		43	1.7%	501	United Kingdom, Europe
	α_3b	ORF1ab	11083	G>U	L>F	24	9.2%	376	China, Asia
	α_3c	N	28311	C>U	P>L	64	1.9%	3	South Korea, Asia
	α_3d	ORF1ab	13730	C>U	A>V	71	1.8%	3	Taiwan/Malaysia, Asia
	α_3e	ORF1ab	6312	C>A	T>K	71	1.7%	483	Taiwan/Malaysia, Asia
	α_3f	ORF3a	26144	G>U	G>V	28	5.1%	121	China, Asia
	α_3g	ORF1ab	14805	C>U		54	6.0%	334	United Kingdom, Europe
	α_3h	ORF1ab	17247	U>C		64	2.0%	580	Switzerland, Europe
	α_3i	ORF1ab	2558	C>U	P>S	54	1.7%	26	United Kingdom, Europe
	α_3j	ORF1ab	2480	A>G	I>V	54	1.6%	462	United Kingdom, Europe
β₁		ORF1ab	3037	C>U		31	77.0%	11	China, Asia
β₂		S	23403	A>G	D>G	31	77.1%	36	China, Asia
β₃		ORF1ab	14408	C>U	P>L	41	76.9%	3032	Saudi Arabia, Middle East
	β_3a	ORF1ab	20268	A>G		64	5.7%	1213	Italy, Europe
	β_3b	N	28854	C>U	S>L	29	3.1%	527	China, Asia
	β_3c	ORF1ab	15324	C>U		29	2.3%	678	China, Asia
	β_3d	ORF3a	25429	G>U	V>L	77	1.7%	485	United Kingdom, Europe
	β_3e	N	28836	C>U	S>L	74	1.6%	3	Switzerland, Europe
	β_3f	ORF1ab	13862	C>U	T>I	74	1.6%	50	Switzerland, Europe
	β_3g	ORF1ab	10798	C>A	D>E	86	1.4%	414	United Kingdom, Europe
γ₁		ORF3a	25563	G>U	Q>H	41	29.8%	884	Saudi Arabia, Middle East
	γ_1a	ORF1ab	18877	C>U		41	4.0%	757	Saudi Arabia, Middle East
	γ_1b	M	26735	C>U		41	1.5%	439	Saudi Arabia, Middle East
δ₁		ORF1ab	1059	C>U	T>I	54	23.0%	5157	Singapore, Asia
	δ_1a	S	24368	G>U	D>Y	75	1.3%	389	Sweden, Europe
	δ_1b	ORF8	27964	C>U	S>L	76	2.7%	790	USA, North America
	δ_1c	ORF1ab	11916	C>U	S>L	72	1.6%	166	USA, North America
	δ_1d	ORF1ab	18998	C>U	A>V	72	1.0%	305	USA, North America
ε₁		N	28881	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₂		N	28882	G>A	R>K	54	25.7%	2	United Kingdom, Europe
ε₃		N	28883	G>C	G>R	54	25.7%	5365	United Kingdom, Europe
	ε_3a	ORF1ab	313	C>U		66	2.1%	608	USA, North America
	ε_3b	ORF1ab	19839	U>C		64	1.5%	452	Switzerland, Europe
	ε_3c	M	27046	C>U	T>M	69	1.6%	453	Worldwide
	ε_3d	ORF1ab	10097	G>A	G>S	69	2.5%	5	Denmark, Europe
	ε_3e	S	23731	C>U		69	2.5%	403	Denmark, Europe
	ε_3f	N	28580	G>U	D>Y	69	1.2%	353	Chile, South America
ν₁		ORF1ab	17858	A>G	Y>C	59	4.7%	32	USA, North America
ν₂		ORF1ab	17747	C>U	P>L	59	4.7%	1374	USA, North America

Amino acid change is shown only for non-synonymous change.

New variants occurred in the genomic background of the variants preceding them in the reconstructed mutation history with a very high propensity (co-occurrence index, COI > 84%; fig. 2), except for one low-frequency offshoot mentioned above (β_3b; COI = 54%). Overall, these results suggest a strong signal to infer a sequential mutational history, even though a small minority of sequences at a position may have experienced homoplasy or recurrent mutations. Indeed, a bootstrap analysis involving genome resampling to assess the robustness of the mutation history produced high bootstrap confidence levels (BCLs) for key groups of mutations as well as many offshoots (fig. 2; BCL > 95%).

Fig. 2.

Mutational history graph of SARS-CoV-2 from the 29KG data set. Thick arrows mark the pathway of widespread variants (frequency, vf ≥ 3%), and thin arrows show paths leading to other common mutations (3% > vf > 1%). The pie-chart sizes are proportional to variant frequencies in the 29KG data set, with pie-charts shown for variants with vf > 3% and pie color based on the world’s region where that mutation was first observed. A circle is used for all other variants, with the filled color corresponding to the earliest sampling region. The COI (black font) and the BCL (blue font) of each mutation and its predecessor mutation are shown next to the arrow connecting them. Underlined BCL values mark variant pairs for which BCLs were estimated for groups of variants (see Materials and Methods) because of the episodic nature of variant accumulation within groups resulting in lower BCLs (<80%, dashed arrows). Base changes (n.) are shown for synonymous mutations, and amino acid changes (p.) are shown for nonsynonymous mutations along with the gene/protein names (“ORF” is omitted from gene name abbreviations given in table 1). More details on each mutation are presented in table 1.

Open in new tab Download slide

Episodic Evolution and Selection

The order of some mutations in the mutational history is not established with high BCLs, for example, the relative order of ε₁, ε₂, and ε₃ mutations. This is because the three ε variants almost always occur together (7,624 genomes), and the intermediate combinations of ε variants were found in only 42 genomes. Similarly, the count of genomes harboring all three β variants (22,739 genomes) far exceeded those with two or fewer β variants (201 genomes). There is a strong temporal tendency of variants to be sampled together (e.g., ε₁–ε₃ and α_1a–α_1d), suggesting an episodic spread of variants (Wald–Wolfowitz run tests P ≪ 0.01; see Materials and Methods) that does not allow for determining the precise order of some mutations’ appearance. Episodic variant spread may be caused by founder effects, positive selection, or both (e.g., MacLean et al. 2021). It may also be an artifact of highly uneven regional and temporal genome sequencing that will produce a biased representative sample of the actual worldwide population (fig. 1b).

In this mutation history, the ratio of nonsynonymous to synonymous changes (N/S) is 1.9, which is almost ten times larger than their ratio of 0.18 for the inferred proCoV2 and observed Bat CoV proteins. The McDonald–Kreitman test (McDonald and Kreitman 1991) rejected the similarity of molecular evolutionary patterns observed within the SARS-CoV-2 population (29KG data set) and between human proCoV2 and the bat coronavirus. However, the selective interpretation of such a difference is complicated by the fact that polymorphisms in SARS-CoV-2 genomes are affected by molecular mechanisms (e.g., RNA editing) (Giorgio et al. 2020; Rice et al. 2021), not just selection, and slightly deleterious alleles can become common when there is a population expansion (Casals and Bertranpetit 2012). Furthermore, selection may have played a significant role during the divergence of human CoV-2 and bat CoV sequences (MacLean et al. 2021; Martin et al. 2021; Tegally et al. 2021). Nevertheless, N/S patterns derived from common variants show that molecular evolutionary patterns observed within SARS-CoV-2 genomes infecting humans differ from those spanning the divergence between the bat RaTG13 and SARS-CoV-2 genomes, even though positive selection in the early SARS-CoV-2 pandemic history may have been limited (Chiara et al. 2021; MacLean et al. 2021).

The Progenitor Genome and the Index Case

The root of the mutation tree is the most recent common ancestor (MRCA) of all the genomes analyzed, which gave rise to two early coronavirus lineages (ν and α; fig. 2). The MRCA genome was the progenitor of all SARS-CoV-2 infections globally, henceforth proCoV2, and was likely carried by the “first detectable” case of human transmission in the COVID-19 pandemic (index case). A comparison of proCoV2 with Wuhan-1 genomes revealed three differences in 49 positions analyzed, which was also true for other reference genomes (fig. 1c). This suggests that the Wuhan-1 (EPI_ISL 402123) and the other earliest sampled genomes are derived coronavirus strains that arose from proCoV2 after the divergence of ν and α lineages (fig. 2). According to the mutational history, the Wuhan-1 strain evolved by three successive α mutations (two synonymous and one nonsynonymous) in proCoV2 (α₁, α₂, and α₃). This progression is statistically supported (BCL = 100%), which is made possible by the presence of 896 intermediate genomes containing one or two α variants in the 29KG data set. Importantly, three closely related nonhuman coronavirus genomes (bats and pangolin) all have the same base at these positions as does the proCoV2 genome, suggesting that the ancestral genome did not contain α variants. Furthermore, genomes with ν variants of proCoV2 do not contain the other 47 variants, all of which occurred on the genomic background containing α₁–α₃. These facts support the inference that coronaviruses lacking α variants were the ancestors of Wuhan-1 and other genomes sampled in December 2019 in China (fig. 1c). Therefore, we conclude that Wuhan-1 was not the direct ancestor of all the early coronavirus infections globally.

A comparison of the proCoV2 genetic fingerprint (49 positions) in the 29KG collection revealed three matches in China (Fujian, Guangdong, and Hangzhou) and three in the United States (Washington) in January 2020 (fig. 1c). One more match was found in New York in March 2020. The ν mutant of proCoV2 was first sampled 59 days after the Wuhan-1 strain. This means that the progenitor coronavirus spread and mutated in the human population for months after the first reported COVID-19 cases. Furthermore, comparisons of the protein sequences encoded by the proCoV2 genome revealed 131 other genomic matches, which contained only synonymous differences from proCoV2. A majority (89 genomes) of these matches were from coronaviruses sampled in China and other Asian countries (fig. 1d). The first sequence was sampled 12 days after Wuhan-1. Multiple matches were found in all sampled continents and detected as late as April 2020 in Europe. These spatiotemporal patterns suggest that proCoV2 already possessed the repertoire of protein sequences needed to infect, spread, and persist in the global human population (see also MacLean et al. 2021).

Coronavirus Diversity before the First Coronavirus Outbreak

The progenitor of all genomes sequenced from human coronavirus infections (MRCA, proCoV2) is three bases different from the Wuhan-1 genome, which extends the ancestry before late November/early December 2020 date that has been suggested by Pekar et al. (2021). Their inference was based on analyzing SARS-CoV-2 genomes from the first 4 months of coronavirus infections in China with a strict molecular clock in which they placed coronavirus genomes from December 2019 at the root of their phylogeny (Pekar et al. 2021). Their most likely root position is the same as the Wuhan-1 position in our mutation history (figs. 1 and 3), which is not surprising because their data set did not include more than 1,000 genomes that comprise the early diverging ν lineage (fig. 3). The genomes containing the ν lineage, sampled in North America, descended from an earlier ancestor that also gave rise to the genome (α lineage) at the root of Pekar et al. (2021) phylogeny. Therefore, our analysis has revealed an earlier MRCA than that detectable by considering a smaller subset of sequences from China.

Fig. 3.

A waterfall display of genome phylogeny recapitulating the mutation history in figure 2. The numbers of genomes mapped to each node are depicted by open circles (very few genomes), open triangles (few genomes), small gray triangles (many genomes), and large black triangles (very many genomes). The actual number of genomes is given in the parenthesis. The tip label is the name of the mutation on the connecting branch. Green and red branches are synonymous and nonsynonymous mutations, respectively. Thick branches mark mutations that occur with a frequency >3% in the 29KG data set. The yellow background highlights the diversity of coronavirus lineages that evolved from the genomes leading to Wuhan-1 coronavirus.

Open in new tab Download slide

The mutational history from proCoV2 to Wuhan-1 genome points to the presence of measurable coronavirus diversity before the earliest recognized coronavirus outbreak in December 2019 (fig. 1). The presence of such diversity has been acknowledged and analyzed, for example, Pekar et al. (2021), but variants present in this diverse population were not identified. Our analyses clearly show that the ancestors of the Wuhan-1 genome gave rise to a diversity of Wuhan-1’s sibling coronavirus lineages (α_3a–α_3j; figs. 1 and 3). These sibling coronavirus lineages were detected in China in January 2020 (α_3b and α_3f) and February 2020 in Asia (α_3c–α_3e) and Europe (α_3a, α_3g–α_3j) (table 1). Thousands of genomes in the 29KG data set belong to siblings and ancestors of Wuhan-1 (table 1 and fig. 3, yellow box). However, the paucity of genomes sampled in 2019 makes it impossible to establish the date and location of origin precisely, but some must have originated before the first detection of the outbreak. Notably, the evolution of α₃ was preceded by the evolution of α₂ and α₁ lineages, with α₁ spawning multiple offshoots first detected in Europe in February 2020 (α_1a–α_1d). The ν lineage, detected in the United States in February 2020, is an even earlier descendant of proCoV2 and is a sibling of the α lineage (table 1 and fig. 3). These lineages may not have been detected earlier because of the lack of sequencing in 2019, and it is likely that some originated early and spread around the world, whereas others evolved from proCoV2 or its early descendant in different parts of the world. Again, thousands of these coronavirus genomes were found throughout the world (fig. 3, yellow box). None of these genomes contained the widely studied spike protein mutant (D614G), a β mutation that occurred in the genomes carrying all three α variants and was first seen in late January 2020. Therefore, the proCoV2 (MRCA) and a large diversity of its early descendants were all able to spread in the global human population.

Estimated Timing of MRCA and the Index Case

Because proCoV2 is three bases different from the Wuhan-1 genome, we estimate that the divergence of the earliest variants of proCoV2 occurred 5.8–8.1 weeks earlier, based on the range of estimated mutation rates of coronavirus genomes (see Materials and Methods). This timeline puts the presence of proCoV2 in late October 2019, which is consistent with the report of a fragment of spike protein identical to Wuhan-1 in early December in Italy, among other evidence (Giovanetti et al. 2020; Li, Wang, et al. 2020; van Dorp et al. 2020; Amendola et al. 2021). The sequenced segment of the spike protein is short (409 bases). It does not span positions in which 49 major early variants were observed, which means that the Italian spike protein fragment can only confirm the existence of proCoV2 before the first coronavirus detection in China.

Our timings of MRCA (tMRCA) is 1 month older than the date for the MRCA of genomes presented by Pekar et al. (2021) because their analysis is restricted to the ancestry of the coronaviruses sampled from China only, which resulted in the exclusion of the ν lineage from their analysis. The potential for not sampling such lineages is well appreciated in Pekar et al. (2021). This exclusion and the use of sampling times in strict clock phylogenetic analyses would naturally lead to analyses leaning closer to the earliest sampling times of SARS-CoV-2 (December 2019). Anyway, if we assume that proCoV2 was the index case, then the date of zoonosis (tIndex) would be late October to mid-November 2019. This range overlaps with Pekar et al.’s (2021) index date falling between mid-October and mid-November 2019. However, it likely that the actual tIndex is much earlier than tMRCA because proCoV2 likely increased in frequency over time before reaching a human host, and it is possible that one of its immediate ancestor first infected a human. Based on an approximately 1-month difference between MRCA and Index dates in Pekar et al. (2021), it is tempting to speculate that tIndex could have been as early as September 2019 for our SARS-CoV-2 phylogeny. This speculation requires more extensive analysis and experimental confirmation in the future.

Analysis of the 68KG Database Snapshot

Next, we analyzed a later snapshot of SARS-CoV-2 genome collection acquired 3 months after the 29KG data set. This data set expanded the collection of coronavirus genomes from viral isolates collected after July 7, 2020 (16,739 genomes) and added 20,004 genome sequences from viral isolates dated before July 7, 2020. In the expanded MOA analysis, we retained 49 variants found with frequency >1% in the 29KG data set and added variants found with a frequency >1% in the 68KG data set (84 total variants; see supplementary table S1, Supplementary Material online). MOA analysis of the 68KG data set produced the proCoV2 genome identical to that inferred using the 29KG data set (see Materials and Methods). We found one additional genome (EPI_ISL_493171) with a proCoV2 fingerprint sampled in Hubei, China, 4 weeks after the Wuhan-1 strain was reported.

The inferred mutation history from the 68KG data set was well-supported with high COIs and BCLs and concordant with the mutation history produced using the 29KG data set (fig. 4). Therefore, inferences reported above for the 29KG data set were robust to the expanded sampling of genomes. In the expanded mutation history, two new groups of variants were identified (ζ and η). They originated in mid-March 2020 and were found in a relatively high frequency in the 68KG data set (∼4.4% and 8.0%, respectively, supplementary table S1, Supplementary Material online). Variants in ζ and η groups also showed episodic accumulation of mutations, for example, the count of genomes containing three ζ mutations (ζ₁–ζ₃; 2,955 genomes) was much larger than those with a subset of these variants (148 genomes). The episodic nature of mutational spread for 84 variants in the 68KG is statistically significant (P < 10⁻⁸), that is, clusters of mutations together have become common variants (see Materials and Methods).

Fig. 4.

The backbone of SARS-CoV-2 mutational history. The mutational history inferred was from (a) 29KG and (b) 68KG data sets. Major variants and their mutational pathways are shown in black, and minor variants and their mutational pathways are shown in gray. Circle color marks the region where variants were sampled first. The 68KG data set contains 12 additional variants and more than two times the genomes than the 29KG data set.

Open in new tab Download slide

Coronavirus Fingerprints and Spatiotemporal Tracking

The mutation history progression directly transforms into a collection of genetic fingerprints. Each fingerprint represents a genome type containing all the variants on the path from that tip node up to the progenitor proCoV2. These fingerprints can classify genomes and track spatiotemporal patterns of dominant lineages (see Materials and Methods). We use a shorthand to refer to each fingerprint in which only the major variant type is used. For example, α fingerprint refers to genomes that one or more of the α variants and no other major variants, and αβ fingerprint refers to genomes that contain at least one α, at least one β variant, and no other major variants. This nomenclature is intuitive and provides a way to glean evolutionary information from the coronavirus lineage’s name. In the 68KG data set (October 12, 2020 GISAID snapshot), global frequencies of major proCoV2 fingerprints were as follows: αβε (32.1%), αβγδ (17.7%), αβ (16.7%), αβεη (9.9%), αβ (9.8%), αβγ (6.8%), αβζ (4.5%), and ν (2.4%).

Figure 5 shows the evolving spatiotemporal of all major fingerprints in Asia, Europe, and North America inferred for an expanded data set of 172,480 genomes (December 30, 2020 snapshot). Spatiotemporal patterns in cities, countries, and other regions are available online at http://sars2evo.datamonkey.org/. We observe the spread and replacement of prevailing strains in Europe (αβε with αβζ) and Asia (α with αβε), the preponderance of the same strain for most of the pandemic in North America (αβγδ), and the continued presence of multiple high-frequency strains in Asia and North America. Spatiotemporal patterns of strain spread converged for Europe and Asia by July–August 2020 to αβε genetic fingerprints. These patterns diverged from North America, where αβ along with its mutant (αβγδ) were common. After that, Europe saw ζ variants of αβ grow (αβζ), replacing αβε genomes and its new η offshoot (αβεη) (e.g., Hodcroft et al. 2020). The ζ mutations were first detected 3 weeks after the sampling of the first ε variants. Remarkably, αβγδ has remained the dominant lineage in North America since April 2020, in contrast to the turnover seen in Europe and Asia.

Fig. 5.

Spatiotemporal dynamics of 172,480 SARS-CoV-2 genomes (December 2019–2020). Spatiotemporal patterns of genomes mapped to lineages containing different combinations of major variants in (a) Asia, (b) Europe, and (c) North America. The number of genomes mapped to major variant lineages contains all of its offshoots, for example, α lineage contains all the genomes with α₁–α₃, α_1a–α_1d, and α_3a–α_3j variants only. The stacked graph area is the proportion of genomes mapped to the corresponding lineage. The solid black line shows the count of total genome samples. Spatiotemporal patterns in cities, countries, and other regions are available online at http://sars2evo.datamonkey.org/ (last accessed on March 28, 2021).

Open in new tab Download slide

More recently, novel fast-spreading variants have been reported (e.g., Rambaut, Loman, et al. 2020). In particular, an S protein variant (N501Y) from South Africa and the United Kingdom has rapidly increased (Rambaut, Loman, et al. 2020). Coronaviruses with N501Y variant in South Africa carry the αβγδ genetic fingerprint, whereas those in the United Kingdom carry the αβε genetic fingerprint. This means that the N501Y mutation arose independently in two coronavirus lineages that show convergent patterns of increased spread. At present, αβζ dominates the United Kingdom, and the number of genomes publicly available from South Africa is relatively small to make reliable inferences (see http://sars2evo.datamonkey.org for future updates). Overall, our mutational fingerprinting and nomenclature provide a simple way to glean the ancestry of new variants compared with phylogenetic designations, such as B.1.351 and B.1.1.7 (Rambaut, Loman, et al. 2020).

Conclusions

Through innovative analyses of two large collections of SARS-CoV-2 genomes, we have consistently reconstructed the same progenitor coronavirus genome and identified its presence worldwide for many months after the pandemic began. The progenitor genome is a better reference for rooting phylogenies, orienting mutations, and estimating sequence divergences. The reconstructed mutational history of SARS-CoV-2 revealed major mutational fingerprints to identify and track the novel coronavirus’s spatiotemporal evolution, revealing convergences and divergences of dominant strains among geographical regions from an analysis of more than 172 thousand genomes.

Furthermore, the approach taken here to reconstruct the progenitor genome and discover key mutational events will generally be applicable for analyzing other pathogens during the early stages of outbreaks. The approach is scalable for even bigger data sets because it does not require more phylogenetically informative variants with an increasing number of samples. In fact, it benefits from bigger data sets as they afford more accurate estimates of individual and co-occurrence frequencies of variants and enable more reliable detection of lower frequency variants. Its continued application to SARS-CoV-2 genomes and other pathogen outbreaks will produce their ancestral genomes and their spatiotemporal dynamics, improving our understanding of the past, current, and future evolution of pathogens and associated diseases.

Materials and Methods

Genome Data Acquisition and Processing

A flowchart describing the protocol for data assembly and processing is shown in supplementary figure S1, Supplementary Material online. In the first step, we download 60,332 SARS-CoV-2 genomes from the GISAID database (Shu and Mccauley 2017), along with information on sample collection dates and locations (until July 7, 2020). Of all the genomes downloaded, we only retained those with >28,000 bases and were marked as originating from human hosts and passing controls detailed below. Similarly, the second data set, the 68KG data set, was assembled from 133,741 genomes and downloaded on October 12, 2020. Again, we retained only those with >28,000 bases and marked as originating from human hosts.

Each genome was subjected to codon aware alignment with the NCBI reference genome (accession number NC_045512) and then subdivided into ten regions based on the following coding sequence (CDS) features: ORF1a (including nsp10), ORF1b (starting with nsp12), S, ORF3a, E, M, ORF6, ORF7a, ORF8, N, and ORF10. Gene ORF7b was removed because it was too short for alignment and comparisons. For each region, we scanned and discarded sequences containing too many ambiguous nucleotides to remove data with too many sequencing errors. Thresholds were 0.5% for the S gene, 0.1% for ORF1a and ORF1b genes, and 1% for all other genes. We mapped individual sequences to the NCBI reference genome (NC_045512) using a codon-aware extension to the Smith–Waterman algorithm implemented in HyPhy (Pond et al. 2005; Gianella et al. 2011) (https://github.com/veg/hyphy-analyses/tree/master/codon-msa), translated mapped sequence to amino acids, and performed multiple protein sequence alignment with the auto settings function of MAFFT (version 7.453; Katoh and Standley 2013). Codon sequences were next mapped onto the amino acid alignment. The multiple sequence alignment of SARS-CoV-2 genomes was aligned with the sequence of three closest outgroups, including the coronavirus genomes of the Rhinolophus affinis bat (RaTG13), R. malayanus bat (RmYN02), and Manis javanica pangolin (MT121216.1) (Liu et al. 2020; Zhou et al. 2020). The alignment was visually inspected and adjusted in Geneious Prime 2020.2.2 (https://www.geneious.com). The final alignment contained all genomic regions except ORF7b and noncoding regions (5′ and 3′ UTRs, and intergenic spacers). After these filtering and alignment steps, the multiple sequence alignment contained 29,115 sites and 29,681 SARS-CoV-2 genomes for the July 7, 2020 snapshot, which we refer to as the 29KG data set. For the October 12 snapshot, there were 68,057 sequences, which we refer to as the 68KG data set. We also conducted a spatiotemporal analysis on an expanded data set containing 172,480 genomes (172KG) acquired on December 30, 2020.

Reference Genomes and Collection Dates

We used the dates of viral collections provided by the GISAID database (Shu and Mccauley 2017) in all our analyses if they were resolved to the day (i.e., we discarded data that only contained partial dates, for example, April 2020). All genomes were used in the mutation ordering analyses, but genomes with incomplete sampling dates were excluded from the spatiotemporal analyses and derived interpretations. We noted that the earliest sample included in GISAID (ID: EPI_ISL_402123) was collected on December 24, 2019, although the NCBI website lists its collection date as December 23, 2019 (GenBank ID: MT019529). Therefore, we used the GISAID collection date for the sake of consistency. Regarding the NCBI reference genome (GenBank ID: NC_045512; GISAID ID: EPI_ISL_402125) (Wu et al. 2020), this sample was collected on December 26, 2019 (Chiara et al. 2021). We also used the GISAID reference genome in our analysis (ID: EPI_ISL_402124) collected on December 30, 2019 (Okada et al. 2020).

Mutation Order Analyses

First, we analyzed the 29KG data set. We used a maximum likelihood method, SCITE (Jahn et al. 2016), and variant co-occurrence information for reconstructing the order of mutations corresponding to 49 common variants (frequency > 1%) observed in this data set (see flowchart in supplementary fig. S2, Supplementary Material online). MOA has demonstrated high accuracy for analyzing tumor cell genomes that reproduce clonally, have frequent sequencing errors, and exhibit limited sequence divergence (Jahn et al. 2016; Miura et al. 2018). In MOA, higher frequency variants are expected to have arisen earlier than low-frequency variants in clonally reproducing populations (Kim and Simon 2014; Jahn et al. 2016). We used the highest frequency variants to anchor the analysis and the shared co-occurrence of variants among genomes to order mutations while allowing probabilistically for sequencing errors and pooled sequencing of genomes (Jahn et al. 2016). MOA is different from traditional phylogenetic approaches where positions are treated independently, that is, the shared co-occurrence of variants is not directly utilized in the inference procedure. Notably, both traditional phylogenetic and mutation order analyses are expected to produce concordant patterns when sequencing errors and other artifacts are minimized. However, sequencing errors and limited mutational input during the coronavirus history adversely impact traditional methods, as does the fact that the closest coronaviruses useable as outgroups have more than a thousand base differences from SARS-CoV-2 genomes that only differ in a handful of bases from each other (Mavian et al. 2020; Morel et al. 2021; Pipes et al. 2021).

MOA requires a binary matrix of presence/absence (1/0) of mutants, which is straightforward in analyzing cell sequences from tumors because they arise from normal cells that supply definitive ancestral states. To designate mutation orientations for applying MOA in SARS-CoV-2 analysis, we devised a simple approach in which we began by comparing nucleotides at the 49 genomic positions among three closely related genomes (bat RaTG13, bat RmYN02, and pangolin MT121216.1) (Boni et al. 2020). We chose the consensus base to be the initial reference base, such that SARS-CoV-2 genome bases were coded to be “0” whenever they were the same as the consensus base at their respective positions. All other bases were assigned a “1.” There were 39 positions in which all 3 outgroup genomes were identical to each other and 9 in which 2 of the outgroups showed the same base. In the remaining position (28657), all three outgroups differed, so we selected the base found in the gene with the highest sequence similar to the human SARS-CoV-2 NCBI reference genome (NC_045512) because SARS-CoV-2’s ancestor likely experienced genomic recombination before its zoonotic transfer into humans (Huang et al. 2020; Li, Giorg, et al. 2020; MacLean et al. 2021). At one position, both major and minor bases in humans were different from the consensus base in the outgroups, so we assigned the mutant status to the minority base (U; vf = 29.8%). All missing and ambiguous bases were coded to be ignored (missing data) in all the analyses. These initially assigned mutation orientations were tested in a subsequent investigation of variants’ COI. COI for a given variant (y) is the number of genomes that contain y and its directly preceding mutation (x) in the mutation history, divided by the number of genomes that contain y. When COI was lower than 70%, we reversed each position’s mutation orientation individually and selected the mutation orientation that produced mutation histories with the highest COI (see below).

In the SCITE analysis of 49 variants and 29,861 genomes, we started with default parameter settings of false-negative rate (FNR = 0.21545) and false-positive rate (FPR = 0.0000604) of mutation detection. We carried out five independent runs to ensure stability and convergence to obtain 29KG collection-specific estimates of FNR and FPR by comparing the observed and predicted sequences based on this mutation graph. The estimated FNR (0.00488) and FPR (0.00800) were very different from the SCITE default parameters, where the estimated FNR was much lower. This difference in error rates is unsurprising because we used only common variants (vf > 1%), and the 29KG data set was not obtained from single-cell sequencing in which dropout during single-cell tumor sequencing elevates FNR, that is, mutant alleles are not sequenced.

As noted above, the initial mutation orientations were simply the starting designations for our analysis, which are subsequently investigated by evaluating the COI of each variant in the reconstructed mutation history. In this process, we reverse ancestor/mutant coding for variants that received low COI to examine if a mutation history with higher COI can be generated. Two positions (3037 and 28854) received low COI (<70%). At position 3037, the reversed encoding (C → U) received significantly higher COI (100%) than the starting encoding (U → C; 60%), so the position was recoded. At position 28854, the ordering and direction of mutation remained ambiguous despite extensive analyses, but it did not impact the predicted MRCA sequence. Therefore, we only recoded the column for position 3037 and generated a new 49 × 29861 (SNVs × genomes) matrix.

At one position (28657), all three outgroup sequences had different bases, so we initially selected the base found in the gene with the highest sequence similar to the human SARS-CoV-2 NCBI reference genome. We next tested if reversed encoding produced a better mutation graph. The reversed encoding produced a mutation graph with a much higher log-likelihood, that is, −32,355.58 and −30,289.92, for the initial and reversed encoding, respectively; P ≪ 0.01 using the AkaikeInformation Criterion (AIC) protocol in Pupko et al. (2002). Therefore, we recoded position 28657 and generated a new 49 × 29,861 (SNVs × genomes) matrix.

It was subjected to SCITE analysis and produced a mutation graph for 49 variants in the 29KG data set. This graph predicts an FNR of 0.00418 and FPR of 0.00295 per base. Using these new FNR and FPR, we again performed SCITE analysis and produced the final mutation history graph. Starting from the top of a mutation graph, a distinct Greek symbol was assigned to a group of mutations that were occurred sequentially, and variants with similar frequency were assigned the same Greek symbol (μ, ν, α, β, γ, δ, and ε). The high-frequency variants with the same Greek symbol were distinguished by numbers to represent the sequential relationship, for example, α₁ and α₂. When an offshoot of a high-frequency mutation had low variant frequency, we assigned it the same Greek symbol and number to represent the parent-offspring relationship and further distinguished descendants by adding a small letter, for example, α_1a and α_1b.

Timing of the Progenitor Genome

The MRCA corresponds to the progenitor that gave rise to ν and α lineages in the mutation graph. MRCA is the progenitor of all human SARS-CoV-2 infections (proCoV2), which descended from the parental lineage after its divergence from its closest relatives, including bats and pangolins. We estimate that proCoV2 existed 5.8–8.1 weeks before December 24, 2019 sampling date of Wuhan-1, by using an SARS-CoV-2 HPD (Highest Posterior Density) mutation rate range of 6.64 × 10⁻⁴–9.27 × 10⁻⁴ substitutions per site per year (Pekar et al. 2021).

We have made available the proCoV2 genome sequence in FastA format at http://igem.temple.edu/COVID-19, which is the same as the NCBI reference genome with base differences corresponding to α₁–α₃ mutations at positions 18060, 8782, and 28144, as discussed in the main text. In this mutation graph, COI for each variant is shown next to the arrow.

Bootstrap Analysis

We assessed the robustness of the mutation history inference to genome sampling by bootstrap analysis. We generated 100 bootstrap replicate data sets, each built by randomly selecting 29,861 genomes with replacement. Then, SCITE was used to infer the mutation graph for each replicate data set. Bootstrap confidence level, scored for each variant pair, was the number of replicates in which the given pair of variants were directly connected in the mutation history in the same way as shown in figure 2. BCLs were often lower for major variants within groups (e.g., ε₁–ε₃) because they occur with very similar frequencies. This feature adversely affected the BCL values of mutation orders between groups, for example, β and ε. In this case, we considered each group as a single entity and computed BCL to be the proportion of replicates in which pairs of groups were directly connected in the mutation history in the same way as shown in figure 2. Groups used were β₁–β₃, ε₁–ε₃, and α_1a–α_1d. All of these BCL values are shown with an underline.

Temporal Concordance

Because MOA analyses did not use spatial or temporal information for genomes or mutations, the inferred mutation history can be validated by evaluating the concordance of the inferred order of mutations with the timing of their first appearance (tf). Using the genomes for which virus sampling day, month, and year were available, we determined tf for every variant in the 29KG data set. For mutation i, we compared its tf(i) with tf(j) such that j is the nearest preceding mutation in the mutation graph. We found that tf(j) ≥ tf (i) for 47 of 49 mutations, except for β_3b and β_3c pairs. These two offshoot mutants of β₃ were sampled 35 days (β_3b) and 12 days (β_3c) earlier than their predecessors, which could be due to their low frequency or sequencing error. COI of one variant (β_3b) was low (54%), but the other variant (β_3c) had a very high COI (97%).

Mutational Fingerprints

Each node in the mutational history graph predicts an intermediate (ancestral) or a tip sequence containing all the mutations from that node to the mutation graph’s root. The mutational fingerprint is then produced directly from the mutation history graph drawn as a directional graph anchored on the root node. We compared our mutational fingerprints of the genomes in the 29KG data set with a phylogeny-based classification (Rambaut, Loman, et al. 2020) obtained using the Pangolin service (v2.0.3; https://pangolin.cog-uk.io/). We assigned each of the 29 K genomes to a fingerprint based on the highest sequence similarity at positions containing 49 common variants, allowing mismatches due to population-level variations and sequencing errors. A small fraction of genomes (1.8%) could not be assigned unambiguously to one fingerprint, so they were excluded and will be investigated in the future. The number of genomes assigned to each fingerprint is shown in table 1. We submitted genome sequences to the Pangolin website for classification one-by-one, and a clade designation was received. The results are summarized in supplementary figure S3, Supplementary Material online. In this table, all phylogenetic groups with fewer than 20 genomes were excluded.

Of the 80 phylogenetic groups shown, 74 are defined primarily by a single mutation-based fingerprint, as more than 90% of the genomes in those phylogenetic groups share the same fingerprint. This includes all small- and medium-sized phylogenetic groups (up to 488 genomes) and two large groups (A.1 with 1,377 genomes and B.1.2 with 749 genomes). One large group, B.1.1, predominately connects with ε₃ node (79%, 4,832 genomes), but some of its members belong to ε₃ offshoots because they contain respective diagnostic mutations. For group B.1.1.1, two other ε₃ offshoots are mixed up almost equally. Three other large differences between mutational fingerprint-based classification and phylogeny-based grouping are seen for A, B, B1.1, and B.2 groups. These differences are likely because the location of the root and the earliest branching order of coronavirus lineages are problematic in phylogeny-based classifications (Mavian et al. 2020; Morel et al. 2021; Pipes et al. 2021; Wenzel 2020). Overall, our mutational fingerprints are immediately informative about the mutational ancestry of genomes.

Analysis of 68KG Data Set

We repeated the above MOA procedure on the 68KG data set (68,057 genomes). This 68KG data contained 72 common variants (>1% frequency). For direct comparison purposes, we added 12 variants that were common variants on 29KG data, but their frequency had become less than 1% in the 68KG data. Therefore, we used 84 variants in total and constructed a 84 × 68,057 (SNVs × genomes) matrix for the SCITE analysis to determine the mutational order. We also conducted the bootstrap analysis and assigned mutational fingerprints using the procedure mentioned above. The number of genomes mapped to each fingerprint is listed in supplementary table S1, Supplementary Material online.

Sequence Classification for 172KG Data Set

We developed a sequence classification protocol that first calls variants in a viral genome using proCoV2 as the reference sequence using a browser-based sequence alignment (bioseq npm package) based on the codebase of minimap2 (Li 2018). Then, it assigns the sequence to a path in the mutation graph with the highest concordance (Jaccard index). It is implemented in a simple browser-based tool, which shows the example output for ENA accession number MT675945 (supplementary fig. S4, Supplementary Material online; http://sars2evo.datamonkey.org, last accessed on March 18, 2021). The classification is conducted on the client-side such that the researcher’s data never leave their personal computer.

Testing Episodic Spread of Variants

We performed nonparametric Wald–Wolfowitz run tests (Wald and Wolfowitz 1940; Mateus and Caeiro 2015) of the null hypothesis that the first sampling of variants is randomly distributed over time (i.e., evenly spaced). The null hypothesis was rejected for both 29KG and 64KG analyses at P ≪ 0.01, suggesting significant temporal clustering in both 29KG and 64KG data sets. Because many mutations were first sampled on December 24, 2019, we only included one mutation for that day to avoid biasing the test.

We also tested the null hypothesis of the same molecular evolutionary patterns within the SARS-CoV-2 population and between species (i.e., Human SARS-CoV-2 and Bat RaTG13) by using a McDonald-Kreitman test (McDonald and Kreitman 1991). The numbers of nonsynonymous and synonymous polymorphisms with a frequency >1% were 32 and 17, compared with the numbers of nonsynonymous and synonymous fixed differences (170 and 958, respectively) between proCoV2 and bat RaTG13 sequences. The McDonald–Kreitman test rejected the null overwhelmingly (P ≪ 0.01).

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

Acknowledgments

We thank all the authors and organizations who have kindly deposited and shared genome data on GISAID (see http://igem.temple.edu/COVID-19 for a list of all the authors). We thank Ananias Escalante, Rob Kulathinal, Li Liu, Jose Barba-Montoya, Antonia Chroni, Ravi Patel, and Caryn Babaian for their critical comments. We appreciate the technical support provided by Jared Knoblauch and Glen Stecher. This research was supported by grants from the U.S. National Science Foundation to S.K. (GCR-1934848, DEB-2034228) and S.P. (DBI-2027196) and from the U.S. National Institutes of Health to S.K. (GM-139504-01) and S.P. (AI-134384).

Data Availability

Live evolutionary history and spatiotemporal distributions of common variants can be accessed via http://sars2evo.datamonkey.org/. All genome sequences and metadata are available publicly at GISAID (https://www.gisaid.org/), and the predicted proCoV2 sequence is available at http://igem.temple.edu/COVID-19. The other relevant information is provided in the supplementary materials, Supplementary Material online.

References

Amendola

Bianchi

Gori

Colzani

Canuti

Borghi

Raviglione

Zuccotti

Tanzi

2021

Evidence of SARS-CoV-2 RNA in an Oropharyngeal Swab Specimen, Milan, Italy, early December 2019

Emerg Infect Dis

(

648

–

650

Andersen

Rambaut

Lipkin

Holmes

Garry

RF.

2020

The proximal origin of SARS-CoV-2

Nat Med

(

450

–

452

Boni

Lemey

Jiang

Lam

TTY

Perry

Castoe

Rambaut

Robertson

DL.

2020

Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic

Nat Microbiol

(

1408

–

1417

Casals

Bertranpetit

2012

Human genetic variation, shared and private

Science

337

(

6090

–

Castells

Lopez-Tort

Colina

Cristina

2020

Evidence of increasing diversification of emerging SARS-CoV-2 strains

J Med Virol

(

2165

–

2172

Chiara

Horner

Gissi

Pesole

2021

Comparative genomics reveals early emergence and biased spatio-temporal distribution of SARS-CoV-2

Mol Biol Evol

38(6):2547–2565

Google Scholar

OpenURL Placeholder Text

WorldCat

da Silva Filipe

Shepherd

Williams

Hughes

Aranday-Cortes

Asamaphan

Ashraf

Balcazar

Brunker

Campbell

, et al.

2021

Genomic epidemiology reveals multiple introductions of SARS-CoV-2 from mainland Europe into Scotland

Nat Microbiol

(

112

–

122

Dearlove

Lewitus

Bai

Reeves

Joyce

Scott

Amare

Vasan

Michael

, et al.

2020

A SARS-CoV-2 vaccine candidate would likely match all currently circulating strains

. Proc Natl Acad Sci U S A. 117(38):

23652

–

23662

OpenURL Placeholder Text

WorldCat

Fauver

Petrone

Hodcroft

Shioda

Ehrlich

Watts

Vogels

CBF

Brito

Alpert

Muyombwe

, et al.

2020

Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United States

Cell

181

(

990

–

996.e5

Forster

Renfrew

Forster

2020

Phylogenetic network analysis of SARS-CoV-2 genomes

Proc Natl Acad Sci U S A

117

(

9241

–

9243

Gianella

Delport

Pacold

Young

Choi

Little

Richman

Kosakovsky Pond

Smith

DM.

2011

Detection of minority resistance during early HIV-1 infection: natural variation and spurious detection rather than transmission and evolution of multiple viral variants

J Virol

(

8359

–

8367

Giorgio

Martignano

Torcia

Mattiuz

Conticello

SG.

2020

Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2

Sci Adv

–

Google Scholar

Crossref

WorldCat

Giovanetti

Benvenuto

Angeletti

Ciccozzi

2020

The first two cases of 2019-nCoV in Italy: where they come from?

J Med Virol

(

518

–

521

Gómez-Carballa

Bello

Pardo-Seco

Martinón-Torres

Salas

2020

Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders

Genome Res

(

1434

–

1448

Hodcroft

Zuber

Nadeau

Comas

González Candelas

Stadler

Neher

RA.

2020

Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020

medRxiv

. doi:10.1101/2020.10.25.20219063.

Google Scholar

OpenURL Placeholder Text

WorldCat

Huang

J-M

Jan

Wei

Wan

Ouyang

2020

Evidence of the recombinant origin and ongoing mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

bioRxiv

. doi:10.1101/2020.03.16.993816.

Google Scholar

OpenURL Placeholder Text

WorldCat

Jackson

Rambaut

Pybus

Robertson

Connor

Loman

NJ.

2020

. Recombinant SARS-CoV-2 genomes involving lineage B.1.1.7 in the UK. Available from: https://virological.org/t/recombinant-sars-cov-2-genomes-involving-lineage-b-1-1-7-in-the-uk/658 (last access March 24, 2021).

Jahn

Kuipers

Beerenwinkel

2016

Tree inference for single-cell data

Genome Biol

Katoh

Standley

DM.

2013

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

Mol Biol Evol

(

772

–

780

Kim

Simon

2014

Using single cell sequencing data to model the evolutionary history of a tumor

BMC Bioinformatics

Komissarov

Safina

Garushyants

Fadeev

Sergeeva

Ivanova

Danilenko

Lioznov

Shneider

Shvyrev

, et al.

2021

Genomic epidemiology of the early stages of the SARS-CoV-2 outbreak in Russia

Nat Commun

(

–

Lai

Bergna

Acciarri

Galli

Zehender

2020

Early phylogenetic estimate of the effective reproduction number of SARS-CoV-2

J Med Virol

(

675

–

679

Lemey

Hong

Hill

Baele

Poletto

Colizza

O’Toole

McCrone

Andersen

Worobey

, et al.

2020

Accommodating individual travel history and unsampled diversity in Bayesian phylogeographic inference of SARS-CoV-2

Nat Commun

(

–

Lemieux

Siddle

Shaw

Loreth

Schaffner

Gladden-Young

Adams

Fink

Tomkins-Tinch

Krasilnikova

, et al.

2021

Phylogenetic analysis of SARS-CoV-2 in Boston highlights the impact of superspreading events

Science

371

(

6529

eabe3261

2018

Minimap2: pairwise alignment for nucleotide sequences

Bioinformatics

(

3094

–

3100

Giorg

Marichannegowda

Foley

Xiao

Kong

Chen

Gnanakaran

Korber

Gao

2020

Emergence of SARS-CoV-2 through recombination and strong purifying selection

Sci Adv

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Wang

Zhao

Zai

Zhao

Chaillon

2020

Transmission dynamics and evolutionary history of 2019-nCoV

J Med Virol

(

501

–

511

Liu

Jiang

Wan

Hua

Zhou

Wang

Hou

Chen

Zou

, et al.

2020

Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)?

PLoS Pathog

(

e1008421

Zhao

Niu

Yang

Wang

Song

Huang

Zhu

, et al.

2020

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Lancet

395

(

10224

565

–

574

MacLean

Lytras

Weaver

Singer

Boni

Lemey

Kosakovsky Pond

Robertson

DL.

2021

Natural selection in the evolution of SARS-CoV-2 in bats, not humans, created a highly capable human pathogen

PLoS Biol

(

e3001115

De Maio

Walke

Borges

Weilguny

Slodkowicz

Goldman

2020

. Issues with SARS-CoV-2 sequencing data. Available from: https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 (last access March 24, 2021).

Martin

Weaver

Tegally

San

Wilkinson

Giandhari

Pillay

Singh

Lessells

Oliveira

, et al.

2021

The emergence and ongoing convergent evolution of the N501Y lineages coincided with a major global shift in the SARS-CoV-2 selective landscape

medRxiv

. doi:10.1101/2021.02.23.21252268.

Google Scholar

OpenURL Placeholder Text

WorldCat

Mateus

Caeiro

2015

An R implementation of several randomness tests

. In:

Simos

Monovasilis

, editors.

AIP Conf Proc

1618

(

531

–

534

. doi: 10.1063/1.4897792.

Google Scholar

OpenURL Placeholder Text

WorldCat

Mavian

Pond

Marini

Magalis

Vandamme

Dellicour

Scarpino

Houldcroft

Villabona-Arenas

Paisie

, et al.

2020

Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable

Proc Natl Acad Sci U S A

117

(

12522

–

12523

McDonald

Kreitman

1991

Adaptive protein evolution at the Adh locus in Drosophila

Nature

351

(

6328

652

–

654

Miura

Huuki

Buturla

Gomez

Kumar

2018

Computational enhancement of single-cell sequences for inferring tumor evolution

Bioinformatics

(

i917

–

i926

Morel

Barbera

Czech

Bettisworth

Huebner

Lutteropp

Serdari

Kostaki

E-G

Mamais

Kozlov

, et al.

2021

Phylogenetic analysis of SARS-CoV-2 data is difficult

Mol Biol Evol

38(5):1777–1791

Google Scholar

OpenURL Placeholder Text

WorldCat

Nei

Kumar

2002

Molecular evolution and phylogenetics

. New York : Oxford University Press.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Okada

Buathong

Phuygun

Thanadachakul

Parnmen

Wongboot

Waicharoen

Wacharapluesadee

Uttayamakul

Vachiraphan

, et al.

2020

Early transmission patterns of coronavirus disease 2019 (COVID-19) in travellers from Wuhan to Thailand, January 2020

Euro Surveill

2000097

Google Scholar

Crossref

WorldCat

Pekar

Worobey

Moshiri

Scheffler

Wertheim

JO.

2021

Timing the SARS-CoV-2 index case in Hubei province

Science

372

(

6540

412

–

417

Pipes

Wang

Huelsenbeck

Nielsen

2021

Assessing uncertainty in the rooting of the SARS-CoV-2 phylogeny

Mol Biol Evol

38(4):1537–1543

Google Scholar

OpenURL Placeholder Text

WorldCat

Pond

SLK

Frost

SDW

Muse

SV.

2005

HyPhy: hypothesis testing using phylogenies

Bioinformatics

(

676

–

679

Pupko

Huchon

Cao

Okada

Hasegawa

2002

Combining multiple data sets in a likelihood analysis: which models are the best?

Mol Biol Evol

(

2294

–

2307

Rambaut

Holmes

O’Toole

Hill

McCrone

Ruis

du Plessis

Pybus

OG.

2020

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Nat Microbiol

(

1403

–

1407

Rambaut

Loman

Pybus

Barclay

Barrett

Carabelli

Connor

Peacock

Robertson

Volz

2020

. Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. Available from:https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563 (last access March 24, 2021).

Rice

Morales

Mordstein

Mühlhausen

Watson

Cano

Young

Kudla

Hurst

LD.

2021

Evidence for strong mutation bias towards, and selection against, U content in SARS-CoV-2: implications for vaccine design

Mol Biol Evol

(

–

Richard

Owen

van Dorp

Balloux

2020

. No detectable signal for ongoing genetic recombination in SARS-CoV-2. bioRxiv. doi:10.1101/2020.12.15.422866.

Ross

Markowetz

2016

OncoNEM: inferring tumor evolution from single-cell sequencing data

Genome Biol

(

–

Shu

Mccauley

2017

GISAID: global initiative on sharing all influenza data-from vision to reality

Euro Surveill

30494

Stefanelli

Faggioni

Lo Presti

Fiore

Marchi

Benedetti

Fabiani

Anselmo

Ciammaruconi

Fortunato

, et al.

2020

Whole genome and phylogenetic analysis of two SARSCoV-2 strains isolated in Italy in January and February 2020: additional clues on multiple introductions and further circulation in Europe

Euro Surveill

–

Google Scholar

Crossref

WorldCat

Tang

Song

Yao

Duan

Zhang

Wang

Qian

, et al.

2020

On the origin and continuing evolution of SARS-CoV-2

Natl Sci Rev

(

1012

–

1023

Google Scholar

Crossref

WorldCat

Tegally

Wilkinson

Giovanetti

Iranzadeh

Fonseca

Giandhari

Doolabh

Pillay

San

Msomi

, et al.

2021

Emergence of a SARS-CoV-2 variant of concern with mutations in spike glycoprotein

Nature

592

(

7854

438

–

443

Turakhia

De Maio

Thornlow

Gozashti

Lanfear

Walker

Hinrichs

Fernandes

Borges

Slodkowicz

, et al.

2020

Stability of SARS-CoV-2 phylogenies

PLoS Genet

(

e1009175

van Dorp

Acman

Richard

Shaw

Ford

Ormond

Owen

Pang

Tan

CCS

Boshier

FAT

, et al.

2020

Emergence of genomic diversity and recurrent mutations in SARS-CoV-2

Infect Genet Evol

104351

Wald

Wolfowitz

1940

On at test whether two samples are from the same population

Ann Math Statist

(

147

–

162

Google Scholar

Crossref

WorldCat

Wenzel

2020

Origins of SARS-CoV-1 and SARS-CoV-2 are often poorly explored in leading publications

Cladistics

(

374

–

379

Google Scholar

Crossref

WorldCat

Worobey

Pekar

Larsen

Nelson

Hill

Joy

Rambaut

Suchard

Wertheim

Lemey

2020

The emergence of SARS-CoV-2 in Europe and the US. Science 370(6516):564–570

OpenURL Placeholder Text

WorldCat

Zhao

Chen

Wang

Song

Tao

Tian

Pei

, et al.

2020

A new coronavirus associated with human respiratory disease in China

Nature

579

(

7798

265

–

269

Yang

Kumar

Nei

1995

A new method of inference of ancestral nucleotide and amino acid sequences

Genetics

141

(

1641

–

1650

Zhou

Yang

Wang

Zhang

Zhu

Huang

, et al.

2020

A pneumonia outbreak associated with a new coronavirus of probable bat origin

Nature

579

(

7798

270

–

273

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Associate Editor:

Download all slides

Month:	Total Views:
May 2021	1,798
June 2021	3,520
July 2021	1,223
August 2021	2,117
September 2021	942
October 2021	576
November 2021	667
December 2021	346
January 2022	363
February 2022	949
March 2022	805
April 2022	434
May 2022	345
June 2022	464
July 2022	293
August 2022	367
September 2022	319
October 2022	222
November 2022	270
December 2022	290
January 2023	163
February 2023	230
March 2023	2,077
April 2023	265
May 2023	707
June 2023	166
July 2023	142
August 2023	124
September 2023	137
October 2023	250
November 2023	168
December 2023	178
January 2024	177
February 2024	195
March 2024	325
April 2024	104
May 2024	121
June 2024	132
July 2024	93
August 2024	117
September 2024	452
October 2024	139
November 2024	136

Article Contents

An Evolutionary Portrait of the Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic

Abstract

Introduction

Results and Discussion

A Mutational Order Approach for SARS-CoV-2

Mutational History and Progenitor of SARS-COV-2

Episodic Evolution and Selection

The Progenitor Genome and the Index Case

Coronavirus Diversity before the First Coronavirus Outbreak

Estimated Timing of MRCA and the Index Case

Analysis of the 68KG Database Snapshot

Coronavirus Fingerprints and Spatiotemporal Tracking

Conclusions

Materials and Methods

Genome Data Acquisition and Processing

Reference Genomes and Collection Dates

Mutation Order Analyses

Timing of the Progenitor Genome

Bootstrap Analysis

Temporal Concordance

Mutational Fingerprints

Analysis of 68KG Data Set

Sequence Classification for 172KG Data Set

Testing Episodic Spread of Variants

Supplementary Material

Acknowledgments

Data Availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

An Evolutionary Portrait of the Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic

Abstract

Introduction

Results and Discussion

A Mutational Order Approach for SARS-CoV-2

Mutational History and Progenitor of SARS-COV-2

Episodic Evolution and Selection

The Progenitor Genome and the Index Case

Coronavirus Diversity before the First Coronavirus Outbreak

Estimated Timing of MRCA and the Index Case

Analysis of the 68KG Database Snapshot

Coronavirus Fingerprints and Spatiotemporal Tracking

Conclusions

Materials and Methods

Genome Data Acquisition and Processing

Reference Genomes and Collection Dates

Mutation Order Analyses

Timing of the Progenitor Genome

Bootstrap Analysis

Temporal Concordance

Mutational Fingerprints

Analysis of 68KG Data Set

Sequence Classification for 172KG Data Set

Testing Episodic Spread of Variants

Supplementary Material

Acknowledgments

Data Availability

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only