Protein evolution

De novo gene birth

Protein-coding genes are sometimes born de novo from non-coding sequences (either intergenic or alternative reading frame). A newborn protein must avoid causing harm, e.g. through aggregation, while providing some benefit to the organism. High levels of intrinsic structural disorder help newborn genes strike this balance between avoiding harm and causing positive effects (Wilson et al. 2017, Willis & Masel 2018. James et al. 2021). An amino acid composition that promotes intrinsic structural disorder is correlated with more benign fitness effects within random peptides (Kosinski et al. 2022).

Our theories of evolvability, specifically pre-adapting selection (Masel 2006, Rajon & Masel 2011), help explain why an overwhelmingly high likelihood of harmfulness of a random peptide does not prevent de novo gene birth. If non-coding sequences are translated just a little bit, this provides an opportunity for the most deleterious amino acid sequences to be eliminated by selection (Wilson & Masel 2011). The pre-screened set of sequences provides the raw material from which de novo protein-coding genes could be co-opted. Consistent with this possibility is the fact that many “non-coding” sequences are often found in association with ribosomes in S. cerevisiae, meaning they are likely translated at low levels. Our use of riboprofiling data as the first demonstration of this pervasive translation (Wilson & Masel 2011) was also the first study to use riboprofiling data for the purpose of gene annotation; as well as lots of noisy translation, we found one case that appeared to be a previously unannotated de novo evolved protein.

A tractable case study for the de novo birth of coding sequences is the birth of only part of a gene. One way this can occur is when a stop codon is lost, and the 3'UTR, up to a backup stop codon, becomes part of the protein C-terminus (Giacomelli et al. 2007, Andreatta et al. 2015). Low levels of stop codon readthrough prescreen the genetic variation beyond stop codons, raising its intrinsic structural disorder (Kosinski & Masel 2020).

In yeast, elevated levels of stop codon readthrough can be caused by the [PSI+] prion, an epigenetically inherited aggregate of the Sup35 protein, which is a release factor required for translation to terminate at stop codons. When [PSI+] appears, elevated readthrough occurs at every gene in the genome, and a range of pre-existing cryptic genetic variation is phenotypically revealed. As an epigenetically inherited protein aggregate, [PSI+] can easily be lost after some generations. This returns the lineage to its normal [psi-] state and restores translation fidelity. If a subset of revealed phenotypic variation is adaptive, it may have lost its dependence on [PSI+] by this time. This process of genetic assimilation may, for example, involve one or more point mutations in stop codons. This leaves the yeast with a new adaptive trait and with no permanent load of other, deleterious variation. The yeast prion [PSI+] is a wonderful model system for studying evolutionary capacitance, because the relevant molecular biology is well understood. In Saccharomyces, a high proportion of 3′UTR incorporation events involve the inclusion of in-frame 3′UTR through precise mutation of the stop codon, rather than frameshifts. This is compatible with the genetic assimilation of in-frame readthrough products produced by [PSI+] (Giacomelli et al. 2007).

Long-term evolutionary trends

One way we look at long-term trends is to classify protein-coding sequences by age since birth (phylostratigraphy), and ask whether protein properties depend age. Young animal proteins are enriched for amino acids that promote intrinsic structural disorder (Wilson et al. 2017, James et al. 2021), while young plant proteins are enriched for amino acids that are more available (James et al 2021). The most ancient protein domains are enriched for the amino acids believed to be the first to be incorporated into the genetic code (James et al. 2021). Throughout all lineages, there is a trend in which older sequences have their hydrophobic amino acids more interspersed rather than clustered along the sequence (Foy et al. 2019, James et al. 2021). The trends in disorder and hydrophobic clustering seem to be driven by differential retention of certain sequences, rather than being driven by biases during descent with modification (James et al. in prep). However, descent with modification is important for explaining protein differences between species with high vs. low "effective population size" (Weibel et al. 2020, McShea et al in preparation). Adaptive paths through descent with modification face "frustration" from the fact that the same hydrophobic amino acids that promote functional folding also promote harmful aggregation - this creates a special kind of adaptive landscape (Bertram & Masel 2020).

Phylogenetic inference and ancestral sequence reconstruction for proteins rely on mathematical models of the relative rates of amino acid substitutions, usually under the assumption that amino acid frequencies stay constant over time (stationarity) and that fluxes obey detailed balance (time reversibility). We have contributed to the development of time non-reversible amino acid substitution models that not only better fit the data, but allow phylogenetic trees to be rooted without using an outgroup (Dang et al. 2022). We are continuing to improve substitution models by filtering alignment errors out of training data. We are excited to use non-stationary models to estimate amino acid frequencies at ancestral nodes. We hope to characterize the proteome and amino acid frequencies of ancient life on earth all the way back to the last universal common ancestor (LUCA), and indeed, even before. In doing so, we hope to learn more about the environmental conditions of ancient life, and the origins of the genetic code.

Publications :