Part 3: Sequence Optimization
3.2. Nussinov and Zuker
As previously discussed, mRNA molecules may have different 3D configurations - different shapes - based on their sequence. The ability to predict the shape of mRNA molecules - and therefore important properties - computationally is based on mRNA folding.
First, it’s worth noting that mRNA folding has a distant relative - seemingly the same at the surface, but vastly different in reality. For over 50 years, protein folding has been a big challenge in biology. While protein structures - like mRNA structures - can be determined experimentally through techniques such as X-ray crystallography and cryogenic electron microscopy, the methods are expensive and time-consuming, making them unable to keep pace with the rapid accumulation of new protein sequences. As such, finding a way to predict structures computationally rather than experimentally has been a topic of research with significant potential. Naturally, the first approaches to this problem were ab initio - starting from first principles, simulating the folding process, or trying to find the folding configuration that is most stable as reflected by its low free energy. Computing the free energy is then treated as a molecular dynamics problem, where force fields are used to describe the potential energy of atoms in proteins due to their spatial arrangement. The modeling techniques have been under development for decades, and while the field is advancing, simulations still yield significant errors relative to experimentally determined free energies. Moreover, as the systems get more complex or the proteins get bigger, the errors compound exponentially - ‘good’ results can only be obtained for small proteins with relatively short folding times. Unfortunately, the ab initio strategy also incurs a high computational cost, with one group simulating 1 millisecond of folding for a 36-amino-acid protein (really short) over 2 months on a machine with 256 processors. A more extreme example is the Anton platform, which employs a supercomputer with customized hardware in the form of application-specific integrated circuits (ASICs) to accelerate molecular dynamics simulations. Even assuming enough computing power will be reached to process longer sequences, the ‘rules’ governing current simulations of protein folding are not entirely clear. For example, one of the leading software programs using molecular dynamics, Rosetta, uses an energy function with over 140 terms, most of which are knowledge-based (parameters determined from experimental data).
Major advances in protein folding were made when AlphaFold was announced in 2018. Using artificial intelligence-based methods, AlphaFold beat other computational prediction methods available at the time. AlphaFold2, announced in 2021, marked another significant leap by being the first end-to-end model that could predict protein structures at the atomic level with an accuracy comparable to experimental methods.
Many consider that the field of mRNA folding began with a paper by Nussinov et al. (1978), which predicted mRNA structure with low but relevant accuracy. The algorithm does this by computing the most stable conformation, where stability is directly proportional to the number of base pairs. In other words, the more base pairs there are, the better the stability of the molecule is. Using dynamic programming, Nussinov’s method has a complexity of O(n^3) and can fold sequences hundreds of base pairs long with the computing resources available at the time. While the predictions were not excellent, they were good enough that other scientists began building on the method.
Ultimately, Zuker & Stiegler (1981) built on the work and arrived at much better predictions by recognizing that structures beyond base pairs contribute to a molecule's free energy. By simulating all the possible base pairs and determining the minimum free energy (MFE) structure, one can infer the most likely conformation of mRNA with some precision. More importantly, MFE is highly correlated with lab measurements of molecule stability. Zuker’s algorithm is still used in mRNA software today, but is limited to mRNA molecules only thousands of bases long, as it inherits its predecessor’s O(n^3) complexity.
Before diving deeper into folding, it’s important to recognize that both Nussinov et al. (1978) and Zuker & Stiegler (1981) made a big assumption to make their algorithms more efficient, namely that there will be no pseudoknots. In essence, this rule can be restated as: if you draw the mRNA with base-pairings, no two pairing lines will overlap. While this led to many advances in the field, in practice, a significant number of base pairs are pseudo-knots, which constitute a significant source of error in MFE predictions.