Part 3: Sequence Optimization

3.4. The Codon Adaptation Index

Minimum free energy is not the only variable that can be optimized. After examining which codons code for amino acids, scientists noticed that some have higher translational efficiency than others. In other words, certain codons used in mRNA molecules get translated more quickly than others, even if they are synonymous (i.e., code for the same amino acid). The Codon Adaptation Index (CAI) quantifies this ‘bias’ in translation for a given coding sequence.

To build this score, scientists used a reference set of highly expressed genes - genes that get transcribed and translated very often. For a given codon, its relative adaptiveness is the relative (normalized) frequency of that codon among synonymous codons in the reference set. The CAI of an mRNA sequence is then just the geometric mean of the relative adaptiveness of all its codons.

An ‘optimal’ codon usage for the host results in a score close to 1 and means the synonymous codon choice leads to as much transcription as possible. Values near 0 indicate poor adaptation and slower transcription. It’s quite impressive that CAI has very high predictive power, given that the formula says nothing about the underlying mechanism driving the speed boost, beyond the fact that each codon is translated independently.

Interestingly, we know that high CAI accelerates protein yield mainly by speeding up ribosomal elongation and reducing stalling/frameshifting. In other words, it makes the translation process as efficient as possible, which explains why it works. That said, CAI alone is not a good enough optimization metric, as it often produces unstructured, AU-rich mRNA molecules that degrade quickly.

The solution is to come up with an optimization goal that unifies MFE, which allows the mRNA to ‘stick around’ for longer (get degraded more slowly), with CAI, which allows mRNA to produce more proteins. This is exactly what Zhang et al. (2023) did by normalizing CAI to the sequence length and subtracting it from MFE, yielding a value to be minimized. In other words, solutions to the problem should have an MFE as low as possible (lower energy leads to slower degradation) and a CAI as high as possible.

Figure 5 - MFECAI formula. The parameter lambda controls the priority of MFE over CAT, as explained below.

The hyperparameter lambda controls the relative weight given to MFE vs. CAI. A value of 0 means we’re minimizing MFE alone, while larger values of lambda prioritize maximizing CAI. The logarithmic scaling, as well as normalization by the codon sequence length, ensures that the two terms are dimensionally comparable.