The Computer Scientist's Guide to Designing mRNA Vaccines

Part 2: Antigen Selection Pipeline

2.4. Manual Review

The pipeline should generate enough recommendations to test many candidates, but not too many, so manual reviews are not difficult. My estimate is 10-50, so the current number of candidates (21) is right in the middle.

In the literature, there is significant variability in this part of the results processing. A good idea is to start by translating protein IDs, which NCBI assigns as RefSeq IDs, into UniProt KB IDs. In other words, the proteins should be identified on UniProt, which offers a lot more information on entries - from known/inferred location to similar proteins. A known location is very important, as it's much more reliable than a location determined heuristically by a tool. With this information, we can start building a spreadsheet like the one below:

Vaccine Candidates Spreadsheet

DeepTMHMM is another useful tool to have in our arsenal. Using artificial intelligence, it predicts amino acid-level “localization”, telling us which part of the sequence will likely be on the outside of the cell, the membrane, and inside. BETA predictions are particularly relevant, as beta sheets are commonly transmembrane - meaning some parts of the antigen will be inside the cell. In contrast, other parts will be displayed on the outside.

The ability of a candidate to be a good antigen is also vital. The most commonly used tools to quantify this are VaxiJen 2, which returns an antigenicity score from 0 to 1 for a given sequence, and VaxiJen 3, which provides a categorical answer (immunogen/non-immunogen) along with a probability. For all my pipelines, the latter probability was 66% or 100%, indicating that the pipeline and VaxiJen 3 may be sharing some of the same criteria for selecting antigens.

The most non-conventional step in the pipeline is the removal of accessory proteins, which the current pipeline performs in a manner not reported in the literature. As such, it makes sense for the manual review to ‘look back’ and see how well-conserved the proposed antigens are across all strains. This can be done using DIAMOND, this time running each genome against a database generated from the candidates for efficiency. To paint a complete picture, we can check for matches with 99% identity, as well as 95%, 90%, 80%, 50%, and 10%.

While the output of this step may be a bit chaotic to read, it’s fairly easy to track progress via an additional script:

Loading terminal recording…

Armed with all the data, suggestions can be made. It’s normal for a few proteins to ‘slip past’ the pipeline’s checks and reach the manual review stage even though they're not good candidates. This can happen because tools are not 100% accurate, and no step in the pipeline checks for actual localization. These are easy and fast to exclude. Some candidates are easy to recommend for lab testing - they meet all criteria and score well on benchmarks (including conservation across strains). It’s really exciting when some of the proposed candidates have not been previously studied but show high potential, which is the case for the last 3 proteins produced by our pipeline!

There is, however, a third category: proteins with mixed signals. These don’t look like excellent candidates, but they may turn out to be. I put these in the ‘needs more research’ category - a more comprehensive review of existing literature around them (or their general protein class) may be needed before making a call. Then, combined with data and the possible constraint on the number of final candidates to be recommended, these may be tested in a lab or not.

To assess the quality of the pipeline, it makes sense to examine processes with similar goals (proposing A. baumannii vaccine candidates) in the literature and compare results. While it’s exciting that the pipeline identified new candidates, missing those proposed by the literature without a good justification is a concern. This is, however, not the case here, as 4 of the 8 candidates we would’ve recommended for lab analysis have been suggested by other works in the literature as well: FKBP-type peptidyl-prolyl cis-trans isomerase Multidrug efflux RND transporter outer membrane channel subunit AdeK & Multidrug efflux RND transporter periplasmic adaptor subunit AdeI type IV pilus biogenesis stability protein

Another study reported an outer membrane protein other than Omp38, OmpA.