Absci Invites | Fast, Accurate Antibody Structure Prediction from Deep Learning on Massive Set of Natural Antibodies
Jun 01, 2022
“Fast, Accurate Antibody Structure Prediction from Deep Learning on Massive Set of Natural Antibodies” presented by Absci Invites: Seminar Series
#AbsciInvites seminar series just hosted another great guest, Jeffrey Ruffolo. Jeff presented his research on IgFold, a novel, incredibly fast #machinelearning method for antibody structure prediction. #AI #unlimit
Disclaimer: Views and content presented by Jeffrey Ruffolo are his own and should not be attributed to Absci.
Presentation Transcript:
Gregory Hannum:
Hello, everyone, and thank you for joining us today. I’m Gregory Hannum, V-P of A-I research here at Absci. Today I’m really excited to welcome Jeffrey Ruffolo from John Hopkins University. His research is focused on scalable artificial intelligence models and their application to molecular biophysics problems. Today, Jeffrey is going to be discussing the use of deep learning models to predict antibody structures based on massive sets of natural antibodies. With that, I’ll hand the controls over to you, Jeff.
Jeffrey Ruffolo:
Great. Yeah, thank you guys so much for inviting me to talk to you today. I’ll be sharing our work on antibody structure prediction that we’re calling, “Fast, accurate antibody structure prediction from deep learning on a massive set of natural antibodies.” This is work for myself, Lee-Shin, and Pooja from the lab of Jeffrey J Gray.
Jeffrey Ruffolo:
Today, I’ll talk a little bit about a couple of our models. The first is AntiBERTy, which is an antibody specific language model that we’ve trained for representation learning, and then the main model will be I-G Fold, which is a model that we’ve used to predict antibody structures from the representations learned by AntiBERTy. We can use that model to predict structures end-to-end in a fast way, as well as estimate the error of those predicted structures, which gives information about their reliability. We can also incorporate known structural information into those predictions, and finally, we’ve applied the model to a large set of natural antibody sequences to further expand the observed antibody structural space.
Jeffrey Ruffolo:
Before I get too much into the models, I’ll talk a little bit about antibody structures, which I’m sure most of the audience here is familiar with. Antibodies are large protein complexes, composed of two heavy and two light chains for conventional antibody, and the role of antibodies is, of course, to bind and neutralize antigens. When we talk about antibody structure prediction, we focus on the F-V region here shown in blue, which really forms the context with the antigen. Zooming in on that interaction, the F-V binding is really mediated by a set of six complementary determining region loops called the C-D-Rs, which I’ve shown here in red.
Jeffrey Ruffolo:
Five of these C-D-Rs are pretty easy to predict the structures of, they adopt a set of canonical folds, and you can usually figure out that fold just from the sequence, but the third C-D-R loop of the heavy chain, or C-D-R-H three, which I’ll talk more about later, really has a lot more structural diversity due, in part, to its sequence diversity as well as the length diversity of this loop, but because it plays a really central role in binding the antigen, structural modeling of this loop is pretty critical for engineering antibodies and understanding their functions.
Jeffrey Ruffolo:
Historically, researchers have predicted antibody structures by grafting together pieces of previously solved antibody structures. This is called, “Grafting,” it’s pretty similar to template modeling or hemology modeling for regular proteins. The basic workflow here is to first take your sequences and parse them into their different structural domains. For the heavy chain, you would want to find where the C-D-R H one, two, and three start, and likewise for the light chain. Then you can search those discrete chunks against the database like SAb Dab, which collects antibody structures, and finally piece together those templates to form a complete structure. These approaches work pretty well for the C-D-R loops that adopt canonical folds, but have historically struggled on the H three, because it’s hard to find a template for this longer loop. When you can find a template, the structural diversity still limits the accuracy.
Jeffrey Ruffolo:
A couple years ago, we started to think about how we could apply machine learning to this problem of antibody structure prediction. At the time, deep learning approaches were really treating proteins as images, where for each pixel, you have a measurement describing, for two residues, how far apart they are or how they’re oriented in space. We took that approach where we think of an antibody structure as a set of distances and orientations between residues, then we train a model to go from the sequence to that image like representation. Finally, once you have those descriptions of where things should be relative to each other in space, you can put that into something like Rosetta to come up with actual three dimensional coordinates. These approaches worked pretty well, they moved the state of the art for antibody H three loop structure prediction down about half an Angstrom to an Angstrom on average, but there’s still room for improvement. In the time since those methods have been developed, of course, AlphaFold has really revolutionized protein structure prediction.
Jeffrey Ruffolo:
I won’t go too much into the details of AlphaFold, but the general idea is to take co-evolutionary information in the form of multiple sequence alignment and then, through a neural network, map that into a three D structure. Our goal today is to talk about I-G Fold and here, the idea is to take the antibody sequence and predict the structure in an end-to-end fashion. We want to do this so that we can predict the antibody structures quickly, previous approaches, as well as AlphaFold, tend to be slow and hard to apply the large antibody data sets, like you might have early in a discovery campaign. We want to also be able to incorporate template information if you have it, so if you know part of your structure, we’d like to produce a prediction that’s consistent with that data. Finally, what we want to know, if our predictions are likely to be accurate, so we’ll have the model estimate its own error.
Jeffrey Ruffolo:
To do this, we want to call on as much data as we can find, although there’s only a few thousand antibody structures that have been solved experimentally, there’s a vast set of antibody sequences that have been collected through immune repertoire sequencing studies. In those studies, typically, you can go one of two directions. You can take your sample and identify antigen specific antibodies, which might make promising therapeutics, you can also sequence the antibodies within the repertoire, and that’s how we get the data sets that we’re going to use. For the last couple decades, researchers and companies around the world have been collecting these data sets, which have been aggregated into the observed antibody space by Charlotte Deane’s lab at Oxford. This data set contains about a billion unpaired antibodies and 100,000 paired antibodies, which we’ll use for training our representation learning model.
Jeffrey Ruffolo:
We want to use these sequences for structure prediction, but of course, we don’t have structures for them, so we need some other way to extract information from this data set. To do this, we turn to an approach that’s pretty common in protein-like modeling today, which is called, “Mass language modeling,” or, “Mass residue prediction,” in our case. The idea here is to take your sequence and then hide some of the residues from your modeling task kit with predicting what the residue’s hidden identity was. For example, in this case, if I hide this residue, I’d like the model to predict that E goes there. In learning this process, the model can pick up a lot of cues about structure. For example, if I hide a residue within a beta sheet, if the model can learn the propensity of residues within beta sheets, it has a better chance of guessing what goes in that position.
Jeffrey Ruffolo:
Similarly, if the model can learn something about the three D arrangement of a protein, if I mask one residue, for example, assisting in this conserved, a sulfite bond, the model should have a better chance of predicting what goes in that position, if it’s learned something about the three D arrangement of antibodies. Once we’ve trained this model, we want to use it to extract representations for a given antibody sequence, to do this we just take our sequence of interest, code it in the model, and take the final hidden representation. This is a summary of the sequence that we can use for downstream tasks.
Jeffrey Ruffolo:
Before I get too much into structure prediction, we’ve done some analysis of these representations to get a sense of what the model has learned from this mass residue prediction task. Here, I’m showing embeddings for four flu vaccine recipients, these are people who got the flu vaccine and shortly after had their antibody repertoire sequenced. When we encode these in the model and reduce it down to a visualizable space with UMAP, we see this organization that reflects the rules of V-D-J recombination. You see dominant clusters correspond different V genes, if you zoom in and label with the J genes, again, you see these sub clusters within the V genes that describe which J gene the antibody used. Finally, if we go all the way down to the D gene, because these are heavy chains, we can see these micro clusters within the antibodies space.
Jeffrey Ruffolo:
We also looked a little bit at whether their embeddings have learned something about structure from this pre-training task. Here, I’ve taken the sequences of all of the paired antibodies in SAb Dab, which there’s a few thousand, coded them in the model, and extracted the part of the representation that corresponds to the C-D-R loops. For example, for C-D-R H one, if we take the representation from the sequence that corresponds to the H one loop, average that down to a fixed size, and then project down to two dimensions with T SNE, we can begin to see some clustering, according to the structural canonical folds that researchers have previously identified. You can see similar trends for the H two loop, as well. The H three loop, we don’t have these clusters, so we instead just visualized by link. You can see that the space is organized by the link of the C-D-R H three loop.
Jeffrey Ruffolo:
We see a similar trend for the light chain C-D-R loops as well, with some clusters emerging for some of the loops, others not so much, but the model seems to have picked up something about the structure of these canonical folds through the pre-training task. Of course, our goal is to predict three dimensional structures, so it’s promising that we have this representation that looks like it can get us there, but we need to build a little bit more on top of it to get to our final goal. Here, our model I-G Fold takes these antibody sequences, encodes them with AntiBERTy, and then learns through an end-to-end fashion to predict the backbone structure. Once we have that backbone structure, we can put it into Rosetta to add the side chains and work out any discrepancies, such as clashes, non-ideal bond angles, and the like. We take an approach where we think of the antibody structure as a fully connected residue graph to start, so this allows the model to learn the associations between residues as well as pass information around the entire sequence and structure.
Jeffrey Ruffolo:
Finally, we migrate to a three D coordinate representation, which ultimately yields our prediction. Zooming in a little bit on the model, we start with our antibody sequence, this can be a heavy and light chain or just an antibody sequence, we encode that with our AntiBERTy model and we pull out the embeddings I mentioned previously, as well as the attention matrices from throughout the transformer model. These encode information about which residues might need to be associated during the structural prediction task, because if the model has learned to attend the things that are structurally close, for example, it might have already picked up on some of the dependencies in sequence space, so we can use these attention matrices to bootstrap the learning process.
Jeffrey Ruffolo:
Then we proceed through a set of graph transformer and triangle multiplicative update layers to update the nodes and edges. Next, if we have a template structure, which I’ll mention how we get these during training in a moment, we incorporate that through I invariant point attention, which is proposed for AlphaFold. Our implementation is a little bit different than that of AlphaFold, because we want to just give the model a structure that we know, so rather than have it update that structure, we pull the [inaudible 00:11:04] fix and just allow the model to collect information using the structure, though it’s not yet predicting one. After the template is incorporated, we then start with a coordinate frame at the origin, similar to AlphaFold, and use a set of invariant point attention layers, again, to move those coordinates to the final predicted structure.
Jeffrey Ruffolo:
As I mentioned previously, we also want to predict the error in our predictions. Once again, we turn to invariant point attention, it’s similar to how we used it for templates, where we pass in the predicted structure by the model, holding the coordinate fixed, and have it predict an atom deviation from the native structure. Once we’ve done all this, we can put our predicted structure into Rosetta to work out those non-idealities that I mentioned previously, as well as visualize where the model is likely to be accurate or inaccurate. To train this model, we wanted to use more antibody structures than were available in Sab Dab, there’s only a few thousand there, but of course we have AlphaFold, which is a highly accurate protein structure predictor. We took the paired sequences to start from O-A-S, there’s about 120,000, and clustered those down to a more manageable population of 16,000.
Jeffrey Ruffolo:
Then, using AlphaFold, we produced this synthetic structure database that we can use to produce more training data. We also look to the unpaired antibody sequences and about a billion of these. After clustering those down, 40% sequence identity were left with about 23,000. In combination, this is about 10 times the number of non-redundant antibody structures, as we would have if we had used crystals. The advantage of using unpaired sequences, as well, is that we can get better performance on antibodies, where you don’t have a heavy and light chain. I’ll talk a little bit about that more later, but some models that we developed previously really struggled on antibodies, because they’re only trained in the paired context, so they don’t see the full confirmational space accessible when you don’t have a light chain. To train the model, we sample evenly between SAb Dab, the O-A-S paired synthetic set, and the O-A-S unpaired synthetic set.
Gregory Hannum:
Hi, Jeffrey. Just a second, we have a question from the chat here.
Jeffrey Ruffolo:
Okay.
Gregory Hannum:
How much work did you have to do to decide on this model? Are there other variations you’ve tried that anecdotally worked less well?
Jeffrey Ruffolo:
Yeah, great question. The model was really developed while the community was figuring out how awful it might work before the code and the paper were released. Although the overall architecture was decided pretty early on, we swapped out a lot of these components. For example, the invariant point attention, before that was out, we tried things like the E-G-N-N, S-C three transformer, we found that they were a lot less efficient for the same performance. A model that might take a week to train using E-G-N-N or S-C three, we could get down to a couple days with I-P-A with significantly less parameters, as well as the predicted structure part. I-P-A is pretty nice for predicting updates to coordinates, whereas with E-G-N-N we predicted things directly and just aligned them. Yeah, we took most of what was published at the time and tried to swap in and out these pieces, but the core idea was pretty consistent.
Gregory Hannum:
Okay.
Jeffrey Ruffolo:
As I mentioned, we sample evenly from our three structured data sets to train our model. Every time, of course, we give the model the sequence, but half the time we also give it a template. We come up with these templates by taking the structure from our database and then corrupting it by removing anywhere between one and six spans of 20 amino acids. You can think of this as randomly dropping out a C-D-R loop, but we also allow the training process to drop out other regions as well, so it’s more robust deletions in the structure. Then the model is tasked with predicting the structure, of course, as well as that error estimate per atom in the backbone.
Jeffrey Ruffolo:
Our loss is composed of a means square error on the coordinates after aligning the framework of the antibody, as well as a bond length loss, to try to get things to come together in a more realistic way, and then finally, an L one loss and the deviation of the backbone atoms and the carbon beta. Once we’ve trained the model, we also looked at how it assembles the structure using that second stack of I-P-A layers that I mentioned. Starting with the coordinate at the origin, we find that the model initially gets the three D relative arrangement of the residues pretty well, although it’s in a compact form, before finally, at step two, scaling out the residues to their actual positions at the scale of an antibody structure. Finally, the third step of I-P-A adjusts the bond links and bond ankles to make it look more like a well funky antibody.
Gregory Hannum:
Thank you.
Jeffrey Ruffolo:
Is there another question?
Gregory Hannum:
Yeah, another question here. These losses are different from AlphaFolds, how did you choose these?
Jeffrey Ruffolo:
Yeah, that’s a great question. I’ll start with the error estimation. AlphaFold uses the P-L-D-T, which is more similar to the metrics you might look at for general protein structure accuracy, but for antibodies, the modeling community has a pretty well defined set of metrics that they usually look at for evaluating models that they’re familiar with. They wanted to predict things that are going to be immediately useful to people who have been using antibody structure predictors before. Typically, we want to align the framework residues and measure the R-M-S-D of the heavy atoms within the C-D-R loop, so that’s what we train a model to do as well. For the bond length loss, we found that without this, the model approximate the structures pretty well, but it tends to take the easy way out for the C-D-R H three loop, where it might come up with a prediction where things are spatially as close to the native as they can be, but they look pretty unrealistic.
Jeffrey Ruffolo:
This forces the model to reconcile the overall position of the residues, but also put them in an arrangement that looks more like a protein backbone. The hardest one to decide on was the mean squared error loss here. We tried doing a few things, we started with R-M-S-D, which was what Rosetta fold did, we didn’t try the loss that AlphaFold uses, it ended up being a little bit too inefficient computationally for our resources. The main square area really seemed like a good in between, so it worked better than R-M-S-D, in my experience, while being more efficient.
Jeffrey Ruffolo:
Once we have this final structure from the model, we refine it, of course, to remove those non-idealities I mentioned. This step really only makes small changes to the structure, here I’m showing the result from end-to-end prediction from I-G Fold versus its refined counterpart. You can see, for most of the structure, there’s really no change. You might see some small changes in the H three loop for moving non-realistic torsion angles and fixing bond links, but not much in most cases. The cases where this is really necessary is for longer H three loops, here, for example, you can see the blue one has some clashes, some atoms that are a little bit too close together. That’s what we really want Rosetta to fix for us, so we can have a more realistic structure prediction.
Jeffrey Ruffolo:
To evaluate our model, we took a set of methods, spanning grafting based methods that have been used historically, as well as deep learning methods that can be applied to antibodies. We evaluate by, like I mentioned a moment ago, chopping up the structure into the C-D-R loops that we want to measure R-M-S-D four, as well as the framework residues. For grafting based methods, they can typically achieve some extreme accuracy for both the heavy and light chain frameworks, as well as most of the C-D-R loops, but for C-D-R H three, they tend to have worse a performance. A couple of antibody specific deep learning methods have come out over the last year or so, one is DeepAb, that we developed previously, as well as ABlooper from Charlotte Deane’s lab. These methods performed better on the C-D-R H three loop, as well as got some marginal improvements on the other C-D-R loops.
Jeffrey Ruffolo:
Finally, AlphaFold Multimer can also predict antibody structures. Here we see a decent improvement on the H three loop, nothing too significant, but it does perform well on these, despite being trained on all proteins. When we add I-G Fold in, we see it’s performance looks pretty similar to AlphaFold Multimer, which makes some sense, because it was trained on a lot of AlphaFold predictions, but the main distinguishing factor between these methods is the speed. For grafting based methods, you’re really trading off accuracy for really high throughput prediction. You can typically predict these in a few seconds, definitely under a minute. The antibody specific methods, you get a little bit more accuracy, but the time goes up. For ABlooper, it’s about a minute, for DeepAb it’s up to 10 minutes for sequence. AlphaFold Multimer, you’re getting good accuracy, but it can take anywhere between 30 minutes and two hours to get your structure prediction, based on how many inputs the model you want to use and what resources you have access to.
Jeffrey Ruffolo:
Finally, for I-G fold on a C-P-U machine, we can predict these structures in less than a minute. Meaning that you can predict a lot more structures, given a modest compute budget. We also compared the individual predictions between I-G Fold and AlphaFold. Despite having pretty similar performance and I-G Fold being trained on predictions from AlphaFold, we see that, in a lot of cases, they predict pretty different confirmations. Here I’m showing the H three R-M-S-D from native for AlphaFold predictions on the X versus I-G Fold in the Y. You can see a lot of points in this far off diagonal space. These are cases where, for example, for this point, I-G Fold was about seven Angstroms from the crystal structure and AlphaFold was about one Angstrom. In practice, what this looks like is pretty distinct confirmations. For example, in this case, AlphaFold in red is pretty close to the native, whereas I-G Fold is a pretty distinct confirmation.
Jeffrey Ruffolo:
Then, of course, there are points on both sides of the diagonal, so in some cases, I-G Fold is really close to the native, whereas AlphaFold predicts a pretty distinct confirmation. Distinguishing these cases is where the error prediction comes in handy. Here I’m showing the predicted R-M-S-D for the H three loop, versus the actual R-M-S-D from the crystal structure, and we see a strong relationship between the predictions and the actual R-M-S-D. If I put this on structures, you can see that, in some cases, the model will have a pretty poor prediction. For example, in this case where you have this long H three loop with this beta sheet domain in the middle of the loop, the model predicts this wide open loop, but it associates it with a high predicted error. In other cases with long H three loops, the model does make a good prediction, and here you see a correspondingly low error.
Jeffrey Ruffolo:
Although long loops tend to be more error prone, the model isn’t just learning regression on the loop length, it’s actually learning something about how well its data supports the prediction, so it’s more informative than just thinking a long loop is going to have higher error. When we look at this metric for other C-D-R loops, we see similar trends, the metric tends to be pretty informative for all of the loops. One exception we don’t see a significant correlation is for C-D-R L two, most of these loops are predicted sub Angstrom to begin with, except for this one outlier, which the model thought would be accurate, but ended up not being. A big shortcoming of some antibodies, specifically learning methods that came up previously, was that it didn’t work for single domain antibodies very well, so we wanted to see if we could use I-G Fold to accelerate nanobodies structure prediction while maintaining good accuracy. Here, I’m going to show results for a similar set of methods that we looked at for paired antibody structure prediction, starting with our grafting based method, ABodyBuilder.
Jeffrey Ruffolo:
You can see that for nanobodies, again, we can typically achieve sub Angstrom accuracy for the framework. We see a little bit of drop off for the C-D-R one and two, versus paring two bodies. This is due in part, because there’s just fewer nanobody structures in the structural databases, so there are fewer templates to choose from. For the C-D-R three, we really see this wide range of prediction accuracies, because nanobodies can have long C-D-R three loops, you really exacerbate the issues that grafting-based methods had with paired heavy chain H three loops and see this widespread in performance.
Jeffrey Ruffolo:
Our previous method, DeepAb, can, in principle, predict nanobody structures, though it was trained only on paired. We see improvements at C-D-R one and two versus grafting-based methods, but for C-D-R three, we actually see significantly worse performance. This is because, as I mentioned, it’s only trained paired sequences and structures, so it never sees the confirmation’s accessible when you don’t have a light chain. It always predicts structures that look like they’re paired, even if you don’t give it a light chain. AlphaFold performs remarkably well on the antibodies we find, C-D-R one and two it’s pretty comparable, but a bit better than the previous approaches. Then, for C-D-R three, we see a significant improvement over grafting or DeepAb. I-G Fold, we see again, good performance on C-D-R one and two, but a degradation versus AlphaFold despite training on AlphaFold structures.
Jeffrey Ruffolo:
When we look at what the structures actually look like in practice, we find that sometimes I-G Fold can get the correct C-D-R three confirmation. In this case, we have what’s called a stretch twist confirmation with the C-D-R three loop that folded down against the beta sheet. I-G Fold lands the loop in this confirmation, while AlphaFold doesn’t, but where we find out I-G Fold really struggles, it’s when you have a secondary structure within the C-D-R loops. In this case, you have this C-D-R three loop that has a small alpha helix in the middle. AlphaFold, which has been trained to predict protein structures in many contexts, figures out that there should be a helix here, whereas I-G Fold, which really doesn’t see a lot of alpha helixes to begin with, gets the overall location of the loop correct, but really has this unstructured fold.
Jeffrey Ruffolo:
We can also use our error estimations for nanobodies, of course. Again, we see pretty strong correlations between the actual R-M-S-D of each C-D-R loop with the predicted R-M-S-D. Here, I’m just showing our nanobody benchmark, you can see that the model doesn’t suffer from the same problems that I mentioned for DeepAb, it predicts a variety of confirmations instead of just predicting them as if they were with paired structures, and the error estimations tend to be reliable. The next thing we looked at was whether we can incorporate known structural information into the prediction. Here, this might be useful, if you’re working on designing a new antibody from a parent. If you have a structure for your parent antibody, but you’re going to redesign the [inaudible 00:26:37] for example, it would be nice to incorporate what you know about the antibody to begin with and build a model of your design antibody that is consistent with that data, but predicts the H three loop onto it.
Jeffrey Ruffolo:
To test this, we take our antibody benchmark, delete the H three loops, then we pass those remaining coordinates to the model as templates, and ask it to predict the whole structure. I’m showing three different strategies here, the first is just standard I-G Fold, where we’re not providing any structural information. Next is I-G Fold giving everything except the H three. Finally, I-G Fold if we give it the whole F-V. You can see in most cases, if you provide the structure, you get to sub Angstrom accuracy. For the one where we don’t provide the H three loop, it doesn’t look like there’s much change here, but I’ll zoom in a little bit more later. Of course, when we give the H three loop, the model can get sub Angstrom predictions typically. The useful part is that, if I-G Fold was going to produce an error prone prediction for one of the C-D-R loops, but you have the structure, rather than relying on something like grafting to fix these, a lot of manual work, I-G Fold can, with its template capabilities, incorporate all of that information into its prediction for free.
Jeffrey Ruffolo:
For the H three loop we find that, although the overall performance doesn’t shift much, there are a handful of cases where providing the rest of the framework and the C-D-R loops can yield a meaningful improvement in H three accuracy. For example, in this case, when we give everything except the H three, the prediction ends up being much closer to the actual crystal structure, although it still struggles with this ordered beta sheet domain. We can do the same thing for nanobodies, we see a similar trend when we give everything except the C-D-R three, when we give the F-V. There are fewer cases where this makes a big difference, one interesting case is this point down here. When we zoom in on this, we find that, by giving the model the framework, it actually better aligns the C terminal domain of the C-D-R three loop, which allows it to improve from about two Angstroms to about one Angstrom, so not much, but providing that context can give you some improvements from nanobodies as well.
Jeffrey Ruffolo:
To give a little more context for why we tested the F-V, providing all of the F-V. Although this isn’t practically useful for structure prediction, because if you have the structure, you don’t need to predict it, we think this might be more useful for putting together embeddings for your antibody, with the BERT embeddings, we know they contain strong structural context, but here we really have an embedding that has that birth context with the structure infused, so this might be more useful for downstream model training. After validating our model, we wanted to predict structures for a larger set of antibodies sequences that are available in databases, likes SAb Dab, so we went back to O-A-S and collected the 120,000 paired sequences, clustered those down to 95% sequence identity, just to remove things that are very, very similar. That yielded about a 105,000 paired sequences, which we then predicted with I-G Fold. As you can see with this histogram down here, most of them are predicted to be quite accurate.
Jeffrey Ruffolo:
These structures… I’m sorry. “Have you tried using the embeddings for downstream model training?” Yeah, great question. We haven’t done too much with it yet, but I think any application that’s using BERT models, for example, some people are using BERT models for Paratope prediction. Any case where you’re doing something that’s fundamentally reliant on the structure should be improved. I don’t see a scenario where having the structure would ever be a detriment. Similarly, humanization might be useful, because there, you’re trying to say something about how the body might react to an antibody, so having the structure in that context could be useful as well.
Jeffrey Ruffolo:
Once we’ve put together this synthetic structure database, if we compare to SAb Dab at similar sequence redundancy, it represents a 40 fold expansion in terms of structures that we have access to. We can perform this calculation, a pretty modest compute budget of about 2500 C-P-U hours. Looking forward, we think I-G Fold will be useful for existing antibody design pipelines. If you’re doing rational design of binding interactions or docking, having a better starting structure should be useful. The error estimations can also tell you areas where you should be cautious starting your design process.
Jeffrey Ruffolo:
For more immunological studies, we think this model’s speed and accuracy will allow us to transition from thinking of antibodies just as sequences to really structures that exist within the body. We’ve done some work on trying to use BERT to understand what’s going on within antibody repertoires, and we’re excited to see how we can incorporate structure prediction into our thought process as well. To summarize a bit, I-G Fold allows you to take an antibody sequence and predict a structure in less than a minute. It doesn’t require state of the art computational resources, you don’t need G-P-Us to get this time, you can do it on the C-P-U machine. The predictions from I-G Fold are state-of-the-art, they match those of AlphaFold for pairing antibody structures, though they do lag behind for nanobodies. With our predictions, we have these informative error estimations, so you can know, for example, if your C-D-R H three loop is likely to be reliable or not. All the code is available online on GitHub, as well as installable on PyPI.
Jeffrey Ruffolo:
I’ll wrap up by just acknowledging everybody in the lab who contributed to this work, including my advisors, Jeffrey Gray and Jeremias Sulam, as well as Pooja Mahajan, Lee-Shin Chu, and then finally Richard, who worked with me in the lab while I was doing most of the I-G full development over the summer in the last year. I’m happy to take any other questions.
Gregory Hannum:
All right. Thanks a lot, Jeffrey, for that wonderful presentation. We have some time now for questions. If you have a question, please feel free to press the raise hand button at the bottom of the screen and I’ll come to you real time. A popup will appear asking you to unmute, which you’ll need to click before we can hear you. If you prefer to enter your question, use the Q and A window, that works as well, and I can ask them out loud. If you think of any questions after today’s presentation, you can always reach out to Jeffrey or myself directly. All right. Maybe I’ll start with one myself there. I’d be curious about where you see this going, in terms of the next and important steps here. Particularly, my mind goes to representing the antigen on the other side of the equation for a lot of these binding problems and how one would model that or work with affinity, specificity, and docking a lot of the related challenges.
Jeffrey Ruffolo:
Yeah, that’s a great question. I think there are two main directions, prediction needs to move in. One is, of course, docking, predicting the complex with the antigen, as you mentioned. The challenge there is really the data that are available, the approach we took here to overcome the data shortage is to use AlphaFold for antibody, producing a synthetic structured database, because I’m sure people at Absci are aware, AlphaFold really can’t be used to do that for antibody antigen complexes. Training a model to quickly produce those will require people to get a little bit more creative with how you produce the data. The other main area that we need to address is the confirmational diversity of C-D-R loops. Although, a lot of H threes don’t really move that much upon binding, there are some that move significantly when they come in contact with their antigen. Something that, rather than produce a one-to-one output for a sequence, it can incorporate some variation into the process would be more useful. There you can think of modeling an ensemble of antibody structures that you can sample from, rather than just a fixed structure.
Gregory Hannum:
All right, thank you. Another question we have here about the C-P-U resources for running this. Is this something that’s running on a single node or using a lot of C-P-Us?
Jeffrey Ruffolo:
Great question. When I produced the large scale antibody structured database, each structure was given just two C-P-U cores. I don’t remember specifically, but they’re not anything super fancy. I think the cluster was built in 2020, and those can be produced in about a minute. I see another question about the language model.
Gregory Hannum:
Yeah. How important is the language model? Have you tried ablating it and what would a deeper language model perform? Would that be better? How about a general protein language model like E-S-M trained on Unipro?
Jeffrey Ruffolo:
Yeah. We’re in the process of ablating the language model for I-G Fold. That can start by giving a little bit of context for DeepAb, where we used a pretty simple L-S-T-M encoder to provide some sequence context to the structure predictor. There we found that, with and without that, you got about a quarter Angstrom improvement when you added it to the model. Here, I’m currently testing, just replacing the BERT model with C-N-N, similar to what we use for DeepAb to see if it can just learn. Because we have more structures, maybe it doesn’t need that boost from the language model. We haven’t tried general protein language models, I think there, the issue for us is really compute resources. For example, if we add something like E-S-M one B, then we’re taking our model from a 25 million parameter BERT model that we have now, up to a 650 million perimeter model, if I remember correctly. It really extends the training process. Deeper models, it would be really interesting to see, but I don’t know that we have the resources to test that.
Gregory Hannum:
That gets to the next question here, which is, roughly, how long does it take to train these models and the hardware you’re working with?
Jeffrey Ruffolo:
Yeah. For the BERT model, it takes about a week on four A 100 G-P-Us. We train on about 550 million antibody sequences, and we can go over those about 10 times in that training period. Then, for I-G Fold, it takes about 70 hours to go over our 44,000 antibody structures. I think we do 2 million training steps.
Gregory Hannum:
All right. I have another question myself about… You use a lot of AlphaFold in here and I’d be interested to think… AlphaFold, if it’s not in itself a perfect production of the structure, do you think training on it is… Essentially, are you picking up a consistent signal despite that? Or is it something that, obviously if you had an improved structure predictor there, it would make it easier downstream?
Jeffrey Ruffolo:
Yeah, I think it would definitely be better to have something more accurate than AlphaFold for these synthetic structured databases, but because what we have is AlphaFold, we wanted to take advantage of it. I will say that, when we remove SAb Dab from this training regime, we see a significantly degraded performance to the order of, I believe, about half to one Angstrom degradation for H three loops. Having the real structures really is valuable here, because AlphaFold isn’t perfect, like you said.
Gregory Hannum:
Mm-hmm.
Jeffrey Ruffolo:
I think where AlphaFold is useful is expanding this pretty narrow distribution of sequence and structures that’s in SAb Dab to more accurately reflect the antibodies it’s likely to encounter in practice. With DeepAb, you have a rigid model, it’s only trained on about 1700 structures, so it’s pretty easy to surprise it with a new antibody, but when we can incorporate these more diverse 40,000 structures from AlphaFold, even if the model isn’t getting perfect details on where the H three loop is, it learned something about space of antibodies that’s going to encounter when you apply it in practice.
Gregory Hannum:
All right. Another question we have here is a more general question. What interests you about antibody structure prediction and where are you heading in research?
Jeffrey Ruffolo:
Yeah, great question. The interesting thing for me, personally, in antibody structure prediction is how we can make this transition from thinking of antibodies as sequences to structures. There’s been a lot of interesting work looking at immune repertoires, where, if you have structures for the sequences in the repertoire, you can find antibodies that are going to behave similarly. There was a paper from Charlotte Deane’s lab where they took some known COVID antibodies, then went to repertoires, and found other antibodies that had similar paratopes, even though they had pretty divergent sequences in both the paratope and the rest of the framework. Structure opens doors to a lot of questions that would be hard to answer before. Then, of course, antibody design is also interesting, I think we still need some tooling in the complex prediction area to really have that move forward in a significant way, but hopefully this is a step towards that.
Gregory Hannum:
Along those lines, do you have any thoughts on how we might address some of the bigger challenges around the developability of these antibodies? Getting the structure is a great precursor to affinity, but of course, immunogenicity and a lot of other challenges that are bringing them to the clinic are still in the way.
Jeffrey Ruffolo:
Yeah. I would say, for immunogenicity that you mentioned, the natural antibody databases, at least from prior work which Richard really drove forward, it seems like you can get a lot of value for predicting immunogenicity, just from learning what a natural antibody response looks like. In that project, we trained a generative model on the same data set we trained BERT on. We find that it produces sequences that really don’t look immunogenic, when we evaluate them with tools that compare them against T-cell peptide binding, as well as this humanness score that chops it up into fragments and looks at how likely it is to have occurred in a repertoire. I’m optimistic that we can get immunogenicity figured out just from the data we have now. For developability, it’s a little bit more of a challenge, at least working in the public domain, because there’s not a lot of high quality data out there. The data that is out there tends to be focused on one or two antibodies, so it’s hard to build a general model there.
Jeffrey Ruffolo:
One thing that might be possible is to take a similar approach and use synthetic data to bootstrap your process. If you can use something like the SAP score calculator, it’s been out for a while, perhaps you can learn something about what could cause an antibody to aggregate and then, based on that weak signal, then adapt it to these specific data sets that we have now to build a more general model that performs well. Hopefully, the embeddings from I-G Fold would be useful there. I’m not sure if you need structured prediction exactly for that problem, but having structurally infused embeddings would definitely be advantageous over to something like BERT.
Gregory Hannum:
Thank you. Maybe one more question I have for myself is… you mentioned a few times the training limitations, in terms of, these are obviously very large models and difficult to scale. Did you notice any scaling loss, though, in your performance? Is there some thought that, if you could go a lot bigger, it would improve the results?
Jeffrey Ruffolo:
Yeah, good question. If I go back to the architecture, the main place I investigated scaling was in this initial stack of graph transformer and edge update layers.
Gregory Hannum:
Mm-hmm.
Jeffrey Ruffolo:
Then also in these two I-P-A stacks, where we’re incorporating templates and then predicted the structure. For this first stack, I scaled it up from two to about five or six, and beyond four, the performance pretty much was stagnant, SO I just reduced it to four for efficiency. For the template structure incorporation, this one was pretty critical. Not only the number of layers, but also the attention heads that you give the I-P-A, was really important.
Jeffrey Ruffolo:
In fact, if you reduce it, here we use eight attention heads for this template operation. If you reduce it to four, the model basically ignores the template, so you really need just enough capacity to incorporate things well. If you’re shy of that, then it doesn’t really learn much. I also tested two versus three layers of I-P-A here, and that didn’t make too much of a difference. For the structure realization from the coordinates, the origin to the final prediction. Here, again, I scaled it from one up to four. Three and four didn’t make too much of a difference, so I kept it at three, but it seems like you need just enough capacity for the model to learn things about antibodies, but it’s not learning.
Jeffrey Ruffolo:
These properties aren’t that hard to learn, relative to general protein, so I don’t expect as much improvement from scaling. The real place to drive improvements is in the data and also the task you’re trying to learn. If we were trying to learn this confirmationally aware model, where you might have a distribution of structures represented by the model that you can sample from, maybe there having a little bit more capacity would be useful, but you have to solve some data issues first.
Gregory Hannum:
All right. Thank you, Ed. Okay, here we have another question as well. Regarding the improvements in data, do you mean non-predicted structures?
Jeffrey Ruffolo:
It could be either. For this work for AlphaFold, we just used the top rank model. If you wanted to build something that’s confirmationally aware, you could get more value just by showing the model examples where something might adopt different confirmations, even if those don’t accurately reflect the ensemble that you would have experimentally. It might be useful just to disassociate this one-to-one mapping into something that’s a landscape. In terms of non-predicted structure, of course, more of those will be useful. We haven’t done a comprehensive study, at least recently, of how many different confirmations exist for similar antibodies in a database likes SAb Dab, I expect there’s some variability, particularly surrounding therapeutic antibodies that have tons of structures, where you might see for one sequence, or a set of very similar sequences, that it can adopt a variety of C-D-R loop confirmations. Having more data there would definitely be useful, but it might be at a place where you could achieve that now, just from a combination of AlphaFold and those cases where you do have multiple structures for the same sequence.
Gregory Hannum:
With regards to expanding the data challenge, could you foresee utility in combining this with some of the energy model approaches? Yeah, either deep learning representations of energy functions, something that essentially augmented the data information there.
Jeffrey Ruffolo:
That’s a good question. I’m not too familiar with energy-based modeling, so I don’t think I have a good answer for you, but I’ve started to see some interesting structure prediction models going down that approach.
Gregory Hannum:
All right. Looks like that’s all the questions we have for today. I want to thank you again, thanks to our speaker Jeffrey Ruffolo for presenting to us today. Thanks to everyone who joined in participating as well. Have a great rest of your day and keep an eye out for our future editions of Absci Invites Seminar Series.