lch
发布于 2026-04-20 / 0 阅读
0

Antibody Language Models: Taking the biology seriously makes models better

  1. Antonio Matas-Gil Is a corresponding author
  2. Andreas Tiffeau-Mayer
  1. Division of Infection and Immunity, University College London, United Kingdom
  2. Institute for the Physics of Living Systems, University College London, United Kingdom

Have you ever wondered how the immune system is capable of repelling most of the pathogens it finds, even if their molecular signatures are completely different to anything it has seen before? Proteins called antibodies that recognize the antigens released by pathogens are key players in this process, as are the B cells that produce them. Antibodies roam the body and can bind to pathogens with specific molecular signatures to neutralize them, or flag them for destruction by other immune cells.

Two important features of the antibody immune system are somatic hypermutation and clonal selection ( Figure 1 , left). Somatic hypermutation means that the region of an antibody that binds to an antigen experiences very high levels of genetic mutation, which ensures high levels of antibody diversity, while clonal selection favours antibodies with higher affinities for antigens. The phenomenon of B cells producing antibodies with higher and higher affinities over the course of an immune response is known as affinity maturation.

Figure 1
Download asset Open asset
Protein language models and the immune system.

Left: Schematic of the process of affinity maturation for antibodies. A process called VDJ recombination generates the initial antibody sequence (SYGSSYWF in this case). Then, during a process called somatic hypermutation, mutations lead to changes in the binding site (here, the second S become a T in the second generation; the Y becomes an F in the third generation; and the final F becomes a Y in the fourth generation). At the same time clonal selection favours antibodies with higher affinities for antigens. This process can be interpreted as an evolutionary tree: the further down we go, the more time has passed and the better the antibody should be. Middle: When using masked language modelling (MLM) to train a protein language model, one of the amino acids in the input sequence is masked (indicated by the question mark), and the model is tasked with finding which amino acid is most likely to fit in the masked location. The prediction of the model is then compared with the actual amino acid, and the weights used in the model are updated accordingly. This approach works well in general, but makes no use of what we know about the biology of antibodies. Right: When using phylogenetic pair modelling to train an antibody language model, the input is a parent–child pair from the antibody evolutionary tree (left), and there are separate language and mutation models. In this approach, the aim is for the language model to learn about clonal selection, as somatic hypermutation can be reasonably well modelled.

Antibodies can also be engineered to act as biological therapeutics. Based on work in the mid-1970s by Georges Köhler and César Milstein ( Köhler and Milstein, 1975 ) – work that led to them sharing a Nobel Prize in 1984 – monoclonal antibodies were first approved by the US Food and Drug Administration in 1986 ( Kung et al., 1979 ; Ortho Multicenter Transplant Study Group, 1985 ), and their use has grown ever since. Monoclonal antibodies are now the largest class of biopharmaceuticals, helping to treat patients with a range of conditions including cancer, autoimmune disease and various infectious diseases. However, making antibodies with the specificity needed to target a particular disease is challenging, so there is a pressing need for new approaches. Computational protein design has emerged as a tantalizing alternative in recent years, although the complexity of the sequence-structure-function landscape for antibodies means that this approach is also challenging.

Researchers use ‘protein language models’ to design antibodies. These models are similar to the large language models used by platforms such as ChatGPT, but they work with sequences of amino acids instead of natural language. Protein language models are trained by masking an amino acid in an existing protein, and asking the model to guess the identity of the amino acid that was masked. This training strategy, which is called masked language modelling (MLM), allows the model to learn general features of protein biology ( Figure 1 , middle). This approach has seen success in areas such as protein structure prediction, but it has not been as successful when applied to other problems, such as protein function prediction ( Li et al., 2024 ). Nevertheless, MLM training remains the dominant training paradigm.

Now, in eLife, Frederick Matsen of the Fred Hutchinson Cancer Center and colleagues report a new training strategy that is specifically designed for antibodies ( Matsen et al., 2026 ). This strategy takes advantage of the fact that the mutations initiated by an enzyme called AID (short for activation-induced cytidine deaminase) during somatic hypermutation exhibit sequence preferences that arise from the mutational mechanism itself rather than from clonal selection ( Figure 1 , right). In short, the AID enzyme deaminates a cytosine base in DNA, turning it into uracil, which is then processed by DNA repair pathways such as base excision and mismatch repair. Together these processes result in characteristic, non-random mutation patterns, which can be inferred from data.

Matsen et al. first make the limitation of current approaches explicit by showing that MLM-trained protein language models essentially learn the biased sequences introduced by somatic hypermutation, even though these sequences do not directly reflect functional selection pressures. Having identified this issue, the researchers then propose an ingenious solution rooted deeply in the biology of antibodies and the affinity maturation process.

The new approach, which we propose calling phylogenetic pair modelling, allows Matsen et al. to train a deep amino acid selection model (DASM). Importantly, phylogenetic pair modelling separates the contributions of mutation and selection, extending previous work in the field by themselves ( Matsen et al., 2025 ) and others ( Elhanati et al., 2015 ). Training involves working with pairs of sequences – a parent sequence, and the child sequence after somatic hypermutation – and adjusting the parameters of the DASM to optimise the predictions of the model. When adjusting parameters, existing models of somatic hypermutations are used to account for the underlying probabilities of mutations, thus focusing the DASM on learning the selection pressures.

The researchers find that using phylogenetic pair modelling to train their model allows it to substantially outperform existing models of larger size. The beauty of this new approach to training is that its improvement in performance is not driven by extra computational training, which is costly, but by a conceptually simple, biologically-informed training procedure. The end result is a model that is both smaller and faster than existing models.

Looking forward, approaches such as those described here will be crucial for understanding the fundamental biology of the affinity maturation process, as well as for the development of antibody-based therapies. More broadly, the work of Matsen et al. showcases the importance of training strategies, as has also been demonstrated for another component of the immune system – T cell receptors: last year it was shown that a training method called contrastive learning can help protein language models in this field to perform better by counteracting recombination biases ( Nagano et al., 2025 ). Collectively, these strategies demonstrate how true interdisciplinarity – a combination of biological domain knowledge and machine learning know-how in this case – can tackle complex scientific challenges.

References

    1. Elhanati Y
    2. Sethna Z
    3. Marcou Q
    4. Callan CG
    5. Mora T
    6. Walczak AM
    (2015) Inferring processes underlying B-cell repertoire diversity
    Philosophical Transactions of the Royal Society B 370 :20140243.
    • Google Scholar
    1. Köhler GJF
    2. Milstein C
    (1975) Continuous cultures of fused cells secreting antibody of predefined specificity
    Nature 256 :495–497.
    • PubMed
    • Google Scholar
    1. Kung PC
    2. Goldstein G
    3. Reinherz EL
    4. Schlossman SF
    (1979) Monoclonal antibodies defining distinctive human T cell surface antigens
    Science 206 :347–349.
    • Google Scholar
    1. Li FZ
    2. Amini AP
    3. Yue Y
    4. Yang KK
    5. Lu AX
    (2024) Feature reuse and scaling: Understanding transfer learning with protein language models
    Proceedings of Machine Learning Research 235 :27351–27375.
    • Google Scholar
    1. Matsen FA
    2. Sung K
    3. Johnson MM
    4. Dumm W
    5. Rich D
    6. Starr TN
    7. Song YS
    8. Bradley P
    9. Fukuyama J
    10. Haddox HK
    (2025) A sitewise model of natural selection on individual antibodies via a transformer-encoder
    Molecular Biology and Evolution 42 :msaf186.
    • PubMed
    • Google Scholar
    1. Matsen FA
    2. Dumm W
    3. Sung K
    4. Johnson MM
    5. Rich DH
    6. Starr TN
    7. Song YS
    8. Fukuyama J
    9. Haddox HK
    (2026) Separating selection from mutation in antibody language models
    eLife 15 :RP109644.
    • Google Scholar
    1. Nagano Y
    2. Pyo AGT
    3. Milighetti M
    4. Henderson J
    5. Shawe-Taylor J
    6. Chain B
    7. Tiffeau-Mayer A
    (2025) Contrastive learning of T cell receptor representations
    Cell Systems 16 :101165.
    • PubMed
    • Google Scholar
    1. Ortho Multicenter Transplant Study Group
    (1985) A randomized clinical trial of OKT3 monoclonal antibody for acute rejection of cadaveric renal transplants
    New England Journal of Medicine 313 :337–342.
    • Google Scholar

Article and author information

Author details

  1. Antonio Matas-Gil

    Antonio Matas-Gil is in the Division of Infection and Immunity and the Institute for the Physics of Living Systems, University College London, London, United Kingdom

    For correspondence
    a.gil@ucl.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0009-0001-8455-8903
  2. Andreas Tiffeau-Mayer

    Andreas Tiffeau-Mayer is in the Division of Infection and Immunity and the Institute for the Physics of Living Systems, University College London, London, United Kingdom

    For correspondence
    andreas.mayer@ucl.ac.uk
    Competing interests
    No competing interests declared
    "This ORCID iD identifies the author of this article:" 0000-0002-6643-7622

Publication history

  1. Version of Record published :

Copyright

© 2026, Matas-Gil and Tiffeau-Mayer

This article is distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 215
    views
  • 16
    downloads
  • 0
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

  • Article PDF

Open citations (links to open the citations from this article in various online reference manager services)

  • Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Antonio Matas-Gil
  2. Andreas Tiffeau-Mayer
(2026)
Antibody Language Models: Taking the biology seriously makes models better
eLife 15 :e111070.