Kavraki Lab | STAG-LLM: Predicting TCR-pHLA binding with protein language models and computationally generated 3D structures

STAG-LLM: Predicting TCR-pHLA binding with protein language models and computationally generated 3D structures

J. K. Slone, M. Zhang, P. Jiang, A. Montoya, E. Bontekoe, B. N. Rausseo, A. Reuben, and L. E. Kavraki, “STAG-LLM: Predicting TCR-pHLA binding with protein language models and computationally generated 3D structures,” Computational and Structural Biotechnology Journal, vol. 27, pp. 3885–3896, Sep. 2025.

Abstract

Background: Strong binding between T cell receptors (TCRs) and peptide–HLA (pHLA) complexes is important for triggering the adaptive immune response. Binding specificity prediction, identifying which TCRs will bind strongly to which pHLAs, can serve as a first step in designing personalized immunotherapy treatments. Existing machine learning (ML) methods to predict binding specificity rely primarily on the amino acid sequences of TCRs and pHLAs to make predictions. However, incorporating the 3D structure and geometry of the TCR-pHLA complex as an additional data modality alongside protein sequence offers a promising approach to improving ML methods for predicting TCR-pHLA binding specificity. Modern computational modeling tools present unprecedented opportunities to incorporate structure data into ML pipelines. We utilize such computational tools to incorporate 3D data into this work. Results: We present STAG-LLM, a multimodal ML model for predicting TCR-pHLA binding specificity that leverages sequence data and computationally generated 3D protein structures. We show that by combining a protein language model with a geometric deep learning architecture, our method outperforms existing methods even when trained on 3x smaller datasets. To further validate our model, we conduct in vitro alanine scanning experiments for four peptides and demonstrate a correlation with the attention weights learned by our model and in vitro results. We also seek to address three key challenges that arise from using computationally generated 3D structures in ML pipelines: increased inference costs arising from the need to generate 3D structures, limited training data, and robustness to noise in the generated structures. Conclusions: STAG-LLM shows tremendous potential for structure-based TCR-pHLA binding prediction methods, offering a foundation for further advancements in using modeled 3D structures to solve problems in immunology and proteomics. We anticipate that the usefulness of STAG-LLM and similar tools will increase in coming years as both protein structure prediction models and large language models continue to advance.

Publisher: http://dx.doi.org/10.1016/j.csbj.2025.09.004

PDF preprint: http://kavrakilab.org/publications/slone-stagllm2025.pdf