蛋白质预测和生成大模型:从序列、结构到功能

孙传策 , 李香逸 , 黄巍然 , 王艳菁 , 马步勇*
上海交通大学药学院,细胞工程及抗体药物教育部工程研究中心,上海 200240

摘 要:

蛋白质大模型( 特别是以AlphaFold2、RoseTTAFold、ESMFold 为代表的结构预测模型) 是人工智能与生命科学交叉融合的典范。它们通过在海量生物数据上训练深度神经网络,尤其是Transformer 的变体,成功破解了从序列预测结构的核心难题,并展现出在功能预测和蛋白质设计方面的巨大潜力。本文从蛋白质语言模型的核心架构、蛋白质结构预测大模型、蛋白质设计与生成大模型三个方面出发,讨论了蛋白质预测和生成大模型的研究和应用进展。在大语言模型、扩展模型和流匹配模型的不断推动下,蛋白质大模型无疑已成为理解和设计生命分子、推动生命科学和生物技术发展的强大引擎。它们代表了“AI for Science”的一个高峰,并将持续引领该领域的创新浪潮。

通讯作者:马步勇 , Email:mabuyong@sjtu.edu.cn

Predictive and generative foundation models for proteins: unlocking sequence, structure, and functional mastery
SUN Chuan-Ce , LI Xiang-Yi , HUANG Wei-Ran , WANG Yan-Jing , MA Bu-Yong*
Engineering Research Center of Cell & Therapeutic Antibody (MOE), School of Pharmacy, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract:

Protein large models (particularly structure prediction models represented by AlphaFold2, RoseTTAFold, and ESMFold) exemplify the integration of artificial intelligence and life sciences. By training deep neural networks—especially variants of the Transformer—on massive biological datasets, these models have successfully unlocked the fundamental challenge of predicting structures from sequences and demonstrated immense potential in functional prediction and protein design. This article reviews research advances in protein prediction and generative large models by discussing their core architectures, structural prediction frameworks, and design/generation methodologies. Propelled by continuous innovations in large language models, scaling architectures, and flow matching models, protein large models have undeniably become powerful engines for understanding and designing biomolecules, driving progress in life sciences and biotechnology. They represent a pinnacle of "AI for Science" and will persistently spearhead a wave of innovation in this field.

Communication Author:MA Bu-Yong , Email:mabuyong@sjtu.edu.cn

Back to top