《生命科学》 2025, 37(12): 1534-1548
蛋白质预测和生成大模型:从序列、结构到功能
摘 要:
蛋白质大模型( 特别是以AlphaFold2、RoseTTAFold、ESMFold 为代表的结构预测模型) 是人工智能与生命科学交叉融合的典范。它们通过在海量生物数据上训练深度神经网络,尤其是Transformer 的变体,成功破解了从序列预测结构的核心难题,并展现出在功能预测和蛋白质设计方面的巨大潜力。本文从蛋白质语言模型的核心架构、蛋白质结构预测大模型、蛋白质设计与生成大模型三个方面出发,讨论了蛋白质预测和生成大模型的研究和应用进展。在大语言模型、扩展模型和流匹配模型的不断推动下,蛋白质大模型无疑已成为理解和设计生命分子、推动生命科学和生物技术发展的强大引擎。它们代表了“AI for Science”的一个高峰,并将持续引领该领域的创新浪潮。
通讯作者:马步勇 , Email:mabuyong@sjtu.edu.cn
Abstract:
Protein large models (particularly structure prediction models represented by AlphaFold2, RoseTTAFold, and ESMFold) exemplify the integration of artificial intelligence and life sciences. By training deep neural networks—especially variants of the Transformer—on massive biological datasets, these models have successfully unlocked the fundamental challenge of predicting structures from sequences and demonstrated immense potential in functional prediction and protein design. This article reviews research advances in protein prediction and generative large models by discussing their core architectures, structural prediction frameworks, and design/generation methodologies. Propelled by continuous innovations in large language models, scaling architectures, and flow matching models, protein large models have undeniably become powerful engines for understanding and designing biomolecules, driving progress in life sciences and biotechnology. They represent a pinnacle of "AI for Science" and will persistently spearhead a wave of innovation in this field.
Communication Author:MA Bu-Yong , Email:mabuyong@sjtu.edu.cn