从序列到功能:蛋白质大语言模型的应用与发展

张 良1 , 李明辰1 , 赵维殳2 , 肖 湘2 , 洪 亮1,2,*
1上海交通大学自然科学研究院,上海 200240 2上海交通大学生命科学与技术学院,上海 200240

摘 要:

近年来,受自然语言处理领域预训练模型的启发,蛋白质语言模型(PLMs) 已成为连接蛋白质序列与功能的基础智能工具。蛋白质语言模型将蛋白质序列视为一种“生物语言”,通过在海量未标注序列数据上进行自监督学习,捕捉氨基酸之间复杂的上下文依赖关系,从而学习其隐含的结构与功能信息。本文综述了蛋白质语言模型的核心技术、主要应用与未来挑战。在模型架构方面,介绍了四种主流范式:以学习上下文表示为目标的掩码语言建模;适用于序列生成的自回归模型;基于三维结构进行条件生成的逆折叠模型;在生成质量和灵活性上更具优势的离散扩散模型。在应用层面,PLMs 主要服务于两大方向:一是从序列推断功能,包括功能注释和突变效应预测;二是从功能设计序列,包括根据功能挖掘酶和从头设计蛋白质。尽管发展迅速,PLMs 仍面临显著挑战。研究表明,与大型语言模型不同,PLMs 的性能与模型规模的扩展关系并不明确,缺乏涌现能力证据,甚至存在性能随规模增大而下降的现象。此外,高质量的经实验验证的蛋白质数据稀缺已限制了模型的进一步发展。未来的发展将聚焦于更有效地融合结构等多模态信息以及扩充高质量数据资源,以期在AI 辅助蛋白质工程领域实现新的突破。

通讯作者:洪 亮 , Email:hongl3liang@sjtu.edu.cn

From sequence to function: the application and development of protein large language models
ZHANG Liang1 , LI Ming-Chen1 , ZHAO Wei-Shu2 , XIAO Xiang2 , HONG Liang1,2,*
1Institute of Natural Sciences, Shanghai Jiaotong University, Shanghai 200240, China 2School of Life Sciences and Biotechnology, Shanghai Jiaotong University, Shanghai 200240, China

Abstract:

In recent years, inspired by pretrained models in the field of natural language processing, Protein Language Models (PLMs) have emerged as pivotal artificial intelligence tools for bridging protein sequence and function. These models treat protein sequences as a "biological language", learning their implicit structural and functional information by capturing complex contextual dependencies among amino acids through self-supervised learning on massive-scale unlabeled sequence data. This paper provides a comprehensive review of the core technologies, primary applications, and future challenges of PLMs. In terms of model architecture, this review introduces four mainstream paradigms: masked language modeling, which aims to learn contextual representations; autoregressive models, suitable for sequence generation; inverse folding models, which perform conditional generation based on three-dimensional structures; and discrete diffusion models, which offer advantages in generation quality and flexibility. At the application level, PLMs primarily serve two major directions: first, inferring function from sequence, including functional annotation and mutation effect prediction; and second, designing sequences from function, which encompasses novel enzyme discovery and de novo design of entirely new proteins. Despite their rapid development, PLMs still face significant challenges. Research indicates that, unlike Large Language Models (LLMs), the relationship between the performance of PLMs and model scale is not well-defined. There is a lack of convincing evidence for "emergent abilities", and in some cases, performance has been observed to decrease as model size increases. Furthermore, the scarcity of high-quality, experimentally validated protein data has become a core bottleneck constraining the advancement of these models. Future developments will focus on more effective integration of multimodal information, such as structure, and on the expansion of high-quality data resources, with the aim of achieving breakthroughs in the field of AI-assisted protein engineering.

Communication Author:HONG Liang , Email:hongl3liang@sjtu.edu.cn

Back to top