生物数据与知识的双向转化进展与趋势

李 荣1,2 , 葛佳莹3 , 张学博1 , 张永娟1 , 陈大明1,4,* , 陶 诚5,*
1中国科学院上海生命科学信息中心,中国科学院上海营养与健康研究所,上海 200031 2上海大学文化遗产与信息管理学院,上海 200444 3上海市生物医药科技产业促进中心,上海 201203 4中国科学院大学,北京 100049 5中国科学院武汉文献情报中心,武汉 430071

摘 要:

生命科学研究范式正经历从单向数据挖掘向“数据-模型-知识-数据”闭环协同的深刻变革。人工智能技术的全面渗透,推动生物数据从静态资源向可编程、可设计的智能对象演进,而“学习-设计-构建-测试”循环则构成了这一转型的核心引擎。在数据向知识的转化路径中,符合人工智能就绪标准的生物数据通过机器学习模型实现跨模态融合与深度表征,从海量异构信息中提炼可计算、可演绎的生物学模型,进而转化为可解释、可推理的“知识实体”;在知识向数据的转化路径中,数字孪生、虚拟细胞等计算模型将机制性知识编码为可执行的系统架构,通过仿真模拟主动生成预测性数据并指导实验设计。数据、模型与知识在此框架中构成螺旋上升的循环关系:数据驱动模型学习,模型提炼并深化知识,知识又反哺并生成新数据,进而训练更优模型。这一以人工智能赋能为基础、以系统化闭环为核心的整合范式,正成为生命科学迈向智能化、可预测与可设计时代的重要路径。

通讯作者:陈大明 , Email:chendaming@sinh.ac.cn 陶 诚 , Email:taoch@mail.whlib.ac.cn

Advances and trends in the bidirectional transformation between biological data and knowledge
LI Rong1,2 , GE Jia-Ying3 , ZHANG Xue-Bo1 , ZHANG Yong-Juan1 , CHEN Da-Ming1,4,* , TAO Cheng5,*
1Shanghai Information Center for Life Sciences, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai 200031, China 2School of Cultural Heritage and Information Management, Shanghai University, Shanghai 200444, China 3Shanghai Center of Biomedicine Development, Shanghai 201203, China 4University of Chinese Academy of Sciences, Beijing 100049, China 5Wuhan Documentation and Information Center, Chinese Academy of Sciences, Wuhan 430071, China

Abstract:

The life sciences research paradigm is undergoing a profound transformation from unidirectional data analysis towards a synergistic, closed-loop system of ″data-model-knowledge-data″. This evolution is centrally driven by the pervasive integration of artificial intelligence technologies, which are redefining biological data from static repositories into programmable, designable intelligent entities. This paper systematically examines the bidirectional transformation between biological data and knowledge, highlighting the critical roles of AI-ready data, intelligent models, and the ″Learn-Design-Build-Test″ (LDBT) cycle. In the Data-to-Knowledge (D2K) trajectory, the journey begins with ensuring data ″AI-ready″, adhering to FAIR principles, possessing standardized formats, and being semantically aligned with biological knowledge. High-quality, structured data from major databases like PDB, NCBI, and GEO fuel sophisticated models. These models learn patterns to generate statistical or correlative knowledge. The crucial next step, Model-to-Knowledge (M2K), involves translating model outputs into verifiable scientific knowledge, such as mechanistic hypotheses. Enhanced model interpretability and integration into the LDBT cycle are essential for this transformation, moving beyond mere correlations to testable biological insights. Conversely, the Knowledge-to-Data (K2D) trajectory initiates with Knowledge-to-Model (K2M), where established mechanistic, associative, or hypothetical knowledge is encoded into computational model architectures. This is exemplified by digital twins and virtual cell models, which embed biological priors as structural constraints. Subsequently, in Modelto-Data (M2D), these knowledge-informed models including generative AI like diffusion models, cross-omics translators, and single-cell foundation models actively synthesize biologically plausible predictive or synthetic data. This addresses data scarcity and guides experimental design. The LDBT paradigm forms the core operational engine that unifies these bidirectional paths, creating a spiraling iterative relationship. Data drives model learning, models distill knowledge, and knowledge feeds back to generate new data for training superior models. However, challenges remain, including ensuring the reliability and reusability of AI-extracted knowledge, bridging the ″conversion gap″ between computational designs and successful experimental validation, and establishing standardized interfaces between D2K and K2D stages. Looking forward, the bidirectional loop is posited as a fundamental methodological framework for tackling biological complexity and integrating multimodal data. Its systematic engineering, through the continuous optimization of the LDBT cycle within research infrastructure, paves the way for life sciences to advance into an era of predictive and designable intelligence. Future efforts must focus on building a robust AI-ready data foundation, developing next-generation algorithms that deeply integrate data and prior knowledge, and perfecting the dry-wet lab integration for automated scientific discovery.

Communication Author:CHEN Da-Ming , Email:chendaming@sinh.ac.cn TAO Cheng , Email:taoch@mail.whlib.ac.cn

Back to top