《生命科学》 2026, 38(1): 1-17
营养大模型的技术架构、应用进展与未来挑战
摘 要:
营养信息学正由传统基于规则与常规机器学习范式,迈向以大语言模型(large language model,LLM)与多模态大模型(multimodal large language models,MLLM)为核心的新阶段。本文系统综述了2019–2025年间营养大模型领域的研究进展,归纳了视觉-语言对齐、领域知识注入、检索增强生成(retrieval-augmented generation,RAG)及可解释推理等关键架构与训练技术。在此基础上,本文详细梳理了模型在个性化膳食推荐、营养状态评估、疾病营养管理及膳食自动化记录等典型场景的应用现状。此外,本文总结了Nutrition5k、NutriBench等核心数据集与评测基准的演变历程。最后,针对模型可信度、数据隐私、跨文化泛化及临床循证支持等挑战,本文提出未来研究应深度融合临床证据,构建高质量多模态数据体系,并推进人机协同的精准营养服务落地,以提升临床转化价值。
通讯作者:童天朗 , Email:183947@shsmu.edu.cn 王 慧 , Email:huiwang@shsmu.edu.cn
Abstract:
Nutrition informatics has undergone a significant paradigm shift in recent years. Approaches historically grounded in rule-based decision support and classical task-specific machine learning pipelines are increasingly being superseded by an ecosystem centered on large language models (LLMs) and multimodal vision-language foundation models. This review synthesizes researches published between 2019 and 2025, with the objectives of clarifying architectural patterns that enable nutrition-oriented perception and reasoning, summarizing advances and identifying gaps across major application scenarios, and outlining strategic directions for reliable translation research in clinical and public health practice. Based on a systematic analysis of 92 representative studies, we organize the current landscape into three interrelated research trajectories: (1) Vision and multimodal modeling for dietary perception, focusing on food recognition, ingredient parsing, portion estimation, and nutrient prediction from meal images and videos. Recent methodologies increasingly adopt Transformer-based encoders and explicit vision-language alignment, leveraging depth cues and scale calibration to improve robustness under complex realworld conditions. (2) LLM-based nutrition agents for interactive guidance, supporting dietary counseling, meal planning, and health coaching. To mitigate challenges such as hallucinations and numerical inconsistency, current research emphasizes domain adaptation, tool-augmented computation, and retrieval-augmented generation (RAG) to ground model responses in reliable nutrition databases and clinical guidelines. (3) Personalization-oriented hybrid systems, which combine foundation models with structured components—such as knowledge graphs and causal inference frameworks—while integrating individual-level multi-omics signals, biomarkers, and lifestyle data. These systems aim to generate and optimize meal plans under strict constraints of safety, clinical feasibility, and patient adherence. Across these trajectories, interpretability has transitioned from an optional feature to a core system requirement, driven by the needs of clinical accountability and risk auditing. Concurrently, evaluation protocols are expanding from image-centric datasets (e.g., Nutrition5k) to comprehensive benchmarking suites designed for multimodal reasoning. Despite rapid progress, limitations persist regarding model factuality, privacy preservation, and external validity across diverse cuisines and socioeconomic settings. We advocate for evidence-grounded pipelines, standardized multimodal datasets with clinical endpoints, and unified evaluation frameworks
spanning accuracy, safety, and bias. Human-in-the-loop deployment remains essential to quantify benefit-risk profiles and facilitate the regulatory adoption of AI-driven nutrition services.
Communication Author:TONG Tian-Lang , Email:183947@shsmu.edu.cn WANG Hui , Email:huiwang@shsmu.edu.cn