宇航计测技术 ›› 2025, Vol. 45 ›› Issue (2): 1-30.doi: 10.12060/j.issn.1000-7202.2025.02.01

• •    下一篇

大语言模型评估方法综述

宋佳磊1,2 ,左兴权1,2 ,张修建3,4,黄海1,2   

  1. 1.北京邮电大学计算机学院,北京 100876;
    2.可信分布式计算与服务教育部重点实验室,北京 100876;
    3.北京航天计量测试技术研究所,北京 100076;
    4.国家市场监管重点实验室 人工智能计量测试与标准,北京 100076
  • 出版日期:2025-04-15 发布日期:2025-04-29
  • 作者简介:宋佳磊(2002-),男,在读硕士研究生,主要研究方向:人工智能模型评测与对抗攻击。

A Review of Large Language Model Evaluation Methods

SONG Jialei1,2,ZUO Xingquan1,2,ZHANG Xiujian3,4,HUANG Hai1,2   

  1. 1.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876,China;
    2.Key Laboratory of Trustworthy Distributed Computing and Services,Ministry of Education,Beijing 100876,China;
    3.Beijing Aerospace Institute for Metrology and Measurement Technology,Beijing 100076,China;
    4.Key Laboratory of Artificial Intelligence Measurement and Standards for State Market Regulation,Beijing 100076,China
  • Online:2025-04-15 Published:2025-04-29

摘要: 随着大语言模型的迅速发展,其广泛的应用前景引起了学术界和产业界的高度关注。大语言模型在实际应用前,需要对其性能和潜在风险进行全面评估。近年来,已有研究从多个角度讨论了大语言模型的评估方法。文中系统地总结了大语言模型在性能、鲁棒性和对齐方面的评估指标、方法和基准,分析了各种评估指标和方法的优劣,最后探讨了大语言模型的未来研究方向和面临的挑战。

关键词: 大语言模型, 评估方法, 评估指标, 评估基准

Abstract: With the rapid development of large language models, their broad application prospects have attracted significant attention from both the academic and industrial communities.Before a large language model is applied to practice,its performance and potential risks need to be comprehensively evaluated.In recent years,the evaluation methods of large language models have been studied from multiple perspectives by researchers.In this paper,the evaluation metrics,methods and benchmarks of large language models in terms of performance,robustness,and alignment,are reviewed systematically and the advantages and disadvantages of various evaluation metrics and methods are analyzed.Finally,the future research directions and challenges of large language model evaluation are discussed.

Key words: Large language models, Evaluation methods, Evaluation metrics, Evaluation benchmarks

中图分类号: