大语言模型评估方法综述

doi:10.12060/j.issn.1000-7202.2025.02.01

宇航计测技术 ›› 2025, Vol. 45 ›› Issue (2): 1-30.doi: 10.12060/j.issn.1000-7202.2025.02.01

• • 下一篇

大语言模型评估方法综述

宋佳磊^1,2 ，左兴权^1,2 ，张修建^3，4，黄海^1,2

1.北京邮电大学计算机学院，北京 100876；
2.可信分布式计算与服务教育部重点实验室，北京 100876；
3.北京航天计量测试技术研究所，北京 100076；
4.国家市场监管重点实验室人工智能计量测试与标准，北京 100076

出版日期:2025-04-15 发布日期:2025-04-29
作者简介:宋佳磊（2002-），男，在读硕士研究生，主要研究方向：人工智能模型评测与对抗攻击。

A Review of Large Language Model Evaluation Methods

SONG Jialei^1,2，ZUO Xingquan^1,2，ZHANG Xiujian^3，4，HUANG Hai^1,2

1.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876，China；
2.Key Laboratory of Trustworthy Distributed Computing and Services,Ministry of Education,Beijing 100876，China；
3.Beijing Aerospace Institute for Metrology and Measurement Technology,Beijing 100076，China；
4.Key Laboratory of Artificial Intelligence Measurement and Standards for State Market Regulation，Beijing 100076，China

Online:2025-04-15 Published:2025-04-29

摘要/Abstract

摘要： 随着大语言模型的迅速发展，其广泛的应用前景引起了学术界和产业界的高度关注。大语言模型在实际应用前，需要对其性能和潜在风险进行全面评估。近年来，已有研究从多个角度讨论了大语言模型的评估方法。文中系统地总结了大语言模型在性能、鲁棒性和对齐方面的评估指标、方法和基准，分析了各种评估指标和方法的优劣，最后探讨了大语言模型的未来研究方向和面临的挑战。

关键词: 大语言模型, 评估方法, 评估指标, 评估基准

Abstract: With the rapid development of large language models, their broad application prospects have attracted significant attention from both the academic and industrial communities.Before a large language model is applied to practice,its performance and potential risks need to be comprehensively evaluated.In recent years,the evaluation methods of large language models have been studied from multiple perspectives by researchers.In this paper，the evaluation metrics,methods and benchmarks of large language models in terms of performance,robustness,and alignment,are reviewed systematically and the advantages and disadvantages of various evaluation metrics and methods are analyzed.Finally,the future research directions and challenges of large language model evaluation are discussed.

Key words: Large language models, Evaluation methods, Evaluation metrics, Evaluation benchmarks

中图分类号:

TP181，V19

宋佳磊 , 左兴权, 张修建, 黄海. 大语言模型评估方法综述[J]. 宇航计测技术, 2025, 45(2): 1-30.

SONG Jialei, ZUO Xingquan, ZHANG Xiujian, HUANG Hai . A Review of Large Language Model Evaluation Methods[J]. Journal of Astronautic Metrology and Measurement, 2025, 45(2): 1-30.

[1]	郑旭, 刘静, 张栗粽, 闫科, 宋发仁, 常清雪. 大语言模型辅助的知识图谱渐进式错误修复方法[J]. 宇航计测技术, 2025, 45(2): 63-71.
[2]	游新冬, 张旭, 吕学强, 董志安, 马登豪. 面向实体搜索的大语言模型测试评估技术[J]. 宇航计测技术, 2024, 44(6): 1-13.

大语言模型评估方法综述

A Review of Large Language Model Evaluation Methods

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐

Metrics

本文评价