Characteristic AI Agents via Large Language Models

Xi Wang1,2, Hongliang Dai1,2, Shen Gao3, Piji Li1,2*,
1 College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, China, 2MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China, 3School of Computer Science and Technology, Shandong University, China

Abstract

The advancement of Large Language Models (LLMs) has led to significant enhancements in the performance of chatbot systems. Many researchers have dedicated their efforts to the development of bringing characteristics to chatbots. While there have been commercial products for developing role-driven chatbots using LLMs, it is worth noting that academic research in this area remains relatively scarce. Our research focuses on investigating the performance of LLMs in constructing Characteristic AI Agents by simulating real-life individuals across different settings. Current investigations have primarily focused on act on roles with simple profiles. In response to this research gap, we create a benchmark for the characteristic AI agents task, including dataset, techniques, and evaluation metrics. A dataset called “Character100” is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. With the constructed dataset, we conduct comprehensive assessment of LLMs across various settings. In addition, we devise a set of automatic metrics for quantitative performance evaluation. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents.

Examples

Experimental Results

ModelName BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L SemanticSim Hit@1 Hit@3 Hit@5
Llama 2-7B-Base(Zero-Shot) 0.080 0.043 0.028 0.019 0.114 0.435 0.365 0.447 0.485
Llama 2-7B-Chat(Zero-Shot) 0.157 0.111 0.086 0.069 0.209 0.510 0.368 0.473 0.519
ChatGLM2-6B(Zero-Shot) 0.331 0.271 0.232 0.202 0.361 0.636 0.338 0.429 0.473
Vicuna-7B-v1.5(Zero-Shot) 0.263 0.208 0.173 0.146 0.287 0.547 0.322 0.406 0.444
Baichuan2-7B-Base(Zero-Shot) 0.024 0.006 0.002 0.001 0.037 0.336 0.255 0.341 0.382
Baichuan2-7B-Chat(Zero-Shot) 0.089 0.053 0.036 0.027 0.125 0.483 0.413 0.504 0.546
ChatGPT(Zero-Shot) 0.105 0.086 0.072 0.061 0.312 0.723 0.593 0.671 0.704
ChatGPT(Few-Shot) 0.105 0.067 0.049 0.038 0.153 0.488 0.308 0.392 0.427
ChatGPT(Few-Shot) 0.258 0.208 0.176 0.152 0.373 0.666 0.411 0.517 0.566
ChatGPT(Few-Shot) 0.323 0.272 0.238 0.211 0.376 0.598 0.472 0.562 0.597
ChatGPT(Few-Shot) 0.321 0.265 0.227 0.198 0.409 0.705 0.406 0.513 0.557
ChatGPT(Few-Shot) 0.025 0.007 0.003 0.001 0.040 0.359 0.173 0.240 0.273
ChatGPT(Few-Shot) 0.101 0.062 0.043 0.032 0.152 0.534 0.326 0.411 0.450
ChatGPT(Few-Shot) 0.199 0.169 0.147 0.129 0.502 0.794 0.534 0.620 0.661

Table 1: The results of LLMs without fine-tuning.

ModelName BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L SemanticSim Hit@1 Hit@3 Hit@5
Llama 2-7B-Base(Zero-Shot, LoRA) 0.215 0.175 0.148 0.126 0.313 0.662 0.403 0.507 0.552
Llama 2-7B-Base(Zero-Shot, QLoRA) 0.210 0.172 0.145 0.124 0.307 0.661 0.410 0.508 0.552
Llama 2-7B-Chat(Zero-Shot, LoRA) 0.128 0.086 0.064 0.050 0.177 0.496 0.297 0.383 0.424
Llama 2-7B-Chat(Zero-Shot, QLoRA) 0.378 0.331 0.295 0.266 0.509 0.762 0.364 0.466 0.509
ChatGLM2-6B(Zero-Shot, LoRA) 0.052 0.021 0.010 0.004 0.083 0.435 0.161 0.237 0.275
ChatGLM2-6B(Zero-Shot, QLoRA) 0.056 0.023 0.010 0.005 0.086 0.445 0.156 0.232 0.271
Vicuna-7B-v1.5(Zero-Shot, LoRA) 0.344 0.291 0.252 0.220 0.459 0.754 0.367 0.466 0.514
Vicuna-7B-v1.5(Zero-Shot, QLoRA) 0.352 0.298 0.257 0.225 0.462 0.754 0.347 0.448 0.495
Baichuan2-7B-Base(Zero-Shot, LoRA) 0.030 0.009 0.003 0.001 0.049 0.453 0.224 0.302 0.344
Baichuan2-7B-Base(Zero-Shot, QLoRA) 0.051 0.025 0.015 0.009 0.082 0.509 0.305 0.396 0.439
Baichuan2-7B-Chat(Zero-Shot, LoRA) 0.028 0.006 0.001 0.000 0.039 0.382 0.307 0.405 0.449
Baichuan2-7B-Chat(Zero-Shot, QLoRA) 0.078 0.044 0.029 0.021 0.116 0.486 0.401 0.501 0.548
Llama 2-7B-Base(Few-Shot, LoRA) 0.213 0.173 0.145 0.124 0.310 0.614 0.354 0.449 0.493
Llama 2-7B-Base(Few-Shot, QLoRA) 0.210 0.169 0.141 0.120 0.284 0.578 0.326 0.406 0.443
Llama 2-7B-Chat(Few-Shot, LoRA) 0.199 0.149 0.118 0.097 0.287 0.602 0.272 0.359 0.404
Llama 2-7B-Chat(Few-Shot, QLoRA) 0.530 0.474 0.430 0.393 0.590 0.797 0.366 0.467 0.513
ChatGLM2-6B(Few-Shot, LoRA) 0.040 0.015 0.006 0.003 0.066 0.391 0.157 0.233 0.272
ChatGLM2-6B(Few-Shot, QLoRA) 0.042 0.016 0.007 0.003 0.069 0.399 0.146 0.222 0.261
Vicuna-7B-v1.5(Few-Shot, LoRA) 0.416 0.357 0.312 0.276 0.508 0.770 0.379 0.479 0.524
Vicuna-7B-v1.5(Few-Shot, QLoRA) 0.407 0.346 0.301 0.264 0.500 0.770 0.373 0.473 0.524
Baichuan2-7B-Base(Few-Shot, LoRA) 0.027 0.008 0.002 0.000 0.043 0.419 0.167 0.240 0.279
Baichuan2-7B-Base(Few-Shot, QLoRA) 0.046 0.023 0.014 0.009 0.073 0.476 0.255 0.335 0.378
Baichuan2-7B-Chat(Few-Shot, LoRA) 0.032 0.009 0.002 0.000 0.049 0.420 0.238 0.315 0.359
Baichuan2-7B-Chat(Few-Shot, QLoRA) 0.095 0.058 0.040 0.030 0.147 0.527 0.328 0.416 0.456

Table 2: The results of LLMs after fine-tuning.

BibTeX

@misc{wang2024characteristic,
  title={Characteristic AI Agents via Large Language Models}, 
  author={Xi Wang and Hongliang Dai and Shen Gao and Piji Li},
  year={2024},
  eprint={2403.12368},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}