Character100 : A Benchmark for Characteristic AI Agents

Abstract

The advancement of Large Language Models (LLMs) has led to significant enhancements in the performance of chatbot systems. Many researchers have dedicated their efforts to the development of bringing characteristics to chatbots. While there have been commercial products for developing role-driven chatbots using LLMs, it is worth noting that academic research in this area remains relatively scarce. Our research focuses on investigating the performance of LLMs in constructing Characteristic AI Agents by simulating real-life individuals across different settings. Current investigations have primarily focused on act on roles with simple profiles. In response to this research gap, we create a benchmark for the characteristic AI agents task, including dataset, techniques, and evaluation metrics. A dataset called “Character100” is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. With the constructed dataset, we conduct comprehensive assessment of LLMs across various settings. In addition, we devise a set of automatic metrics for quantitative performance evaluation. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents.

Examples

Abraham Lincoln (/ˈlɪŋkən/ LINK-ən; February 12, 1809 – April 15, 1865) was an American lawyer, politician, and statesman who served as the 16th president of the United States from 1861 until his assassination in 1865. Lincoln led the Union through the American Civil War to defend the nation as a constitutional union and succeeded in abolishing slavery, bolstering the federal government, and modernizing the U.S. economy.

What was your occupation before becoming president?

I was a lawyer and a politician.

The eighth child of the Jackson family, Jackson made his public debut in 1964 with his older brothers Jackie, Tito, Jermaine, and Marlon as a member of the Jackson 5 (later known as the Jacksons). Jackson began his solo career in 1971 while at Motown Records. He became a solo star with his 1979 album Off the Wall. ······ Thriller became the best-selling album of all time, while Bad was the first album to produce five US Billboard Hot 100 number-one singles.

Which album was your first to produce five US Billboard Hot 100 number-one singles?

Bad was my first album to produce five US Billboard Hot 100 number-one singles.

In 1997, Jobs returned to Apple as CEO after the company's acquisition of NeXT. ······ In 2001, the original Mac OS was replaced with the completely new Mac OS X (later known as macOS), based on NeXT's NeXTSTEP platform, giving the operating system a modern Unix-based foundation for the first time. In 2003, Jobs was diagnosed with a pancreatic neuroendocrine tumor. ······

Which operating system replaced your original Mac OS in 2001?

My original Mac OS was replaced with the completely new Mac OS X (later known as macOS) in 2001.

One of the world's most marketable and famous athletes, Ronaldo was ranked the world's highest-paid athlete by Forbes in 2016 and 2017 and the world's most famous athlete by ESPN from 2016 to 2019. Time included him on their list of the 100 most influential people in the world in 2014. He is the first footballer and the third sportsman to earn US$1 billion in his career.

Which year were you included on Time's list of the 100 most influential people in the world?

2014.

Experimental Results

ModelName	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-L	SemanticSim	Hit@1	Hit@3	Hit@5
Llama 2-7B-Base(Zero-Shot)	0.080	0.043	0.028	0.019	0.114	0.435	0.365	0.447	0.485
Llama 2-7B-Chat(Zero-Shot)	0.157	0.111	0.086	0.069	0.209	0.510	0.368	0.473	0.519
ChatGLM2-6B(Zero-Shot)	0.331	0.271	0.232	0.202	0.361	0.636	0.338	0.429	0.473
Vicuna-7B-v1.5(Zero-Shot)	0.263	0.208	0.173	0.146	0.287	0.547	0.322	0.406	0.444
Baichuan2-7B-Base(Zero-Shot)	0.024	0.006	0.002	0.001	0.037	0.336	0.255	0.341	0.382
Baichuan2-7B-Chat(Zero-Shot)	0.089	0.053	0.036	0.027	0.125	0.483	0.413	0.504	0.546
ChatGPT(Zero-Shot)	0.105	0.086	0.072	0.061	0.312	0.723	0.593	0.671	0.704
ChatGPT(Few-Shot)	0.105	0.067	0.049	0.038	0.153	0.488	0.308	0.392	0.427
ChatGPT(Few-Shot)	0.258	0.208	0.176	0.152	0.373	0.666	0.411	0.517	0.566
ChatGPT(Few-Shot)	0.323	0.272	0.238	0.211	0.376	0.598	0.472	0.562	0.597
ChatGPT(Few-Shot)	0.321	0.265	0.227	0.198	0.409	0.705	0.406	0.513	0.557
ChatGPT(Few-Shot)	0.025	0.007	0.003	0.001	0.040	0.359	0.173	0.240	0.273
ChatGPT(Few-Shot)	0.101	0.062	0.043	0.032	0.152	0.534	0.326	0.411	0.450
ChatGPT(Few-Shot)	0.199	0.169	0.147	0.129	0.502	0.794	0.534	0.620	0.661

Table 1: The results of LLMs without fine-tuning.

ModelName	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-L	SemanticSim	Hit@1	Hit@3	Hit@5
Llama 2-7B-Base(Zero-Shot, LoRA)	0.215	0.175	0.148	0.126	0.313	0.662	0.403	0.507	0.552
Llama 2-7B-Base(Zero-Shot, QLoRA)	0.210	0.172	0.145	0.124	0.307	0.661	0.410	0.508	0.552
Llama 2-7B-Chat(Zero-Shot, LoRA)	0.128	0.086	0.064	0.050	0.177	0.496	0.297	0.383	0.424
Llama 2-7B-Chat(Zero-Shot, QLoRA)	0.378	0.331	0.295	0.266	0.509	0.762	0.364	0.466	0.509
ChatGLM2-6B(Zero-Shot, LoRA)	0.052	0.021	0.010	0.004	0.083	0.435	0.161	0.237	0.275
ChatGLM2-6B(Zero-Shot, QLoRA)	0.056	0.023	0.010	0.005	0.086	0.445	0.156	0.232	0.271
Vicuna-7B-v1.5(Zero-Shot, LoRA)	0.344	0.291	0.252	0.220	0.459	0.754	0.367	0.466	0.514
Vicuna-7B-v1.5(Zero-Shot, QLoRA)	0.352	0.298	0.257	0.225	0.462	0.754	0.347	0.448	0.495
Baichuan2-7B-Base(Zero-Shot, LoRA)	0.030	0.009	0.003	0.001	0.049	0.453	0.224	0.302	0.344
Baichuan2-7B-Base(Zero-Shot, QLoRA)	0.051	0.025	0.015	0.009	0.082	0.509	0.305	0.396	0.439
Baichuan2-7B-Chat(Zero-Shot, LoRA)	0.028	0.006	0.001	0.000	0.039	0.382	0.307	0.405	0.449
Baichuan2-7B-Chat(Zero-Shot, QLoRA)	0.078	0.044	0.029	0.021	0.116	0.486	0.401	0.501	0.548
Llama 2-7B-Base(Few-Shot, LoRA)	0.213	0.173	0.145	0.124	0.310	0.614	0.354	0.449	0.493
Llama 2-7B-Base(Few-Shot, QLoRA)	0.210	0.169	0.141	0.120	0.284	0.578	0.326	0.406	0.443
Llama 2-7B-Chat(Few-Shot, LoRA)	0.199	0.149	0.118	0.097	0.287	0.602	0.272	0.359	0.404
Llama 2-7B-Chat(Few-Shot, QLoRA)	0.530	0.474	0.430	0.393	0.590	0.797	0.366	0.467	0.513
ChatGLM2-6B(Few-Shot, LoRA)	0.040	0.015	0.006	0.003	0.066	0.391	0.157	0.233	0.272
ChatGLM2-6B(Few-Shot, QLoRA)	0.042	0.016	0.007	0.003	0.069	0.399	0.146	0.222	0.261
Vicuna-7B-v1.5(Few-Shot, LoRA)	0.416	0.357	0.312	0.276	0.508	0.770	0.379	0.479	0.524
Vicuna-7B-v1.5(Few-Shot, QLoRA)	0.407	0.346	0.301	0.264	0.500	0.770	0.373	0.473	0.524
Baichuan2-7B-Base(Few-Shot, LoRA)	0.027	0.008	0.002	0.000	0.043	0.419	0.167	0.240	0.279
Baichuan2-7B-Base(Few-Shot, QLoRA)	0.046	0.023	0.014	0.009	0.073	0.476	0.255	0.335	0.378
Baichuan2-7B-Chat(Few-Shot, LoRA)	0.032	0.009	0.002	0.000	0.049	0.420	0.238	0.315	0.359
Baichuan2-7B-Chat(Few-Shot, QLoRA)	0.095	0.058	0.040	0.030	0.147	0.527	0.328	0.416	0.456

Table 2: The results of LLMs after fine-tuning.

BibTeX

@misc{wang2024characteristic,
  title={Characteristic AI Agents via Large Language Models}, 
  author={Xi Wang and Hongliang Dai and Shen Gao and Piji Li},
  year={2024},
  eprint={2403.12368},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}