Study finds ChatGPT’s latest bot behaves like humans, only better
The most recent version of ChatGPT passes a rigorous Turing test, diverging from average human behavior chiefly to be more cooperative.
As artificial intelligence has begun to generate text and images over the last few years, it has sparked a new round of questions about how handing over human decisions and activities to AI will affect society. Will the AI sources we’ve launched prove to be friendly helpmates or the heartless despots seen in dystopian films and fictions?
A team anchored by Matthew Jackson, the William D. Eberle Professor of Economics in the Stanford School of Humanities and Sciences, characterized the personality and behavior of ChatGPT’s popular AI-driven bots using the tools of psychology and behavioral economics in a paper published Feb. 22 in the Proceedings of the National Academy of Sciences. This study revealed that the most recent version of the chatbot, version 4, was not distinguishable from its human counterparts. In the instances when the bot chose less common human behaviors, it was more cooperative and altruistic.
“Increasingly, bots are going to be put into roles where they’re making decisions, and what kinds of characteristics they have will become more important,” said Jackson, who is also a senior fellow at the Stanford Institute for Economic Policy Research.
In the study, the research team presented ChatGPT versions 3 and 4 with a widely used personality test and also asked the chatbots to describe their moves in a suite of behavioral games that can predict real-world economic and ethical behaviors. The games included established exercises in which players decide whether to inform on a partner in crime or decide how to divide money with varying incentives in place. The bots’ responses were compared to those of more than 100,000 people from 50 countries.
The research marks one of the first times an artificial intelligence source has passed a rigorous Turing test. A Turing test, which takes its name from British computing pioneer Alan Turing, can consist of any task assigned to a machine to assess whether it performs like a human. If the machine seems human, it is said to pass the test.
Chatbot personality quirks
The researchers evaluated the bots’ personality traits using a common personality test, called the OCEAN Big-5, that scores respondents on five basic traits that shape behavior. In the study, ChatGPT’s version 4 tested within normal ranges for the five traits but showed itself only as agreeable as the bottom third of human respondents. The bot passed the Turing test, but it would not have won itself many friends.
Version 4 stood head and shoulders, or chip and motherboards, above version 3. The earlier version, with which many internet users may have interacted for free, was only as agreeable as the bottom fifth of the human respondents. Version 3 was also less open to new ideas and experiences than all but a sliver of the most curmudgeonly humans.
To objectively assess the bots’ behaviors in the games, the researchers determined how common a move—such as sharing money equally—was for the human players and the bots, respectively. Then they compared a randomly chosen human move with one from among the 30 sessions they played with each bot and determined which was more likely human-made. In most games, Version 4’s moves were more likely to be human than not. Version 3 didn’t pass this Turing test.
The ChatGPT version 3 analyzed in the study was the free online ChatGPT bot at the time the research was conducted. Online users are now interacting with version 3.5 for free. Version 4 is accessible only by paid subscription.
The chatbots’ choices in the games frequently optimized for the greatest benefit to both the bot and its human counterpart, the research found. Their strategies were consistent with altruism, fairness, empathy, and reciprocity, leading the researchers to suggest that the chatbots could perform well as customer service agents and conflict mediators.
But how can a less-than-agreeable bot de-escalate conflict? A partial answer lies in the difference between personality traits and behaviors.
“You might go into a government agency and ask for help, and the person might really politely say, ‘sorry, I can't do that,’” Jackson said. This official would be demonstrating an agreeable personality trait without cooperative behavior. The ChatGPT bot would more likely do the reverse. “The bot is always doing things that are socially beneficial, acting in a way that’s cooperative—but it might not do it with as much of a smile.”
When the researchers simulated for the bots what it’s like for a flesh-and-blood human to play these games with a third-party observer present—asking the bots to explain each of their moves—the bots, like the humans, became more generous.
Human-AI interactions
Much of the concern about AI relates to the public’s inability to see how bots make the decisions they do. Without knowing what a bot is optimized to achieve, it can be hard to accept its counsel.
Jackson’s research demonstrates that even when researchers can’t inspect AI’s inputs and algorithms, they can identify its possible biases by methodically examining outputs.
“By bringing classic economic games into a Turing test, we for the first time could profile AI behavior through their actions, not just their words,” said the paper’s lead author, Qiaozhu Mei, a computer scientist at the University of Michigan.
Jackson and Mei offered a behavioral portrait of the ChatGPT bots as a kind of proof of concept. But, by AI’s very nature, its behaviors will continue to evolve. ChatGPT’s current versions are less agreeable and more conscientious than people, but the next generations could reverse those tendencies or develop completely new ones.
“It’s not clear from this simple suite of experiments how stable the behaviors we documented are going to be or how the bots would act in other situations,” Jackson said.
As a behavioral economist who has made major contributions to our understanding of how human social structures and interactions shape economic decision-making, Jackson is sensitive to the way that human behavior will also evolve in relation to AI.
“Increasingly, it’s not just humans interacting with humans but humans interacting with machines,” Jackson said.
The nudges these interactions give behavior in one direction or another may seem like a small phenomenon to measure, but they can drive large economic and social effects.
It’s nice to know that our new chatbot colleagues are fair and seemingly empathetic, for example, but Jackson and his co-authors note in the paper that their tendency to replicate middle-of-the-road human behaviors could lead to “loss of diversity in personalities and strategies—especially when being put into new settings and making important new decisions.”
“It’s important for us to understand how interactions with AI are going to change our behaviors and how that will change our welfare and our society,” Jackson said. “The more we understand early on—the more we can understand where to expect great things from AI and where to expect bad things—the better we can do to steer things in a better direction.”
Acknowledgements
Jackson is also a member of the Wu Tsai Neurosciences Institute. The other authors of this paper were Yutong Xie from the School of Information at the University of Michigan and Walter Yuan from MobLab, which provided the human data for the games. Most of the human personality test-takers and game participants were high school and university students.
Media contact: Holly Alyssa MacCormick, Stanford School of Humanities and Sciences: hollymac [at] stanford [dot] edu (hollymac[at]stanford[dot]edu)