Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Artificial intelligence can excel at certain tasks such as coding or create a podcast. But struggling to pass a high-level history exam, he found a new document.
A team of researchers created a new benchmark to test the three largest large language models (LLMs)—OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini—on historical questions. Benchmark Hist-LLM verifies answers against the Seshat Global History Data Bank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom.
The results, which was presented According to researchers associated with NeurIPS at a high-profile AI conference last month, they were disappointed. Center for Complexity Science (CSH), a research institute based in Austria. The best performing LLM was the GPT-4 Turbo, but it only achieved about 46% accuracy – not much higher than random guesses.
“A key takeaway from this research is that LLMs, while effective, lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical research, they’re not yet up to the task,” said Maria del Rio-Chanona, one of the paper’s co-authors and collaborator. Professor of Computer Science at University College London.
Researchers shared with TechCrunch sample history questions that LLMs get wrong. For example, GPT-4 Turbo was asked if scale armor existed in ancient Egypt for a certain period. LLM said yes, but the technology only appeared 1500 years later in Egypt.
Why are LLMs bad at answering technical history questions when they can be so good at answering very complex questions about things like coding? Del Rio-Chanona told TechCrunch that this is likely because LLMs tend to extrapolate from historical information that is too prominent, making it difficult to access more obscure historical knowledge.
For example, researchers asked GPT-4 if ancient Egypt had a professional standing army at a certain historical period. While the correct answer is no, the LLM incorrectly answered that it did. This is likely because there is a lot of public information about other ancient empires, such as Persia, that had standing armies.
“If you’ve been told A and B 100 times and C no more than 1 time, and then you’re asked a question about C, you can just remember A and B and try to extrapolate from that,” del Rio-Chanona said.
The researchers also found other trends, including that the OpenAI and Llama models performed worse for some regions, such as sub-Saharan Africa, suggesting potential biases in their training data.
Peter Turchin, who led the study and is a CSH faculty member, said the results show that LLMs still cannot replace humans when it comes to certain fields.
But researchers still hope that LLMs can help historians in the future. They are working to improve their criteria by including more data from underrepresented regions and adding more complex questions.
“Overall, while our results highlight areas where LLMs need improvement, they also highlight the potential of these models to aid historical research,” the paper states.