Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti.
It’s become something of a meme, as well as a benchmark: To see if a new video generator can realistically show Smith downing a bowl of noodles. Smith himself he parodied Trending in an Instagram post in February.
Google Veo 2 did just that.
Now we finally eat spaghetti. pic.twitter.com/AZO81w8JC0
— Jerrod Lew (@jerrod_lew) December 17, 2024
Will Smith and pasta is one of those things strange “unofficial” criteria Taking AI society by storm in 2024. A 16-year-old developer has created an app that puts artificial intelligence in control of Minecraft and tests its ability to design structures. Elsewhere, a British programmer has created a platform where artificial intelligence plays games like Pictionary and Connect 4 against each other.
It’s not like there aren’t more academic tests of AI performance. But why did the strange ones explode?
First, many industry-standard AI benchmarks don’t tell the average person much. Companies often tout their AI’s ability to answer questions on Math Olympiad exams or find reasonable solutions to PhD-level problems. However, most people – including yours truly – use chatbots for things like responding to emails and basic research.
Crowdsourced industry events are not necessarily better or more informative.
Let’s take for example Chatbot Arenais a public benchmark that many AI enthusiasts and developers follow obsessively. Chatbot Arena allows anyone on the web to evaluate how well an AI performs at specific tasks, such as creating a web application or creating an image. But the evaluators tend not to be representative — most come from the AI and tech industry circles — and cast their votes based on personal, hard-to-close preferences.
Ethan Mollick, professor of management at Wharton, recently a post Another problem with many AI industry benchmarks in X: they don’t compare a system’s performance to that of a normal human.
“It’s a real shame that there aren’t 30 different criteria from different organizations in medicine, law, counseling quality, etc., because people use the systems regardless,” Mollick wrote.
Weird AI indicators like Connect 4, Minecraft, and Will Smith certainly eat spaghetti. no anything empirical – or even generalizable. Just because an AI passes the Will Smith test doesn’t mean it will, say, create a burger pit.
One expert I spoke to about AI benchmarks suggested focusing on the AI community’s downstream impacts instead of its ability in narrow areas. This makes sense. But I have a feeling the weird criteria isn’t going away anytime soon. They’re not just fun – who doesn’t love watching an AI build Minecraft castles? – but they are easy to understand. Also like my colleague Max Zeff wrote about recentlythe industry continues to struggle by distilling complex technology like artificial intelligence into digestible marketing.
The only question on my mind is what weird new benchmarks will go viral in 2025?