Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

How OpenAI’s bot crushed this seven-person company’s website ‘like a DDoS attack’


On Saturday, Triplegangers CEO Oleksandr Tomchuk received an alert that his company’s e-commerce site was down. It appeared to be some kind of distributed denial of service attack.

He soon discovered that the culprit was a bot from OpenAI that was relentlessly trying to break his entire massive site.

“We have over 65,000 products, each product has a page,” Tomchuk told TechCrunch. “There are at least three images per page.”

OpenAI was sending “tens of thousands” of server requests trying to download all of this, hundreds of thousands of photos and detailed descriptions.

“OpenAI used 600 IPs to scrape the data, and we’re still analyzing last week’s logs, maybe it’s more,” he said of the IP addresses the bot used to consume the site.

“Their crawlers crushed our site,” he said, “it was basically a DDoS attack.”

Triplegangers’ website is his business. The seven-employee company has spent more than a decade amassing the largest database of what it calls “human digital pairs” on the web — 3D image files scanned from actual human models.

It sells 3D object files as well as photos—from hands to hair, skin, and full bodies—to 3D artists, video game makers, and anyone else who needs to digitally recreate authentic human features.

Tomchuk has a team based in Ukraine, but also licensed in the US out of Tampa, Florida. terms of service page on the site, which prohibits bots from taking unauthorized pictures. But this alone did nothing. Websites must use a properly configured robot.txt file with tags that specifically tell OpenAI’s bot GPTBot to leave the site alone. (OpenAI also has several other bots with their own tags, ChatGPT-User and OAI-SearchBot. according to the information page in his browsers.)

Robot.txt, also known as the Robot Exclusion Protocol, was created to tell search engine sites what not to crawl when indexing the web. OpenAI says on its data page that it respects such files when configured with its own set of crawl tags, although it warns that it can take up to 24 hours for its bots to recognize an updated robot.txt file.

As Tomchuk’s experience shows, if a site doesn’t use robot.txt properly, OpenAI and others can hack it to their heart’s content. This is not a plug-in system.

To add insult to injury, not only were Triplegangers taken offline by OpenAI’s bot during business hours in the US, but Tomchuk expects to rack up his AWS bill due to all the CPU and download activity from the bot.

Robot.txt is also not secure. AI companies follow suit voluntarily. Another AI startup, Perplexity, was pretty famously called out by a Wired investigation last summer when some evidence suggests no Confusion respects

Triplegangers product page
Each of these is a product and has a product page with more photos. Used with permission.Image credits:Triples (opens in new window)

It is impossible to know exactly what was bought

On Wednesday, a few days after OpenAI’s bot was returned, the Triplegangers site had a properly configured robot.txt file, as well as a Cloudflare account set up to block GPTBot and several other bots it had discovered, such as Barkrowler (an SEO browser) and Bytespider. TokTok crawler). Tomchuk also hopes to block creeps from other AI model companies. Thursday morning said the site was not down.

But Tomchuk still has no reasonable way to find out what OpenAI successfully took, or to remove that material. Didn’t find any way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. OpenAI has so far failed to deliver the long-promised opt-out toolAs TechCrunch recently reported.

This is especially difficult for Triplegangers. “We’re in a business where rights are a serious issue because we’re scanning actual people,” he said. With European laws like GDPR, “they can’t just take someone’s picture on the internet and use it.”

Triplegangers’ website was also a particularly delightful find for AI riders. Billion dollar startups like Scale AIcreated where people painstakingly tag images to train AI. Triplegangers’ site has photos tagged in detail: ethnicity, age, tattoos vs. scars, all body types, and more.

The irony is that the greed of the OpenAI bot alerted Triplegangers to just how much it was exposed. Had he broken it more gently, Tomchuk would never have known, he said.

“It’s scary because there’s a loophole that these companies use to crawl data by saying, ‘If you update robot.txt with our tags, you can opt out,’ but that puts the onus on the business owner,” says Tomchuk. figure out how to block them.

openai browsing log
Triplegangers’ server logs showed how the OpenAI bot was relentlessly accessing the site from hundreds of IP addresses. Used with permission.

He wants other small online businesses to know that the only way to detect if an AI bot is taking a website’s copyrighted material is to actively look. He is certainly not the only one terrorized by them. Recently, the owners of other sites reported Business Insider How OpenAI bots crash sites and raise AWS accounts.

The problem has grown in 2024. New research from digital advertising company DoubleVerify It found that AI crawlers and scrapers led to an 86% increase in “total invalid traffic” in 2024, meaning traffic that didn’t come from a real user.

Still, “most sites remain unaware that they’ve been hacked by these bots,” warns Tomchuk. “Now we need to monitor daily activity to detect these bots.”

When you think about it, the whole model works a bit like a mafia shakedown: AI bots will get what they want when you don’t have protection.

“They should ask for permission, not just delete the data,” says Tomchuk.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *