Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Is it possible to train an AI based solely on data generated by another AI? Perhaps this is a foolish idea. But it’s something that’s been around for quite some time — and it’s gaining traction as new, real-world data gets harder and harder to come by.
Anthropic used some synthetic data to develop one of its flagship models. Claude 3.5 Sonnet. Meta clarified it Llama 3.1 models Using data generated by AI. OpenAI is said to obtain synthetic training data o1his “justification” model for what is to come Orion.
But why does AI need data in the first place – and what? friendly need information? And this data can indeed can it be replaced by synthetic data?
AI systems are statistical machines. By training on many examples, they learn to predict the patterns in those examples, such as how the word “to” usually precedes “this may concern” in an email.
Annotations, usually text that labels the meaning or parts of the data received by these systems, are central to these examples. They serve as guides, “teaching” a model to distinguish objects, places, and ideas.
Consider a photo classification model in which multiple kitchen images tagged with the word “kitchen” are displayed. As you practice, the model will begin to make connections between the “kitchen” and the common features kitchens (for example, the presence of refrigerators and countertops). After training, when given a photo of a kitchen that was not included in the initial samples, the model should be able to identify it as such. (Of course, if pictures of kitchens were tagged with “cow,” that would identify them as cows, which underscores the importance of good annotation.)
The appetite for artificial intelligence and the need to provide labeled data for its development has inflated the market for annotation services. Size Market Research estimates it is worth $838.2 million today and will be worth $10.34 billion in the next 10 years. 2022, although there are no exact estimates of how many people are engaged in tagging paper closes the number with “millions”.
Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. Some of these jobs pay quite well, especially if the labeling requires special knowledge (such as math experience). Others may withdraw. Annotators in developing countries paid only a few dollars an hour on averagewithout any benefit or guarantee of future concerts.
So there are humanistic reasons to seek alternatives to man-made labels. For example, Uber is expanding its fleet gig workers to work on AI annotation and data labeling. But there are also practical ones.
People can only tag so fast. There are also annotators prejudices can manifest itself in their annotations and subsequently in any models trained on them. Annotators edit mistakesor get fell down with labeling instructions. Paying people to do things costs money.
Data in general it’s expensive, that’s why. Shutterstock charges tens of millions of dollars from vendors to access AI archivesAnd Reddit there is Google has made hundreds of millions in revenue from licensing data to OpenAI and others.
Finally, data acquisition is also becoming more difficult.
Most models are trained on massive public data collections – owners prefer to access more and more data for fear of having their data intercepted. it is plagiarismor they will not accept credit or attribution for it. Over 35% of the top 1000 websites in the world block OpenAI’s web scraper now. And about 25% of the data from “high-quality” sources were restricted from the main databases used to develop the models. to learn found.
If the current trend of blocking access continues, the Epoch AI research team projects Between 2026 and 2032, developers will run out of data to develop generative AI models. It is accompanied by fear copyright claims and controversial material access to open datasets has forced AI vendors to reckon.
At first glance, synthetic data seems to be the solution to all these problems. Need an annotation? Create them. More sample information? No problem. The sky is the limit.
And to some extent this is true.
Os Keyes, a PhD candidate at the University of Washington who studies the ethical implications of emerging technologies, told TechCrunch: “If ‘data is the new oil,’ synthetic data presents itself as a biofuel that can be generated without the negative externalities of the real thing.” . “You can take a small starting data set and simulate and extrapolate new inputs from that.”
The AI industry has taken the concept and is running with it.
This month, Writer, an enterprise-focused generative artificial intelligence company, debuted the Palmyra X 004, a model trained almost entirely on synthetic data. The writer claims that it cost only 700,000 dollars to make. comparison to estimates of $4.6 million for a comparably sized OpenAI model.
Microsoft’s Phi open models were trained using partially synthetic data. So was Google Gemma models. Nvidia this summer introduced a family of models designed to generate synthetic training data, and AI startup Hugging Face recently released what it claims is largest AI training database from a synthetic text.
Creating synthetic data has become a business in itself – it can be worth $2.34 billion by 2030. Gartner predicts This year, 60% of data used for artificial intelligence and analytics projects will be created synthetically, he said.
Luca Soldaini, senior fellow in artificial intelligence at the Allen Institute, noted that synthetic data techniques can be used to create training data in a format not readily available through scraping (or even content licensing). For example, in his video generator tutorial Film GenMeta used Llama 3 to generate captions for the shots in the training data, which people then refined to add more detail, such as descriptions of lighting.
Along the same lines, OpenAI says it’s fine-tuned GPT-4o using synthetic data to create like a sketchpad Canvas Feature for ChatGPT. And there’s Amazon he said It creates synthetic data to supplement the real-world data it uses to train speech recognition models for Alexa.
“Synthetic data models can be used to rapidly extend human intuition of the data needed to achieve a particular model behavior,” Soldaini said.
Synthetic data is not a panacea. It suffers from the same “garbage in, garbage out” problem as all AI. Models to create synthetic data and if there are biases and limitations in the data used to train these models, their results will be similarly tainted. For example, groups that are poorly represented in the primary data will be so in the synthetic data.
“The problem is you can only do so much,” Keyes said. “Let’s say there are only 30 black people in the data set. Extrapolation can help, but if those 30 people are all middle-class or all light-skinned, that’s what the ‘representative’ data will all look like.”
Until this point, 2023 to learn Researchers at Rice University and Stanford found that overreliance on synthetic data during training can produce models of “decreasing quality or diversity.” According to the researchers, selection bias—a poor representation of the real world—causes the model’s diversity to deteriorate after several generations of training (although they also found that some mixing of real-world data helps reduce this).
Keyes sees additional risks in complex models like OpenAI’s o1, which he thinks may be more difficult to detect. hallucinations in their synthetic data. These, in turn, can reduce the accuracy of models trained on the data—especially if the sources of the artifacts are not easy to identify.
“Complex patterns are hallucinatory; Data produced by complex models contain hallucinations, “says Keyes. “And with a model like o1, the developers themselves cannot necessarily explain why the artifacts appear.”
Mixed hallucinations can lead to delusional patterns. A to learn A paper published in the journal Nature shows how to create models trained on data subject to errors even more data related to the error and how this feedback loop worsens the next generation of models. The researchers found that models lost more esoteric knowledge over generations – becoming more general and giving inconsistent answers to frequently asked questions.
A follow up to learn It shows that other types of models, such as image generators, are not immune to this kind of collapse:
Soldaini agrees that “raw” synthetic data should not be relied upon, at least if the goal is to avoid training forgetful chatbots and homogenous image generators. Using it “safely,” he says, requires thorough review, validation and filtering, and ideally combining it with fresh, real data — just as you would with any other database.
Failure to do so may be the end causing the model to collapsewhere the model becomes less “creative” and more biased in its results, severely compromising its functionality. Although this process can be identified and arrested before it becomes serious, it is risky.
“Researchers must examine the generated data, repeat the generation process, and identify safeguards to eliminate low-quality data points,” Soldaini said. “Synthetic data pipelines are not self-improving machines; their product must be carefully tested and improved before being used for training”.
Sam Altman, CEO of OpenAI, once claimed that artificial intelligence will happen one day produces synthetic data good enough to train itself effectively. But – assuming it’s even possible – the technology doesn’t exist yet. No major AI lab has released a trained model only on synthetic data.
At least for the foreseeable future, it looks like we’ll need people in the loop somewhere to make sure the model’s training doesn’t go awry.
TechCrunch has an AI-powered newsletter! Register here to receive in your inbox every Wednesday.
Update: This story was originally published on October 23 and was updated with additional information on December 24.