Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
In the copyright lawsuit filed against Meta, the plaintiffs’ attorney claims that Meta CEO Mark Zuckerberg gave the green light to the team behind the company. Llama AI models using a collection of pirated e-books and articles for training.
The lawsuit against Kadrey Meta is one of many against tech giants that develop artificial intelligence, accusing the companies of training models on copyrighted works without permission. In most cases, defendants like Meta argued that they were protected by fair use, a US legal doctrine that allows copyrighted works to be used to make something new as long as it is sufficiently transformative. Many creators reject this argument.
In new unprocessed documents Plaintiffs against Kadrey Meta, including bestselling authors Sarah Silverman and Ta-Nehisi Coates, who filed Wednesday in the U.S. District Court for the Northern District of California, recounted Meta’s testimony late last year. Meta’s use of a dataset called LibGen for Llama-related training.
Self-described as a “links aggregator,” LibGen provides access to copyrighted works from publishers including Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen has been sued multiple times, ordered shut down, and fined tens of millions of dollars for copyright infringement.
According to Meta’s testimony, counsel for the plaintiffs said Zuckerberg allowed Meta to use LibGen to develop at least one of its Llama models, despite concerns within Meta’s AI board and others at the company. The document notes that Meta employees refer to LibGen as a “dataset we know to be pirated” and that its use “could damage (Meta’s) negotiating position with regulators.”
The document also refers to a memo stating that Meta’s AI team was “approved to use LibGen” after Meta AI “promoted to MZ” to decision makers. (MZ is pretty obvious shorthand for “Mark Zuckerberg” here.)
The details coincide with a report by The New York Times last April. This suggested that Meta cut corners to collect data for AI. At one point, Meta hired contractors in Africa to compile book summaries and considered buying Simon & Schuster, according to the Times. But company executives determined that negotiating licenses would take too long, arguing that fair use was a strong defense.
Wednesday’s filing contains new accusations that Meta LibGen may have tried to hide the alleged breach by removing attribution data.
According to the plaintiffs’ lawyer, Nikolay Bashlykov, a Meta engineer working at the Llama research group, wrote a script to remove copyright information, including the words “copyright” and “thanks,” from e-books on LibGen. Separately, Meta allegedly removed copyright marks from scientific journal articles and “source metadata” in the training data it used for Llama.
“This discovery shows that Meta (copyright information) is not only for training purposes, but also to hide copyright infringement, because removing copyrighted works … prevents Llama from removing copyright information that could alert Llama users and the public happens” to the violation of Methane.”
According to the latest filing, Meta also revealed during depositions that it torrented LibGen, which gave some Meta research engineers pause. Torrenting, a method of distributing files across the Internet, requires torrenters to simultaneously “seed” or download the files they are trying to access.
By torrenting LibGen, thereby facilitating the distribution of its content, Meta effectively engaged in another form of copyright infringement, the plaintiffs’ attorney contends. The lawyer claims that Meta also tried to hide its activity by minimizing the number of files it uploaded.
According to the document, Ahmed Ah-Dahle, Meta’s head of generative AI, “cleared the way” for LibGen to be downloaded via torrent — brushing aside Bashlikov’s remarks that it “might not be legally good.”
“If Meta bought the plaintiffs’ works at a bookstore or borrowed them from a library and taught them the unlicensed Llama models, he would be infringing copyright,” the plaintiffs’ attorney wrote in the filing. “Meta’s decision to bypass legal means of obtaining books and become a knowing participant in an illegal torrenting network … serves as evidence of copyright infringement.”
The meta case is still pending. Currently, this only applies to Meta’s earliest Llama models – not its latest releases. A court could rule in Meta’s favor if it agrees with the company’s fair use argument.
But the allegations don’t bode well for Meta, as presiding Judge Thomas Hixson noted in an order Wednesday denying Meta’s request to redact a large portion of the filing.
“Clearly, Meta’s unsealing requirement is not intended to protect against the disclosure of sensitive business information that competitors could use to their advantage,” Hixson said. “Rather, it is intended to prevent negative publicity.”
We’ve reached out to Meta for comment and will update this piece if we hear back.