Three authors, Abdi Nazemian, Brian Keene, and Stewart O’Nan, are part of a new copyright infringement lawsuit against Nvidia, the latest such suit to challenge generative AI providers’ reliance on the “fair use” doctrine to acquire copyrighted material to train their large language models.
The suit, filed late last week, is similar to other suits against generative AI creators, in that it alleges that they used copyrighted material — in this case, works of fiction by the named authors — as training data for an LLM. In this case, the LLM is Nvidia’s NeMo Megatron series, which, according to the complaint, uses several data sets known to contain the authors’ copyrighted material and used without permission.
Specifically, the “Books3” dataset seems to be at the heart of the matter. This comprises 108GB of data and is a copy of the Bibliotik private tracker — one of several “shadow library” sites that have a long-standing place in the LLM development world, since they “host and distribute vast quantities of unlicensed copyrighted material,” according to the complaint. The authors ask for monetary damages and “destruction … of all copies [Nvidia] made or used in violation of the exclusive rights of the Plaintiffs.”
The authors are represented by the Joseph Saveri Law Firm, which is already representing other groups of creative professionals in their suits against major AI providers. Comedian and writer Sarah Silverman is part of one such suit, filed in July 2023, against OpenAI and Meta, while another class action names authors Mona Awad and Paul Tremblay as lead plaintiffs. Like the other suits, the case was filed in federal district court in the Northern District of California. (Copyright cases, which are governed exclusively by federal law, are always heard by federal courts.)
All of these suits hinge on the concept of “fair use,” which is a set of exceptions to US copyright law that allow, in some cases, for the reproduction or other use of copyrighted works without permission. The legal test for whether a particular activity qualifies as fair use, according to the Stanford Copyright and Fair Use Center, asks judges to look at four factors, which are the purpose and character of the use, the nature of the copyrighted work, the amount and “substantiality” of the portion of the work used, and use’s effects on the copyright holder’s market for the work.
Defendant AI creators like Nvidia are likely to argue that their use of the copyrighted works is transformative and much different than the original creators’ use would be, and that the use of the books for AI training is unlikely have much of an impact on the market for prospective readers. Plaintiffs, on the other hand, are likely to point to the ingestion of multiple works in full and the commercial nature of Nvidia’s use of the books as arguments against fair use.
Nvidia did not immediately respond to a request for comment.