Copyright, AI Training, and LLMs: The Path Forward

The following is an excerpt from the article “The Heart of the Matter: Copyright, AI Training, and LLMs,” authored by Daniel Gervais (Milton R. Underwood Chair in Law, Vanderbilt University), Noam Shemtov (Professor in Intellectual Property and Technology Law/Deputy Head of CCLS, Queen Mary University of London), Haralambos Marmanis (Executive Vice President and CTO, CCC), and Catherine Zaller Rowland (Vice President and General Counsel, CCC).

The full article can be read in the Journal of the Copyright Society.

The timeline of copyright legislation shows a clear pattern: the introduction of major new technologies has often been followed by the creation of new exclusive rights (e.g., broadcasting rights) and/or compensation mechanisms (e.g., cable retransmission fees), as well as limitations and exceptions (e.g., for parody or criticism) that reflect a balance between the interests of copyright holders and those of the broader public. ¹

Courts have also tried to reflect a balance when interpreting the statute. We see examples in opposite directions in opinions of the Supreme Court in Sony and Grokster.² In the former case, the court moved from an apparent position of significant skepticism at oral argument to an affirmation of fair use for the sale of home video recording devices (VCRs) as a dual-use technology capable of both infringing and substantial non-infringing uses.³ In the latter case, although peer-to-peer file sharing was also a dual use technology, its “promotion” as an infringement tool led the court to find Grokster secondarily liable under a doctrine of inducement borrowed at least in part from patent law.⁴

Today, we stand on the brink of another crucial era in which technologies such as AI can absorb and process the creative works of humans to autonomously produce competitive content.⁵ The stakes for creators, industries, and society as a whole are immense, as the fundamental nature of creativity and the way we value human artistry may be about to change. It would be strange if, for perhaps the first time, such a significant change was not met with adequate adaptation of the international legal framework.⁶

Like with other advancements, copyright concepts apply to AI and help set the stage for how we can use both technology and the law to promote innovation. It is clear that AI is built on a foundation of immense works of authorship, many of which are protected by copyright.⁷ When copyrighted works are used, AI systems typically make copies of the works to train and power AI outputs.⁸ In LLMs, tokenization, sub-word tokenization, and the creation and storage of contextual word embeddings that delineate the relationships between words in extensive sequences can implicate two distinct exclusive copyright rights: the right of reproduction (essentially, copying), and the right of adaptation.⁹

The reproduction right essentially gives copyright owners the exclusive right to make copies of their work or to authorize others to do so.¹⁰ Article 12 of the Berne Convention, provides that “Authors of literary or artistic works shall enjoy the exclusive right of authorizing adaptations, arrangements, and other alterations of their works.”¹¹Adaptation is broadly recognized as the process of transforming a work into a different form of expression, moving beyond simple reproduction, such as the adaptation of a novel into a film. ¹² In the United States, there is also the specific right to create derivative works—works based on the original work—that are enjoyed by the copyright owner. Other nations tend to use the Berne Convention language and refer to adaptation and translation.¹³

If someone other than the copyright owner reproduces, adapts, or makes derivative works of a copyrighted work without permission, the copyright owner can make a claim of infringement, which would be subject to copyright law’s set of exceptions and limitations.¹⁴ In the case of LLMs, it is first necessary to determine whether there is unauthorized copying, adapting, or preparation of derivative works during the creation of training datasets, as well as during the actual training process.¹⁵ Specifically, does the comprehensive training process involve the unauthorized replication of works, and does a transformer model, once trained, retain copies or unauthorized adaptations of protected works?

Daniel J. Gervais, (Re)structuring Copyright: A Comprehensive Path to International Copyright Reform 207-215 (rev. ed., 2019). ↩︎
Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417 (1984); Metro-Goldwyn-Mayer Studios Inc. v. Grokster, Ltd., 545 U.S. 913 (2005). ↩︎
See Jessica Litman, The Story of Sony v. Universal Studios: Mary Poppins Meets the Boston Strangler, in Intellectual Property Stories 358, 366-368 (J. C. Ginsburg & R. C. Dreyfuss eds., 2006). ↩︎
See Grokster, 545 U.S. at 948 (“Even if the absolute number of noninfringing files copied using the Grokster and StreamCast software is large, it does not follow that the products are therefore put to substantial noninfringing uses and are thus immune from liability.”). ↩︎
See Daniel J. Gervais, The Machine As Author, 105 Iowa L. Rev. 2053, 2057 (2020) (“[M]achines are increasingly good at emulating humans and laying siege to what has been a strictly human outpost: intellectual creativity.At this juncture, we cannot know with certainty how high machines will reach on the creativity ladder when compared to, or measured against, their human counterparts, but we do know this: They are far enough already to force us to ask a genuinely hard and complex question, one that intellectual property scholars and courts will need to answer.”). ↩︎
One of us has proposed to open a discussion at WIPO on the revision of the Berne Convention, which, like the work on a possible protocol to the Convention in the 1980s and early 1990s, could lead to a new instrument (in that case, the WCT). See generally Gervais, supra note 56, at Appendix [read full article here]. On the possible protocol, see Ginsburg & Ricketson, supra note 49, ¶¶ 4.15-4.18. ↩︎
For example, see House of Lords Communications and Digital Select Committee, OpenAI-Written Evidence (2024), https://committees.parliament.uk/writtenevidence/126981/pdf/. ↩︎
See infra Part II [read full article here]. ↩︎
17 U.S.C. §§ 106(1)-106(2). ↩︎
17 U.S.C § 106(1); Restatement of Copyright § 56 (Am. L. Inst., Tentative Draft No. 3, 2022). ↩︎
Berne Convention, supra note 35, art. 12 [read full article here]. ↩︎
See Ginsburg & Ricketson, supra note 49, ¶¶ 11.30 – 11.41 (discussing the rights of reproduction and adaptation in the Convention) [read full article here]. ↩︎
See, e.g., German Act on Copyright and Related Rights (Urheberrechtsgesetz – UrhG), § 3 – Adaptations, Copyright Act of 9 September 1965 (Federal Law Gazette I, p. 1273), as last amended by Article 25 of the Act of 23 June 2021 (Federal Law Gazette I, p. 1858); Copyright, Designs and Patents Act 1988, c. 48 § 21 (UK). ↩︎
See 17 U.S.C. § 501(a) (“Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 . . . is an infringer of the copyright or right of the author, as the case may be.”). The reference to “sections 106 through 122” includes sections 107 to 122 of the Act, which provide a set of exceptions and limitations on the rights granted to copyright owners and authors. ↩︎
This is part of the plaintiff’s necessary prima facie case. See Sony 464 U.S. at 432-33 (defining an infringer as “anyone who trespasses into his exclusive domain by using or authorizing the use of the copyrighted work in one of the ways set forth in the statute.”). ↩︎