heart of the matter - executive summary banner

Copyright, AI Training, and LLMs: Historical Perspective and Infringement Analysis


The following is an excerpt from the article “The Heart of the Matter: Copyright, AI Training, and LLMs,” authored by Daniel Gervais (Milton R. Underwood Chair in Law, Vanderbilt University), Noam Shemtov (Professor in Intellectual Property and Technology Law/Deputy Head of CCLS, Queen Mary University of London), Haralambos Marmanis (Executive Vice President and CTO, CCC), and Catherine Zaller Rowland (Vice President and General Counsel, CCC).

The full article can be read in the Journal of the Copyright Society.

There is considerable uncertainty about the future. This is potentially costly both to AI companies, which may end up paying large statutory and compensatory damages in multiple jurisdictions, and to authors and other rights holders whose livelihoods are directly affected by LLMs. The fair use debate in the United States is likely to continue for several years until one or more Supreme Court opinions shed additional light on the issue. While LLMs are a significant and new technology and may be capable of multiple non-infringing uses, not every use of them with copyrighted material is transformative. AI companies know this.

For example, an LLM that ingests scientific articles and then simply regurgitates them on demand would not perform a transformative function. Nor would ingesting, say, all the music written by Taylor Swift for the purpose of producing more (but free) music “like her” in a way that infringes on her rights in the ingested musical works be transformative in our view, especially in light of Warhol. Nevertheless, some uses of LLMs and their training may be found to be fair use. 

Even in the EU, where more specific legislation has been adopted, debate continues about the scope of the text and data mining exceptions, their interplay, their interface with the AI Act, and the operation of the Article 4 opt-out. In other jurisdictions, limitations on statutory exceptions, such as in Japan and Singapore, may also require additional clarification.

The rapid change and uncertainty in the realm of AI and copyright raises the inevitable question of how to legally enable users to access high quality and compliant materials for use in AI systems, given the variability and attendant uncertainty about the scope of rights and exceptions and limitations.1 Where does the law come down on the creation of LLMs, both in the input and output of existing copyrighted materials? The answer to this conundrum may simply lie in the time-tested solution that has proven successful during earlier periods of technological advancement: licensing.2 Licensing enables copyright owners and users to come together in a mutually beneficial manner, helping the market function more efficiently and responsibly.

There is no single global copyright law, and countries vary significantly in their approach to copyright and AI-related issues like text and data mining and transparency.3 There is also no single court that will hand down all decisions on copyright and AI, either within a country or globally as the text of exceptions and limitations in national law varies greatly. However, global licenses can harmonize how copyright owners and users agree to use copyrighted works, significantly benefiting innovation and progress by setting the stage for consistent and responsible copyright uses that could lead to untold scientific and cultural advancements. 

Licenses could put an end to much of the uncertainty and to both pending and potential future litigation, putting acceptable boundaries on what can and cannot be done with copyrighted material when training LLMs.

Various licensing models could play a crucial role in this progress. Direct licensing—agreements between a copyright owner and user—is incredibly important because it allows the parties to be flexible in defining terms like payment, timing, and addressing specific, bespoke use cases.4 Voluntary collective licensing is also likely to play a critical role in solving  the licensing puzzle, enabling users to obtain a single license that can cover thousands (or more) of copyrighted works without having to negotiate with each copyright owner individually.5 This approach is highly beneficial for both copyright owners and users, as it provides an efficient mechanism to grant and obtain permission for using copyrighted works.6

Voluntary collective licensing is uniquely equipped to handle some of the more complex issues, when there are large numbers of works and potential users searching for an efficient mechanism to provide and obtain permission for using copyrighted works.7

One example of how this might be helpful in the AI context is a company engaged in heavy research and development activities that may want to make additional internal uses of a large number of textual works that they have acquired lawfully.8 The company may not have the bandwidth to engage in additional negotiations, while the publishers of the various scholarly journals would similarly be interested in licensing but would prefer to rely on a more streamlined approach. Importantly, voluntary collective licenses complement direct licenses, providing a framework where copyright owners and users can rely on collective licenses for many typical use cases and direct licenses for unique or individualized situations.

In the case of AI, we believe that both direct and collective licenses can be valuable to reduce uncertainty and establish a viable ecosystem going forward.9 Some uses, such as certain training activities or general categories of outputs that need access to diffuse copyrighted materials, may be good candidates for collective licensing.10 Conversely, specific high-value or individual uses based on more defined sets of copyrighted materials could be better suited for direct licensing.11 Regardless of the approach, licensing provides both parties with compliant access to high-quality works, leading to innovative uses.

This is not unlike how licensing has worked during prior times of technological advancement.12 In the 1970s, when photocopying was the disruptive technology, both direct and collective licensing helped make the market for using copyrighted materials work.13 Similar stories exist about the early days of the Internet and the early days of text and data mining.14 Each technological advancement raised new questions, but licensing has always provided a long-standing answer to enable responsible and beneficial use of copyrighted works.

  1. See supra Part IV [read full article here]. ↩︎
  2. See supra note 56, at 231-56 [read full article here]. ↩︎
  3. See supra Part IV [read full article here]. ↩︎
  4. See Daniel Gervais, Collective Management of Copyright and Related Rights Ch. 1 (3rd ed. 2015). ↩︎
  5. This is precisely what collective and centralized licensing does, namely allow users to use large repertoires of protected works. See id. ↩︎
  6. See Daniel Gervais, The Economics of Copyright Collectives, in 1 Research Handbook on the Economics of Intellectual Property Law 489-507 (P. Menell & B. Depoorter eds. 2019). ↩︎
  7. See Gervais, supra note 173 [read full article here]. ↩︎
  8. This is basically the fact pattern in Am. Geophysical Union v. Texaco, Inc., 60 F.3d 913, 926 (2d Cir. 1994). The court found that this activity was not a fair use. ↩︎
  9. One example may be Getty Images Launches Industry-First Model Release Supporting Data Privacy In Artificial Intelligence And Machine Learning, Getty Images (Mar. 21, 2022), https://newsroom.gettyimages.com/en/getty-images/getty-images-launches-industry-first-model-release-supporting-data-privacy-in-artificial-intelligence-and-machine-learning. ↩︎
  10. See e.g., Dave Shumaker, A Work in Progress: CCC and Artificial Intelligence, Information Today (Apr. 16, 2024), https://newsbreaks.infotoday.com/NewsBreaks/A-Work-in-Progress-CCC-and-Artificial-Intelligence-163574.asp. ↩︎
  11. See supra note 173 [read full article here]. ↩︎
  12. See infra note 184 [read full article here]. ↩︎
  13. This led to the formation of Copyright Clearance Center, Inc. in the United States and Reprographic Rights Organizations (RROs) around the world. For a list, see ifrro.org. ↩︎
  14. For example, see Statement of Maria A. Pallante Register of Copy. U.S. Copy. Off. before the Subcomm. on Courts., Intellectual Prop. and Internet Comm. on the Judiciary, U.S. Copy. Off. (Mar. 20, 2013) https://www.copyright.gov/regstat/2013/regstat03202013.html; The Exception for Text and Data Mining (TDM) in the Proposed Directive on Copyright in the Digital Single Market- Legal Aspects (Feb. 2018), https://www.europarl.europa.eu/RegData/etudes/IDAN/2018/604941/IPOL_IDA(2018)604941_EN.pdf. ↩︎

Topic:

Author: CCC

A pioneer in voluntary collective licensing, CCC advances copyright, accelerates knowledge, and powers innovation. With expertise in copyright, data quality, data analytics, and FAIR data implementations, CCC and its subsidiary RightsDirect collaborate with stakeholders on innovative solutions to harness the power of data and AI.