The Heart of the Matter: Copyright, AI Training, and LLMs - Part 1: Understanding How LLMs Actually Copy

A Four-Part Series for Corporate Legal Counsel

Introduction

A conversation I hear regularly goes something like this: the AI vendor assures the legal team that their system doesn’t “store” copyrighted works, it just learns patterns. The legal team files that away and moves on.

Meanwhile, plaintiffs’ attorneys in dozens of active lawsuits are arguing the opposite: that these systems are built on mass reproduction of protected works. Both statements can’t be right. Your job is to know which one is.

This article extracts the technical foundation from “The Heart of the Matter: Copyright, AI Training, and LLMs” and presents it as a standalone primer for counsel who need to assess AI copyright risk without becoming computer scientists.

What follows covers the fundamentals: tokenization, embeddings, and the stages of model training. The claim that “there are no copies, only parameters” fundamentally misrepresents how these systems work. That claim doesn’t survive contact with the technology. This primer explains why.

Understanding the technology is not about taking sides. It is about having enough grounding to assess risk and asking the right questions of your vendors without relying on someone else’s characterization (vendor or plaintiff) of what is happening inside the machine.

The following is an excerpt from Part I of the article “The Heart of the Matter: Copyright, AI Training, and LLMs,” authored by Daniel Gervais (Milton R. Underwood Chair in Law, Vanderbilt University), Noam Shemtov (Professor in Intellectual Property and Technology Law/Deputy Head of CCLS, Queen Mary University of London), Haralambos Marmanis (Executive Vice President and CTO, CCC), and Catherine Zaller Rowland (Vice President and General Counsel, CCC). The author is authorized by the paper’s authors to reproduce this content. The complete paper addresses the technical and legal analysis, jurisdictional comparisons, and licensing solutions.

Part I can be read in full beginning on page 484 in Journal of the Copyright Society.

UNDERSTANDING LARGE LANGUAGE MODELS

In this Part, we review the way in which LLMs are trained and how they use human language, which many of the models learn by copying and processing copyrighted material such as books, newspaper, and journal articles.

The Role of Language

To understand the relationship between AI and copyright, we must first grasp the fundamentals of how AI systems that use large language models (LLMs) currently operate. At the core of human communication, whether verbal or written, lies the arrangement of words in sequences governed by the rules of syntax specific to a particular language, such as English, an indispensable part of the legal system.⁶ While words themselves can be single or multi-character symbols, as seen in Chinese and Japanese, from a computer science perspective, human languages are classified as “natural languages.”⁷

The study of processing and understanding natural languages is known as “Natural Language Processing” (NLP)⁸ and “Natural Language Understanding,”⁹ respectively, with the former term being more commonly used.¹⁰ When it comes to textual content, books, articles, and other text-based artifacts are essentially compilations of word sequences.¹¹ However, computers, which operate using numbers, cannot directly comprehend the words of our language.¹² Therefore, an essential part of the software architecture for all systems that process text, including AI systems, is the representation of text using numerical values that enable the system to perform the required tasks.

Conclusion

Copies exist in multiple forms throughout the AI lifecycle. That is not a plaintiffs’ attorney talking point, it is a description of how the technology actually works.

Copies are made during training and fine-tuning. In RAG systems, content is stored permanently for retrieval. Direct prompting creates a copy the moment an employee uploads copyrighted material into an AI system. Copyright law recognizes all of these as reproductions.

The question was never whether copying occurs. It does. The question is whether it is authorized or excused. That analysis is the subject of Parts II and III to follow.