This article was originally published in The Scholarly Kitchen.

I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)

I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.

In the TDM context, this raises interesting questions:

  • Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
  • Does the attribution need to be included in the data set at every stage?
  • Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated. On November 4th, a class action lawsuit — Doe 1 v. GitHub Inc., N.D. Cal., No. 3:22-cv-06823, 11/3/22 — was filed in the US District Court in the Northern District in California, alleging against Microsoft and GitHub (a Microsoft subsidiary), inter alia: violation of the DMCA; breach of contract; tortious interference in a contractual relationship; unjust enrichment; unfair competition; violation of California Consumer Privacy Act; and negligence. Also sued were a confusing mishmash of for profit and non-profit related entities all using a variation of the name OpenAI (OpenAI, Inc., OpenAI, LLC, OpenAI Startup Fund GP I, L.L.C.; you get the picture). OpenAI received one billion dollars in funding from Microsoft although they seem “officially unrelated.”

As of this writing, this case is at the earliest stage and has a long way to go before there is any sort of result. But the issues it raises are significant, especially for authors who have published content under so-called “open licenses” with attribution requirements.

Let’s start with the lawsuit’s basics. GitHub is a hosting platform commonly used to share open-source code. The impulse to create and use open source code is  reasonable and has some social utility. Many of the processing tasks contemporary software engineers are called upon to create are repetitive, and relatively well-known and understood in the literature. Open source programming is one means of addressing the burden this represents. Basically, there’s no need to keep reinventing the wheel.

Over time, licenses were developed to standardize and better manage reuse rights in code developed and deployed this way. If the code came without restrictions, that would be the easiest in terms of reuse. But people like credit for their work, even in the open world. Some open code carries relatively light requirements, for example: “Don’t use my code commercially (don’t sell it or use it in something you sell)” and, very basically, “Acknowledge my contribution (keep my name on my work).” These types of requirements are familiar in our industry given widespread use of Creative Commons licenses.

Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement. As GitHub states, “…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.” The resulting product allegedly omitted any credit to the original creators.

Open licenses have tended to be looked upon by users as a free-for-all, without adequate attention to the very real concerns of the creators. In this case, the sheer scale of the alleged violation in terms of works used may well form the basis of the defense. “Your honor, we needed so many works that it was simply not practical to ask permission of the creators.” I don’t find this argument convincing given the ability today to license many content types at scale for TDM, including images, music and yes, journal articles (See “Full disclosure” above), but it is an argument often offered by infringers.

Open licenses can create a very practical challenge for users who go beyond the terms. Mining is a legitimate use of content under a CC BY license, but if you need permission from authors to, e.g., not include attribution information, time and effort may be needed. With journals, some publishers require authors to sign copyright agreements even for content that is then published under an open license. This practice creates a single point of contact for uses that may not fit within the CC lines. Of course with the expansion of rights retention strategies, the problem of contacting all authors only becomes worse.

As a final note, the complaint alleges a violation under the Digital Millennium Copyright Act for removal of copyright notices, attribution, and license terms, but conspicuously does not allege copyright infringement. A material breach of a copyright license can give rise to an infringement claim, so this is an interesting move. While the plaintiffs’ attorney indicated that an infringement claim might be added later, I suspect that this was done to avoid a messy fair use dispute. The complaint includes a statement by GitHub asserting an expansive, almost global fair use assertion which is at odds with explicit relevant law in many countries and frankly at odds even with US law. Nonetheless, fair use as a defense is expensive and complicated to litigate, so perhaps they chose to focus on something that is beyond factual dispute, and still provides the same damages.


Author: Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for CCC. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee. He serves on the Executive Committee of the of the United States Intellectual Property Alliance (USIPA) Board. He was the founding corporate Secretary of CrossRef, and formerly chaired its legal working group. He is a Chef in the Scholarly Kitchen and has written and lectured extensively on the subjects of copyright, licensing, open access, artificial intelligence, metadata, text/data mining, new media, artists’ rights, and art law. Kaufman is Editor-in-Chief of "Art Law Handbook: From Antiquities to the Internet" and author of two books on publishing contract law. He is a graduate of Brandeis University and Columbia Law School.
Don't Miss a Post

Subscribe to the award-winning
Velocity of Content blog