man using computer and AI technology

AI Rights Reservation: Human Readable is Machine Readable


An Interview with Haralambos “Babis” Marmanis

This piece originally appeared in The Scholarly Kitchen.

Ever since the passage of the EU’s Digital Single Market copyright directive in 2019, I have been especially intrigued by the rights reservation provisions relating to the commercial text and data mining (TDM) exception in Article 4. One thing that has puzzled me is the language specifying: “The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.”

TDM at its essence concerns the use of content by sophisticated machines to solve complex problems through computational power. TDM finds meaning and correlation in text. As I understood it, to the machine, everything is readable.

Maybe I was missing something in my admittedly surface-level understanding of technology.

Thankfully, I have a great go-to person on technology questions. Haralambos (“Babis”) Marmanis is Copyright Clearance Center’s polymath Executive Vice President and Chief Technology Officer. He is the author of the book Algorithms of the Intelligent Web, which introduced machine learning to a wide audience of practitioners working on everyday software applications. He holds a PhD in applied mathematics from Brown university, he co-authored the first book on Spend Analysis, and has published in numerous journals, conferences, and technical periodicals.

Critically, Babis was willing to answer my questions.

It always struck me as odd that the EU specified that the “machine readable” in the context of a TDM exception which allows, for want of a better phrase, “machine reading.” What do you think of this?

Rights reservation language, whether in plain English, included in terms, or coded into, e.g., metadata, is “machine readable.” It is a choice by an AI developer to not read “human readable” rights reservation language. It is prudent best practice to reserve rights by whatever reasonable technological means are available and in human readable language, but this does not negate the fact that human readable is machine readable. Frankly, if we can build systems that can pass the Turing test, a simple statement in plain language on the website should suffice for the crawler to detect whether the content should be processed or not.

The EU was certainly capable of specifying language or required technology to use for rights reservation. Before getting into the “why,” what technologies would have been available in 2019 if the EU wanted to specify a technology, and what technologies are available today?

There are many rights expression languages that have existed since the early 2000s, if not earlier! METSRights, Open Digital Rights Language (ODRL), and MPEG-21 are examples. In particular, MPEG REL (“REL” meaning “Rights Expression Language”), as defined by ISO/IEC 21000-5, provides flexible, interoperable mechanisms to support transparent and augmented use of digital resources throughout the value chain in a way that protects the digital resource and honors the rights, conditions, and fees specified for it. Although there are differences between these RELs, it is not hard to create extensions or crosswalks.

So why do you think they did not require any specific technology?

Well, the truth is that the various RELs go into different depths in the data they specify and they take different approaches in terms of how their instructions should be processed. Perhaps the EU did not want to commit to a single standard that is best in a particular set of circumstances but not the best for other cases. This is not a material problem, in my view, but that would be a plausible explanation for their choice.

Historically, publishers and authors reserved rights with a copyright notice or the statement “all rights reserved.” STM publishers typically attach Creative Commons licenses to content that is openly available. Are those “machine readable?”

As stated above, everything is machine readable. At this stage, we could express rules in natural language and, provided use of certain conventions and consistent terminology, it would not be hard for a machine to know what to do. In other words, there is no reason why a crawler wouldn’t stop processing a page and discard a page that included the statement “All rights reserved.” If the machine is smart enough to book your appointments and make decisions for you, isn’t it smart enough to know that the publisher doesn’t want you to crawl a page? I am not advocating that as a solution, I am just making the point that any unencrypted and accessible digital asset is “machine readable”

How, from a technology perspective, could a machine “read” this human readable text?

Well, when a crawler visits a page that corresponds to one of its target URLs, it extracts the relevant information from that page, such as the title, the headers, certain keywords, image descriptions, and so on. The crawler could immediately stop processing the page if it identifies the statement “All rights reserved,” which is how a human would know that the content is not available for uses other than the one that they have been permitted to have. Structured data, such as REL based expressions could make that far more specific in terms what is allowed and under what specific terms. But from the perspective of whether you are allowed to process the page or not, even the simplest convention would work. Incidentally, the robots.txt is supposed to indicate exactly what the owner of a page wants you to do with it but it is operating merely as a suggestion, it is entirely up to the crawler whether it will comply to its instructions.

Wouldn’t that mean any rights reservation is “machine readable?”

Yes, any rights reservation today is machine readable. Of course, if one is interested in expressing the rights in great detail and in capturing complicated use cases then a rights expression language would have to be used.

Topic:

Author: Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for CCC. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee. He serves on the Executive Committee of the of the United States Intellectual Property Alliance (USIPA) Board. He was the founding corporate Secretary of CrossRef, and formerly chaired its legal working group. He is a Chef in the Scholarly Kitchen and has written and lectured extensively on the subjects of copyright, licensing, open access, artificial intelligence, metadata, text/data mining, new media, artists’ rights, and art law. Kaufman is Editor-in-Chief of "Art Law Handbook: From Antiquities to the Internet" and author of two books on publishing contract law. He is a graduate of Brandeis University and Columbia Law School.