AI Rights Reservation: Human Readable is Machine Readable

An Interview with Haralambos “Babis” Marmanis

This piece originally appeared in The Scholarly Kitchen.

Ever since the passage of the EU’s Digital Single Market copyright directive in 2019, I have been especially intrigued by the rights reservation provisions relating to the commercial text and data mining (TDM) exception in Article 4. One thing that has puzzled me is the language specifying: “The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.”

TDM at its essence concerns the use of content by sophisticated machines to solve complex problems through computational power. TDM finds meaning and correlation in text. As I understood it, to the machine, everything is readable.

Maybe I was missing something in my admittedly surface-level understanding of technology.

Thankfully, I have a great go-to person on technology questions. Haralambos (“Babis”) Marmanis is Copyright Clearance Center’s polymath Executive Vice President and Chief Technology Officer. He is the author of the book Algorithms of the Intelligent Web, which introduced machine learning to a wide audience of practitioners working on everyday software applications. He holds a PhD in applied mathematics from Brown university, he co-authored the first book on Spend Analysis, and has published in numerous journals, conferences, and technical periodicals.

Critically, Babis was willing to answer my questions.

It always struck me as odd that the EU specified that the “machine readable” in the context of a TDM exception which allows, for want of a better phrase, “machine reading.” What do you think of this?

Rights reservation language, whether in plain English, included in terms, or coded into, e.g., metadata, is “machine readable.” It is a choice by an AI developer to not read “human readable” rights reservation language. It is prudent best practice to reserve rights by whatever reasonable technological means are available and in human readable language, but this does not negate the fact that human readable is machine readable. Frankly, if we can build systems that can pass the Turing test, a simple statement in plain language on the website should suffice for the crawler to detect whether the content should be processed or not.

The EU was certainly capable of specifying language or required technology to use for rights reservation. Before getting into the “why,” what technologies would have been available in 2019 if the EU wanted to specify a technology, and what technologies are available today?

There are many rights expression languages that have existed since the early 2000s, if not earlier! METSRights, Open Digital Rights Language (ODRL), and MPEG-21 are examples. In particular, MPEG REL (“REL” meaning “Rights Expression Language”), as defined by ISO/IEC 21000-5, provides flexible, interoperable mechanisms to support transparent and augmented use of digital resources throughout the value chain in a way that protects the digital resource and honors the rights, conditions, and fees specified for it. Although there are differences between these RELs, it is not hard to create extensions or crosswalks.

So why do you think they did not require any specific technology?

Well, the truth is that the various RELs go into different depths in the data they specify and they take different approaches in terms of how their instructions should be processed. Perhaps the EU did not want to commit to a single standard that is best in a particular set of circumstances but not the best for other cases. This is not a material problem, in my view, but that would be a plausible explanation for their choice.

Historically, publishers and authors reserved rights with a copyright notice or the statement “all rights reserved.” STM publishers typically attach Creative Commons licenses to content that is openly available. Are those “machine readable?”

As stated above, everything is machine readable. At this stage, we could express rules in natural language and, provided use of certain conventions and consistent terminology, it would not be hard for a machine to know what to do. In other words, there is no reason why a crawler wouldn’t stop processing a page and discard a page that included the statement “All rights reserved.” If the machine is smart enough to book your appointments and make decisions for you, isn’t it smart enough to know that the publisher doesn’t want you to crawl a page? I am not advocating that as a solution, I am just making the point that any unencrypted and accessible digital asset is “machine readable”

How, from a technology perspective, could a machine “read” this human readable text?

Well, when a crawler visits a page that corresponds to one of its target URLs, it extracts the relevant information from that page, such as the title, the headers, certain keywords, image descriptions, and so on. The crawler could immediately stop processing the page if it identifies the statement “All rights reserved,” which is how a human would know that the content is not available for uses other than the one that they have been permitted to have. Structured data, such as REL based expressions could make that far more specific in terms what is allowed and under what specific terms. But from the perspective of whether you are allowed to process the page or not, even the simplest convention would work. Incidentally, the robots.txt is supposed to indicate exactly what the owner of a page wants you to do with it but it is operating merely as a suggestion, it is entirely up to the crawler whether it will comply to its instructions.

Wouldn’t that mean any rights reservation is “machine readable?”

Yes, any rights reservation today is machine readable. Of course, if one is interested in expressing the rights in great detail and in capturing complicated use cases then a rights expression language would have to be used.