There is a place in the recent Star documentary “Dopesick” where the investigators seeking to validate the claim of the manufacturers of OxyContin that it is non-addictive, search for the article which is said to be the source of the claim in the NEJM. They cannot find it in the year cited.
What is more, they cannot find it in the previous or subsequent years. This being the first decade of the century, they resort to keyword searching. Eventually they find the source which is said to validate this claim. It turns out to be a five-line letter in the correspondence columns, but the screen shows the thousands of references raised by the keyword searching and a long and harrowing task for the researchers. As I watched it, I reflected how far the world had moved on from those haphazard searching days, along with the realization that we have not progressed quite as far as we sometimes think, as our world of information connectivity keeps having to track back before it goes forward.
The great age of keyword searching was succeeded in the following decade by the great age of AI and machine learning hype. We welcomed the world of unstructured data in which we could find everything because of the power and majesty of keyword searching and because of our effectiveness in teaching machines to become researchers. And we will never know how much we spent or how many blind alleys we went up as we tried to apply prematurely systems that only worked when we built in enough information to enable us to be categorical and certain about the results that we achieved. And in science, and in health science in particular, categorical certainty is what we must achieve at a bare minimum.
We will never know cost in time and money terms because it is not in the interests of anybody to tell us. But we can see, as we move forward again, that there is a wide recognition now that fully informative metadata has to be in place, that identifiers have to point to a knowledge container in order to point machine intelligence in the right direction. All the work which we did in earlier decades on taxonomies, ontologies and metadata was not wasted, but simply became the foundation of the intelligent and machine interoperable data society which we seek now to build. We are developing a clear understanding that the route to the knowledge we seek lies in the quality of meta data, and the ability to rely upon primary data which is fully revealed in the meta data. Many of us now accept that primary data (by which I mean research articles and evidential data sets) will exist in future for most researchers as background: most of their research will take place in the meta data, and they will dip into the primary data only on rare occasions. There is a clear parallel here to other sectors, like the law, where the concepts and commentary become more important than the words of enactment.
And so as the age of reading research articles comes to an end, so the business of understanding what they mean has to take further strides forward. This means that existing players in the data marketplace have to reposition, and this week’s announcements from CCC exemplify that repositioning in a very clear manner. CCC is an extremely valuable component part of scholarly communications: its history in the protection of copyright and the development of licensing has been vital to helping data move to the right places at the right times. But CCC management know that that is not any longer quite enough: their role as independent producers of the knowledge artefacts that will make the emerging data marketplaces succeed is now coming to the fore. In two announcements this week they demonstrate that transition. One is the acquisition of Ringgold, an independent player in the PID marketplace. Amongst all meta data types, PID and DOI alongside ORCID, have enabled the scope of the communications database marketplace to emerge.
PID means permanent identification. In the case of Ringgold, it means making sense of organizational chaos thus disambiguating what are the most confusing aspects of the sector. Just as ORCID sought to disambiguate the complex world of authorship, so Ringgold seeks to help machines reading machine readable data understand the nature and type of organizations and activities that are being described. And then extend this one stage further. It is one thing to know these fixed points of meta data knowledge, quite another to use and manipulate them effectively to create the patterns of connectivity upon which a data driven society depends.
While announcing the acquisition of Ringgold, CCC also announced a huge step forward in the developing world of knowledge graph development. Putting labels on things is one thing: turning them into maps of interconnected knowledge points is another. CCC have been brewing this activity for some time, as their COVID research development work demonstrates, it is impressive to see this being done, and being done by a not-for-profit with a long history of sector service and neutrality to the sector major player competitiveness. We are increasingly aware that some things have to be done for the benefit of the entire sector if the expectations of researchers, and of the sector as a whole are to be achieved.
And of course, those competitive major players are also moving forward. The announcement in the last two weeks by Elsevier that they have made a final bid for the acquisition of Interfolio is a clear indicator. This implies to me a thesis that article production, processing and distribution is no longer in the development marketplace. We are now critically aware that the business of producing, editing and presenting research articles may be fully automated within the next decade. With even more development work going into the automation of peer review checking, human intervention in the cycle of research publication may become supervisory, and the bulk of production work may shift to the computer on the researcher’s desk.
There will still be questions of brand of course. There will still be questions of assessment and valuation of research contributions to be made. And it is this latter area which becomes a major point of interest as Interfolio takes Elsevier into the administration of universities and the administration of funding in a vital interesting way. This is the place where new standards of evaluation will be developed. When the world does let go of the hand of the impact factor which has guided everybody for so long, then Elsevier want to be in the centre of the revaluation. In other words, if Elsevier as market leader formally competed most with fellow journal publishers, its key competitor in future development may well be companies like Clarivate.
The critical path of evaluation will be the claims made by researchers for their work, and the way those claims are validated in subsequent work reproducibility. In this connection we should also note the work of FAR and it’s collaborators in developing ideas around “nano publishing”, using metadata to clearly outline claims made by researchers in articles and both their conclusions parts of articles and in other places as well. If this had been in place, the OxyContin investigators would have found it easy to find a letter back at the beginning of the century.
This piece was re-published with permission by David Worlock.