Research Results, Replicability, & Retractions

As with my earlier post in this series on the overall role of peer review in the scholarly and scientific publishing process, for Peer Review Week 2022 we are reposting our pieces in this series. And little bit of updated information on the topics we covered:

* Psychology, in particular, seems to have a replicability problem; even outside of psych, everyone knows that bad papers are still getting published (and they may even have a discoverability advantage).

* Retraction Watch is still engaged in its useful work, and IMHO, everyone in scholarly and scientific publishing should subscribe to it. For example, this Retraction Watch collection highlights COVID-19 papers that had to be pulled.

Earlier in this series, I talked about the “QA” aspects of formal peer review; and then we took a look at preprints, the Versions of Record, and postprints. In this post, I’m focusing on the Why and the How of reproducibility, of replication studies. At the end of this post I’ll wrap up with a brief look at the graveyard of dead papers, the dread retraction.

A long time ago (1934 or so; from 1959 in English), the Viennese philosopher of science, Karl Popper, articulated what has since become known as the “criterion of falsifiability,” the notion that within the scientific disciplines, for a proposition to be considered scientific, it must be — at least in principle — capable of being refuted by an experiment. It may not be a perfect criterion, but applying it does serve to separate science from pseudo-science and that, in the words of Martha Stewart, “a good thing.”

In his book, The Logic of Scientific Discovery, Popper wrote:

“We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence,’ but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.”

Admittedly, that is all a bit jargon-y; but what I see Popper getting at here is the noble practice, in scientific research and publishing, of demanding that experiments be re-run to see if their results hold up. Ideally, the initial results will check out; but if they don’t, maybe the methods were invalid or the data wasn’t of sufficient quality in the first place. In other words, if something is amiss it is much better to find out as soon as possible. In biomedicine, this testing rigor is enforced through procedures of testing in vitro (essentially, in a dish), then in vivo (i.e., on live subjects, starting with animal testing); then on human subjects (control groups and all that) and —with luck—final approvals from the FDA (or other authorities) for specific uses. In essence, what we see in this instance is the scientific process at work, striving to provide treatments that are at once both safe and effective for doctors to prescribe and for people to use.

At the most abstract level, scientific studies – especially experiments which are designed to ferret out new knowledge – produce outcomes that indicate… something. Maybe the something is “the null hypothesis” – effectively, that no meaningful relation between this and that shows up in the results; or some new and interesting result may be uncovered, as when penicillin was shown to have a strong, general antibiotic effect.

But such discoveries are not complete – they are not considered reliable — until they have been replicated by others. In basic terms, replication studies are those in which the conditions, data and procedures of original experiments are re-run to see if they come out the same. Of course there is more to it than that.

Although I am sure there are many others, the greatest failure-to-replicate example that I know of is the Fleishmann-Pons “cold fusion” debacle of 1989. (I recall following the controversy in Usenet’s sci.physics newsgroups in near-real-time. It was fascinating, even to an outsider like me.) Bear in mind that “cold fusion” (sometimes referred to as “desktop fusion,” would have certainly changed the world of energy and fuel production—if the effect were real.

The summary paragraph provided in the Wikipedia entry on “Cold Fusion” is precise and to the point:

“In 1989, two electrochemists, Martin Fleischmann and Stanley Pons, reported that their apparatus had produced anomalous heat (“excess heat”) of a magnitude they asserted would defy explanation except in terms of nuclear processes. They further reported measuring small amounts of nuclear reaction byproducts, including neutrons and tritium. The small tabletop experiment involved electrolysis of heavy water on the surface of a palladium (Pd) electrode. The reported results received wide media attention and raised hopes of a cheap and abundant source of energy.”

Boiled down to essentials, the initial claims of the Fleishmann-Pons team, concerning detectable energy production — and particularly their inference as to its source — failed to replicate, and failed to hold up under closer scrutiny. The more specialists refined the experimental procedure, the less the net energy production effect, even when present, appeared. Although unfortunate for the reputations of those two scientists, the replication step did its job; relatively quickly it showed that there were significant problems in the experiment and that the claimed results were not to be relied on.

Note: At this distance, it appears that Fleishmann-Pons and follow-on experiments found something interesting; but not energy at the levels they thought, and not any form of anything that should be referred to as “cold fusion.:

Finally, even where the underlying research and the resulting article have cleared all the usual hurdles, and have been published in a reputable (not fly-by-night) journal, sometimes – rarely —major problems with the overall work are identified after-the-fact. The procedure for addressing such problems is known as “making a retraction” and no one likes to do it. The author or research team can understandably feel defeated or even rejected, the editor and journal likely feel they have suffered a loss of reputation, and readers may rightly feel let down by both. Retractions can occur fast, or they can be slow and take years to complete. For those who need to follow such things, Retraction Watch is a good aggregator of retraction events.

According to the Oxford Dictionaries, “quality assurance” may be functionally defined as “the maintenance of a desired level of quality in a service or product, especially by means of attention to every stage of the process of delivery or production.” Throughout this short series, we’ve focused on the many steps, and all stages, used in scientific and scholarly publishing to ensure and improve on the quality and reliability of published articles. As is often said about peer review, but I think equally true about the others, these are the inglorious (or, maybe, simply non-glorious) but necessary procedures to enforce if a top-quality product and reputation is be earned and maintained.