Tuesday, July 8, 2014

In Defense of Replication Studies

There’s been a recent fluttering of activity on the Internet about a paper written by Harvard social psychologist Jason Mitchell, the full text of which can be read here: http://wjh.harvard.edu/~jmitchel/writing/failed_science.htm.  The crux of the issue seems to be that Dr. Mitchell apparently sees little value in replication studies or in the publication of negative results, a noted and alarming inverse of the current trend among reputable scientists to decry the lack of those very types of publications in most major journals for reasons I will discuss briefly (though by no means completely) in this response.

Dr. Mitchell received his B.A. and M.S. from Yale and his Ph.D. from Harvard, and is now a professor of psychology at Harvard where he is the principal investigator at the University’s Social Cognitive and Affective Neuroscience Lab (http://www.wjh.harvard.edu/~scanlab/people.html).  I say this to point out that Dr. Mitchell’s credentials appear impeccable, at least on paper.  He’s a professor at one of the world’s most prestigious universities (though the merit of such prestige in education is often called into question, that is a discussion for another day), and appears to have a consistent flow of publications in the scientific literature, much of which, though I am completely unfamiliar with his work beyond this single paper in question, appears to be of significant interest.  Having established those credentials, the duty now falls upon my shoulders to convince you that despite an apparently productive career in social science, Dr. Mitchell appears never to have received even the most rudimentary education on the basics of the scientific method, either through oversight on the parts of his instructors or, more likely, inattention on Dr. Mitchell’s part during those key lectures.

It is strongly recommended that you either read Dr. Mitchell’s paper, “On the emptiness of failed replications” in its entirety before returning to this document or that you read it alongside this discussion so that his argument can be made to you in his own words.  I would not wish to be accused of misrepresenting his argument.  Nevertheless, I will proceed through the article point-by-point, providing significant commentary along the way and quoting the source material, though sparingly, so as to provide direct refutations.

Dr. Mitchell’s article begins with a bullet-pointed listing of six postulates, each one of which is dead wrong.  I will attempt my exploration of the faults in Dr. Mitchell’s paper by examining each of these points in turn.  The bulk of the paper is simply Dr. Mitchell’s supporting arguments and evidence (such as they are) for these six points.  As such, the bulk of the paper, though not often directly quoted here, will be addressed under the headings of the six claims.

1) “Recent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.”

Several years ago, I had a chance encounter on the Internet with a gentleman who was pursuing his doctorate in applied physics, specializing in acoustics.  We became acquainted through commentary on my girlfriend’s page on a social media website during a discussion of evolutionary science and creationist dogma, during which debate this gentleman revealed that, despite his scientific training, he was a young earth creationist and that, further, he believed physics supported his position.  Amongst his misunderstandings were claims that because the Sun is burning up, it should be getting smaller, and a belief that Einstein’s theory of special relativity suggests that as an object approaches the speed of light, it loses mass (when in reality, objects approaching light-speed approach infinite mass).  I mention this frustrating conversation because until now, it was the greatest misunderstanding of science I have ever heard from someone claiming any degree of professional training in the sciences.  Dr. Mitchell has the dubious honor of having surpassed that creationist’s achievement.  This creationist, at least, made a show of doing real science and claiming the evidence supported his argments (however misguided those claims were).  Dr. Mitchell’s approach to science, if I dare call it an approach to science, appears to suggest that any study failing to confirm the experimenter’s hypothesis is useless.

For those of you who aren’t already either rolling off your chair in fits of uncontrollable laughter at Dr. Mitchell’s expense or banging your head against your desk in frustration for much the same reason, I will pause for a moment to explain the ludicrousness of Dr. Mitchell’s position (and offer the promise of further hilarity to follow).

To begin with, the “hand-wringing” as Dr. Mitchell dismissively refers to a growing collective concern amongst scientists, is very well-deserved.  If you follow the scientific world, you may have heard of something called “publication bias.”  The idea is that journals tend to like to publish positive results of exciting experiments because those grab headlines and help sell the publication to professional readers.  There’s nothing particularly evil about this on its face, except when you realize that replication is a key part of the scientific process for reasons we’ll discuss in greater depth later on (but it basically comes down to being sure that a published result wasn’t just a phantom due to random chance or experimenter error), and that these replication studies (being the “un-sexy” sort of work that just sets out to question or to establish the credibility of previously published work) find extremely limited venues for their publication.  When they are published, and there is certainly no guarantee they will be, it is often in obscure journals that fail to reach even a sizeable fraction of the readership of the original paper.  The result of this, concern over which is dismissed by Dr. Mitchell as “pointless” and “hand-wringing,” is that erroneous papers which reach publication (yes, despite all the best efforts, erroneous information does get published either due to oversight or, rarer, deliberate misrepresentation of research in order to get published) may wait a considerably long time before they are corrected--if, indeed, they are ever corrected.  This means there is a distinct possibility (nay: probability) that some indeterminate amount of the information accepted into the body of scientific knowledge is wrong.

None of this is intended to cast doubt upon science as a method of knowing. Indeed, the scientific method, when properly applied, is specifically designed to avoid just this sort of situation.  The problem we currently face with the issue of publication bias in the sciences is not a problem with the science, but with the politics that have come to dominate within the halls of academia, and to which science unfortunately often takes a backseat in the minds of the administrators who perpetuate the problem.  This, however, is not intended to be a referendum on politics in academia, but a discussion of the flaws with Dr. Mitchell’s little paper, so I will refrain from heading down the rabbit hole (some might call it a black hole) of academic politics.

Even if replication studies were not of any importance, however--even if Dr. Mitchell’s apparent assumption that original research is always flawless were completely and undeniably true--there would still be much to find fault with in just this first bullet point.  He claims that “unsuccessful experiments have no meaningful scientific value.”  There is a bit of an ambiguity in that statement, and the Principle of Charity would compel me to address the best possible interpretation of his claim.  I will do so, though I will then explore the more troubling interpretation because I actually believe the more troubling interpretation to be the interpretation Dr. Mitchell originally intended.

The ambiguity has to do with the phrase “unsuccessful experiments.”  By that does Dr. Mitchell mean an experiment which has been compromised by error?  Or does he mean an experiment which yields negative results?

Let us examine the former.  If he does indeed mean to discuss experiments which have gone wrong, and yielded inaccurate information due to some experimental error (or even chance fluctuations), then he is arguably correct (though barely so) in suggesting that these experiments have no meaningful scientific value.  The problem, however, is that by conflating this statement with a condemnation of replication studies, he betrays an assumption that original research is always performed with greater accuracy than replication studies.  To be sure, this is sometimes the case.  I am by no means suggesting that a replication study is of greater merit than its predecessor.  What I am saying, and what I believe any competent scientist would say, is that when two studies show up with contradictory results, it indicates that at least one of them contains some kind of error.  It is then for the scientific community to conduct further examination (whether that is a closer reexamination of the data or a completely new experiment) in order to determine which.  Certainly it is of scientific value to determine which of two contradictory studies is invalid, even if that means we then determine that this particular study is completely invalid and without value.  Unless we assume the infallibility of original research, these negative replication studies do provide scientific value because they help us to determine which of the original studies need to be reexamined.  Furthermore, even completely failed experiments often lead scientists to explore new, previously unconsidered hypotheses, so there is indirect scientific value in that way as well.

I do not, however, suspect that this is what Dr. Mitchell intended to say.  Rather, it is my assumption, based on phrasing later in the article equating the term “scientific failure” with “an experiment [that] is expected to yield certain results, and yet… fails to do so,” that Dr. Mitchell means an “unsuccessful experiment” to refer to any experiment which fails to support the researcher’s hypothesis.  This is a much more troubling interpretation of his words, however, for two primary reasons.

The first, and arguably less important (though it is of particular importance to me personally as a student of not only the practice but the philosophy of science) problem with this statement is that it equates the negative result with a failure. Yes, we all become attached to our pet hypotheses, but a negative experiment, if viewed through the proper lens of pure scientific inquiry, is not a failure. It is a monumental success, for it has shown that the experimenter’s assumptions had been incorrect. There is something else at work. There is something new to learn. Issac Asimov famously said that “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka’ but ‘That's funny....’” What he meant by that is that true scientific discovery stems not from experiments that confirm what we already suspect to be true, but from those that show us there is something entirely unexpected, just waiting to be discovered. Science would be a sorry practice indeed if we all just went around trying desperately to prove ourselves right without the slightest consideration that there might be more to the universe that we suspected. And so it is the negative experimental result which often leads us in those unexpected but fruitful paths upon which the most profound discoveries are made. Surely Dr. Mitchell is familiar with this philosophical approach to pure scientific inquiry, but his paper gives no indication of it.

Of greater significance is the fact that, putting philosophy aside, his statement is just plain wrong. Negative results are of great “meaningful scientific value.” Science is as much about figuring out what isn’t so as it is about figuring out what is. Indeed, the very essence of the scientific method, apparently taught more thoroughly to fifth-grade science fair competitors than to Harvard researchers, is the practice of formulating testable hypotheses and then attempting to falsify them in order to determine the likelihood of their accuracy. The hypothesis that is not falsified may be tentatively accepted as true (though subject, much to Dr. Mitchell’s apparent displeasure, to further testing and review), while the hypothesis that is falsified is discarded so the scientist may move on to more fruitful pastures. This is the most basic principle of scientific research, and to have to explain it in a paper in response to a credentialed professor at a prestigious center of learning is troublesome to say the least. Negative experimental results indicate falsified hypotheses. Yes, false negatives can occur, so it is worth replicating even negative results, but that certainly doesn’t mean they’re of no scientific value.

Perhaps Dr. Mitchell or someone of his opinion would counter by saying something along the lines of, “Well, that’s all very good, but it’s not important to publish the negative findings. Falsified results may direct a researcher away from a point of inquiry, but are of no value to the larger community in and of themselves.” Obviously this is not so. Knowing of work that has not been supported is valuable to the scientific community at large for precisely the same reason it is important to the individual researcher: it helps us to direct further research. Even putting aside scientific curiosity and a drive to understand the world as much as we possibly can, there is a very good economic reason to desire greater publication of negative results. Grant money is notoriously hard to come by. Even Dr. Mitchell makes a nod toward this fact when he writes, “Science is a tough place to make a living. Our experiments fail much of the time, and even the best scientists meet with a steady drum of rejections from journals, grant panels, and search committees.” This is all very true, and having it spelled out in Dr. Mitchell’s own essay saves me the trouble of having to make exactly the same point in opposition to his thesis. Science is, as Dr. Mitchell says, a tough business. It is very difficult to get grant money. The more involved the work, the more difficult it is to fund. This is Economics 101. So why, oh why, should we want to endlessly reinvent the wheel? Replication studies are essential to avoid both false positives and false negatives, but they are specifically designed as replications. Imagine if Scientist A falsifies his hypothesis after ten years of hard work and then, either by choice or because publications shy away from such things, his work is not published. Later, Scientist B stumbles upon a similar (or identical) hypothesis. She then applies for and receives a grant to look into it. She spends her six-figures of grand money and ten years of her life, and finds an identical result. Had Scientist A published his findings, she might never have made the investment.

Make no mistake, if Scientist B wishes to conduct the study as a replication study, she is well-advised to do so.  Replication is essential.  It’s very possible that Scientist A made some mistake in his original experiment, and Scientist B might be able to correct that mistake.  However, such replications become meaningless when negative results are never published.  This view that negative results are of no scientific value dooms generations of scientists to endlessly follow the same dead-end trails.  It slows scientific progress, costs millions of dollars of grant money which could be better spent elsewhere, and wastes the productive time of countless scientists.  Let’s not pretend we have an overabundance of qualified scientists, either.  Every man-hour is precious, especially in a world where so much of the general population is far more content to spend their lives watching television than working in a laboratory.

I will close this discussion of Dr. Mitchell’s first bullet-point (oh yes, we still have five more of his inane bullet-points, plus several points from the main body of the article to get through before we draw this discussion to an end) with a personal story.  Some years back, I was asked to participate as a judge for a local private school’s science fair, a duty I was happy to perform.  While wandering from presentation to presentation with my fellow judges, I noticed something of a trend amongst the entries.  Namely, most were very traditional (one might be tempted to say clichéd) science fair projects.  This is not less than one would expect from a school limited to kindergarten through eighth grade, so I did not judge particularly harshly, but I did make a mental note that for many of the students, the science fair was about producing a flashy display.  There was a remote controlled robot or two, several volcanoes, and many presentations along those lines.  The quality of display was occasionally impressive, but there was very little science actually being done.  Then I happened across one of the last entries of the day.  It was from a student whose family had recently immigrated from Mexico.  His English, though far more impressive than my Spanish would be given a similar amount of time to study, was extremely limited, and his family had very little money with which to purchase supplies, but he wanted to enter the science fair nonetheless.  Unable to afford flashy props, he did a simple experiment.  He filled basketballs to various levels of air pressure to determine which was the most bouncy.  He hypothesized that the fullest ball would be the bounciest.  To test this, he filled one ball to regulation pressure, overfilled one, and underfilled another.  He found, contrary to his hypothesis, that the medium-filled ball was actually the bounciest.  Granted, this was not a rigorously controlled scientific experiment that would be worthy of publication in even the most lenient of journals.  However, this student was the only one of the many entries to actually do real science.  He conducted a proper experiment, achieved a result that did not support his hypothesis, and wrote up his display (with his teacher’s help to get his English right) to tell us all about what he had found.  I do not recall the results of the science fair once all the judges’ scores were compiled, but he received my highest marks.  If he had taken Dr. Mitchell’s postulate that “failed” experiments are of no scientific value to heart, that would never have taken place.

2) “Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way.  Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them.”

Upon reading this statement, I withheld some hope that clarification would be forthcoming in the body of the text; clarification that might serve to negate the glaring oversight in Dr. Mitchell’s claim.  Indeed, further clarification was provided, but instead of negating his error, Dr. Mitchell doubled down on his mistake.

Lest I get ahead of myself as I explore this idea (albeit in much briefer terms than the previous point), allow me to bludgeon you, dear reader, with the obvious: Dr. Mitchell fails to account for the fact that the replicator may be a more skilled experimenter than the scientist who produced the original finding.

Dr. Mitchell is correct about one thing in this analysis.  It is clearly possible that the replicator might have “bungled something along the way.”  It happens.  As humans, we err.  This is undeniable and hardly worth pointing out.  Except, it seems that Dr. Mitchell struggles not only with the philosophical side of science, but also with the self-evident traits of humanity.  Certainly, this is a forgivable oversight, however.  He is, after all, only a scientist working in a discipline dedicated to understanding the traits of humanity.  But I digress.

The problem is that the statement can easily be reversed.  Let me give it a try: “Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any positive experimental result will always be that the researcher bungled something along the way.  Unless original research is conducted by flawless experimenters, nothing interesting can ever be learned from it.”  If that sounds to you like absolute garbage, you are absolutely correct.  Dr. Mitchell’s great failure is in assuming inerrancy on the part of original researchers and incompetence on the part of replicators.  In reality, replicators and original researchers are often the very same people.  As a reputable scientist, it should be part of every researcher’s job to do both original research and replication studies as the need arises for either.  There would be nothing wrong with specializing in one or the other, but a well-balanced approach to research by doing some of both is probably the best way to advance not only the collective scientific knowledge but one’s personal knowledge of one’s own discipline.  Putting aside the old bugaboo of academic politics, I would think the best way to advance the goal, not necessarily of career advancement but of scientific advancement, would be to do a bit of both.  Nevermind all that, though.  Let’s assume for the moment that we have entered into a fantasy world where scientists are allowed to do either original research or replications but not both.  Is there some magical force that bestows competence disproportionately upon one rather than the other?  Of course not.  There will be incompetents and geniuses on both sides, and the average will always be average.

Dr. Mitchell is correct that experimental error is a problem that needs to be addressed in any replication study and though he seems to forget that the same is true of original research, he is correct to suggest that examining replications for experimental error is a worth-while pursuit.

What Dr. Mitchell seems not to understand is that replication is not an argument that an experiment is somehow better the second time it is performed or when done in a different laboratory than in the first case.  The point of replication is that, just as he argues that there can be mistakes in replication experiments, there are mistakes or unknown factors in original research, too.  Replication is essential to determine the robustness of a finding.  If ten studies show a finding to be valid and a new study fails to replicate it, we still examine all eleven, though we do so with the assumption that the fault might likely lie in the new study.  However, if only two studies have been done, we must examine both very carefully to determine which is more likely correct.  There is the further possibility that all of the studies, even with their conflicting results, can be valid, and that there is just some small change in experimental conditions that renders the studies different.  This could lead to entirely new discoveries.

I will illustrate with this example (note: these studies are fictitious and not based on any real data of any kind).  Let us imagine that Scientist X from the University of Timbuktu conducts an experiment and finds that when given 12-volt electric shocks, people perform better at chess than a control group.  Then, Scientist Y from the University of Nantucket conducts a replication trial.  The experiment is performed in exactly the same conditions, but Scientist Y finds no such effect.  What could be happening? Scientist Z from the University of Neverland reads both papers.  He writes letters to both scientists to make sure the experiments were identical, and reexamines the raw data from both experiments to determine which of the studies was wrong, but he finds no experimental error on either side, no problems with data entry, certainly no fraud, and nothing at all to indicate which study was correct.  Can you solve this little problem?  Certainly it would seem that Dr. Mitchell would immediately assume that Scientist X is correct and Scientist Y has made some undetectable mistake.  However, perhaps the real solution is that they are both correct.  There is no flaw in the University of Timbuktu study, but it is incomplete.  It fails to account for the fact that, in Nantucket, they rather enjoy electric shocks due to some previously undiscovered environmental factor, so they are immune to the effects of the experimental manipulation in the study by Scientist X.  Of course it’s a stupid example, but I think it vividly illustrates the point that Dr. Mitchell, for all his laudable attempts to avoid experimental error reaching the literature, has ignored the possibility that replication studies can bring new insights in addition to oversight.

If an original study is superior to the replication study that finds different results, it should be very easy for the original researchers to defend their work.  They could point out the flaws in the replication, or they could conduct further research or call for independent research.  Any of these approaches could vindicate the original study and show the replication to be incorrect.  Instead of taking this proper approach, Dr. Mitchell suggests that we should ignore replication entirely because sometimes a replicator might get it wrong.  He forgets that in science, truth is determined not by who published first but by who has the best evidence.  All of his anticipated problems with replication are easily dismissed simply by providing the evidence that shows the original study correct.

3) “Three standard rejoinders to this critique are considered and rejected.  Despite claims to the contrary, failed replications do not provide meaningful information if they closely follow original methodology; they do not necessarily identify effects that may be too small or flimsy to be worth studying; and they cannot contribute to a cumulative understanding of scientific phenomena.

Moreso than the other five points, this one relies heavily on the body of the essay to understand its meaning.  The basic idea is that Dr. Mitchell is considering three responses to his critique.  While I’m sure that these responses are real ones, I question his selection because they were not the first three that came to my mind.  Could Dr. Mitchell be attempting to subtly erect a straw man?  At the very least, he seems not to be arguing against the best form of his opponents’ arguments.  Nevertheless, these three points are worth examining.

The first point is one which I must, unfortunately, rely upon quoting in its entirety, so that you may fully appreciate the ineptitude of the argument:

There are three standard rejoinders to these points.  The first is to argue that because the replicator is closely copying the method set out in an earlier experiment, the original description must in some way be insufficient or otherwise defective.  After all, the argument goes, if someone cannot reproduce your results when following your recipe, something must be wrong with either the original method or in the findings it generated. 

This is a barren defense.  I have a particular cookbook that I love, and even though I follow the recipes as closely as I can, the food somehow never quite looks as good as it does in the photos. Does this mean that the recipes are deficient, perhaps even that the authors have misrepresented the quality of their food?  Or could it be that there is more to great cooking than simply following a recipe?  I do wish the authors would specify how many millimeters constitutes a “thinly” sliced onion, or the maximum torque allowed when “fluffing” rice, or even just the acceptable range in degrees Fahrenheit for “medium” heat.  They don’t, because they assume that I share tacit knowledge of certain culinary conventions and techniques; they also do not tell me that the onion needs to be peeled and that the chicken should be plucked free of feathers before browning.  If I do not possess this tacit know-howperhaps because I am globally incompetent, or am relatively new to cooking, or even just new to cooking Middle Eastern food specificallythen naturally, my outcomes will differ from theirs.

Likewise, there is more to being a successful experimenter than merely following what’s printed in a method section.  Experimenters develop a sense, honed over many years, of how to use a method successfully.  Much of this knowledge is implicit.  Collecting meaningful neuroimaging data, for example, requires that participants remain near-motionless during scanning, and thus in my lab, we go through great lengths to encourage participants to keep still.  We whine about how we will have spent a lot of money for nothing if they move, we plead with them not to sneeze or cough or wiggle their foot while in the scanner, and we deliver frequent pep talks and reminders throughout the session.  These experimental events, and countless more like them, go unreported in our method section for the simple fact that they are part of the shared, tacit know-how of competent researchers in my field; we also fail to report that the experimenters wore clothes and refrained from smoking throughout the session.  Someone without full possession of such know-howperhaps because he is globally incompetent, or new to science, or even just new to neuroimaging specificallycould well be expected to bungle one or more of these important, yet unstated, experimental details.  And because there are many more ways to do an experiment badly than to do one well, recipe-following will commonly result in failure to replicate.

Of course, the myriad problems with Dr. Mitchell’s analogy should not require great lengths to expose.

The first problem is the same problem encountered above.  Dr. Mitchell assumes that all providers of original research are, as if by some divine right, more competent practitioners than providers of replication studies.  This is simply not so.  It should be clearly stated that cooking and science are two entirely different practices and that any analogy is bound to be imperfect (cooking is, after all, much more of an art than a science).  However, in the interest of proceeding along established terms, allow me to offer a better analogy.  Dr. Mitchell compared replication studies to his amateur attempts to reproduce recipes from his favorite cookbook.  I fancy myself a rather good cook, but I can sympathize--my food doesn’t always come out looking as good as the photo in the cookbook.  Do I think that this means the authors misrepresented their recipes?  No.  Dr. Mitchell is right to think not.  As an amateur, he is not expected to cook as well as the professionals who wrote his cookbook.  However, if Chef Gordon Ramsay or Chef Wolfgang Puck (or whoever your favorite chef might be) attempted to recreate the recipes, following them precisely, combining the detailed descriptions with the established knowledge of culinary practices that Dr. Mitchell points out are generally understood but not explicitly stated and the food still came out significantly worse than the photograph would indicate, then I might begin to suspect that the cookbook has some flaw.  Dr. Mitchell assumes in his argument that he is the one trying to recreate the recipe.  The reality of replication is that it could just as easily be Chef Ramsay.

None of this is to say that science should be judged based on the fame or credentials of the scientist.  No, scientific questions must be determined based on the evidence.  But it is the height of both arrogance and short-sightedness to assume that anyone who would bother to replicate a study must be new to science and thus less worthy of attention than the author of the original paper.

Replication is essential precisely because (amongst other reasons), people who are new to a particular discipline conduct original research as well, and their mistakes could lead to erroneous papers.

However, there is another claim within this section worthy of attention.  This is the idea that some of the “real work” (to borrow a phrase from the magicians) is not explicitly published.  There is both truth and falsehood to this.  It is certainly true that the most mundane details of experimental practice are not explicitly stated in every paper.  However, if there is a practice which is not expected to be common knowledge, it should be explicitly stated.  Dr. Mitchell explains that subjects must remain near-motionless during neuroimaging scans, and alludes to techniques used in his lab to make sure this is the case.  It needn’t be stated, because anyone doing such a scan will already know, that the subject needs to remain motionless.  However, specific actions taken to ensure this motionless state should be noted, either in the paper reporting original research, or in a separate paper established experimental methodology which can be cited when that methodology is used in such research.  I do not suspect this to be the case with the methods detailed in Dr. Mitchell’s footnote (in which he lists several such techniques which are never mentioned explicitly in the methods section of his papers), but it is an ever-present possibility an experimental result could be affected by such conditions the experimenter finds unimportant.  If such notes make a paper too long for publication, they should be published elsewhere (perhaps on the same website that would be better used for experimental methodological tips than mindless ramblings about how useless replication is), so that both potential replicators and the merely interested can fully understand the experimental procedure in place during any experiment upon which they will base a scientific belief.  In Dr. Mitchell’s case, it is common knowledge and needn’t be stated that the subject must remain still.  The phrasing used to achieve this, while apparently innocent enough, can vary from laboratory to laboratory and should probably be noted somewhere so that no errors are made.  Similarly, though Dr. Mitchell’s cookbook probably doesn’t say so, I’m sure there is a publication somewhere that would gladly specify that important detail that a bird must be plucked of feathers prior to cooking.

The second argument is that a phenomenon which has a small effect size or is difficult to replicate might nonetheless be real.  True.  But how does one determine that? Through further studies.  The studies should be replicated both using the same and with new techniques to tease out the reality of the situation.  No one has ever suggested that a failed replication necessarily means an unreal phenomenon in every case.  It means an attempt at replication has failed, nothing more and nothing less.  The implications of that failure are a subject both for discussion and for further experimental investigation.  Dr. Mitchell’s examples fall short because in the very same paragraph where he decries replication because it might have “killed” fields of inquiry we now know to be important, he makes reference to further study validating the original findings.  It would seem that Dr. Mitchell only objects to replication when it falsifies original research, and frankly, that’s just bad science.

It’s also worth noting that if there is flimsy evidence, it would be unwise to believe a claim.  That doesn’t mean it’s wrong, but the scientific method is based upon skeptical inquiry.  We should have been skeptical about those findings Dr. Mitchell uses as his examples because evidence was flimsy in the early days.  It wasn’t until new methods were found to investigate these phenomena (as Dr. Mitchell points out) that the original studies were vindicated.  So the time to believe them is now that the evidence is in.  The time to believe them was not early on when they were little more than promising hypotheses.  But it is not our side that is trying to shut down inquiry.  It is Dr. Mitchell’s side (if indeed there are more than one lone misguided soul who ascribe to his view) that would seek to stifle inquiry by tacitly accepting original research without even the consideration of its replicability.  Replicability is not the only factor that makes a theory robust, but it is certainly an important factor.

The final counterargument that Dr. Mitchell attempts to address is, I think, one of the stronger points.  As I mentioned earlier when I explained publication bias, there is an asymmetry between positive and negative results, even in studies of the very same phenomenon.  Dr. Mitchell claims that science requires an asymmetry between positive and negative results, harking back to that old chestnut that absence of evidence is not evidence of absence.  He claims that no matter how many papers might be published claiming that swans are only white, it only took one study to prove that there can be black ones.  This is all very true, but a better analogy would be Sasquatch (or Bigfoot or Yeti, depending upon your region).  Would Dr. Mitchell seriously suggest that if one person publishes a photograph of a Sasquatch that we should immediately ignore any paper which argues to the contrary?  Certainly it is true that there could be such a being, but it, like everything else in science, should be treated with the same skepticism that is necessary for science to work. We believe in claims when there is sufficient robustness of evidence to outweigh the skeptical counterarguments.  No one is saying that we should believe scientific claims based entirely upon the number of papers suggesting one position or the other (although certainly that is an important factor to bear in mind when formulating opinions).  But it is certainly important to read those papers that show a published effect might not really exist.  If the evidence in one paper is stronger than the other, believe that one.  If the evidence in one is not clearly stronger than the other, we need a new experiment.  But we can’t possibly begin to even consider all of this until replication has been attempted and either succeeded or failed.

Dr. Mitchell then offers this nugget of wisdom: “After all, the argument goes, if an effect has been reported twice, but hundreds of other studies have failed to obtain it, isn’t it important to publicize that fact? No, it isn’t.”  Actually, that’s exactly the kind of information the scientific community needs.  We needn’t know the numbers of studies on one side or the other.  We need to know the quality of research on both sides, and we can only do that when all of that research is published.  It’s quite possible there could be two great positive studies and hundreds of other studies all of which were conducted by idiots or baboons.  It’s more likely that either two researchers made a mistake, or that there is some other factor causing the difference.  If the latter is the case, it’s important to have all of the information on the table, so we can attempt to isolate that other factor.

4) “Replication efforts appear to reflect strong prior expectations that published findings are not reliable, and as such, do not constitute scientific output.

Well, I didn’t realize that a scientist’s intentions were how we judged whether or not paper constituted scientific output.  I thought scientific claims’ validity was judged based on the strength of the evidence.  Silly me.

The basis of this argument is that, if a belief in the hypothesis can result in a bias in favor of positive results, then if the replicator believes the result to be invalid, this can result in a bias toward negative results.  These biases are real.   And it is possible that many replicators are interested only in falsifying results that disagree with their preconceptions, though Dr. Mitchell seems to have an abnormally low view of scientists when he assumes that this is almost universally the case.  Indeed, the main two reasons to replicate a study are either to detect possible errors if one thinks the study was in error or to offer further independent support if one thinks the original work was valid.  But the scientific process is specifically designed to minimize the impacts of these biases.

Once again, I must allow Dr. Mitchell’s own words to condemn him: “But consider how the replication project inverts this procedureinstead of trying to locate the sources of experimental failure, the replicators and other skeptics are busy trying to locate the sources of experimental success It is hard to imagine how this makes any sense unless one has a strong prior expectation that the effect does not, in fact, obtain. When an experiment fails, one will work hard to figure out why if she has strong expectations that it should succeed.  When an experiment succeeds, one will work hard to figure out why to the extent that she has strong expectations that it should fail.  In other words, scientists try to explain their failures when they have prior expectations of observing a phenomenon, and try to explain away their successes when they have prior expectations of that phenomenon’s nonoccurrence.”

It is perfectly valid to explore either causes of positive or negative results (I refuse to consider this in terms of experimental success or failure for reasons detailed above).  The point of the experiment is to isolate cause and effect, so if there is another possible cause for an effect (whether that effect is a positive or a negative result), it is within the proper purview of the scientist to try to find it.  This is a good thing.  Dr. Mitchell seems to think that the point of science is to offer proof of one’s predetermined conclusions, but this is not the case at all.  While supporting a pet hypothesis or falsifying a rival hypothesis may be the initial motivation to embark upon a study, any reputable scientist places truth above personal preference and seeks the best explanation for a given phenomenon.

I am reminded of a story once told by Richard Dawkins (who is actually a proper scientist, in the real sense of the word). Dawkins writes: “I have previously told the story of a respected elder statesman of the Zoology Department at Oxford when I was an undergraduate. For years he had passionately believed, and taught, that the Golgi Apparatus (a microscopic feature of the interior of cells) was not real: an artifact, an illusion. Every Monday afternoon it was the custom for the whole department to listen to a research talk by a visiting lecturer. One Monday, the visitor was an American cell biologist who presented completely convincing evidence that the Golgi Apparatus was real. At the end of the lecture, the old man strode to the front of the hall, shook the American by the hand and said--with passion--"My dear fellow, I wish to thank you. I have been wrong these fifteen years." We clapped our hands red. No fundamentalist would ever say that. In practice, not all scientists would. But all scientists pay lip service to it as an ideal--unlike, say, politicians who would probably condemn it as flip-flopping. The memory of the incident I have described still brings a lump to my throat.” (This quote is taken from http://www.beliefnet.com/Faiths/Secular-Philosophies/Why-I-Am-Hostile-Toward-Religion.aspx?p=2).

Unfortunately, Dr. Mitchell has shown Professor Dawkins wrong on one small point.  Apparently not all scientists even bother to pay lip-service to the scientific ideal.  Real scientists have no interest in explaining away results they dislike, whether positive or negative.  They may be initially skeptical, and they certainly demand evidence, and they may even embark upon a replication study in order to further examine that evidence.  But once that evidence is in, if it conflicts with their views, they must admit they had been wrong.

5) “The field of social psychology can be improved, but not by the publication of negative findings.  Experimenters should be encouraged to restrict their "degrees of freedom," for example, by specifying designs in advance.

Actually, putting aside a few phrases, Dr. Mitchell is to be commended for this small section of his essay.  For the reasons already discussed and for the reasons I will discuss in the continuance of this conversation below, he’s dead wrong about his opposition to the publication of negative findings.  However, except for suggesting that this is not the way to improve the field of social psychology, the suggestions he does make are quite reasonable ones.  I won’t rehash everything he said in that section here, but it boils down to increased standards for published research.  On that point, we can all agree.

There is a phrase that bothers me a bit, though, and I want to address it: “All scientists are motivated to find positive results, and social psychologists are no exception.”  This is true, of course, but I think it is problematic and that Dr. Mitchell would have us completely ignore the problem behind it.  Scientists are motivated to find positive results partly because they like to confirm their pet hypotheses.  This is true.  However, this is small motivation indeed when one realizes that most people become scientists because they want to understand the world.  If that means rejecting a pet hypothesis, most scientists (as Richard Dawkins points out) at the very least pay lip-service to the ideal.  For me, rejecting a pet hypothesis may be unpleasant for a day or two, but that emotion soon gives way to the much more profound emotion when I realize that having done so, I have eliminated a false belief and may now substitute a true one.  I think most scientists understand and agree with that desire to follow the evidence wherever it leads and to always seek to discover the truth.

So why, then, are scientists so motivated to find positive results?  Precisely because there is such a bias against publishing negative results.  In academia, if you don’t publish research, your career is doomed to be a short one.  But if you find negative results, you often find yourself with work that can’t find a market in which to publish.  Nevermind that this research might be the result of five years’ work involving dozens of collaborators and research assistants--if it’s negative, it doesn’t get published.  So of course there’s a bias toward finding positive results.  But it’s not necessarily a philosophical bias.  Indeed, there are lots of us (I know--I’ve spoken to them) who actually like negative results because they show us there is more to be learned (“My dear fellow, I wish to thank you…”).  But if we’re trying to meet publication requirements for career advancement, negative results are politically (not scientifically) undesirable.

6) “Whether they mean to or not, authors and editors of failed replications are publicly impugning the scientific integrity of their colleagues.  Targets of failed replications are justifiably upset, particularly given the inadequate basis for replicators’ extraordinary claims.

Whether he means to or not, I think Dr. Mitchell is revealing his true motivation for writing this article here.  He has conflated replication studies with accusations of deliberate misrepresentation of data!  A replication study, even if it is negative, does not impugn anything.  Nor is a replication study an attempted pissing contest between the replicator and the author of the original research.  Indeed, it is possible to perform a replication study while maintaining the greatest of respect for the original author or while having no opinion of him or her at all.  Failed replication does not, need not, and should not be considered an insult to the integrity of the original author unless there is very good reason to suspect deliberate fraud.

Let us imagine a failed replication has been published. What are some possible reasons for this eventuality?

a) The original research is valid, and the replicator made a mistake.
b) The original research is valid, and the replication study failed due to chance
c) The original research is valid, and the replicator falsified his findings
d) The original research is invalid; the original author made a mistake
e) The original research is invalid; the original author falsified his findings
f) The original research is invalid; the original finding was due to chance
g) The original research is valid but incomplete; there are other factors at work

In only one of those situations is the original author’s integrity challenged.  In only one other is his competence even slightly called into question.  It may be uncomfortable to have your work questioned, but that’s just part of science.  It shouldn’t be taken as an attack unless it is coupled with a direct accusation of impropriety.  Those accusations should not be taken lightly.  They should be taken seriously but false accusations should also be met with strict consequences.  Science is an honorable profession, and fraud is rare but intolerable.  False accusations of fraud are similarly rare but also intolerable.  This is not what replication is about, however.  Replication is simply about determining whether original findings hold up.

By convention, we consider a finding to be statistically significant at a p-level of less than 0.05.  That means we accept a 5% chance of a false positive due simply to statistical chance (not considering experimental error).  That means that, all else being equal, as much as 5% of what gets published could be wrong, just based on accepted standards for publication.  We could restrict our p-levels to less than 0.01 if we wanted to, but that still leaves us with 1% of all published research possibly being wrong.  Replication, if nothing else, is about minimizing those probabilities by re-running the experiments to see if the same results happen again.  Even if we put aside all possibility of experimental error, misrepresentation, or incomplete understanding of contributory factors, we must replicate research in order to weed out statistical anomalies.  Restricting p-levels to prohibitively low probabilities won’t do, either, because the more restrictive our statistical tests, the more likely we are to reject findings that are actually real.  That’s just as bad.  So what do we do?  We replicate.

Dr. Mitchell himself points out, “On the occasions that our work does succeed, we expect others to criticize it mercilessly, in public and often in our presence.”  No doubt, it can be quite uncomfortable.  Science is hard work, and it’s a tough business.  If someone thinks you’re wrong, they have no problem saying so, and they expect the same of you.  That’s the way it should be.  There’s no ill will about it--it’s just a matter of subjecting all claims to the strictest of scrutiny.  Anyone who has ever so much as presented a poster understands the feeling of coming under fire.  Anyone who has defended a thesis knows it better than the rest.  When we think someone is wrong, we say so.  When we aren’t sure, we test it, and then we say what the results were.  There’s very little coddling or hand-holding in this field, and there needn’t be.  Scientists are adults, and as such should be able to take professional criticism for what it is and avoid taking it personally.  Replication studies are one more type of potential criticism (though they can also support the original research, as Dr. Mitchell regularly forgets).

He concludes his essay with the following line: “One senses either a profound naiveté or a chilling mean-spiritedness at work, neither of which will improve social psychology.

It seems that exactly one senses such things at work here and that one is called Dr. Jason Mitchell.  The rest of the scientific community seems to understand that replication is not a mean-spirited personal attack, but just part of the job.  Dr. Mitchell’s complaints seem, though I admittedly speak only of a general impression and not from any sort of evidence here, to be the whiny complaints of someone whose pet theory has been called into question.  Instead of calling replicators (who, need I remind you, are just other scientists, just like anyone else, and most often also producers of their own original research) “mean-spirited,” the mature scientist realizes that replication is an essential component of the scientific process and that we neglect it at our peril.

This essay prompted science journalist Ben Lillie to take to Twitter with this comment (quoted in: http://io9.com/if-you-love-science-this-will-make-you-lose-your-sh-t-1601429885?utm_campaign=socialflow_io9_facebook&utm_source=io9_facebook&utm_medium=socialflow ): “Do you get points in social psychology for publicly declaring you have no idea how science works?”  I think that sums up the quality of Dr. Mitchell’s essay quite nicely, though I object to the association of Dr. Mitchell with the rest of the field of social psychology.  The social and behavioral sciences have struggled long and hard to achieve strict scientific standards.  Ill-informed tirades like Dr. Mitchell’s contribute to a popular misconception that these fields are not “true” sciences.  They are and they should be.  It is unfortunate that many of their practitioners seem to disagree, but let us not besmirch the image of entire fields based on the “contributions” of a few of their members who prefer not to follow the rules of science.

Throughout this response, harsh though I may have been (though I assure you, my commentary is no more biting than what is generally expected of any controversial statement among scientists), I have striven to avoid making any sort of personal attack or commentary about Dr. Mitchell.  I don’t know him personally, so it would be improper to do so.  I have attempted to restrict my commentary to his arguments themselves and to his apparent lack of understanding of the scientific process.  However, since he chose to close his article by calling scientists who conduct replication studies “naïve” and “mean-spirited,” I feel no guilt at closing my response by pointing out one additional quotation buried in Dr. Mitchell’s essay: “I was mainly educated in Catholic schools….”

Yeah, we can tell.  Which might explain why Dr. Mitchell prefers to treat social psychology as a religion rather than a science.


Mark Warschauer said...

Brilliant response. How ironic that a Colorado magician understands science 1000 times better than a Harvard professor!!

Unknown said...

Thanks, Mark.

In fairness, I think of myself as a scientist even above and beyond being a magician. I'm still at university, so one would have expected a Harvard professor's knowledge to be more advanced, though. But it was, actually, my interest in psychological science that is largely responsible for getting me started in magic. I wear a lot of different hats (scientist, writer, magician, etc), but they all seem to work quite well together for me.

Ben Lillie said...

Extraordinarily good response. Very, very well done. Also wanted to add that I completely agree with your comments on my tweet. That was done in an ill-advised moment of snark, and I didn't think it would get nearly the exposure that it did. I did follow it up with a number of tweets about how good most psych researchers are, but of course those didn't get much attention. C'est la Twitter.

Unknown said...

Thanks for this, Bob! What an exceptionally bright response!