The Reproducibility Crisis in Research

Why many published findings can't be replicated and what universities are doing to improve research reliability.

What Is the Crisis?

The reproducibility crisis refers to the widespread finding, documented across multiple scientific disciplines over the past fifteen years, that a substantial fraction of published scientific findings cannot be independently reproduced when other researchers attempt to replicate the original studies. This finding has prompted deep questioning of the reliability of the scientific literature and the institutional practices that produce it.

The crisis gained widespread attention following the publication of the Open Science Collaboration's 2015 Reproducibility Project in Science, which attempted to reproduce 100 published psychology studies. Only 36 to 39 percent of the replicated studies showed statistically significant results consistent with the original papers, depending on how success was defined. The study's conclusion was stark: the majority of published findings in a leading psychology journal did not replicate when other researchers tried.

Subsequent replication projects in preclinical cancer biology (the Reproducibility Project: Cancer Biology), social science, neuroscience, economics, and other fields have produced similarly troubling results. While the rate of successful replication varies by field and methodology, the pattern is consistent: published findings are systematically more positive and larger in effect size than independent replications suggest is warranted.

The Peer Review system, long held as science's primary quality control mechanism, has not prevented the accumulation of an unreliable literature. Understanding why requires examining the incentive structures, methodological practices, and publication norms that systematically produce inflated and unreliable findings even in the absence of deliberate misconduct.

Scale of the Problem

Quantifying the scale of the reproducibility crisis is methodologically challenging — the denominator (how many findings would fail to replicate if tested) is unknown because only a small fraction of findings have been subjected to independent replication. The numerator (confirmed failed replications) is itself an underestimate because failed replications face publication barriers.

Survey data provides a complementary perspective. A 2016 Nature survey of 1,576 researchers found that over 70 percent reported failing to reproduce another scientist's experiment, and over 50 percent reported failing to reproduce their own experiments. Across all disciplines surveyed, over 90 percent of respondents agreed there is a reproducibility crisis of some significance in science.

The economic costs of irreproducible research are substantial. A 2015 analysis in PLOS Biology estimated that the cost of irreproducible preclinical research alone in the United States exceeded $28 billion per year — money spent by companies attempting to build on academic findings that could not be confirmed, and by academic labs attempting to extend work that later proved unreliable. When downstream clinical trials fail because they are based on irreproducible preclinical findings, the costs — in money, time, and patient welfare — are even larger.

Not all failures to replicate reflect problems with the original research. Replication studies face their own methodological challenges: samples may differ in ways that legitimately produce different results, effect sizes may vary across populations and contexts, and small samples in both original and replication studies create statistical uncertainty in both directions. Research Ethics in replication science requires distinguishing genuine failures of reproducibility from false accusations generated by legitimate heterogeneity in effects.

Causes

The reproducibility crisis does not primarily reflect deliberate misconduct — it reflects incentive structures and methodological practices that systematically bias the literature toward positive, large, and novel findings regardless of their ultimate reliability.

Publication bias is the primary structural cause. Academic Journal editors and peer reviewers systematically prefer positive findings — studies showing that an intervention works, a drug is effective, a psychological manipulation changes behavior — over null findings, which show no effect. This preference distorts the literature: for every published positive finding, multiple null results languish in file drawers, never submitted or rejected when submitted. Meta-analyses that should aggregate all evidence on a question instead aggregate only the biased published subset.

p-hacking (also called data dredging or selective reporting) refers to the practice of analyzing data in multiple ways until a statistically significant result is found, then reporting only the analysis that crossed the significance threshold. The formal statistical framework underlying most published research assumes that the analyst formulated their hypothesis before seeing the data and tested it once — conditions violated whenever researchers try multiple analytical approaches or try multiple outcomes and report only the significant one.

HARKing — Hypothesizing After Results are Known — refers to the practice of presenting post-hoc hypotheses as if they were pre-registered predictions. A researcher who runs an experiment, observes an unexpected significant correlation, and writes their paper as if they predicted that correlation from the outset has engaged in HARKing. The resulting paper appears to provide confirmatory evidence for a hypothesis that was actually generated by the data it purports to confirm.

Underpowered studies — studies with sample sizes too small to reliably detect the effects they are designed to measure — exacerbate all these problems. A study with only 50% statistical power to detect a real effect will, when it does detect a significant result, produce inflated effect size estimates because only the largest observed effects clear the significance threshold. The result is a literature where small studies' positive findings systematically overestimate effect sizes.

Discipline Variations

The reproducibility crisis manifests differently across scientific disciplines, reflecting variation in sample sizes, measurement quality, effect size expectations, and methodological norms.

Psychology and social psychology have been at the center of high-profile replication failures. Areas including social priming (subtle cues allegedly influencing complex social behavior), power posing (adopting expansive body postures allegedly raising testosterone and confidence), and ego depletion (a theory of limited willpower) have all failed to replicate in large-scale efforts. These failures have fundamentally reshaped certain research areas and prompted heated debates about methodological standards.

Preclinical biomedical research (studies in cells and animals intended to support drug development) has a well-documented irreproducibility problem. Industry groups including Bayer and Amgen have published analyses suggesting that a majority of landmark academic findings in cancer biology and oncology could not be confirmed when they attempted to build on them. These failures are particularly consequential because they feed into drug development pipelines and may be contributing to the historically high failure rate of late-stage clinical trials.

Economics has demonstrated relatively stronger reproducibility in some analyses — perhaps because economics commonly uses administrative datasets that can be accessed by others and analysis code is increasingly shared — but significant problems remain in experimental economics and areas relying on survey data.

Physics and chemistry show stronger reproducibility in part because measurement is more precise, effect sizes are typically larger (atomic properties don't vary by subject population), and methodological standards are more stringent. But even in these fields, high-profile failures — including initial reports of cold fusion and of faster-than-light neutrinos — remind the community that extraordinary claims require extraordinary replication efforts.

Proposed Solutions

The scientific community's response to the reproducibility crisis has been substantial and multidimensional, producing a portfolio of methodological and policy reforms targeting different contributors to the problem.

Open data and materials sharing requirements from journals and funders address the basic obstacle to replication: inability to access the data and procedures needed to attempt reproduction. Journals including PLOS ONE and many others now require data sharing as a condition of publication. NSF's 2023 data management and sharing policy requires that all funded research produce a data management plan with concrete sharing provisions. Resistance from researchers — on grounds of competitive advantage, participant privacy, and data preparation burden — remains, but the trend toward open data is accelerating.

Pre-registration of studies — depositing a detailed research plan including hypotheses, methods, and analysis strategy before data collection begins — directly addresses p-hacking and HARKing by creating a public record of intended analyses. The Open Science Framework and ClinicalTrials.gov provide pre-registration infrastructure. Pre-registration does not prevent researchers from conducting exploratory analyses; it simply requires that exploratory analyses be clearly distinguished from confirmatory tests of pre-specified hypotheses.

Multisite replication projects, in which the same study is conducted simultaneously at multiple institutions with shared protocols, address concerns that replication failures reflect contextual differences rather than problems with the original findings. The Many Labs and Psychological Science Accelerator projects coordinate large-scale replications across dozens of sites in many countries, providing much stronger evidence about the generalizability of findings than any single replication study could.

Revised statistical standards are under active discussion. The p < 0.05 significance threshold, arbitrary from its inception, has been critiqued as creating a false sense of certainty and incentivizing the borderline p-hacking practices that contaminate the literature. Proposals include raising the significance threshold to p < 0.005 for new discoveries, shifting to Bayesian analysis methods that quantify evidence rather than dichotomizing it, and focusing on effect size estimation with confidence intervals rather than binary significance testing.

Registered Reports and Pre-Registration

Registered Reports represent the most systematic journal-level response to the reproducibility crisis — a publication format specifically designed to decouple editorial decisions from results and eliminate publication bias at its source.

In the traditional journal model, editorial and Peer Review decisions are made after research is completed and results are known. This creates the publication bias problem: editors and reviewers, consciously or unconsciously, respond differently to positive versus null results. The Registered Report format interrupts this dynamic by conducting peer review before data collection begins.

In a Registered Report, researchers submit an introduction, methods section, and analysis plan to a journal before collecting data. The journal conducts peer review on this Stage 1 submission, evaluating the importance of the research question, the rigor of the design, and the appropriateness of the planned analyses. If accepted in principle, the journal commits to publish the completed paper regardless of results — provided the authors adhered to their pre-registered protocol.

After data collection and analysis, authors submit the Stage 2 manuscript with their results and discussion. Further peer review ensures that the reported analyses match the pre-registered plan and that the interpretation is appropriate. Preprint versions are often posted alongside Stage 2 review.

Registered Reports are now offered by over 300 journals across disciplines. Meta-analyses comparing Registered Reports with traditional articles consistently find that Registered Reports produce a much higher rate of null results (around 40 to 50 percent versus 5 to 10 percent in traditional literature) — strong evidence that the traditional literature is systematically publication-biased and that Registered Reports correct for this bias.

The Research Ethics implications of the reproducibility crisis extend beyond methodology to institutional culture. Universities that reward publication count and journal prestige over reproducibility and transparency create environments where the practices contributing to the crisis are rational individual responses to institutional incentives. Reform requires not just changing statistical methods but changing what universities measure, reward, and celebrate in research careers — a harder and slower institutional transformation than any technical fix.