How Data Becomes Reliable Evidence

There are moments in the history of statistics that initially sound like charming trivialities but then turn out to be intellectual bombshells. One of the most famous did not take place in a lecture hall, nor in a laboratory, nor even in front of a blackboard covered with formulas, but over a cup of tea. A lady claimed she could taste whether the tea or the milk had been poured into the cup first. For most of those present, this would likely have remained a social curiosity. For Ronald Aylmer Fisher, it was a methodological question. For him, the focus was not on the claim itself, but on the way to test it so that an anecdote could become evidence. How many cups are needed? How should they be arranged? How does one prevent expectation, bias, or mere chance from skewing the result? This scene already foreshadows what makes Fisher so important to the modern era: statistics is not primarily about calculation, but the art of carefully establishing the conditions for judgment.

More than a Statistician

Ronald Aylmer Fisher was born in London in 1890 and is one of those rare figures in the history of science whose influence can hardly be confined to a single discipline. He was a statistician, geneticist, evolutionary biologist, and theorist of the scientific experiment all at once. Encyclopaedia Britannica succinctly describes him as the man who decisively advanced the application of statistical methods to the design of scientific experiments. That is correct—but almost too modest. Fisher did not merely provide tools; he changed the rules by which empirical knowledge is generated in the first place.

The historian of statistics Anders Hald [see Hald 2007, p. 147] therefore judged that Fisher was a genius who almost single-handedly created the foundations of modern statistics ("Fisher was a genius who almost single-handedly created the foundations for modern statistical science"). And the evolutionary biologist Richard Dawkins [see Dawkins 1995, p. 38] called him "Darwin's greatest twentieth-century successor." Taken together, these two assessments convey something important: Fisher was not merely a technician of data analysis, but a thinker who deeply intertwined statistics, biology, and epistemology.

Consequently, his name has been associated with an unusually large number of methods, problems, and distributions: Fisher's information, Fisher's exact test, the F-distribution and the F-test, Fisher's discriminant function, the Fisher–Tippett distribution, the Cornish–Fisher method, the Fisher–Yates procedure, and also the Behrens–Fisher problem. Yet the sheer length of this list is precisely what leads to a misunderstanding. Fisher was not great primarily because so many concepts bear his name. He was great because he recognized a common fundamental problem underlying all these methods: How does one draw reasonable conclusions from finite, noisy, and often poorly constructed data?

The Lady with the Tea—and What She Really Taught

The famous "lady tasting tea" episode is so instructive because, at its core, it is an educational piece on data quality. In "The Design of Experiments" [see Fisher 1935], Fisher demonstrated that a claim can only be seriously tested if the experiment is designed in such a way that alternative explanations are systematically ruled out. Randomization, replication, comparability of units, and clear decision rules are not merely decorative elements, but the very conditions that make a test result interpretable in the first place.

It is precisely in this that Fisher's enduring modernity lies. Many of his successors were primarily regarded as developers of statistical tests. Fisher himself thought earlier than the formula. For him, statistics began with the design of the observation. A poorly defined research question, a biased sample, an unclear measurement rule, or an uncontrolled confounding variable often destroy the validity of the results even before any test is calculated. Good mathematics cannot remedy such shortcomings. At best, it can mask them more elegantly.

Likelihood: What Data Can Actually Achieve

This fundamental attitude also explains why Fisher assigned such a central role to the likelihood principle. In his works on theoretical statistics, particularly between "On the Mathematical Foundations of Theoretical Statistics" [Fisher 1922] and "Theory of Statistical Estimation" [Fisher 1925], he fundamentally shifted the perspective on statistical estimation. For him, the crucial question was no longer: Which parameter value is "true" and how can it be guessed? But rather: For which values of a parameter would the actually observed data be most plausible?

Formally, this means: If one has a statistical model with a density or probability function f(x∣θ) , then after observing the datax , the same function is interpreted as a function of the parameter θ . This function

is called the likelihood function. It is not a probability of the parameter, but a measure of how well different parameter values fit the data already observed. The maximum likelihood estimator is then the value

—that is, the parameter value under which the observed data appear most plausible within the chosen model.
It is precisely this idea that is so elegant that one easily overlooks what it depends on. Likelihood does not operate in a vacuum. It is always tied to a model and to concrete data. If the model is incorrectly specified, if observations have been selected in a biased manner, if measurement processes are unstable, or if the data are not truly comparable with one another, then even the most beautiful likelihood will only find the optimum within a world that is incorrectly described. Mathematical precision then guarantees no objective truth.

This is precisely where the statement in this article gains its true sharpness: Good models do not save bad data. Fisher's greatness therefore lies not only in the formulation of a powerful estimation principle, but in the insight that statistical precision is no substitute for methodological discipline. A model is only robust if its empirical foundation is sound: clean data collection, clear definitions, controlled confounding factors, and reliable comparability of observations.

Fig. 01: Likelihood function based on a simple binomial example

Fig. 01 shows a simple binomial example with n=10 observations and x=7 outcomes. The likelihood function

reaches its maximum at

This is the maximum likelihood estimator.

Fig. 02 illustrates a methodologically more important point: The same model can lead to very different likelihood curves if the data changes. For a representative sample with x=55 outcomes from n=100 , the maximum is at 0,55 . For a biased or selective sample with x=70 from100 observations, the maximum shifts to 0,70. The likelihood function thus does exactly what it is supposed to do: it fits the model to the data. But it cannot detect whether the data itself is biased or methodologically questionable.

Fig. 02: Same model – different data

The Year 1924: A New Grammar of Distributions

A particularly elegant expression of this line of thought can be found in Fisher's 1924 paper "On a Distribution Yielding the Error Functions of Several Well-Known Statistics" [Fisher 1924]. There, he placed Karl Pearson's chi-square distribution and Student's t-distribution in the same probabilistic context as the normal distribution and the z-distribution, from which the modern F-distribution later emerged. What seems self-evident to many users today was, at the time, a profound reorganization of statistical thinking.

Fisher thus demonstrated that many of the well-known testing and estimation problems need not be understood as a loose collection of individual tricks, but rather as a coherent system of distributions, transformations, and inference rules. This unification was far more than mathematical elegance. It created a mathematical grammar in which researchers could speak with greater clarity about variance, fit, significance, and model comparison.

It is precisely at this point that it also becomes clear why Fisher is so often misunderstood. He was not merely a man of the p-value or test statistics. He was the architect of a coherent probabilistic framework in which model, data, test, and estimation problem are mutually interdependent. His theory of distributions was not an ornament, but the framework upon which modern applied statistics rested for a long time.

Rothamsted: Where Statistics Became Operational

Fisher became famous not least because of the Rothamsted Experimental Station, the agricultural research station where he worked starting in 1919. There, his theory encountered a reality that cared little for aesthetic formulas: fields differed in soil quality, weather effects interfered with fertilizer trials, measurements were flawed, and biological materials varied. It was precisely under such conditions that Fisher's doctrine of randomization, replication, and block design emerged. He recognized that experiments do not simply contain information, but only release it through a clever arrangement.

That sounds obvious today, but it certainly wasn't back then. Before Fisher, experiments were often treated as if their results could be statistically "rescued" after the fact. Fisher turned the approach on its head. It is not the analysis that cures poor design; rather, the design determines which analysis makes sense in the first place. In doing so, he created a logic that extends far beyond agriculture—into medicine, industry, psychology, biology, and indeed into today's risk management.

Why good models cannot save bad data

This idea can be illustrated with a real-world example. A company wants to model the risk of supplier failures. It collects historical incidents, estimates probabilities and/or frequencies, builds regression models, generates loss scenarios, and derives early warning indicators. On paper, everything looks sound. But what if the dataset only includes failures that were formally reported, while near-misses resolved informally are missing? What if classifications have changed over the years? What if country risks, second-round effects, or political escalation paths are not visible in the historical data at all? Then the model may calculate precisely, but based on an unsound data foundation.

This is exactly where Fisher would have started. He would not have asked first about the calculation method, but about the construction of the evidence. How were the cases defined? Which units are even comparable? Which confounding factors distort the findings? Which selection mechanisms determine what appears in the dataset and what remains invisible? A risk model that ignores these preliminary questions does not gain in truth through mathematical sophistication. It gains only in apparent precision.

This is an uncomfortable insight because it destroys the comfort of technical complexity. Those who deploy enough variables, enough computing power, and enough visualization would like to believe they have intellectually mastered risk. Fisher reminds us that the quality of the decision begins long before the formula. It begins in the definition of the problem, in the quality of the observation, in the representativeness of the data, and in the control of those influences that distort the picture unnoticed.

Fisher in Risk Management

This is precisely why Fisher remains so relevant to risk management today. In many organizations, risks are still quantified in a way that appears methodologically sound but remains empirically fragile. Operational losses are classified differently, near misses are not recorded consistently, project risks are smoothed in calculations under political pressure, cyber incidents are only partially reported, and historical time series are mixed with expert-based scenario assessments without clearly identifying the differences in the state of knowledge. The result is metrics that appear formally correct yet may still be analytically weak.

Fisher's thinking helps here in two ways. First, it reminds us that data quality is not a secondary technical issue, but rather the core of any inference [see Romeike/Wieczorek 2026]. Second, it shows that model selection, test logic, and decision rules are only robust if the empirical formulation of the problem remains transparent. A Value-at-Risk (VaR) or Expected Shortfall (ES), a loss distribution, a stress test, or a scoring model is not good simply because it is mathematically sophisticated. It is good if it is based on data obtained under controlled, traceable, and substantively meaningful conditions.

It is precisely in this sense that Fisher's work can be read as an early lesson in methodological humility. Models are necessary. But they must never obscure the fact that every number has a history—and that this history is often more important than the third decimal place.

A Controversial Figure—and His Lasting Legacy

Fisher was not only scientifically productive but also combative. His disputes with contemporaries, his sharp judgments, and his intellectual intransigence were part of his style. This did not always make him pleasant, but it was often extraordinarily effective. He wanted statistics to be understood not as a convention but as a discipline. In a certain sense, he was a purist of evidence.

Perhaps this also explains why his work remains so vibrant today. It demands not only technical mastery but also a certain attitude: skepticism toward comfortable certainty, precision in observation, clarity in assumptions, and methodological rigor in the construction of data. Anyone who takes this attitude seriously quickly realizes that Fisher's most important message is not encapsulated in the F-test, nor in likelihood, nor in any single formula. Rather, it is this: one can only draw reasonable conclusions from data if one takes the path to that data just as seriously as its mathematical treatment.

Conclusion: The Path Back to the Cup of Tea

In the end, the path leads back to the cup of tea. The scene is so memorable because it shows in miniature what Fisher was always concerned with. A claim is easily made, a model quickly calculated, a metric swiftly produced. But whether this becomes knowledge is decided earlier: in the design of the experiment, in the selection of the material, in the clarity of the comparative logic, and in the honesty regarding one's own ignorance.

Ronald A. Fisher gave statistics a new rigor. Not the rigor of cynicism, but the rigor of methodological discipline. Anyone who wants to understand risks needs good models. But good models alone are not enough. They can only be reliable if the data on which they rest were constructed with the same care with which they are later analyzed. This is where Fisher's enduring relevance lies—in the sober, sometimes uncomfortable insight that scientific precision does not begin with the formula, but long before it.

Bibliography and further reading

Box, Joan Fisher (1978): R. A. Fisher: The Life of a Scientist, Wiley, New York 1978.
Dawkins, Richard (1995): River out of Eden: A Darwinian View of Life, Weidenfeld & Nicolson, London 1995.
Fisher, Ronald A. (1915): Frequency Distribution of the Values of the Correlation Coefficient in Samples from an Indefinitely Large Population. In: Biometrika, 10(4), pp. 507–521.
Fisher, Ronald A. (1918): The Correlation Between Relatives on the Supposition of Mendelian Inheritance. In: Transactions of the Royal Society of Edinburgh, 52, pp. 399–433.
Fisher, Ronald A. (1921): On the Probable Error of a Coefficient of Correlation Deduced from a Small Sample. In: Metron, 1, pp. 3–32.
Fisher, Ronald A. (1922): On the Mathematical Foundations of Theoretical Statistics. In: Philosophical Transactions of the Royal Society A, 222, pp. 309–368.
Fisher, Ronald A. (1924): On a Distribution Yielding the Error Functions of Several Well-Known Statistics. In: Proceedings of the International Congress of Mathematics, Toronto, 2, pp. 805–813.
Fisher, Ronald A. (1925): Theory of Statistical Estimation. In: Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), pp. 700–725.
Fisher, Ronald A. (1956): Statistical Methods and Scientific Inference, Oliver and Boyd, Edinburgh 1956.
Hald, Anders (1998): A History of Mathematical Statistics from 1750 to 1930, Wiley, New York 1998.
Hald, Anders (2007): A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935, Springer, New York 2007.
Romeike, Frank/Wieczorek, Gabriele (2026): Data Analytics in Risk Management – Descriptive Analytics – Diagnostic Analytics – Predictive Analytics, Springer Verlag, Wiesbaden 2026.

[ Source of cover photo: Generated by AI ]