Miscellaneous – Weev.Blog

Yesterday on NPR I heard a story about medicine, treatment, and “Big Data” where someone said:

…the scientific method itself is growing obsolete.

I’ve seen this same sentiment where I work by some of our statisticians. Before I get into why this is wrong, let’s get some definitions. First, the scientific method, from Dictionary.com:

a method of research in which a problem is identified, relevant data are gathered, a hypothesis is formulated from these data, and the hypothesis is empirically tested.

And now the definition of statistics, again from Dictionary.com:

the science that deals with the collection, classification, analysis, and interpretation of numerical facts or data, and that, by use of mathematical theories of probability, imposes order and regularity on aggregates of more or less disparate elements.

Time to cherry-pick parts of those definitions to support the point I’m trying to make: the scientific method provides empirical evidence for a hypothesis via testing, whereas statistics interprets data using probability. In short: statistics tells you whether or not something might be true, while the scientific method (experimentation) tells you whether or not something is true.

That’s probably oversimplifying, since even within science there are levels of certainty – we know things are probably true or probably not true based on the data generated from experimentation. And that shows that experimentation and statistics aren’t separate things: they rely on each other. But I’ve been hearing some people (people I would assume are normally intelligent people) forget that statistics deals with probability, not certainty; seemingly assuming the data we have is all the data we need. Where I work, someone actually said: “We can know whether or not a test will pass without ever running the test!”, as if statistics was some sort of magic crystal ball. A colleague corrected them, saying it would indicate only that a test would probably fail – the only way to know a test would pass is to run the test. Imagine buying any product where no testing was performed, but it was designed entirely via a statistical analysis of already-collected data. I’d much rather someone do some testing first to make sure that product is safe before it’s sold to me. That’s the only way to be certain it’s safe. (Or at least mostly safe.)

The Merriam-Webster encyclopedia definition of statistics says it very well:

Statistics provides ways to design efficient experiments that eliminate time-consuming trial and error.

And there it is: statistics (data analysis) doesn’t eliminate the need for experimentation or make it obsolete; statistics helps experimentation by weeding out poor hypotheses – and even supporting good hypotheses. Without the experiments, you don’t get any more data. Thinking we have all the data we need and don’t need to collect any more is a dangerous idea. Experimentation is vital for statistics. Likewise, statistics is vital for experimentation – being able to interpret the results is important. If we all keep in mind that statistics can guide experimentation and not replace it, I think we’re on the right track.