Putting the methods you use into context
It may come as a surprise, but the way you were probably taught statistics during your undergraduate years is not the way statistics is done. There are a number of different ways of thinking about and doing statistics. It might be disconcerting to learn that there is no consensus amongst statisticians about what a probability is for example (a subjective degree of belief, or an objective long-run frequency?). Typically, scientists are only exposed to the frequentist school, which has been criticised on a number of grounds (discussed briefly below), and this is an incredible shortcoming of standard science education. Not knowing the big picture about other schools of thought, methods of analysis, or ways of interpreting evidence is a serious limitation for anyone who conducts experiments and interprets the results. The various schools and their associated methods have their own advantages and disadvantages, and it is important to know the limitations of the methods that one uses. A comparison can be made with having been educated only in the use of light microscopes, and not even being aware that confocal or electron microscopes exist and what they can be used for. The diagram below shows the relationship between the different schools, and the descriptions are meant to be brief overviews, thus ignoring many subtleties.
Bayesian methods are arguably the oldest; the Rev. Thomas Bayes published his theorem (posthumously) in 1764. Bayesian methods became less popular in the beginning of the 20th century in favour of frequentist methods, but then saw a resurgence after computers became widely available, as many analyses are not easy to do "by hand" and rely on computer simulations. The basic idea behind Bayesian statistics is that you combine the results of an experiment (which is expressed in terms of a likelihood, see below) with some prior information (for example, the results of a previous experiment) to get what is called a posterior probability. The controversial part of Bayesian methodology is the prior information (which is expressed as a distribution and often just referred to as "a prior"), because two people may have different prior knowledge or information and thus end up with a different result. In other words, individual knowledge (or subjective belief) can be combined with experimental data to produce the final result. Probabilities are therefore viewed as personal beliefs, which naturally differ between individuals. This doesn't sit well with many scientists because they want the data "to speak for themselves", and the data should give only one answer, not as many answers as there are people analysing it! It should be noted that there are also methods where the prior information is taken from the data itself, and are referred to as empirical Bayesian methods, which do not have the problem of subjectivity. In addition, it is possible to specify vague or non-informative priors which do allow the data to speak for themselves. Bayesians would also argue that for many practical situations the prior has little influence on the final result, especially when the sample size is large. The advantage of Bayesian methods is that they can include other relevant information and are therefore useful for integrating the results of many experiments. In addition, the results of a Bayesian analysis are usually what scientists want, in terms of what p-values and confidence intervals represent.
The frequentist or classical school is associated with Sir Ronald Fisher (who gave us the null hypothesis and p-values as evidence against the null) and Neyman and Pearson (who gave us Type I and II errors, power, alternative hypotheses, deciding to reject or not reject based on an alpha level). Fisher and Neyman/Pearson had different views on statistics and what is taught today in the typical "statistics for scientists" course is a jumbled combination of both, which can lead to much confusion (Hubbard & Bayarri, 2003; Christensen, 2005). The procedure is familiar:
- Start with a null hypothesis (H0), which states that there is no effect or relationship
- State an alternative hypothesis (H1), which could be simply that H0 is not true (i.e. there is some effect or relationship)
- Choose an alpha-level (typically the magic 0.05)
- Perform the statistical test and calculate a p-value
- If the p-value is smaller than the alpha-level, call the results "significant"
This approach, often referred to as hypothesis testing or null hypothesis significance testing (NHST), has been criticised for a number of reasons (Cohen, 1994; Loftus, 1996; Kline,2004; Gigerenzer, 2004; Goodman, 2008; Stang et al., 2010). A brief list includes:
- The p-value doesn't tell scientists what they want (it is the probability of the data given that H0 is true, and scientists would like the probability of H0 or H1 given the data)
- H0 is often known to be false
- P-values are widely misunderstood
- Leads to binary yes/no thinking
- Prior information is never taken into account (Bayesian argument)
- A small p-value could reflect a very large sample size rather than a meaningful difference
- Leads to publication bias, because significant results (i.e. p < 0.05) are more likely to be published
These methods are based on earlier work by Akaike, Kullback, Leibler and others, and have been recently popularised by Burnham and Anderson (2002; and Anderson, 2008). These methods are referred to as "information-theoretic" because they are based on concepts of information theory (and thermodynamics)—entropy in particular. The basic idea is to compare alternative models according to how well they capture information in the data, which can be determined by the amount of unexplained variation (i.e the residual sum of squares). This is instead expressed as a likelihood, with the higher the likelihood, the better the fit. However, more complex models will fit the data better than simpler models (a wiggly line has greater freedom to approximate the data points compared to a straight line) and therefore a trade-off has to be made between fit (likelihood) and complexity (the number of parameters in the model). This is done using various "information criteria" such as Akaike's Information Criterion (AIC), which is defined as AIC = -2*log(Likelihood) + 2K. Here, the (logarithm) of the likelihood is multiplied by -2, and so a model with a high likelihood (good fit) will make the AIC small (more negative). K is the number of parameters in the model, and so the greater the complexity of the model, the larger (more positive) the AIC will become. Here the trade-off between goodness-of-fit and complexity can be seen. The values of AIC for various models can be compared, and the model with the lowest AIC is the best (significance tests are not done). This approach is typically used when there are many independent variables, the sample size is small, and there is little theory to suggest which variables should be included in the analysis. Data of this type are often found in the social sciences, epidemiology and ecology, and so far, are uncommon in laboratory experiments. Anderson (2008) provides a very readable introduction.
As can be seen from the figure, likelihood is a common (and perhaps unifying) aspect of the previous methods. The term was coined by Fisher in the 1920's and he developed the method of maximum likelihood (ML), which underpins much of classical statistics. It is also the uncontroversial part of Bayesian statistics, and as a measure of model fit, it is central to the information-theoretic approach. A likelihood is similar to a probability, but the area under a likelihood curve does not add up to one like it does for a probability density. A useful feature of this approach is that it treats the data as fixed (rather than as a random variable) and the likelihood of two different models can be compared. For example, the null model, which assumes that the difference between the means of two groups is zero, and an alternative model which uses the observed difference between the means. One can then calculate how much more likely one model (or hypothesis) is relative to the other, given the data. This is what scientists want: you do an experiment, collect the data (which is now fixed and not treated as a random variable), and you want to know which hypothesis is better supported by the data, the null hypothesis of "no difference" or the alternative hypothesis. This is NOT what a traditional p-value gives you. Two models can be compared by taking the ratio of their likelihoods (the natural log of the likelihoods is used in practise as it makes the computations easier), and a test of signficance can be performed. A recent study has shown that medical students' interpretation of results are more compatible with likelihood ratios rather than with p-values (Pernerger & Courvoisier, 2010). In other words, likelihood ratios are a more natural way of understanding evidence. A short introduction can be found in Goodman & Royall (1988), and a longer (and slightly more mathematical) treatment can be found in Royall (1997). Some of the advantages of working with likelihoods are:
- It is central to almost all of statistics
- It treats the data as fixed (once the experiment is complete, the data ARE fixed)
- Allows one to compare hypotheses given the data (this is of direct interest to scientists)
- Captures the evidence in the data
- Likelihoods can be easily combined, for example from two independent studies
- Prior information can easily included (Bayesian analysis)
- Seems to be the way we normally think (Pernerger & Courvoisier, 2010)
This article provided only a very basic discussion of various methods and schools of thought and much has been omitted, although hopefully not to the point of providing only caricatures. More information can be found in the references below, which do not require advanced statistical knowledge to understand, and are highly recommended. Knowing the big picture allows you to reflect on the methods you use, and ask whether they are appropriate for the task. It is also a useful antidote to some of the dogmatism associated with statistical analyses (you don't have to do something one way just because that's how you saw others do it).
Anderson DR (2008). Model Based Inference in the Life Sciences: A Primer on Evidence. Springer: New York, NY. [Amazon]
Burnham KP Anderson DR (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer: New York, NY. [Amazon]
Christensen R (2005). Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician 59(2):121–126. [PDF]
Cohen J (1994). The earth is round (p < .05). American Psychologist 49(12):997–1003. [PDF]
Gigerenzer G (2004). Mindless statistics. J Socio-Econom 33:587–606. [PDF]
Goodman S (2008). A dirty dozen: twelve p-value misconceptions. Semin Hematol 45(3):135–140. [Pubmed]
Goodman SN, Royall R (1988). Evidence and scientific research. American Journal of Public Health 78(12):1568–1574. [Pubmed]
Hubbard R & Bayarri MJ (2003). Confusion over measures of evidence (p's) versus errors (alpha's) in classical statistical testing. The American Statistician 57(3): 171–178. [PDF]
Kline RB (2004). Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. APA: Washington, DC. [Amazon]
Loftus GR (1996). Psychology will be a much better science when we change the way we analyze data. Curr Dir Psychol Sci 161–171. [PDF]
Pernerger TV, Courvoisier DS (2010). Interpretation of evidence in data by untrained medical students: a scenario-based study. BMC Med Res Methodol 10:78. [Pubmed]
Royall R (1997). Statistical Evidence: A likelihood Paradigm. Chapman & Hall/CRC: Boca Raton, FL. [Amazon]
Stang A, Poole C, Kuss C (2010). The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol 25:225–230. [Pubmed]