I was recently trying to explain Bayesian logic to a friend, and came up with the following analogy. I would be interested to hear feedback on it.
---
Imagine everyone in the USA gets sudden amnesia. We want to find out who the President is, but no one can remember.
A scientist comes up with a test to determine if someone is the President.
If they are the President, there is a 100% chance the test will say they are the President and a 0% chance the test will say they are not the President.
If they are not the President, there is a 99.999% chance the test will say that are not the President, and a 0.001% chance the test will falsely say they are the President.
Giving the test to the person sitting in the big chair in the Oval Office is useful, because it's already quite likely this person is the President. If the test is positive for Presidency, it's extremely likely that person is the president.
Giving the test to the 10 people nearest the oval office is useful, because it's fairly likely the President is one of these people. A positive result will indicate strongly that that person is the President, and if no-one in that group is actually the President, there's a 99.99% chance the test will say so.
Giving the test to the 1000 people in the White House is pretty useful, because it's pretty likely the President is in the White House, and if none of these people are the president, there's still a 99% chance the test will be correct. A positive result for any one person will indicate quite strongly that that person is the President.
But giving the test to everyone in America is not very useful at all, because it's very unlikely that any particular person is the President, and we can expect the test will give a positive result for around 3200 people. For any particular person in this group, it's much more likely they're not the President than they are.
---
Is this a broadly correct, if non-rigorous, analogy? I realize most HNers will be much more familiar with this stuff than I am, I'm interested chiefly in whether or not I misled my friend.
Right. And the frequentist version is also useful.
If you test one person and the the test is positive, then that person is the president (p=0.00001).
If you test a thousand people, and the test is positive for one of them, then that person is the president (p=0.01).
So you don't really need Bayesian logic to reason that you should test fewer people if you want a more significant result. (Note I'm not saying you don't need Bayes' Theorem, which everyone uses.)
Edit: I think most people on HN get their knowledge of frequentist and bayesian statistics from XKCD #1132. That's sad.
> So you don't really need Bayesian logic to reason that you should test fewer people if you want a more significant result.
That's misleading by the use of the word "significant" which apparently means something in frequentism than it does in normal speech. I certainly wouldn't use "significant" in that way as a non-frequentist, I would instead rephrase what you said as:
> So you don't really need Bayesian logic to reason that you should test fewer people if you want more confirmation bias in your result.
And that's a statement I can definitely get behind!
It can definitely be used misleadingly, but it's not too out of line with normal scientific usage of the term. The "significant" in "significant figures" is the same: if a number has "11 significant figures", it doesn't mean the 11th digit is significant in the sense of being important or having a big impact, just that the 11th digit is within the measurement precision (as propagated through any subsequent calculations).
That's a good point. Conversely, when you run a test and get a result which is "not statistically significant", that doesn't mean that the measured effect is not significant. For example, if I test a drug and say "it does not cause cancer (p=0.17)", that's wrong. Yes, the correlation with cancer is not statistically significant, but it is significant in the normal sense of the word.
The point is not the number of people that are tested, but the prior probability that the person tested is the president. The person wandering around the white house already has a decent probability of being president. Some guy found in the middle of the country wearing every day clothes, not so much.
>If you test one person and the the test is positive, then that person is the president (p=0.00001).
>If you test a thousand people, and the test is positive for one of them, then that person is the president (p=0.01).
I am frequently a frequentist, but this particular test doesn't make sense to me.
Without any prior belief, the person you tested is still no more likely to be the president than any of the other 3199 who would test positive.
If you test one person at random, the odds of that person being the President equals the probability of a true positive divided by the probability that the test provides a Positive result (i.e. the sum of the probabilities for True and False Positives). The probability of a true positive is the probability that test subject is the President times the conditional probability that the subject tests positive given that they are the President. If the subject is chosen at random from the population of the US, then the a priori odds of this person being the President is about 1/314M times a conditional probability of testing positive of 1. The probability of a false negative is the a priori odds of a random person being the President, times the conditional probability of a non-President testing positive. By hypothesis, this is (1-1/314M)0.00001. So the odds of a subject chosen at random from the overall US population must be approximately (1/314M)/((1/314M)+(1-1/314M)0.00001), which simplifies to 1/(1+(314M-1)*0.00001, or about 1/3141.
The ~99.97% chance that the test result is a false positive for a subject chosen at random is a consequence of the prior probability of that subject being the President in the first place being over 3000 times lower than the probability of a positive test result. In other words, though the sensitivity of the test is perfect, the specificity of the test is insufficient to isolate a condition as rare as the Presidency. It has nothing to do with testing one person being inherently more decisive than testing more than one.
Only narrowing the pool of candidates in a way which will increase the prior probability that the subject is the President will improve the situation, not random decimation to a single test subject. The GP gave several correct examples such as limiting the test to those who were much more likely to be the President in the first place (eg. a person randomly found sitting in the President's chair at the Oval Office might have a prior probability of being the President of greater than 0.01, millions of times greater than that of a randomly selected person in the US).
It seems awkward that p-value does not account for the setting in which the test is performed.
That is, if I test 1000 people at a high school football game in Peoria, Illinois, and one of them comes out positive (p=0.01), there is a very different likelihood that I have found the true president than if I test 1000 people at a official dinner in the White House and find a match (also p=0.01). In fact, I think I'd be willing to wager more on the chance that the individual tested in the White House is actually the president than the random football fan in Peoria, even if only a single person in Peoria was tested (p=0.00001).
Of course, this is only an issue if p-value is being used to express relative confidence in a conclusion, which it shouldn't (?) be. Still, how does a frequentist account for a choice of venue like this?
Frequentist approach would be to design an experiment, determine rules for interpreting data, and then consider the probability that the experiment lead to the correct conclusion.
In this case, the experiment is really "who is president?" and not "is person X president?" The "who is president?" question does not really have a null hypothesis, so it makes no sense to talk about p-values. Instead, we can talk about two different results: identified the president correctly, and identified the president incorrectly.
We then have to design the experiment in such a way that probabilities can be calculated. But this is not possible if you say the repeated experiment is testing a bunch of people to see if one is president--because the probability depends on who actually is president, and that's not known.
So you consider the experiment to be "the president goes about their business and ends up in some random place, we get amnesia, and then run the test." We can simulate this experiment because we can come up with some probability distribution for where the president is. We can't come up with a probability for who the president is, because that's unknown and either 0% or 100%--only Bayesians would let you do that.
Then you can choose the order of people to test to maximize the probability that the president will be identified correctly.
And then you say things like,
* The test had a 92% chance of succeeding.
* We only had to test 14 people, and in such cases, the test had a 98% chance of succeeding.
P-values are not comparable across different null hypotheses, they are a property of a particular sample given a null hypothesis. You can compare p-values only by making very strong assumptions that the samples come from the same data-generating process. In almost all cases this is unlikely.
A frequentist usually builds a more descriptive model by adding more variables (accounting for obvious factors that contribute to differences in observed frequencies) and increasing sample size to increase statistical power (reduce false negatives). This is not applicable in this case because there is only one president and the tests have low power. The test is bogus because you can't 'sample' and determine to probability of X==president effectively.
I think you're mistaken about what frequentist p-values mean. The definition you're using is the one we would like to have - the probability that X is president given the data. But what a p-value actually gives you is the chance of seeing the given data given that X is president. And if X is the president, we have p=0, since there are no false negatives. So a p-value is useless in this example, although better frequentist measures can handle it.
Less wrong, but still off. The p-value is the probability of committing a type-1 error (false positive: we declare X to be president when he really isn't) given a sample and a null hypothesis. P-values are not the 'chances' of seeing the given data, it is a test-statistic that describe how reliable your probability estimates are for a given sample and null.
Actually I think we're still a little off.
The p-value is the probability of seeing a test statistic at least as extreme as the observed sample statistic, under the assumption that the null hypothesis is true. This can be restated in terms of tests and errors.
The p-value is the probability of committing a type-1 error for a statistical test, where the threshold of that test is chosen to be equal to the sample statistic. In more intuitive terms, it's the probability of a type-1 error under the null hypothesis for the most extreme test that the sample data is able to pass.
In your explanation, where you said "given a sample" you should say "given a sample size", to make it clear that the type-1 error probability is not conditional on the sample data. If you condition on the observed data, then the probability of passing a statistical test is going to be 1 or 0 depending on whether or not the data passes the test. It is the test itself, not the error probability, that depends on the sample data. Which is also something you should have specified.
This is the clearest answer I've read so far. The p-value is the minimum probability of ending up with a test statistic that lies in the rejection region (for a given null hypothesis and critical value).
Please re-read onetwofiveten's explanation. All I've done is paraphrase it. Alpha is just an arbitrary cutoff p-value corresponding to a threshold t-value for a given null hypothesis.
Furthermore, practicing biomedical researchers rarely, if ever, take a single study as "proof" of a hypothesis, no matter what the P-value is.
Replication and plausibility in the context of other studies are taken into account, however these additional parameters are difficult to reduce to numbers.
A recent blog post (link below) describes the actual practice of (good) biomedical research fairly well.
A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is.
Say you are looking for genes that might influence the rate of occurrence a particular disease. There might be genes that influence this rate, or there might not, it could be entirely environmental, or it could be entirely genetic, or something in between. In any case, you go genome-wide studies, and find that certain gene variants occur more often in your diseased population than in your control population. You apply frequentist statistics, using some corrections for multiple hypothesis testing, and get some kind of "significant" result. This gets published in Nature (you lucky thing!).
Are your conclusions correct? Do the genes you identified really modify the course of the disease you studied? Bayesian statistics won't give you the answer.
The only way to get the answer is to do experimental science, i.e. deliberately modify the gene(s) in question and show that your modifications change the occurrence or course of the disease.
Unfortunately, that is not always feasible, for either technical or ethical reasons, so we have to fall back on the poor cousin of experimental science that is population statistics.
Both frequentist and Bayesian decision making ultimately rest on arbitrary prior assumptions. In Bayesian statistics it is explicit. In frequentist stats it is less acknowledged, but still there. Where do your cutoff values for alpha and beta (p-value and power) come from? Same place the Bayesians get their priors.
If you are hypothesizing that genetics influences the incidence of a disease, without any prior knowledge as to whether there is any influence of genetics on a disease, how can you have a prior probability that there is a genetic influence on the disease in question?
- Most trivially, your prior distribution can assign equal probabilities to different outcomes, if you have no reason to do otherwise. Beta(1, 1) for modeling a coin's bias, for example, if you have no prior information about its bias.
- There are more advanced tools in Bayesian analysis such as Jeffreys prior (known as uninformative priors, look it up).
- As was mentioned in other responses the same "big problems" exist in every other statistical and mathematical modeling approach, namely that you have to make assumptions and your results are going to be crap if your assumptions are crap.
- Generally, Bayesian stats got a late start due to high computational resource costs, not some theoretical limitations. The issue with priors that gets repeated by philosophers and some statisticians does not stop the huge, monumental progress Bayesian statistics has had in a ton of applied fields, from computer science / machine learning all the way to economics and political science.
You said "A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is." This is a common complaint about Bayesian stats -- the choice of priors seems arbitrary. That's what I was responding to.
So, based on this data, having the gene increases your probability of having the disease some 11% (0.109/0.0979) or 1.1 percentage points (0,109-0,0979).
Or you might want to calculate some other figures from these. If you just want the ratio, then P(sick) does not have a bearing on it. You could perform some sensitivity analysis etc...
I am not a statistician either, apart from a few courses, but your argument starts with:
P(gene|healthy) = 0.010 and P(no_gene|healthy) = 0.99
P(gene|sick) = 0.011 and P(no_gene|sick) = 0.989
My point is that in many cases there is no justification for placing any number whatsoever on a prior hypothesis. You can't simply say that the prior probability of a particular gene being involved in a disease is 1%, or 0.00001% or 10%, or whatever.
Edit: I'm not saying that Bayesian statistics is without uses, it is very useful in epidemiology, for example. However it is not appropriate for determining molecular and genetic mechanisms.
P(gene|healthy), P(gene) and P(healthy) are something we can often measure.
P(healthy|gene) is something we can calculate with the Bayes formula from the above values.
More generic version:
We want to know P(model|data). That's what we always want. What is the probability of some model, based on this data that we measured.
But we only have P(model), P(data) and P(data|model). So we use the Bayes formula to get the answer to the interesting conditional probability. It needs all three inputs. We must estimate if we don't have some.
Just presenting P(data|model) and not constraining P(model) or P(data) in any way means that we can't say anything about P(model|data).
It's like if we start with x + a + b = 5.
If we don't know or estimate a and b, it's impossible to say anything about x. Bayes' formula is like saying then, solve it like this: x = 5 - a - b.
So, if you say, there is not justification placing any constraint on a or b - then it's just saying we clearly do not have enough data to say anything about x either. There is no way out of it.
it not always possible to have any sense of what the prior probability is
And in those cases, frequentist statistics is no better off, because you have no sense of what the sample space is.
However, there are cases where there is no well-defined sample space, but you can still assign reasonable priors; so Bayesian statistics covers a range of cases that is a superset of the range of cases that frequentist statistics covers. E. T. Jaynes goes into this in some detail in his book Probability Theory: The Logic of Science.
As I understand it, using Bayesian statistics, you can report the posterior probability given various reasonable priors. It seems useful to know how sensitive the conclusion is to the choice of prior. (With a strong result, it should make relatively little difference.)
>A scientist comes up with a test to determine if someone is the President.
It's a poor analogy, because it's not clear to people why such a test is "natural." It's not clear how your specific test could be broken in the peculiar way that it would have 99.999% chance to confirm that someone who isn't the president is indeed not the president.
And people would get caught up with what you mean by "If they are" and "If they are not," since it's not clear how you would know the error of your test without a real president around to identify.
False positives or false negatives are not at all intuitive to people who have never done experimental design. Most people would get stuck at percentages anyhow.
So just to get a handle on this stuff, the two problems you have if you do a test and only look at a low p are:
1) Unusual things do occur. If a million people do the same test, it's obvious they'll come up with some wrong values. It's less obvious that a similar number of wrong values will come if a million people do a million different tests with a similar small chance of bogus results.
2) The pattern of results may indeed be unusual but not necessarily in the fashion you think it is. There may be a non-random pattern possessed by the data but may not because of your particular hypothesis but a "this is not random" result may seem to say your hypothesis does explain the data.
This has probably been addressed millions of times, but just to give a response to why this is a misleading comic:
If you were equally mocking of both Bayesian and Frequentist, the Bayesian would arguably come up with basically the same conclusion. In this scenario where the Frequentist ignores previous data based on human history and our understanding of the lifecycle of the sun based on physics, the Bayesian should do the same if we are fair. Then his prior distribution would likely be 50% belief that the sun would explode (Why favor either outcome? Thus the prior distribution is split down the middle between both outcomes.). With the new information given from the detector, his posterior would be 35/36 probability that the sun has exploded.
Maybe the Bayesian approach provides a more explicit way of incorporating previous information, and without that, it's cause for some misuse with a Frequentist approach, but that doesn't mean Frequentists need be fundamentally ignorant in their approach.
The comic's good for a laugh. Just don't take it too seriously as a criticism.
That's the most famous comparison between frequentist and bayesian statistics, and it's a shame because the frequentist interpretation depicted in the comic is really a straw man argument. See: http://stats.stackexchange.com/questions/43339/whats-wrong-w...
I'm not sure it's a straw man, there seems to be plenty of examples of people publishing results based on p < 0.05 findings even when it wasn't justified... just like the person mentioned in the Nature article almost did.
In a world where all statisticians were Bayesian, you'd still see results like this published. It would simply read differently: "We saw something that seems pretty unlikely, care to confirm?"
Perhaps, or perhaps the scientist would try to replicate the results themselves first (as the person in the article did).
But even if they published them with the verbiage you mentioned, that would be a better world because reporters would be less likely to turn it into a story with a headline like "Scientists say ...". The prevalence of articles like this erode trust in science, and fill the public's head with misconceptions.
---
Imagine everyone in the USA gets sudden amnesia. We want to find out who the President is, but no one can remember.
A scientist comes up with a test to determine if someone is the President.
If they are the President, there is a 100% chance the test will say they are the President and a 0% chance the test will say they are not the President.
If they are not the President, there is a 99.999% chance the test will say that are not the President, and a 0.001% chance the test will falsely say they are the President.
Giving the test to the person sitting in the big chair in the Oval Office is useful, because it's already quite likely this person is the President. If the test is positive for Presidency, it's extremely likely that person is the president.
Giving the test to the 10 people nearest the oval office is useful, because it's fairly likely the President is one of these people. A positive result will indicate strongly that that person is the President, and if no-one in that group is actually the President, there's a 99.99% chance the test will say so.
Giving the test to the 1000 people in the White House is pretty useful, because it's pretty likely the President is in the White House, and if none of these people are the president, there's still a 99% chance the test will be correct. A positive result for any one person will indicate quite strongly that that person is the President.
But giving the test to everyone in America is not very useful at all, because it's very unlikely that any particular person is the President, and we can expect the test will give a positive result for around 3200 people. For any particular person in this group, it's much more likely they're not the President than they are.
---
Is this a broadly correct, if non-rigorous, analogy? I realize most HNers will be much more familiar with this stuff than I am, I'm interested chiefly in whether or not I misled my friend.