Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is.

Say you are looking for genes that might influence the rate of occurrence a particular disease. There might be genes that influence this rate, or there might not, it could be entirely environmental, or it could be entirely genetic, or something in between. In any case, you go genome-wide studies, and find that certain gene variants occur more often in your diseased population than in your control population. You apply frequentist statistics, using some corrections for multiple hypothesis testing, and get some kind of "significant" result. This gets published in Nature (you lucky thing!).

Are your conclusions correct? Do the genes you identified really modify the course of the disease you studied? Bayesian statistics won't give you the answer.

The only way to get the answer is to do experimental science, i.e. deliberately modify the gene(s) in question and show that your modifications change the occurrence or course of the disease.

Unfortunately, that is not always feasible, for either technical or ethical reasons, so we have to fall back on the poor cousin of experimental science that is population statistics.



Both frequentist and Bayesian decision making ultimately rest on arbitrary prior assumptions. In Bayesian statistics it is explicit. In frequentist stats it is less acknowledged, but still there. Where do your cutoff values for alpha and beta (p-value and power) come from? Same place the Bayesians get their priors.


My point was that in most cases of medical research statistics alone will not get you the answer to the question of cause-and-effect.


Ok, but your first sentence said something very different.


I don't see how?

If you are hypothesizing that genetics influences the incidence of a disease, without any prior knowledge as to whether there is any influence of genetics on a disease, how can you have a prior probability that there is a genetic influence on the disease in question?


- Most trivially, your prior distribution can assign equal probabilities to different outcomes, if you have no reason to do otherwise. Beta(1, 1) for modeling a coin's bias, for example, if you have no prior information about its bias.

- There are more advanced tools in Bayesian analysis such as Jeffreys prior (known as uninformative priors, look it up).

- As was mentioned in other responses the same "big problems" exist in every other statistical and mathematical modeling approach, namely that you have to make assumptions and your results are going to be crap if your assumptions are crap.

- Generally, Bayesian stats got a late start due to high computational resource costs, not some theoretical limitations. The issue with priors that gets repeated by philosophers and some statisticians does not stop the huge, monumental progress Bayesian statistics has had in a ton of applied fields, from computer science / machine learning all the way to economics and political science.

Edit: language


You said "A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is." This is a common complaint about Bayesian stats -- the choice of priors seems arbitrary. That's what I was responding to.


You could look at the portion of similar diseases that have had genetic influences.


My loose understanding is that you can get at causality without intervention, although I'm not very far into Pearl's Causality yet...


> A big problem with Bayesian statistics (as I see it) it that it not always possible to have any sense of what the prior probability is.

Reminds me of this quote: (These values are known as priors, which is ironic because Bayesians inevitably pull them out of their posteriors.)

http://plover.net/~bonds/cultofbayes.html


If you learn about Bayes from crazy people, what is the likelihood that you will think Bayesian reasoning is always used in crazy ways?

Last week I did some Bayesian (!!) work on finding a pdf for the direction of a certain quantity. My priors were... [0,2*pi).


Thanks for that, it was entertaining to read.


I'm not sure I follow. Say you have

  P(gene|healthy) = 0.010 and P(no_gene|healthy) = 0.99
  P(gene|sick) = 0.011 and P(no_gene|sick) = 0.989
Then you do large population studies (testing for sick vs healthy is cheaper than testing for genes). You get

  P(healthy) = 0.90 and P(sick) = 0.10.
From the small sample you get the average of the gene's expression frequency:

  P(gene) = 0.0101 and P(no_gene) = 0.9899.
You are interested in what is the probability of a person being sick when they have the gene

  P(sick|gene) = 
  P(sick,gene) / P(gene) = 
  P(gene|sick) * P(sick) / P(gene) =
  0.011 * 0.10 / 0.0101 = 
  0.109
And conversely, odds of being sick without the gene:

  P(sick|no_gene) =
  P(no_gene|sick) * P(sick) / P(no_gene) =
  0.989 * 0.10 / 0.9899 =
  0.0979
So, based on this data, having the gene increases your probability of having the disease some 11% (0.109/0.0979) or 1.1 percentage points (0,109-0,0979).

Or you might want to calculate some other figures from these. If you just want the ratio, then P(sick) does not have a bearing on it. You could perform some sensitivity analysis etc...

(I am not a statistician by far)


I am not a statistician either, apart from a few courses, but your argument starts with:

P(gene|healthy) = 0.010 and P(no_gene|healthy) = 0.99

P(gene|sick) = 0.011 and P(no_gene|sick) = 0.989

My point is that in many cases there is no justification for placing any number whatsoever on a prior hypothesis. You can't simply say that the prior probability of a particular gene being involved in a disease is 1%, or 0.00001% or 10%, or whatever.

Edit: I'm not saying that Bayesian statistics is without uses, it is very useful in epidemiology, for example. However it is not appropriate for determining molecular and genetic mechanisms.


I think you misunderstand.

P(gene|healthy), P(gene) and P(healthy) are something we can often measure.

P(healthy|gene) is something we can calculate with the Bayes formula from the above values.

More generic version:

We want to know P(model|data). That's what we always want. What is the probability of some model, based on this data that we measured.

But we only have P(model), P(data) and P(data|model). So we use the Bayes formula to get the answer to the interesting conditional probability. It needs all three inputs. We must estimate if we don't have some.

Just presenting P(data|model) and not constraining P(model) or P(data) in any way means that we can't say anything about P(model|data).

It's like if we start with x + a + b = 5.

If we don't know or estimate a and b, it's impossible to say anything about x. Bayes' formula is like saying then, solve it like this: x = 5 - a - b.

So, if you say, there is not justification placing any constraint on a or b - then it's just saying we clearly do not have enough data to say anything about x either. There is no way out of it.


it not always possible to have any sense of what the prior probability is

And in those cases, frequentist statistics is no better off, because you have no sense of what the sample space is.

However, there are cases where there is no well-defined sample space, but you can still assign reasonable priors; so Bayesian statistics covers a range of cases that is a superset of the range of cases that frequentist statistics covers. E. T. Jaynes goes into this in some detail in his book Probability Theory: The Logic of Science.


As I understand it, using Bayesian statistics, you can report the posterior probability given various reasonable priors. It seems useful to know how sensitive the conclusion is to the choice of prior. (With a strong result, it should make relatively little difference.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: