In making Venn diagrams to look at overlap of sets, I often wonder how significant a given amount of overlap is. What is the likelyhood of seeing a given amount of overlap from two sets, simply by chance?

One way to assess this, is to use the hypergeometric distribution. The R language has a nice function for calculating the p-value, but the explanation of how to use it involves an Urn of black and white balls.

phyper(q,m,n,k,lower.tail=F)

q = the number of white balls drawn from the urn (without replacement)

m = the number of white balls in the urn

n = the number of black balls in the urn

k = the number of balls drawn from the urn (sample size)

Example comparing gene sets

Let's say you want to compare sets of genes identified in two independent experiments. For instance, in experiment one, you identify 1000 genes up regulated under a given condition. In experiment two you identify 2872 genes with promoters bound by a transcription factor. Now you want to compare the two experiments to see if the up-regulated genes are also those bound by the transcription factor. A venn diagram between the experiments indicates that the two sets (1000 up-regulated genes, and 2872 TF bound genes) have and intersection of 448. Is this significant? The total number of genes in the experiment is 14,800.

q = 448

m = 1000

n = 14800 - 1000

k = 2872

1 - phyper(448,1000,13800,2872)
[1] 1.906314e-81

Making Venn Diagrams

* I wrote a utility for making venn diagrams: venn diagrams

* But someone else wrote a better one recently: venny

VennSignificance (last edited 2011-08-29 17:54:13 by ChrisSeidel)