Back to [Archive] Free GWAS project

Piffer's Freewas
I am starting this thread to have one place for our discussions and to keep my work-email free from non-work related things.

My current todo list is:

  • Clean up the code so that it is possible to add new analyses and steps.
  • Implement a bin-method that takes care of the LD-problem.
  • Add the ancestral allele analysis to the pipeline instead of doing that as a separate analysis afterwards.
  • Make sure the ancestral analysis is correct...
  • Do the Watson/Venter analysis only on ancestral alleles.


Things that have come to me during my time off. These might be good ideas, it is up to you to decide:

  • Now that we are switching to a 300 kb "bin"-based method, using the reference panels in ALFRED seems like it could work better; even if we get many fewer SNPs there will still be many SNPs within each bin I suspect.
  • What about using all SNPs with passable R2s, but weighting them somehow, so that a SNP with correlation 0.45 is used, but not given as much weight as one with an R2 of 0.9? I imagine this might give better signal; as we saw, the results were good for lower R2 SNPs too.
  • Related to the last point; perhaps we should just count all plus and minus alleles in each 300kb-bin instead of looking much at R2?
  • How about crowdfunding a programmer (not me; I am lacking time and energy, not money)*? There are plenty of blogs that might link to such a charity.


*Furthermore, your offer to pay was kind, but it would be a serious breach of ethics for me to use the school server to carry out analyses and then receive payment for it.
I am a bit uncertain about this e-mail:
A criticism that is bound to come up soon or later is that our positive results are simply due to our r>0.9 sample consisting of more derived than ancestral alleles.
That is, our results could be explained by the fact that intelligence genes tend to be derived and that our PC tends to be biased towards derived alleles because that's how they're distributed across populations similarly.
If that were the case, we'd have only proved that intelligence genes tend to be derived (a noteworthy discovery, probably not published before, but pretty obvious and not surprising).
We need to make sure this is not the case.
In order to do this we need to run the W and V comparisons using only the ancestral alleles.
Take the set of SNPs with r>0.9 (or 0.8 if the sample isn't big enough) and fill out the 2x2 contingency table with beneficial and detrimental alleles for Watson or Venter and the CEU population.
Discard the derived alleles and keep only the ancestral alleles from Watson or Venter and CEU's genomes.
Run Fisher's exact test using only ancestral alleles.


What do result do we hope to see and why?
If these analyses turn out to be correct for different traits, should we hint that a new law of BHG might be around the corner, something like an analogue of the first law of BHG- namely that all differences between populations are partly hereditary? Infamy for both of us (should be another paper probably)...
I am a bit uncertain about this e-mail:
A criticism that is bound to come up soon or later is that our positive results are simply due to our r>0.9 sample consisting of more derived than ancestral alleles.
That is, our results could be explained by the fact that intelligence genes tend to be derived and that our PC tends to be biased towards derived alleles because that's how they're distributed across populations similarly.
If that were the case, we'd have only proved that intelligence genes tend to be derived (a noteworthy discovery, probably not published before, but pretty obvious and not surprising).
We need to make sure this is not the case.
In order to do this we need to run the W and V comparisons using only the ancestral alleles.
Take the set of SNPs with r>0.9 (or 0.8 if the sample isn't big enough) and fill out the 2x2 contingency table with beneficial and detrimental alleles for Watson or Venter and the CEU population.
Discard the derived alleles and keep only the ancestral alleles from Watson or Venter and CEU's genomes.
Run Fisher's exact test using only ancestral alleles.


What do result do we hope to see and why?

My email was probably too pessimistic. It's actually good that we're finding more derived alleles among our set of r>0.8 SNPs. However it's possible that derived alleles are not distributed equally across populations or races. If for some odd coincidence my PCs follow closely that geographical/ethnic distribution of allleles, it may be that we're finding more derived alleles not because the PC is a signal of selection for IQ, but because by chance it happens to be a signal of a population with more derived alleles.
Since among the alleles with r>0.8 there are many alleles with ancestral status, we could check whether the ancestral alleles have higher frequencies in Watson and Venter. This would test the hypothesis that our sample of alleles predicts intelligence not simply because the PC picks more derived than ancestral alleles. The signal should be also among the ancestral alleles (unless of course all intelligence alleles are derived then our concerns would be unjustified).
I am starting this thread to have one place for our discussions and to keep my work-email free from non-work related things.

My current todo list is:

  • Clean up the code so that it is possible to add new analyses and steps.
  • Implement a bin-method that takes care of the LD-problem.
  • Add the ancestral allele analysis to the pipeline instead of doing that as a separate analysis afterwards.
  • Make sure the ancestral analysis is correct...
  • Do the Watson/Venter analysis only on ancestral alleles.


Things that have come to me during my time off. These might be good ideas, it is up to you to decide:

  • Now that we are switching to a 300 kb "bin"-based method, using the reference panels in ALFRED seems like it could work better; even if we get many fewer SNPs there will still be many SNPs within each bin I suspect.
  • What about using all SNPs with passable R2s, but weighting them somehow, so that a SNP with correlation 0.45 is used, but not given as much weight as one with an R2 of 0.9? I imagine this might give better signal; as we saw, the results were good for lower R2 SNPs too.
  • Related to the last point; perhaps we should just count all plus and minus alleles in each 300kb-bin instead of looking much at R2?
  • How about crowdfunding a programmer (not me; I am lacking time and energy, not money)*? There are plenty of blogs that might link to such a charity.


*Furthermore, your offer to pay was kind, but it would be a serious breach of ethics for me to use the school server to carry out analyses and then receive payment for it.


1) Wouldn't it be better to finish this work on 1KG first? Otherwise we won't know how much our extremely low p values are affected by linkage.
2)Using more SNPs (lowering the correlation threshold) can be done but it's not necessary now because we've got enough SNPs. However if after controlling for linkage we realize we've got a sample not large enough, we can do it.
3)"Related to the last point; perhaps we should just count all plus and minus alleles in each 300kb-bin instead of looking much at R2?" I am not sure I understand this.
What do you mean by plus or minus? Is it the sign of the correlation with the PC?
I suppose you mean that instead of counting all alleles like we've done so far, for each 300kb region we use the difference between beneficial and detrimental alleles. So if in a 300kb region there are 20 beneficial and 10 detrimental alleles, we count it as 10 beneficial (20-10). Viceversa if there are 10 beneficial and 20 detrimental alleles, we count it as 10 detrimental (10-20), etc.
4) Crowdfunding a programmer is a good idea. But I do not really know how to attract the funds.
Now I understand.

I think our result that height alleles are ancestral is very odd; humans have gotten taller, height alleles should be derived AFAICS.
Venter results, the three new SNPs, bad r2:

Venter 80870 61422
CEU mean: 81007 61232
CEU 8019740 6062056
try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.3746
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9847731 1.0057928
sample estimates:
odds ratio
0.9952289
Now I understand.

I think our result that height alleles are ancestral is very odd; humans have gotten taller, height alleles should be derived AFAICS.


Humans have gotten taller only in the last century due to a dramatic boost in nutrition and health. Throughout the Neolitic and until 1970s they were much shorter than in the Paleolithic.
TBH, I do not think that controlling for allele status should alter the results. For the 7SNPs altogether, we didn't find an overrepresentation of derived alleles among those with r>0.8, yet they were overrepresented in W+V genomes.
Let's say we analyze a couple more genomes from high IQ people that confirm our results obtained from W and V.
Then we can be pretty confident that our set of r>0.8 alleles contains some genes that increase intelligence. How are we gonna find out which genes? Most of them will probably have no or little effect. We need a sieve that picks out the best alleles. The general idea was that of starting with a set of candidate genes much smaller than the entire genome, so as to reduce multiple comparisons and Bonferroni's curse.
The best way to select the best alleles would probably be correlating the number of copies of each allele (0,1,2) of each individual with their IQ score. However, the r>0.8 sample is too big to do this (N=100-300K).Perhaps before we proceed with the actual GWAS, it'd be good to select a smaller sample, say that with r>0.95 or r>0.98. Hopefully Fisher's test will give good odds ratios and then we can proceed.
Just out of curiosity, it'd be good to get a random and representative (n>1000, distributed across all chromosomes) sample of SNPs and pick the derived alleles, then see their average frequencies in each of the 26 1KG populations. If it's highly correlated to our PCs, then we may be picking up the derived signal rather than the intelligence signal.I suppose this should be pretty easy to do.
3)"Related to the last point; perhaps we should just count all plus and minus alleles in each 300kb-bin instead of looking much at R2?" I am not sure I understand this.
What do you mean by plus or minus? Is it the sign of the correlation with the PC?
I suppose you mean that instead of counting all alleles like we've done so far, for each 300kb region we use the difference between beneficial and detrimental alleles. So if in a 300kb region there are 20 beneficial and 10 detrimental alleles, we count it as 10 beneficial (20-10). Viceversa if there are 10 beneficial and 20 detrimental alleles, we count it as 10 detrimental (10-20), etc.


Yes, or some variation of this, but I do not know whether it is a good idea.
This thread isn't pword protected for me.

Anyways, this is my list of todo-items. Feel free to add stuff or re-prioritize:

  1. Implement a bin-method that takes care of the LD-problem.
  2. Add the ancestral allele analysis to the pipeline instead of doing that as a separate analysis afterwards.
  3. Make sure the ancestral analysis is correct...
  4. Do the Watson/Venter analysis only on ancestral alleles.
  5. Pick random SNPs and see the frequencies of derived alleles in each population (couldn't I just pick all SNPs for which we have ancestral data)?
How are you going to find a programmer you trust? Are you only going to give him/her stuff to do you have published preprints for or what?
This thread isn't pword protected for me.

Anyways, this is my list of todo-items. Feel free to add stuff or re-prioritize:

  1. Implement a bin-method that takes care of the LD-problem.
  2. Add the ancestral allele analysis to the pipeline instead of doing that as a separate analysis afterwards.
  3. Make sure the ancestral analysis is correct...
  4. Do the Watson/Venter analysis only on ancestral alleles.
  5. Pick random SNPs and see the frequencies of derived alleles in each population (couldn't I just pick all SNPs for which we have ancestral data)?


Yes, this is all correct. And yes, you can pick all SNPs for which we have ancestral data.
Also, I'd like to have a set of alleles with r>X so that the SNPs above that threshold are around 10. We cannot run a GWAS on too big a sample of alleles, I didn't do a power calculation but I guess 10 SNPs would enable us to run it on a small sample of genomes.
1) Another thing to try (after all the other stuff) is to pick the SNPs closest to those in http://iqdb.cbi.pku.edu.cn/ among the SNPs we analysed and look at their R2 scores. If some of them are high it is a strong indication that we have found some good SNPs, but the contrary is not true; if none of the correlations are good, it is very likely due to the fact that the list contains so few SNPs and many of them might also be false positives.

2) Yet another thing is to do functional annotation of all the SNPs with an R2 over e.g. 0.8. and see if they are overrepresented in a certain functional category. Functional annotation is a Fisher exact where we check if the SNPs we find are overrepresented in some category (molecular function, physiological function, pathway, etc). We should hope for the category "brain function" or st. similar to be overrepresented.

Example image:



If any of these two methods yield good results, it would be strange not to include them in our final paper, since they use common bioinformatic reasoning, but feel free to disagree.

I am just writing down the ideas here for my own sake; when I have an idea without writing it down, I cannot come up with new ones, I just keep thinking of the same thing.
These are good ideas and I think we should do that.
When you have time, can you find all the SNPs with r>0.99 or something similar that yields a very small set of SNPs? I'll need that for our GWAS.
Do you need it for the preprint? I can do that first thing tomorrow (I work from approx 14 to at most 16 with this, but I do not mind writing on the forums at other times).

And the top snps from which analysis? Using all 7, 4 or 3?
Do you need it for the preprint? I can do that first thing tomorrow (I work from approx 14 to at most 16 with this, but I do not mind writing on the forums at other times).

And the top snps from which analysis? Using all 7, 4 or 3?


4 and 3 separately
I am reading your paper and liking it so far. These are people we can contact for a comment and hope they reply:

Greg Cochran
Razib Khan
Henry Harpending
Meisenberg

Should I or you do it? I am thinking you, since I am anonymous and they will not respect that as much...