Back to [Archive] Free GWAS project

Piffer's Freewas
I am reading your paper and liking it so far. These are people we can contact for a comment and hope they reply:

Greg Cochran
Razib Khan
Henry Harpending
Meisenberg

Should I or you do it? I am thinking you, since I am anonymous and they will not respect that as much...


I will do it. People (and academics in particular) are often too close minded and rushing to final judgments for me to regard their opinion that much....
The top snps for the 3 snp and 4 snp vector, respectively.

endrebak@tang:~/playground/freewas$ awk '{if ($4*$4 > .915) {print $0}}' r2_with_3
rs2099216 A C -0.959649577798
rs2099215 A G -0.960493788886
rs1525884 T C -0.958820050904
rs6855458 A G 0.959045445452
rs1986590 G A -0.961737000187
rs4385038 G A -0.963011558514
rs7655799 G A -0.959543820653
rs1320005 C T -0.958372489173
rs2798247 G A -0.9584732923
rs8047203 A G -0.965597488944
rs7187075 T C -0.966140066399
rs7187300 T C -0.966570768548
rs11648052 G A -0.961615135283
rs1178560 T C -0.960871489383

endrebak@tang:~/playground/freewas$ awk '{if ($4*$4 > .9392) {print $0}}' r2_with_4
rs12620887 C A -0.971524150849
rs34025731 G T 0.973760720618
rs35585936 A G 0.974903312488
rs7614673 A G -0.969233179219
rs2672578 A G -0.96918989041
rs2568865 A G -0.969267860537
rs184503 C T -0.97423454355
rs2595924 C T -0.973239739549
rs12926178 T C 0.972201477711
rs596551 G A -0.973994794732
rs6507250 T C 0.969397056245
The top snps for the 3 snp and 4 snp vector, respectively.


Thanks! We'll need these for the Freewas proper.
How about posting a comment at drjamesthompson? He has a relevant post right now.... http://drjamesthompson.blogspot.no/2014/09/gene-hunters-and-gene-saboteurs.html
How about posting a comment at drjamesthompson? He has a relevant post right now.... http://drjamesthompson.blogspot.no/2014/09/gene-hunters-and-gene-saboteurs.html


Done
The top snps for the 3 snp and 4 snp vector, respectively.

endrebak@tang:~/playground/freewas$ awk '{if ($4*$4 > .915) {print $0}}' r2_with_3
rs2099216 A C -0.959649577798
rs2099215 A G -0.960493788886
rs1525884 T C -0.958820050904
rs6855458 A G 0.959045445452
rs1986590 G A -0.961737000187
rs4385038 G A -0.963011558514
rs7655799 G A -0.959543820653
rs1320005 C T -0.958372489173
rs2798247 G A -0.9584732923
rs8047203 A G -0.965597488944
rs7187075 T C -0.966140066399
rs7187300 T C -0.966570768548
rs11648052 G A -0.961615135283
rs1178560 T C -0.960871489383

endrebak@tang:~/playground/freewas$ awk '{if ($4*$4 > .9392) {print $0}}' r2_with_4
rs12620887 C A -0.971524150849
rs34025731 G T 0.973760720618
rs35585936 A G 0.974903312488
rs7614673 A G -0.969233179219
rs2672578 A G -0.96918989041
rs2568865 A G -0.969267860537
rs184503 C T -0.97423454355
rs2595924 C T -0.973239739549
rs12926178 T C 0.972201477711
rs596551 G A -0.973994794732
rs6507250 T C 0.969397056245


It'd be interesting to run Fisher's test on this set for W+V
I attach Emily's genome here. Her IQ (certified) is 150. Would be good to analyze it too. Analyze it as you did W and V when you have time.
Download it from here:
https://drive.google.com/file/d/0B7hcznd4DKKQTko2V1RFQkJCYms/edit?usp=sharing
It's possible to download a list of Rietveld's SNPs with p values. It'd be good to download their frequencies from 1KG Phase 3 and see if their p value is correlated with their PC's correlation (Meta-correlation or Method of correlated vectors).

http://ssgac.org/Data.php
Admin
1)

Expanding on the above.

Correlate the p-values with the population frequency correlation using the first factor from the 4 confirmed SNPs for g. A positive correlation indicates that the selection pressure on g genes has been general as opposed to specific.

-

2)

Do races differ more on the genes more highly associated with g? Correlate the p-values with the SD of frequencies in cross-population data. If positive, this indicates recent selection for the g genes above mere drift.

I think.

-

3)

Their method of using the polygenic score is also applicable to entire populations. These predicted edu. att. levels should fit with national IQs. This method cannot be attacked for sampling error in the identified SNPs.
I wasn't completely able to finish this week, but I am 95% there; now all that remains is computing the FE for the "bad" alleles (the original program was so poorly written that it wasn't possible to change/extend it without breaking everything). I will be able to start on our todo-list sometime next week, unless something extraordinary happens.

Any replies or feedback from those researchers?
Any replies or feedback from those researchers?


No, I've not contacted them yet. We may try and contact BGI people when we've completed our to-do list.
I don't want external reviewers to interfere with our project whilst it's still developing.
I finished redoing all the stuff that was rushed.

Now that we are switching to bins instead of snps, we need to find out how to interpret whether a bin is beneficial or detrimental. Should I pick the highest R2 snp for each bin and set the bin to beneficial if the allele is beneficial and vice versa? This won't work well I think; imagine that you have two bins, one from 0-300K and one from 300K-600K. If there is a SNP with very high R2 just left of the 300K border there is probably going to be another SNP with high R2 just right of the 300K border so this method is heavily influenced by linkage.

Another method would be to look at all SNPs over a certain R2 level within each bin and then set the bin to beneficial if most of the SNPs within it are beneficial, and vice versa. I think this is what you suggested here:

...[D]ivide the genomes in chunks of 500kb and treat each one as a single SNP to control for linkage. This will greatly reduce the significance of the good alleles but will also make the bad alleles (p<0.005) non significant so it's gonna weaken and strengthen our results and in the end it shouldn't make them weaker.


This might not work too well either, since you would expect there to be mostly beneficial SNPs within each bin in the reference population too, just fewer than for the brainiacs.

How should I solve this?
I have this strange thing where I can't stay away from this project. Using lots of time I could have used on my PhD so hoping it'll lead to a publication (although there are no guarantees in life).

I'm thinking the current results look _much_ better. I have more faith in this code, and it seems like Venter and Watson actually have many fewer of the "beneficial" <|0.005| alleles (wasn't it the opposite way before?).

And Venter even scores higher than Watson both in real IQ and our odds-ratio!

Venter low R2

80700 81166
try({print(fem)})
[,1] [,2]
[1,] 80700 61318
[2,] 81166 61072

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.1978
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.975642 1.005111
sample estimates:
odds ratio
0.9902685

Watson low R2

80955 81166
try({print(fem)})
[,1] [,2]
[1,] 80955 61069
[2,] 81166 61072

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.7388
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9827136 1.0123879
sample estimates:
odds ratio
0.9974493


Venter high R2

7253 6783
try({print(fem)})
[,1] [,2]
[1,] 7253 1731
[2,] 6783 2200

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.264875 1.460214
sample estimates:
odds ratio
1.358971


Watson high R2

try({print(fem)})
[,1] [,2]
[1,] 7174 1810
[2,] 6783 2200

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 2.723e-12
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.197322 1.380392
sample estimates:
odds ratio
1.285508


Edit: Adding the middling scores (between 0.005 and 0.9 in absolute value). I'm thinking its good that we get good results here, because this list includes some pretty decent r2 snps:

Venter first, then Watson:

1649364 1575101
try({print(fem)})
[,1] [,2]
[1,] 1649364 991226
[2,] 1575101 1069563

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.125977 1.133851
sample estimates:
odds ratio
1.129903

1646085 1575101
try({print(fem)})
[,1] [,2]
[1,] 1646085 994353
[2,] 1575101 1069563

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.120164 1.128060
sample estimates:
odds ratio
1.12411
Done:
  • Clean up the code so that it is possible to add new analyses and steps.

Todo:
  • Add the ancestral allele analysis to the pipeline instead of doing that as a separate analysis afterwards.
  • Do the Watson/Venter analysis only on ancestral alleles, only on derived alleles, and on both.
  • Correlate the 69 snps with 4SNP IQ-PC
  • Correlate p-values of SNPs with population SD in Rietveld
  • Correlate Edu. Att. levels with national IQ?
  • Implement a bin-method that takes care of the LD-problem.


Appended the snp list from Rietvald so that it is easily accessible.
I have this strange thing where I can't stay away from this project. Using lots of time I could have used on my PhD so hoping it'll lead to a publication (although there are no guarantees in life).

I'm thinking the current results look _much_ better. I have more faith in this code, and it seems like Venter and Watson actually have many fewer of the "beneficial" <|0.005| alleles (wasn't it the opposite way before?).

And Venter even scores higher than Watson both in real IQ and our odds-ratio!

Venter low R2

80700 81166
try({print(fem)})
[,1] [,2]
[1,] 80700 61318
[2,] 81166 61072

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.1978
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.975642 1.005111
sample estimates:
odds ratio
0.9902685

Watson low R2

80955 81166
try({print(fem)})
[,1] [,2]
[1,] 80955 61069
[2,] 81166 61072

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.7388
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9827136 1.0123879
sample estimates:
odds ratio
0.9974493


Venter high R2

7253 6783
try({print(fem)})
[,1] [,2]
[1,] 7253 1731
[2,] 6783 2200

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.264875 1.460214
sample estimates:
odds ratio
1.358971


Watson high R2

try({print(fem)})
[,1] [,2]
[1,] 7174 1810
[2,] 6783 2200

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 2.723e-12
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.197322 1.380392
sample estimates:
odds ratio
1.285508


Edit: Adding the middling scores (between 0.005 and 0.9 in absolute value). I'm thinking its good that we get good results here, because this list includes some pretty decent r2 snps:

Venter first, then Watson:

1649364 1575101
try({print(fem)})
[,1] [,2]
[1,] 1649364 991226
[2,] 1575101 1069563

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.125977 1.133851
sample estimates:
odds ratio
1.129903

1646085 1575101
try({print(fem)})
[,1] [,2]
[1,] 1646085 994353
[2,] 1575101 1069563

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.120164 1.128060
sample estimates:
odds ratio
1.12411


These results are perfect, because higher r correspond to higher odds ratios and neutral r corresponds to odds ratios around 1.And Venter's IQ is likely higher than Watson's (a boring bureaucrat).
Done:
  • Clean up the code so that it is possible to add new analyses and steps.

Todo:
  • Add the ancestral allele analysis to the pipeline instead of doing that as a separate analysis afterwards.
  • Do the Watson/Venter analysis only on ancestral alleles, only on derived alleles, and on both.
  • Correlate the 69 snps with 4SNP IQ-PC
  • Correlate p-values of SNPs with population SD in Rietveld
  • Correlate Edu. Att. levels with national IQ?
  • Implement a bin-method that takes care of the LD-problem.


Appended the snp list from Rietvald so that it is easily accessible.


I would move "Implement a bin-method that takes care of the LD-problem" to the top of the list.
I finished redoing all the stuff that was rushed.

Now that we are switching to bins instead of snps, we need to find out how to interpret whether a bin is beneficial or detrimental. Should I pick the highest R2 snp for each bin and set the bin to beneficial if the allele is beneficial and vice versa? This won't work well I think; imagine that you have two bins, one from 0-300K and one from 300K-600K. If there is a SNP with very high R2 just left of the 300K border there is probably going to be another SNP with high R2 just right of the 300K border so this method is heavily influenced by linkage.

Another method would be to look at all SNPs over a certain R2 level within each bin and then set the bin to beneficial if most of the SNPs within it are beneficial, and vice versa. I think this is what you suggested here:

...[D]ivide the genomes in chunks of 500kb and treat each one as a single SNP to control for linkage. This will greatly reduce the significance of the good alleles but will also make the bad alleles (p<0.005) non significant so it's gonna weaken and strengthen our results and in the end it shouldn't make them weaker.


This might not work too well either, since you would expect there to be mostly beneficial SNPs within each bin in the reference population too, just fewer than for the brainiacs.

How should I solve this?


Cannot you just use the average R2 of beneficial alleles within the 500kb chunk?

This is what I found about averaging Pearson's r: For Pearson correlation coefficients, it is generally appropriate to transform the r values using a Fisher z transformation. Then average the z-values and convert the average back to an r value. (http://stats.stackexchange.com/questions/8019/averaging-correlation-values)
Okidoki, I just needed to know how to do it. LD now on top of list.

Seems like the R psych package can do this for me:

cors <- seq(-.9,.9,.1)
zs <- fisherz(cors)
rs <- fisherz2r(zs)


If the R value for the bin is positive, it means it the block is beneficial, otherwise detrimental.
If the R value for the bin is positive, it means it the block is beneficial, otherwise detrimental.


Well, it depends on the threshold we chose, right? If we set the threshold for "beneficial" at 0.9, then the average r will have to be >0.9 in order to be called "beneficial".

Also I am a bit confused because in the previous calculations you used 0.8 as threshold but in the last ones it seems like you've used 0.9? Is this right?
Changed back to 0.9.

Okidoki, fine. I do not see any blocks having an average over 0.8 or 0.9, but we will see.