Back to [Archive] Free GWAS project

Piffer's Freewas
Here are a few example SNPs I have computed ancestral alleles for:

rs115724926 ancestral allele: C a1, a2: T C
rs76558199 ancestral allele: C a1, a2: T C
rs115097218 ancestral allele: A a1, a2: C A
rs73912893 ancestral allele: T a1, a2: C T
rs148989228 ancestral allele: C a1, a2: T C
rs188353846 ancestral allele: C a1, a2: A C
rs183990933 ancestral allele: A a1, a2: G A
rs150703965 ancestral allele: C a1, a2: T C
rs12754304 ancestral allele: G a1, a2: A G


Could you check a few of them to see that my calculations are correct? I'm getting curious results like:

ACB 0.915314274671
ASW 0.91517916277
BEB 0.917049238231
CDX 0.916604144085
CEU 0.916312368517
CHB 0.916396070154
CHS 0.916384565614
CLM 0.916188882557
ESN 0.915038231054
FIN 0.916286699506
GBR 0.916317467703
GIH 0.917061446611
GWD 0.915371030784
IBS 0.916431791717
ITU 0.917123471431
JPT 0.916327029174
KHV 0.916603507552
LWK 0.914944244915
MSL 0.915354857881
MXL 0.916200977742
PEL 0.916411334999
PJL 0.91699048296
PUR 0.916128942459
STU 0.917130458765
TSI 0.91637858723
YRI 0.914866869338


I thought it was known that AFR were more ancestral?


I checked them using 1KG browser and yes they are right. Honestly I thought the frequency of ancestral alleles would be lower but this is because the other studies I read used completely different polymorphisms, i guess they used polymorphisms with a much higher mutation rate. They had found higher frequencies of derived alleles among non-Africans and explained this in terms of genetic drift. However, they are wrong because genetic drift predicts that ancestral alleles are higher among non-Africans. In fact, the lower the frequency of a mutation, the higher the probability that it will disappear from a population if the population is smaller (non-African) and subject to drift. So yes this is what I had expected to find.

Well if these frequencies are real, controlling for them is not gonna weaken our results. To the contrary..
Final result, all SNPs used:

ACB 0.911957905014
ASW 0.911821411242
BEB 0.913758743373
CDX 0.913363069592
CEU 0.913033540557
CHB 0.913130089097
CHS 0.913136142363
CLM 0.912902307909
ESN 0.911663237048
FIN 0.913018271957
GBR 0.913031825069
GIH 0.913740343616
GWD 0.911970657186
IBS 0.913129602339
ITU 0.913810253254
JPT 0.91306370025
KHV 0.913346030274
LWK 0.911599103984
MSL 0.911971566053
MXL 0.912948149276
PEL 0.913162956438
PJL 0.913683004599
PUR 0.912843840412
STU 0.913825134562
TSI 0.913073561496
YRI 0.911495222672
number snps not matching 1391462 (4% where a1 and a2 were both different from the ancestral)
number snps used for calculations 32331521
no ancestral snp for 3902580 (12% of the SNPs looked at had no ancestral info)


The ancestral info is only available for the rs SNPs, while many (most?) SNPs in 1kg has no name.

Furthermore, these are only the frequencies for the SNPs where ancestral info existed and was correct (a1 or a2 equaled the ancestral), not the ancestral info for all SNPs. If we divided by the number of all SNPs the ancestral allele frequencies would be much lower. But it would be impossible to find the right denominator; for many SNPs there probably are ancestral versions, but nobody has bothered to check for them.

Does it matter that we can't get good ancestral ratios for all SNPs? This data is approx correct for the SNPs we are using (we are not using any non-rs snps).

Ps. The files I have been using (SNPAncestralAllele.bcp.gz, Allele.bcp.gz) are from here: ftp://ftp.ncbi.nih.gov/snp/database/organism_data/human_9606/ and seem to have been updated May this year.

Ps. ps. perhaps Hsu could check my ancestral allele computation script, if it sounds interesting to him? It is completetly independent of the freewas.
Final result, all SNPs used:

ACB 0.911957905014
ASW 0.911821411242
BEB 0.913758743373
CDX 0.913363069592
CEU 0.913033540557
CHB 0.913130089097
CHS 0.913136142363
CLM 0.912902307909
ESN 0.911663237048
FIN 0.913018271957
GBR 0.913031825069
GIH 0.913740343616
GWD 0.911970657186
IBS 0.913129602339
ITU 0.913810253254
JPT 0.91306370025
KHV 0.913346030274
LWK 0.911599103984
MSL 0.911971566053
MXL 0.912948149276
PEL 0.913162956438
PJL 0.913683004599
PUR 0.912843840412
STU 0.913825134562
TSI 0.913073561496
YRI 0.911495222672
number snps not matching 1391462 (4% where a1 and a2 were both different from the ancestral)
number snps used for calculations 32331521
no ancestral snp for 3902580 (12% of the SNPs looked at had no ancestral info)


The ancestral info is only available for the rs SNPs, while many (most?) SNPs in 1kg has no name. The files I have been using (SNPAncestralAllele.bcp.gz, Allele.bcp.gz) are from here: ftp://ftp.ncbi.nih.gov/snp/database/organism_data/human_9606/ and seem to have been updated May this year.


The correlation of ancestral freq. with national IQs is 0.57. The corr. with 4SNPs and 3SNPs PCs is 0.79 and 0.857.

It's weird that the percentage of ancestral alleles is positively correlated to measures of intelligence. Perhaps it just indicates genetic distance from Africa... I will run a couple of regressions and get back to you.
Admin
It should be negative, not positive if there has been recently selection for higher g in the higher g countries. That is, unless humans used to be smarter, then got dumber, and then get smarter again.
Ok what we get is pretty good. I ran a regression with country IQ as dependent and ancestral, 3SNPs, 4SNPs as predictors.
Both the 4 and 3 SNPs PC effects on IQ are not affected, actually when entering "Ancestral" in the regression the Beta increases going up to 1.2 for the 4 SNPs PC.
Another really good thing (because it fits with our model) happens: whilst the Beta of Ancestral on IQ is positive when considered alone, when entered in the regression the beta coefficient becomes negative. It is negatively correlated to IQ after partialling out the effects of our PCs. This makes our case stronger because we get an expected negative effect on IQ of ancestral alleles after partialling out the IQ PCs and it doesn't reduce the predictive power of our PCs. SPSS output attached.
It looks like the ancestral issue is a paper tiger..it's not gonna affect our results.It's actually strengthening our case.
It should be negative, not positive if there has been recently selection for higher g in the higher g countries. That is, unless humans used to be smarter, then got dumber, and then get smarter again.


Cochran has this interesting theory:

http://isteve.blogspot.no/2012/07/things-fall-apart-greg-cochrans-new.html

It fits as far as I can see!!!
It should be negative, not positive if there has been recently selection for higher g in the higher g countries. That is, unless humans used to be smarter, then got dumber, and then get smarter again.


Cochran has this interesting theory:

http://isteve.blogspot.no/2012/07/things-fall-apart-greg-cochrans-new.html

It fits as far as I can see!!!


Read my previous post. It addresses Emil's concern.
Our concern was that the PC might pick up more derived alleles and this would have been what was driving up the odds ratios in W and V, but the data now show this is not possible as actually the PC is positively correlated to the frequency of ancestral alleles.So actually our PC is picking up more ancestral alleles. Controlling for allele status should actually make our odds ratios even bigger.
So instead of running FE with only ancestral, now we have got to run it on only derived alleles, because we may be picking up less derived alleles due to the negative correlation with the ancestral freq. vector.
So if the IQ alleles tend to be derived, if we control for ancestral/derived status we'll get stronger results. Viceversa if the IQ alleles are ancestral, we'll get weaker results. But we've got good reasons to expect that the IQ alleles are derived so our results should be strengthened by running FE on derived alleles only!
1000 genomes Phase 5 version, chromosome 12 included.

data/another_dummy_file_Watson_middling
1736539 1660415
try({print(fem)})
[,1] [,2]
[1,] 1736539 1043481
[2,] 1660415 1123867

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.122547 1.130258
sample estimates:
odds ratio
1.126413


data/another_dummy_file_Venter_middling
1739325 1660415
try({print(fem)})
[,1] [,2]
[1,] 1739325 1040867
[2,] 1660415 1123867

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.127218 1.134924
sample estimates:
odds ratio
1.131053

data/another_dummy_file_Watson_low
85082 85368
try({print(fem)})
[,1] [,2]
[1,] 85082 64232
[2,] 85368 64166

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.5544
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9812888 1.0101938
sample estimates:
odds ratio
0.9956255


data/another_dummy_file_Venter_low
84975 85368
try({print(fem)})
[,1] [,2]
[1,] 84975 64335
[2,] 85368 64166

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.3275
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9784773 1.0072862
sample estimates:
odds ratio
0.9927809



data/another_dummy_file_Venter_high
7492 7029
try({print(fem)})
[,1] [,2]
[1,] 7492 1846
[2,] 7029 2307

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 4.724e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.242070 1.428589
sample estimates:
odds ratio
1.332017

data/another_dummy_file_Watson_high
7427 7029
try({print(fem)})
[,1] [,2]
[1,] 7427 1911
[2,] 7029 2307

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 4.1e-12
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.190016 1.367394
sample estimates:
odds ratio
1.275553
I am on holiday so I am on the go but these are excellent results and I will update our report when I get back home. Can you run FE on Emily's genome? If we cannot reproduce the results on the majority of other genomes it could mean that Watson and Venter are smart beyond anybody else (and IQ tests then would be a poor measure of real intelligence) or Piffer's method is not very strong. But hopefully things will work on other genomes as well. Emily's IQ is 140/150 and I uploaded her genome in this thread. It'd be good to see what odds ratios she gets.
As I said before, the pattern of high >middling>low correlations with PC matching the odds ratios is excellent and points towards a Jensen's effect.We should carry out the method of correlated vectors, that is get smaller ranges of correlations with the PC and compute their odds ratios. Then see if the odds ratios are correlated to the PC's correlation. If the more strongly correlated sets of alleles have higher odds ratios we get a positive Jensen effect. I guess N=40 would suffice. So for example run FE on alleles with PC correlations of 0-0.025, 0.025-0.05,0.05-0.075,0.075-0.1,0.1-0.125,etc. then see if odds ratios are correlated to the PC correlation's mid value (for 0-0.025= 0.0125). Put this in the to do list with whatever priority suits you.
I am on holiday so I am on the go but these are excellent results and I will update our report when I get back home. Can you run FE on Emily's genome? If we cannot reproduce the results on the majority of other genomes it could mean that Watson and Venter are smart beyond anybody else (and IQ tests then would be a poor measure of real intelligence) or Piffer's method is not very strong. But hopefully things will work on other genomes as well. Emily's IQ is 140/150 and I uploaded her genome in this thread. It'd be good to see what odds ratios she gets.


Okay, next order of business is calculating Emily's score. I guess I can try that Italian lawyer too.

As I said before, the pattern of high >middling>low correlations with PC matching the odds ratios is excellent and points towards a Jensen's effect.We should carry out the method of correlated vectors, that is get smaller ranges of correlations with the PC and compute their odds ratios. Then see if the odds ratios are correlated to the PC's correlation. If the more strongly correlated sets of alleles have higher odds ratios we get a positive Jensen effect. I guess N=40 would suffice. So for example run FE on alleles with PC correlations of 0-0.025, 0.025-0.05,0.05-0.075,0.075-0.1,0.1-0.125,etc. then see if odds ratios are correlated to the PC correlation's mid value (for 0-0.025= 0.0125). Put this in the to do list with whatever priority suits you.


Is this a higher priority than doing FE on ancestral and derived alleles?
Were you able to enlist that other Hsu btw? I am just curious, I am not going to meddle.
I am on holiday so I am on the go but these are excellent results and I will update our report when I get back home. Can you run FE on Emily's genome? If we cannot reproduce the results on the majority of other genomes it could mean that Watson and Venter are smart beyond anybody else (and IQ tests then would be a poor measure of real intelligence) or Piffer's method is not very strong. But hopefully things will work on other genomes as well. Emily's IQ is 140/150 and I uploaded her genome in this thread. It'd be good to see what odds ratios she gets.


Okay, next order of business is calculating Emily's score. I guess I can try that Italian lawyer too.

As I said before, the pattern of high >middling>low correlations with PC matching the odds ratios is excellent and points towards a Jensen's effect.We should carry out the method of correlated vectors, that is get smaller ranges of correlations with the PC and compute their odds ratios. Then see if the odds ratios are correlated to the PC's correlation. If the more strongly correlated sets of alleles have higher odds ratios we get a positive Jensen effect. I guess N=40 would suffice. So for example run FE on alleles with PC correlations of 0-0.025, 0.025-0.05,0.05-0.075,0.075-0.1,0.1-0.125,etc. then see if odds ratios are correlated to the PC correlation's mid value (for 0-0.025= 0.0125). Put this in the to do list with whatever priority suits you.


Is this a higher priority than doing FE on ancestral and derived alleles?



Yes it is a higher priority. And yes, next order of business is calculating FE score on Emily and the Italian lawyer. And yes I was able to enlist Jeff Hsu. He's gonna be part (unless he will defect) of my crowdfunding team on Indiegogo.
Neat. You should try to get his genome, if he agrees.

Furthermore, the previous analysis was done on the three snp pc. I should probably always use the 7 snp pc from now on, unless you specify otherwise, right?

I imagine Hsu will get into trouble. IQ/race is incredibly taboo in the US, and worker's rights are not so strong. He can easily be fired.

Ps. I couldn't find the lawyer genome anywhere. Do you have a link?
Neat. You should try to get his genome, if he agrees.

Furthermore, the previous analysis was done on the three snp pc. I should probably always use the 7 snp pc from now on, unless you specify otherwise, right?

I imagine Hsu will get into trouble. IQ/race is incredibly taboo in the US, and worker's rights are not so strong. He can easily be fired.


So far I have reported the results for the 3 and 4 SNPs PC SEPARATELY. Can we continue doing that? The 3 SNPs have not been replicated but the 4 SNPs have. I'd like to be immune from attacks such as "some of the SNPs they used were not replicated". If we keep the 2 PCs separated, we can always point to the results from the replicated SNPs.

I guess Hsu is really brave, especially since he lives in the US. The US is a horrible country...populated by...hmm I'd better stop here.
Emily's scores on the 7SNP pc. Not promising. Will try on the four SNP PC.

data/another_dummy_file_Emily_middling
500005 501056
try({print(fem)})
[,1] [,2]
[1,] 500005 488747
[2,] 501056 487737

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.1429
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9903193 1.0014145
sample estimates:
odds ratio
0.99584


7 of 10 steps (70%) done
rule compute_fisher_exact:
input: /local/home/endrebak/playground/freewas/data/Emily_low.fe, /local/home/endrebak/playground/freewas/data/CEU_low.fe
output: data/another_dummy_file_Emily_low
data/another_dummy_file_Emily_low
23569 23495
try({print(fem)})
[,1] [,2]
[1,] 23569 24313
[2,] 23495 24391

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.6278
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9811037 1.0322560
sample estimates:
odds ratio
1.006368


8 of 10 steps (80%) done
rule compute_fisher_exact:
input: /local/home/endrebak/playground/freewas/data/Emily_high.fe, /local/home/endrebak/playground/freewas/data/CEU_high.fe
output: data/another_dummy_file_Emily_high
data/another_dummy_file_Emily_high
5310 5469
try({print(fem)})
[,1] [,2]
[1,] 5310 2052
[2,] 5469 1895

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.003488
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8329371 0.9651712
sample estimates:
odds ratio
0.8966482
4 snp results bad:

data/another_dummy_file_Emily_middling
499049 500338
try({print(fem)})
[,1] [,2]
[1,] 499049 489481
[2,] 500338 488239

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.07195
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.989373 1.000457
sample estimates:
odds ratio
0.9948925


7 of 10 steps (70%) done
rule compute_fisher_exact:
input: /local/home/endrebak/playground/freewas/data/Emily_low.fe, /local/home/endrebak/playground/freewas/data/CEU_low.fe
output: data/another_dummy_file_Emily_low
data/another_dummy_file_Emily_low
24557 24545
try({print(fem)})
[,1] [,2]
[1,] 24557 25619
[2,] 24545 25630

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.9446
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9763643 1.0260905
sample estimates:
odds ratio
1.000918


8 of 10 steps (80%) done
rule compute_fisher_exact:
input: /local/home/endrebak/playground/freewas/data/Emily_high.fe, /local/home/endrebak/playground/freewas/data/CEU_high.fe
output: data/another_dummy_file_Emily_high
data/another_dummy_file_Emily_high
3446 3529
try({print(fem)})
[,1] [,2]
[1,] 3446 1844
[2,] 3529 1761

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.09256
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.8597678 1.0114838
sample estimates:
odds ratio
0.9325439


My main guess would be that the Watson/Venter results are a fluke and that the method is not so strong.

Other possibilities: there is some mismatch due to the genome versions being hg37 hg38 and therefore slightly different. But I doubt it since Razib and Emil scored badly when we used hg37 on them. Plus it really shouldn't matter since we only use rs-snps which should be backwards compatible.

I will keep the script and retry this on 23andme when they update their genome versions. I've spent too much time on this to give up completely.
Other possibility> these are educational attainment and not IQ genes. Or Emily's scoring lower is a fluke.THe use of different genome versions may be a problem too but before discarding the 23andMe data, try it on my genome and the Italian lawyer's. If they do not work then we will have to wait for 23andMe to update their genome version.Or I'll use a private firm to sequence the genomes from my Indiegogo campaign.
Other possibility> online IQ tests are not very reliable and we'll have to run our study only on famous scientists or Field medal winners. But before that let's try on a few more genomes of common mortals such as myself and the Italian lawyer.
I do not think it is wise to raise money for this, given the current results:

data/another_dummy_file_Lawyeresse_middling
496751 496380
try({print(fem)})
[,1] [,2]
[1,] 496751 491437
[2,] 496380 491860

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.5722
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9960235 1.0072118
sample estimates:
odds ratio
1.001609


48 of 51 steps (94%) done
rule compute_fisher_exact:
input: /local/home/endrebak/playground/freewas/data/Lawyeresse_low.fe, /local/home/endrebak/playground/freewas/data/TSI_low.fe
output: data/another_dummy_file_Lawyeresse_low
data/another_dummy_file_Lawyeresse_low
24270 24512
try({print(fem)})
[,1] [,2]
[1,] 24270 25876
[2,] 24512 25635

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.1279
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.956819 1.005597
sample estimates:
odds ratio
0.9809003


49 of 51 steps (96%) done
rule compute_fisher_exact:
input: /local/home/endrebak/playground/freewas/data/Lawyeresse_high.fe, /local/home/endrebak/playground/freewas/data/TSI_high.fe
output: data/another_dummy_file_Lawyeresse_high
data/another_dummy_file_Lawyeresse_high
3487 3472
try({print(fem)})
[,1] [,2]
[1,] 3487 1815
[2,] 3472 1831

try({fisher.test(fem, alternative="two.sided")})

Fisher's Exact Test for Count Data

data: fem
p-value = 0.7591
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.9343777 1.0986160
sample estimates:
odds ratio
1.013175