Back to [Archive] Private discussions

A free GWAS based on Piffer's method: hunting for IQ alleles at no cost
Will do.

Should I use those SNPs that have a mean MAF > .01 across all populations? I can also use the SNPs that have a freq of > .01 for at least one population, but the latter is more work.

Please first make sure that freq data is available for the 26 populations because I suspect it's available only for 14.


The VCF files do not contain frequency data; I compute the population means myself. Should post a guide on doing this sometime.
Should I use those SNPs that have a mean MAF > .01 across all populations?


Well in that case 0.01 may be too stringent a criteria. I'd do >0.005
Phase 3 was released on June 24, 2014 and should include frequency for 26 populations: http://www.1000genomes.org/home
It looks like 1KG have not updated the FAQ, because here they say they have variants only for 14 populations: http://www.1000genomes.org/faq/which-populations-are-part-your-study
So please do send me the frequencies of the 4 alleles for the 26 populations. I am really curious to see what it looks like!
I'll do it on monday. I hack 24/7 with weekends off.

E: Derp. Meant 24/5. Down the road of sitting at the computer 24/7 lies madness.
Are you sure the identifier rs1584700 is correct? The only google hit on it is your paper...
It's rs11584700- Spelling mistake in the table.
Will continue tomorrow. This is a lot more work than it sounds like; the SNPs are not at the positions indicated in the dbSNP so I can't look them up in an index, but have to hunt them down by searching the files. The files are 50 gigs in size so this is really time-consuming.
I attach the .xls table with the correct spelling (I double checked and the other 3 were correct and it was just a spelling mistake luckily because I had downloaded the frequencies for the right alleles). I guess next part of the project shall be less time consuming but right now it's important to have frequencies for all 26 populations.
Here is the data for the SNPs

rs9320913
rs11584700
rs236330
rs4851266

in that order.

The populations are ACB ASW BEB CDX CEU CHB CHS CLM ESN FIN GBR GIH GWD IBS ITU JPT KHV LWK MSL MXL PEL PJL PUR STU TSI YRI, in that order.

0.05208 0.07377 0.25 0.3172 0.2071 0.3058 0.2762 0.1011 0.06061 0.2778 0.2692 0.2913 0.0354 0.2243 0.2696 0.3798 0.3081 0.07576 0.05882 0.101 0.02941 0.2188 0.1346 0.2402 0.1729 0.05556
0.401 0.4016 0.1221 0.1505 0.1869 0.08738 0.1095 0.2021 0.3434 0.2323 0.2967 0.08738 0.3451 0.2523 0.06863 0.2212 0.1212 0.2929 0.3353 0.156 0.05882 0.1146 0.2981 0.09804 0.2009 0.287
0.1042 0.08197 0.1395 0.4946 0.4141 0.4417 0.4333 0.2394 0.0404 0.3535 0.4341 0.3155 0.06637 0.3645 0.2402 0.476 0.3788 0.02525 0.08824 0.328 0.3588 0.3125 0.2788 0.2157 0.3738 0.05093
0.1615 0.2295 0.2907 0.4409 0.5 0.4369 0.4 0.4149 0.1869 0.4697 0.489 0.3252 0.1549 0.4907 0.1765 0.3462 0.4545 0.1364 0.2353 0.281 0.2 0.2396 0.4231 0.2108 0.486 0.1852

I'll paste them into the document tomorrow, but if you want them today you've got to do it yourself.
[hr]
Woops, the above might be unusable. I need to find out what allele the frequency is for. Since Plink: https://www.cog-genomics.org/plink2/ finds the frq for the minor allele and the minor alleles might differ between pops the above data might not be consistent. Will look into it tomorrow.
Here is the data for the SNPs

rs9320913
rs11584700
rs236330
rs4851266

in that order.

The populations are ACB ASW BEB CDX CEU CHB CHS CLM ESN FIN GBR GIH GWD IBS ITU JPT KHV LWK MSL MXL PEL PJL PUR STU TSI YRI, in that order.

0.05208 0.07377 0.25 0.3172 0.2071 0.3058 0.2762 0.1011 0.06061 0.2778 0.2692 0.2913 0.0354 0.2243 0.2696 0.3798 0.3081 0.07576 0.05882 0.101 0.02941 0.2188 0.1346 0.2402 0.1729 0.05556
0.401 0.4016 0.1221 0.1505 0.1869 0.08738 0.1095 0.2021 0.3434 0.2323 0.2967 0.08738 0.3451 0.2523 0.06863 0.2212 0.1212 0.2929 0.3353 0.156 0.05882 0.1146 0.2981 0.09804 0.2009 0.287
0.1042 0.08197 0.1395 0.4946 0.4141 0.4417 0.4333 0.2394 0.0404 0.3535 0.4341 0.3155 0.06637 0.3645 0.2402 0.476 0.3788 0.02525 0.08824 0.328 0.3588 0.3125 0.2788 0.2157 0.3738 0.05093
0.1615 0.2295 0.2907 0.4409 0.5 0.4369 0.4 0.4149 0.1869 0.4697 0.489 0.3252 0.1549 0.4907 0.1765 0.3462 0.4545 0.1364 0.2353 0.281 0.2 0.2396 0.4231 0.2108 0.486 0.1852

I'll paste them into the document tomorrow, but if you want them today you've got to do it yourself.
[hr]
Woops, the above might be unusable. I need to find out what allele the frequency is for. Since Plink: https://www.cog-genomics.org/plink2/ finds the frq for the minor allele and the minor alleles might differ between pops the above data might not be consistent. Will look into it tomorrow.

Yes indeed they are unusable. I compared the values to the 14 populations from phase 1 of 1KG (for rs9320913 and rs11584700) and they're different as you can see here: https://docs.google.com/spreadsheets/d/1yIyeyalb9GV11tmUz7YcGeKZjmQ7_P2M24MPW82G6Kc/edit?usp=sharing

This means that it's not the same allele for all the populations. Or worse, these could be different SNPs. Take for example rs9320913...the minor allele is clearly the same in both datasets for these populations: ASW 0.073 vs 0.23; LWK 0.075 vs 0.17; YRI 0.005 vs 0.18; IBS 0.22 vs 0.43
Yet their frequencies are too different for it to be the same SNP!
There is something fishy here.
Yes, I see. I have plenty of ideas for tomorrow. But it seems that SNPs can jump around between genome builds and that plenty of other things can go wrong: http://www.ncbi.nlm.nih.gov/sites/books/NBK44467/

Since Im using the "freq" function in plink, I doubt that the calculated averages are wrong. https://www.cog-genomics.org/plink2/basic_stats#freq

As a first sanity check, I will try my method on older data. If what I am doing is correct, the older (2012?) data should give results similar to what you got in your paper.

I'll also describe what I did so we can try to brainstorm what went wrong. But I just did something like


plink --vcf file.vcf --snp rs9320913 --freq


and it would be very surprising if plink made a mistake in the calculations so I'm betting differences in the data is what is producing the fishyness.

Ps. these kinds of SNP confusions are unfortunately very common.
[hr]
Ps. in the stuff I posted I realized that the minor allele might be different between populations. This is because when plink computes frequencies of an allele, it computes the frequency for the minor allele. Since what is a minor allele might differ between populations, one average might represent the frequency of allele A for one population, but the frequency of allele B in another.

I'll correct this tomorrow by using the https://www.cog-genomics.org/plink2/input#within modifier to compute all the alleles at the same time (sorry if this is too much techno-speak).
Well it looks like even the MAF's look extremely different (ASW 0.073 vs 0.23; LWK 0.075 vs 0.17; YRI 0.005 vs 0.18; IBS 0.22 vs 0.43) so the problem is unlikely to be due to MAF swapping.
The older data that you can see here (https://docs.google.com/spreadsheets/d/1-iPXIPJ847uGLaho0Jb2uP0Uotl1y-fPmmKRkwZKPnA/edit?usp=sharing) is correct. I can say this because it matches with HapMap and ALFRED. If the new data you get doesn't match it for the 14 populations, it's most likely wrong
In the file below I have used 1kg phase1 and plink:


CHR SNP CLST A1 A2 MAF MAC NCHROBS
1 rs236330 ASW T C 0.5984 73 122
1 rs236330 CEU T C 0.1882 32 170
1 rs236330 CHB T C 0.08247 16 194
1 rs236330 CHS T C 0.11 22 200
1 rs236330 CLM T C 0.1583 19 120
1 rs236330 FIN T C 0.2473 46 186
1 rs236330 GBR T C 0.2865 51 178
1 rs236330 IBS T C 0.3929 11 28
1 rs236330 JPT T C 0.2079 37 178
1 rs236330 LWK T C 0.6907 134 194
1 rs236330 MXL T C 0.1515 20 132
1 rs236330 PUR T C 0.3182 35 110
1 rs236330 TSI T C 0.1837 36 196
1 rs236330 YRI T C 0.7216 127 176
1 rs11584700 ASW G A 0.07377 9 122
1 rs11584700 CEU G A 0.2059 35 170
1 rs11584700 CHB G A 0.3041 59 194
1 rs11584700 CHS G A 0.25 50 200
1 rs11584700 CLM G A 0.09167 11 120
1 rs11584700 FIN G A 0.2688 50 186
1 rs11584700 GBR G A 0.2584 46 178
1 rs11584700 IBS G A 0.2143 6 28
1 rs11584700 JPT G A 0.3764 67 178
1 rs11584700 LWK G A 0.09278 18 194
1 rs11584700 MXL G A 0.09091 12 132
1 rs11584700 PUR G A 0.1455 16 110
1 rs11584700 TSI G A 0.1786 35 196
1 rs11584700 YRI G A 0.0625 11 176


I only checked a few populations, but the results seem to be the same as in the paper: 1 Factor Analysis of Population Allele Frequencies as a Simple, Novel Method of Detecting Signals of Recent Polygenic Selection: The Example of Educational Attainment and IQ.

Success!

My method:


# Download files
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.panel

# Convert vcf to plink binary format
plink --vcf ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz --make-bed --out phase1

# Add population info to the fam file to enable stratified frq computations
cut -f 2 phase1_integrated_calls.20101123.ALL.panel > pop_phase1
cut -d " " -f 2- phase1.fam > phase1_temp.fam
paste -d " " pop_phase1 phase1_temp.fam > phase1.fam

# Compute frq for the two IQ SNPs on chromosome 1
plink --bfile phase1 --family --snps rs11584700 rs236330 --keep-allele-order --freq --out piffer


I'll try my method with the phase3 data tomorrow, safe in the knowledge that if my method gives different results, it is due to the data changing somehow.

Ninja edit: my results seem to match the ones you just posted. You need to invert the numbers for rsrs236330 (1-x), but otherwise they seem to be the same.
I'll try my method with the phase3 data tomorrow, safe in the knowledge that if my method gives different results, it is due to the data changing somehow.

Ninja edit: my results seem to match the ones you just posted. You need to invert the numbers for rsrs236330 (1-x), but otherwise they seem to be the same.


That's good. I hope the phase 3 results will be in line. Otherwise we'll have to stick to the phase 1 data and carry on our GWAS with it!
Admin
For those who are wondering which the new populations are, as I was, they are these. Info is from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20131013_ashg2013_tutorial/20131023_the_1000genomes_project_erik.pdf

I'm excited to see how the height and g analyses will turn out.
For those who are wondering which the new populations are, as I was, they are these. Info is from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20131013_ashg2013_tutorial/20131023_the_1000genomes_project_erik.pdf

I'm excited to see how the height and g analyses will turn out.


Aren't carribeans the ones with the highest incidence of schizophrenia? If so, good to have included for the next Piffer study...

Btw, how do you have an "SQ" (shizophrenia quotient) for each population?
[hr]
I'll try my method with the phase3 data tomorrow, safe in the knowledge that if my method gives different results, it is due to the data changing somehow.

Ninja edit: my results seem to match the ones you just posted. You need to invert the numbers for rsrs236330 (1-x), but otherwise they seem to be the same.


That's good. I hope the phase 3 results will be in line. Otherwise we'll have to stick to the phase 1 data and carry on our GWAS with it!


Here is data for the chromosome one SNPs for phase three (they are very similar to the ones for phase 1, but not identical which is to be expected):


CHR SNP CLST A1 A2 MAF MAC NCHROBS
1 rs236330 ACB T C 0.599 115 192
1 rs236330 ASW T C 0.5984 73 122
1 rs236330 BEB T C 0.1221 21 172
1 rs236330 CDX T C 0.1505 28 186
1 rs236330 CEU T C 0.1869 37 198
1 rs236330 CHB T C 0.08738 18 206
1 rs236330 CHS T C 0.1095 23 210
1 rs236330 CLM T C 0.2021 38 188
1 rs236330 ESN T C 0.6566 130 198
1 rs236330 FIN T C 0.2323 46 198
1 rs236330 GBR T C 0.2967 54 182
1 rs236330 GIH T C 0.08738 18 206
1 rs236330 GWD T C 0.6549 148 226
1 rs236330 IBS T C 0.2523 54 214
1 rs236330 ITU T C 0.06863 14 204
1 rs236330 JPT T C 0.2212 46 208
1 rs236330 KHV T C 0.1212 24 198
1 rs236330 LWK T C 0.7071 140 198
1 rs236330 MSL T C 0.6647 113 170
1 rs236330 MXL T C 0.1562 20 128
1 rs236330 PEL T C 0.05882 10 170
1 rs236330 PJL T C 0.1146 22 192
1 rs236330 PUR T C 0.2981 62 208
1 rs236330 STU T C 0.09804 20 204
1 rs236330 TSI T C 0.2009 43 214
1 rs236330 YRI T C 0.713 154 216
1 rs11584700 ACB G A 0.05208 10 192
1 rs11584700 ASW G A 0.07377 9 122
1 rs11584700 BEB G A 0.25 43 172
1 rs11584700 CDX G A 0.3172 59 186
1 rs11584700 CEU G A 0.2071 41 198
1 rs11584700 CHB G A 0.3058 63 206
1 rs11584700 CHS G A 0.2762 58 210
1 rs11584700 CLM G A 0.1011 19 188
1 rs11584700 ESN G A 0.06061 12 198
1 rs11584700 FIN G A 0.2778 55 198
1 rs11584700 GBR G A 0.2692 49 182
1 rs11584700 GIH G A 0.2913 60 206
1 rs11584700 GWD G A 0.0354 8 226
1 rs11584700 IBS G A 0.2243 48 214
1 rs11584700 ITU G A 0.2696 55 204
1 rs11584700 JPT G A 0.3798 79 208
1 rs11584700 KHV G A 0.3081 61 198
1 rs11584700 LWK G A 0.07576 15 198
1 rs11584700 MSL G A 0.05882 10 170
1 rs11584700 MXL G A 0.1016 13 128
1 rs11584700 PEL G A 0.02941 5 170
1 rs11584700 PJL G A 0.2188 42 192
1 rs11584700 PUR G A 0.1346 28 208
1 rs11584700 STU G A 0.2402 49 204
1 rs11584700 TSI G A 0.1729 37 214
1 rs11584700 YRI G A 0.05556 12 216


Will compute for the other two SNPs today and fill out that form.
[hr]
Here are the two other SNPs (If you want it in that form I have to write a script converting the data to csv, which I'll do tomorrow, if desireable):


CHR SNP CLST A1 A2 MAF MAC NCHROBS
6 rs9320913 ACB A C 0.1615 31 192
6 rs9320913 ASW A C 0.2295 28 122
6 rs9320913 BEB A C 0.2907 50 172
6 rs9320913 CDX A C 0.4409 82 186
6 rs9320913 CEU A C 0.5 99 198
6 rs9320913 CHB A C 0.4369 90 206
6 rs9320913 CHS A C 0.4 84 210
6 rs9320913 CLM A C 0.4149 78 188
6 rs9320913 ESN A C 0.1869 37 198
6 rs9320913 FIN A C 0.5303 105 198
6 rs9320913 GBR A C 0.489 89 182
6 rs9320913 GIH A C 0.3252 67 206
6 rs9320913 GWD A C 0.1549 35 226
6 rs9320913 IBS A C 0.5093 109 214
6 rs9320913 ITU A C 0.1765 36 204
6 rs9320913 JPT A C 0.3462 72 208
6 rs9320913 KHV A C 0.4545 90 198
6 rs9320913 LWK A C 0.1364 27 198
6 rs9320913 MSL A C 0.2353 40 170
6 rs9320913 MXL A C 0.2812 36 128
6 rs9320913 PEL A C 0.2 34 170
6 rs9320913 PJL A C 0.2396 46 192
6 rs9320913 PUR A C 0.4231 88 208
6 rs9320913 STU A C 0.2108 43 204
6 rs9320913 TSI A C 0.514 110 214
6 rs9320913 YRI A C 0.1852 40 216
CHR SNP CLST A1 A2 MAF MAC NCHROBS
2 rs4851266 ACB T C 0.1042 20 192
2 rs4851266 ASW T C 0.08197 10 122
2 rs4851266 BEB T C 0.1395 24 172
2 rs4851266 CDX T C 0.4946 92 186
2 rs4851266 CEU T C 0.4141 82 198
2 rs4851266 CHB T C 0.5583 115 206
2 rs4851266 CHS T C 0.5667 119 210
2 rs4851266 CLM T C 0.2394 45 188
2 rs4851266 ESN T C 0.0404 8 198
2 rs4851266 FIN T C 0.3535 70 198
2 rs4851266 GBR T C 0.4341 79 182
2 rs4851266 GIH T C 0.3155 65 206
2 rs4851266 GWD T C 0.06637 15 226
2 rs4851266 IBS T C 0.3645 78 214
2 rs4851266 ITU T C 0.2402 49 204
2 rs4851266 JPT T C 0.524 109 208
2 rs4851266 KHV T C 0.6212 123 198
2 rs4851266 LWK T C 0.02525 5 198
2 rs4851266 MSL T C 0.08824 15 170
2 rs4851266 MXL T C 0.3281 42 128
2 rs4851266 PEL T C 0.3588 61 170
2 rs4851266 PJL T C 0.3125 60 192
2 rs4851266 PUR T C 0.2788 58 208
2 rs4851266 STU T C 0.2157 44 204
2 rs4851266 TSI T C 0.3738 80 214
2 rs4851266 YRI T C 0.05093 11 216
Admin
There is this: http://en.wikipedia.org/wiki/Schizophrenia#Epidemiology

It is not prevalence, but DALY by country. Treatment is perhaps a strong environmental effect that obscures the genetic signal. I don't know. Perhaps there are data for prevalence somewhere.

It has been said that psychoticism is related to genius and psychoticism is related to schizophrenia. If it can explain the general lack of Asian geniuses despite high g, then it must be the case that Asians have lower schizophrenia allele frequencies.

http://en.wikipedia.org/wiki/Psychoticism
Unable to make docs, but here is a csv which should be importable in your software: