One of the uses of my method is as a tool in gene discovery. The factor score that I extracted from the 4 SNPs should be correlated with SNP allele frequencies throughout the genome. SNPs with the highest correlation to the PCA I extracted using the 4 hits are the best IQ increasing candidates. This strategy can greatly decrease the number of SNPs that are included in a genome-wide association study, reducing the multiple-testing problem and required sample sizes. Then we could check the frequencies of all these alleles on Craig Venter and James Watson's genomes (which are published online) and carry out a chi-square test to see if their frequencies are higher than in the normal population. Then once we've selected the best alleles, the sample size required to carry out a GWAS will be much smaller due to avoidance of massive Bonferroni correction.
It'd be helpful if someone could build a software that does this automatically.
Back to [Archive] Private discussions
How about asking about this on forums or comment sections on blogs interested in HBD?
I suggest HBD reddit.
I'd do it eventually, bit am currently busy with implementing the stats tools paper.
I suggest HBD reddit.
I'd do it eventually, bit am currently busy with implementing the stats tools paper.
I am a bit worried we'd risk that someone will run the GWAS without giving us proper credit. It's better to keep this among us
I do not understand exactly what you intend to do. Could you write it down as steps?
Ps. JW claimed to have a 122 IQ in his DNA book.
Ps. JW claimed to have a 122 IQ in his DNA book.
Craig Venter claimed to have an IQ of 142 in his autobiography ("A life decoded"). I've heard that Watson's IQ is not very high and thought it was in the 120s but didn't know the actual figure. When I suggested this in a forum people said it wasn't true, that his IQ is much higher. I've not read his book but I'll take your word for it. However given his achievement (got a PhD degree very young, Nobel prize and all that) it's likely that his IQ score underestimate his true level of intelligence so we can probably make an exception in his case.
Anyway, this is not the real problem, we can actually get the genomes from other individuals who have taken the 23andMe test (myself for example or Emil's).
What I want do to is very simple indeed:1)download the frequencies of all the SNPs present on 1KG for the 14 populations. For each SNP, pick one of the two alleles at random.
2) Calculate the correlation between each SNP's allele and the Principal Component I extracted from the 4 top IQ SNPs for the 14 populations: see table 1, col. 2 "IQPC" in Piffer, doi: http://dx.doi.org/10.1101/008011
3)Pick the SNPs with the strongest correlation (r>0.9) to IQPC. This is a reverse-engineering of the method of correlated vectors. If the SNPs with lowest genome-wide p value have higher loadings on the PC, then finding the SNPs with higher loading on the PC will lead to selecting those with the lowest genome-wide p values.
4) Select genomes from people with high IQ (a small sample shall suffice). See if the average frequency of the selected alleles is higher among them than the average frequency for their reference population (that reported in 1KG for their country of origin, e.g. if they're from England, the GB frequency). Carry out a chi-square test to assess the significance of the frequency difference.
5)Select the SNPs with the highest frequency difference between the 1kg population and the high IQ sample.
6) Get another sample of high IQ individuals and see if this finding replicates
Anyway, this is not the real problem, we can actually get the genomes from other individuals who have taken the 23andMe test (myself for example or Emil's).
What I want do to is very simple indeed:1)download the frequencies of all the SNPs present on 1KG for the 14 populations. For each SNP, pick one of the two alleles at random.
2) Calculate the correlation between each SNP's allele and the Principal Component I extracted from the 4 top IQ SNPs for the 14 populations: see table 1, col. 2 "IQPC" in Piffer, doi: http://dx.doi.org/10.1101/008011
3)Pick the SNPs with the strongest correlation (r>0.9) to IQPC. This is a reverse-engineering of the method of correlated vectors. If the SNPs with lowest genome-wide p value have higher loadings on the PC, then finding the SNPs with higher loading on the PC will lead to selecting those with the lowest genome-wide p values.
4) Select genomes from people with high IQ (a small sample shall suffice). See if the average frequency of the selected alleles is higher among them than the average frequency for their reference population (that reported in 1KG for their country of origin, e.g. if they're from England, the GB frequency). Carry out a chi-square test to assess the significance of the frequency difference.
5)Select the SNPs with the highest frequency difference between the 1kg population and the high IQ sample.
6) Get another sample of high IQ individuals and see if this finding replicates
Oh, I should have said that I believe his 122 estimate almost as little as I believe Feynmans 125 estimate.
He claimed to have found out by looking at some papers on his teachers desk when she wasn't looking. So it sounds like it was at a grade school age and hence likely to change, but it might not even be true to begin with. He also made this claim in the last part of his book which was very schmalzy ("genes aren't destiny, everybody used to think the Irish were dumb but look at them now, etc etc").
He claimed to have found out by looking at some papers on his teachers desk when she wasn't looking. So it sounds like it was at a grade school age and hence likely to change, but it might not even be true to begin with. He also made this claim in the last part of his book which was very schmalzy ("genes aren't destiny, everybody used to think the Irish were dumb but look at them now, etc etc").
Oh, I should have said that I believe his 122 estimate almost as little as I believe Feynmans 125 estimate.
He claimed to have found out by looking at some papers on his teachers desk when she wasn't looking. So it sounds like it was at a grade school age and hence likely to change, but it might not even be true to begin with. He also made this claim in the last part of his book which was very schmalzy ("genes aren't destiny, everybody used to think the Irish were dumb but look at them now, etc etc").
I guess we'll never know. Do you understand my proposal now?
I'll look more closely at it after work today, but I think so. Step two is just: for each SNP I'll make a vector of length 14 with allele frequencies and correlate that to the 14 length vector of IQ PCA scores. Right?
It seems like this might work for those SNPs where the actual alleles that are boosting/lessening the trait are not reported, only the SNPs.
It seems like this might work for those SNPs where the actual alleles that are boosting/lessening the trait are not reported, only the SNPs.
I'll look more closely at it after work today, but I think so. Step two is just: for each SNP I'll make a vector of length 14 with allele frequencies and correlate that to the 14 length vector of IQ PCA scores. Right?
Yes that's right.
It seems like this might work for those SNPs where the actual alleles that are boosting/lessening the trait are not reported, only the SNPs.
I want to use this for the entire set of SNPs published on 1KGenomes, not only the candidate SNPs reported in other GWAS. It's possible to download the entire 1KG dataset from their website.
I wanna run a new GWAS from scratch. The big advantage of this method is the drastic reduction in sample size needed, because the selected SNPs are much more likely to be GWAS hits compared to random SNPs.
By the way, I think this project is much simpler to carry out than the one you're working on at the moment. It's just a matter of downloading all the SNPs and correlating them to the IQPC.
They are in VCF (Variant Call Format) and can be downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/
I do not know how to import a VCF format in R or Excel though. Do you know how to do it?
They are in VCF (Variant Call Format) and can be downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/
I do not know how to import a VCF format in R or Excel though. Do you know how to do it?
I never load them in R or Excel myself, but this tool claims to be able to load them into R:
http://zhanxw.com/vcf2geno/
Here is their updated version of the tool http://zhanxw.com/seqminer/
As I haven't used them I can't recommend them.
http://zhanxw.com/vcf2geno/
Here is their updated version of the tool http://zhanxw.com/seqminer/
As I haven't used them I can't recommend them.
Well if you can do this: 1. Read SNPs in list of SNPs affecting trait value (Input 1) from 1kg VCF (Input 3)
2. Compute frequencies of SNPs in each population, creating a matrix of rows with SNPs and columns with populations; the data in the matrix are simply the allele frequencies
You can read all SNPs from 1KG VCG, right?
2. Compute frequencies of SNPs in each population, creating a matrix of rows with SNPs and columns with populations; the data in the matrix are simply the allele frequencies
You can read all SNPs from 1KG VCG, right?
Yeah, I have no problem doing this, I thought you were asking for your own sake.
Yeah, I have no problem doing this, I thought you were asking for your own sake.
Well ideally I'd like to be able to do it on my own, but it's not a priority. If you can do that, then we can work on this GWAS together. Considering that labs spend millions of dollars to run a GWAS and we can do it for free, it's no small feat!
Normally these things aren't easily accessible, you often need some UNIX command line fu to get stuff done in bioinformatics (even many of our lab biologists need to learn it).
Normally these things aren't easily accessible, you often need some UNIX command line fu to get stuff done in bioinformatics (even many of our lab biologists need to learn it).
Ok got it. Well if you could do this, it'd be brilliant. I think it's much easier than the other project.
Yes, will do.
One thing, couldn't I also correlate these SNP-frequencies against a vector of phenotype scores, that is actual IQ (potentially adjusting/normalizing them in some way)? Since your IQ PC is based on so few SNPs, the actual IQ scores might be closer to the real "genotypic IQs" than those you estimated (not a critique of your method, but the scores are based on very few datapoints). I think I will do this since I don't need to write a different program for this, I only need to input different data to the program. And that is trivial to do twice.
And is there any reason to expect that a SNP affecting a trait will be distributed like the average genotypic scores for that trait across populations? I see that this is an interesting thing to try; low/no cost and _might_ produce valuable results, but I'm still wondering if there is any theoretical reason to expect it to yield results?
One thing, couldn't I also correlate these SNP-frequencies against a vector of phenotype scores, that is actual IQ (potentially adjusting/normalizing them in some way)? Since your IQ PC is based on so few SNPs, the actual IQ scores might be closer to the real "genotypic IQs" than those you estimated (not a critique of your method, but the scores are based on very few datapoints). I think I will do this since I don't need to write a different program for this, I only need to input different data to the program. And that is trivial to do twice.
And is there any reason to expect that a SNP affecting a trait will be distributed like the average genotypic scores for that trait across populations? I see that this is an interesting thing to try; low/no cost and _might_ produce valuable results, but I'm still wondering if there is any theoretical reason to expect it to yield results?
Yes, will do.
One thing, couldn't I also correlate these SNP-frequencies against a vector of phenotype scores, that is actual IQ (potentially adjusting/normalizing them in some way)? Since your IQ PC is based on so few SNPs, the actual IQ scores might be closer to the real "genotypic IQs" than those you estimated (not a critique of your method, but the scores are based on very few datapoints). I think I will do this since I don't need to write a different program for this, I only need to input different data to the program. And that is trivial to do twice.
And is there any reason to expect that a SNP affecting a trait will be distributed like the average genotypic scores for that trait across populations? I see that this is an interesting thing to try; low/no cost and _might_ produce valuable results, but I'm still wondering if there is any theoretical reason to expect it to yield results?
Well yes you can use IQ as well,even if it won't change much because the IQ PC is strongly correlated to country IQs (r=0.93).Remember though, IQ is not solely due to genetics. In my paper I didn't find strong evidence for environmental determinants but there was a weak predictive power of GDP and Human Development on the residuals which suggested that phenotypic IQs are a little bit affected by the environment.Morever, when we'll use ALFRED, there are IQ data only for about 15 out of 50 populations. So in that case it's much more powerful to use the genotypic IQ I estimated. So yes for 1kGenomes we can use both genotypic IQ and phenotypic IQ, but for ALFRED we'll have to use only genotypic IQ.
No single SNP affecting a trait will be perfectly distributed like the average phenotypic scores for that trait across populations, but on average an SNP affecting a trait will be more similary distributed to the genotypic scores (due to polygenic selection), compared to a random SNP. This is why they are so well correlated as to produce a principal component accounting for over 50% of the variance.
Now I've written a script that computes r2 for snp-freqs and pca scores; will start using it on chromosome 21 on monday.
Should the vector of allele frequencies be normalized or massaged in any way?
It seems that most alleles are driven to fixation, so there are a lot of zeroes here. Example from the frequency file (if you want me to use major instead of minor allele frequencies just tell me):
Ps. 1kg now contains 26 populations it seems:
Ps. ps. where can I get the high-iq genomes to compare?
Should the vector of allele frequencies be normalized or massaged in any way?
It seems that most alleles are driven to fixation, so there are a lot of zeroes here. Example from the frequency file (if you want me to use major instead of minor allele frequencies just tell me):
0 0 0 0.005376 0 0 0 0 0 0 0 0 0 0
0 0.008197 0 0 0.005051 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0.004425 0
0 0 0 0 0 0 0 0.005319 0 0 0 0 0 0
0.01562 0.02459 0 0 0 0 0 0 0.005051 0 0 0 0.004425 0
0.02604 0.01639 0 0 0 0 0 0 0.005051 0 0 0 0.01327 0
0 0 0 0 0 0 0 0 0 0 0 0.009709 0 0
0 0 0 0 0 0 0 0.01064 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ps. 1kg now contains 26 populations it seems:
Gilfoyles-MacBook-Pro:freewas bertramgilfoyle$ cut -f 2 data/integrated_call_samples.20130502.ALL.panel | sort | uniq
ACB
ASW
BEB
CDX
CEU
CHB
CHS
CLM
ESN
FIN
GBR
GIH
GWD
IBS
ITU
JPT
KHV
LWK
MSL
MXL
PEL
PJL
PUR
STU
TSI
YRI
pop
Gilfoyles-MacBook-Pro:freewas bertramgilfoyle$ cut -f 2 data/integrated_call_samples.20130502.ALL.panel | sort | uniq | wc -l
27
(Deduct one because pop is a header not a population.)
Ps. ps. where can I get the high-iq genomes to compare?
Now I've written a script that computes r2 for snp-freqs and pca scores; will start using it on chromosome 21 on monday.
Should the vector of allele frequencies be normalized or massaged in any way?
It seems that most alleles are driven to fixation, so there are a lot of zeroes here. Example from the frequency file (if you want me to use major instead of minor allele frequencies just tell me):
Ps. ps. where can I get the high-iq genomes to compare?
I know that GWAS are carried out on SNPs with minor allele freq >1% so you should exclude all the SNPs whose minor all. freq. is <0.01. Most of these are de-novo mutations and are not captured by GWAS.
Apart from this important point, I think it's fine if you use minor allele frequencies.
It's cool that now 1KG has freq. for 26 populations. These are not accessibe through their browser.
Now I wonder whether we should carry out a PCA on the 26 populations instead of the 14 I used.
Can you get the allele frequencies for the 26 populations for these SNPs (allele symbol) and send me back the file in Excel (with SNPs in 4 columns and the populations in 26 rows). Assuming there are frequencies data for all these 26 populations, it'd be good to get frequencies for all of them before carrying out the GWAS.
I will then carry out a PCA and send the results back to you.
rs9320913 (A)
rs1584700 (G)
rs4851266 (T)
rs236330 (C)
I attach a template for the excel file.
Please first make sure that freq data is available for the 26 populations because I suspect it's available only for 14.