Back to [Archive] Other discussions

Proposal: Piffer method as software
Thanks. I'll probably use Plink for this.

It has an LD option which outputs both r2 and D'. Which one should I use?

And how do I partial out the effects of LD?

E: Seems like if I avoid those with <500kb between each other I'll be safe. But I'm sure there are better ways. Perhaps I should avoid those that are above a certain r2, but use the rest?


Yes, I adopted that approach. I avoided those that were above a certain r2. I think as long as you stay below r2= 0.2 you'll be safe.Also a distance of 500kb is sensible. If you included also SNPs with stronger LD you'd increase statistical power only a little, at the expense of much bigger computational demands (controlling for all inter-correlations is not gonna be easy). Piffer's method essentially provides you with an estimate of a meta-LD, which trascends physical distance on the chromosome (in contrast to standard LD). I call it Piffer's LD or meta-LD, or genome-wide LD.
I guess you could control for traditional LD by multiplying Piffer's LD by (1-r2). With an r2= 0.9, the Piffer's correlation net of LD (p) effects would be (1-0.9)*P, where P is the Piffer's correlation that includes LD effects.
If P= 0.3 and LD is r2= 0.9, the resulting net Piffer correlation p= 0.3*0.1=0.03.

Nobody has ever attempted this (because Piffer's method is novel) and this just just my tentative approach (I am not even sure it's correct). Personally at the first stage of your research I'd not bother and I'd simply adopt random exclusion of SNPs with a distance <500kb.
https://bitbucket.org/hbdtools/popmatrix

Pipeline that takes in snpfile and outputs population specific frequencies. Need vcftools and tabix on your path. Remaining: make csv matrix, but that will be over the weekend.

Edit: after this I only need to use numpy/R so that will hopefully go quicker.
[hr]
I hate bumping this thread with every update. Should computer stuff have its own forum?
The Allele Frequency Database has data for a lot more populations than 1000 Genomes, but a lot smaller n of SNPs. However it may be worth a try: http://alfred.med.yale.edu/
Have you gotten anyone to criticize your method by the way? I do not have any knowledge in this domain (just a programmer), but I'm just going to assume the method has several major methodological and/or data problems (always a good null hypothesis in bioinformatics). Furthermore, I know that when doing analyses of just one population you usually have to include a boatload of checks to see that the data are good and appropriate for the method. When you do cross-population analyses I'd imagine that the number of such checks increases in a combinatorial explosion.

Rather than just making a method that outputs statistics (which is my main goal in the end), I'd love to include as many method/sanity checks as possible. It would be good to find sb. critical of the method to list all possible flaws they might think of.

Here is my suggestion for one, but I do not know anything about genetics so would appreciate feedback:

Begin with the fact that current tinkertoy models of human cognition lead to approx 10k causal variants for IQ, and that it takes 30 SNPs to raise the IQ with one std. dev. Then if there is a great population difference in the frequency of the IQ boosting alleles, these SNPs cannot be representative of the other SNPs affecting the trait. Let's say you find a 10% difference in the frequency of a known IQ SNP between populations. One SNP is half an IQ point so a ten percent difference equals .05 IQ points between the populations. Then extrapolating the 10% difference to the other SNPs, it would lead to a 9999*.05≈500 IQ point difference. Proof by contradiction: the found SNPs cannot be representative of the other IQ boosting alleles. Which would make the results very questionable in my view.

This might be wrong; if so let me know how and why. And if you can think of other checks, please tell me.
[hr]
Oh, and I seem to remember S. Hsu posting a paper that he thought strongly indicated that the differences in height between N/S europeans were genetic. Anyone have a link or pdf? I guess it would be good paper to read for domain background.
Admin
Normally, we repeat the analyses on height as well. Height and g are similar genetically (super polygenic), but height is not very controversial. So, if one can show that the method works for height, this gives some confidence in the method.
Admin
But SNPs from different studies have high intercorrelations even tho they were found by different authors for different phenotypes (educational attainment vs. more cognitive measures). This makes the drift scenario pretty unlikely. As N of SNPs keeps increasing, the drift scenario becomes less and less likely.

Speaking against drift is also the fact that the brain is extremely expensive, accounting for some 20% total metabolism. This means that there is a natural economic selection against whichever genes that make the brain use more energy.

Perhaps we can make James Lee make a user here, and post his criticism. The eyes of another geneticist is certainly welcome.
Have you gotten anyone to criticize your method by the way? I do not have any knowledge in this domain (just a programmer), but I'm just going to assume the method has several major methodological and/or data problems (always a good null hypothesis in bioinformatics). Furthermore, I know that when doing analyses of just one population you usually have to include a boatload of checks to see that the data are good and appropriate for the method. When you do cross-population analyses I'd imagine that the number of such checks increases in a combinatorial explosion.

Rather than just making a method that outputs statistics (which is my main goal in the end), I'd love to include as many method/sanity checks as possible. It would be good to find sb. critical of the method to list all possible flaws they might think of.

Here is my suggestion for one, but I do not know anything about genetics so would appreciate feedback:

Begin with the fact that current tinkertoy models of human cognition lead to approx 10k causal variants for IQ, and that it takes 30 SNPs to raise the IQ with one std. dev. Then if there is a great population difference in the frequency of the IQ boosting alleles, these SNPs cannot be representative of the other SNPs affecting the trait. Let's say you find a 10% difference in the frequency of a known IQ SNP between populations. One SNP is half an IQ point so a ten percent difference equals .05 IQ points between the populations. Then extrapolating the 10% difference to the other SNPs, it would lead to a 9999*.05≈500 IQ point difference. Proof by contradiction: the found SNPs cannot be representative of the other IQ boosting alleles. Which would make the results very questionable in my view.

This might be wrong; if so let me know how and why. And if you can think of other checks, please tell me.
[hr]
Oh, and I seem to remember S. Hsu posting a paper that he thought strongly indicated that the differences in height between N/S europeans were genetic. Anyone have a link or pdf? I guess it would be good paper to read for domain background.


I think most critics are out just to have fun and spit BS on people's theories without even knowing what they're talking about. You often read silly criticisms that want you make jump off a cliff and wish you'd never published your paper, and you probably would not have done it if you knew beforehand which kind of silly people were gonna read it.
I see that you're playing devil's advocate here and you're pretending to be one of those silly critics, even if you're actually a lot smarter than them.So I will bother replying.
My method does not look at only 1 SNP, so you cannot extrapolate from a frequency difference for 1 SNP to all other SNPs. I never claimed that you could do this. Only after you get a good sample of SNPs and with a t-test or ANOVA you find a stat sig difference between two or more populations, you can use that sample of SNPs to extrapolate to frequency differences in general to the whole set.
My IBC paper says : In the second ANOVA with 10 educational attainment allelesand the addition of 2 IQ increasing alleles, the frequencies differed
significantly across the three races, F (2, 33) = 4.127, P =
0.025. Tukey post-hoc comparisons of the three groups indicated
that the East Asian group (M = 47.41, 95% CI [25.63, 69.20]
had significantly higher allele frequencies than the African
group (M = 18.50, 95% CI [7.70, 29.29]), P = 0.024.


So yes you could infer from this that the average freq. difference between East Asians and Africans is 47.41-18.50= 28.91%
I guess one could "reverse engineer" this difference and extrapolate it to other SNPs, but it can be done only if we know what the average effect on IQ for these SNPs is and if we know which N of SNPs is involved in cognition. Also keep in mind that the positive hits are very likely a much bigger effect on IQ, so all the other SNPs that still have to be discovered are likely to have a smaller effect on IQ. So the found SNPs cannot be representative of all the others for the simple technical fact that the first SNPs to be discovered are gonna have a bigger effect, since they show up with lower p values in GWAS.So your simplistic equation of 9999*0.05 has to factor this in, because there is a conditional probability involved, so that SNPs that yet have to be discovered have a smaller effect than your 0.05.So replace your 9999*0.05 with 9999*0.01 or 0.005 or much less.

The paper you asked about is this: Turchin, M. C., Chiang, C. W., Palmer, C. D., Sankararaman, S., Reich,D., Genetic Investigation of, A. T. C., and Hirschhorn, J. N. (2012). Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat Genet 44, 1015-1019. I had even cited it in my MQ paper.
Another weakness in the above criticism is that: not only will the SNPs yet to be discovered have a lower effect on IQ than the ones already identified; they will also have lower frequency differences between populations, because alleles with stronger effects will have undergone stronger selection, thus increasing differences between populations (see method of correlated vectors whereby I found alleles with lower p values had higher freq differences between populations). So it's also wrong to extrapolate from a frequency difference obtained with the current set of available SNPs to frequency difference in general. That will be lower than the 28.91% difference I found.
Admin
I think they were from the newest of those books:

Lynn, R., & Vanhanen, T. (2012). Intelligence: A unifying construct for the social sciences. Ulster Institute for Social Research.

The problem is that no one has created another IQ dataset yet free of the many errors in the LV2012 dataset (and the earlier version). Malloy has been working on specific countries slowly, but has not yet produced a list for all countries.

However, there is another large dataset with a proxy for g (or G), namely the Altinok educational achievements dataset which John Fuerst and I have been using. It changes very little however, since the cor (LV12 NIQs x Altinok) is .91 or something. In the previously analyses I have done including those in my paper currently in review (International General Socioeconomic factor) I used both but results are very similar.

Nevertheless, I think it is wise to include both g proxies, at least just to avoid giving ill-informed critics an easy card to play in the rhetoric game.
Your findings will most likely go against most editors/reviewers/readers prior beliefs. I'd imagine that as a tactical issue, it would be better to be humble, list all things that might be wrong with it and present it as an interesting finding that needs an explanation. I imagine that this would make it more likely for more mainstream journals to accept it and would give the paper a wider audience.

And I finished the converting the SNP list into csv script: https://bitbucket.org/hbdtools/pop_snp_csv/

This works for any SNP dataset (i.e. GWAS study). It only looks up the SNPs in 1kg as of yet.

Now I can actually try to implement Emils pseudocode by the end of this week.
Your findings will most likely go against most editors/reviewers/readers prior beliefs. I'd imagine that as a tactical issue, it would be better to be humble, list all things that might be wrong with it and present it as an interesting finding that needs an explanation. I imagine that this would make it more likely for more mainstream journals to accept it and would give the paper a wider audience.

And I finished the converting the SNP list into csv script: https://bitbucket.org/hbdtools/pop_snp_csv/

This works for any SNP dataset (i.e. GWAS study). It only looks up the SNPs in 1kg as of yet.

Now I can actually try to implement Emils pseudocode by the end of this week.


Most editors and reviewers are morons or dishonest. I've just reran PCA and Principal Axis Factoring using only the top 3 SNPs from Rietveld, given that Ward et al. (2014). PlosOne, have just replicated their effects on a sample of school children. In my MQ paper I did the analysis with all 10 of them but it seems better to use only the 3 most reliable SNPs. The results from the 50 ALFRED populations seem very good, with better within-continent resolution.

Columns E and F report results with PCA and PAF (only a single PC/factor with eigenvalue >1 was identified, accounting for about 46% of the variance).

https://docs.google.com/spreadsheets/d/1GEO3RRnEoOPuWS3ia5hF4gvx2PuZy3aDVybAJaAbo2U/edit?usp=sharing

Now can someone help me find good IQ estimates for these 50 populations?I wanna find out the correlation. I will fill out my own column called Piffer but other users can create their column. I will send you the editing link via email if you are interested.

Edit1: I've also asked Lynn to help me find the IQs for these populations
Playing around with R/Python: https://bitbucket.org/hbdtools/emil_method/src/8e87edea734d503c82ed37b913e383068c25b97e/test.py?at=master

Now I need you to describe what factor analysis you want me to use, Emil. In my script I just use PCA:


import pandas as pd
import numpy as np
import pyper as pr

INPUT_MATRIX = 'external_files/matrix_file.csv'

pop_matrix = pd.read_csv(INPUT_MATRIX, index_col=0, usecols=['AFR', 'ASN', 'EUR'])

NB_ITERATIONS = 10
NB_SNPS_CHOSEN = 6
half_nb_snps_chosen = NB_SNPS_CHOSEN/2

r = pr.R(use_pandas=True)

for i in range(NB_ITERATIONS):

random_snps = np.random.permutation(pop_matrix)
snp_group_1, snp_group_2 = random_snps[:half_nb_snps_chosen], random_snps[-half_nb_snps_chosen:]

r.assign("s1", snp_group_1)
r.assign("s2", snp_group_2)

print r['s1'], r['s2']

r('result1 = princomp(s1)')
r('result2 = princomp(s2)')

print r['result1'], r['result2']


To get my code plus example files just do git clone https://dineshchugtai@bitbucket.org/hbdtools/emil_method.git in the terminal.

You need to conda install PypeR and pandas and numpy to use this script.
Playing around with R/Python: https://bitbucket.org/hbdtools/emil_method/src/8e87edea734d503c82ed37b913e383068c25b97e/test.py?at=master

Now I need you to describe what factor analysis you want me to use, Emil. In my script I just use PCA:


I'd stick to PCA. Factor Analysis is recommended when there is measurement error and we want to find the "latent factor" without error variance. But error is negligible in the case of gene frequencies.
Oki, then I'll start on this part tomorrow:
...save results (mean abs. loading, loadings for each SNP, size of the first factor, correlation between both factors extracted)


If there are any details there I should know, please tell me.
Oki, then I'll start on this part tomorrow:
...save results (mean abs. loading, loadings for each SNP, size of the first factor, correlation between both factors extracted)


If there are any details there I should know, please tell me.


We had agreed that we'd delete the redundant SNPs (those located within 500kb). However, I think it's much better to use the average frequency of the SNPs located within 500Kb, so as to get a better estimate of the frequency at that genetic locus.
Remember that most SNPs that were found by GWAS are not directly responsible for the effects, but they are correlated to other neighbouring SNPs which have the direct effect. Thus averaging them would give a more reliable estimate and further reduce measurement error due to sampling.
Admin
Oki, then I'll start on this part tomorrow:
...save results (mean abs. loading, loadings for each SNP, size of the first factor, correlation between both factors extracted)


If there are any details there I should know, please tell me.


What I meant was that using of looping over the extraction, saving only the loadings or correlations or something in each sampling, it is better just to save all the data in one big matrix. This also means that one can use numpy style operations to make it go much faster. I think. I haven't done it yet.

I will perhaps try to do it later today, but I'm not very competent at data manipulation with R or python yet.
I have the SNPs in a matrix like so:

SNP_NAME,AFR,AMR,ASN,EUR,SAN
3_63833050_C,0.676248,0.370317,0.531746,0.40159,0.370143
5_140143664_C,0.310893,0.564841,0.526786,0.532803,0.519427
15_84706461_C,0.988654,0.821326,0.946429,0.72664,0.817996
15_84699214_G,1,1,0.998016,1,1
4_103146890_A,0.996974,0.951009,1,0.907555,0.993865
15_84689272_A,1,1,0.998016,1,1
22_41587556_T,0.847958,0.502882,0.944444,0.734592,0.672802
17_17958402_G,0.529501,0.533141,0.904762,0.37674,0.778119
19_50091199_T,0.481846,0.708934,0.806548,0.782306,0.552147
1_177280121_C,0.853253,0.806916,0.809524,0.799205,0.773006
3_52845105_T,0.295008,0.399135,0.569444,0.5,0.605317
22_42603814_C,0.343419,0.51585,0.338294,0.574553,0.49182


This is fed into the princomp method of R. Correct so far?

Then I get the loadings for the two groups:

This outputs something like (for each of the two groups, for each iteration):

Loadings:
Comp.1 Comp.2
[1,] 0.608 -0.794
[2,] 0.794 0.608

Comp.1 Comp.2
SS loadings 1.0 1.0
Proportion Var 0.5 0.5
Cumulative Var 0.5 1.0


I think that takes care of the size of the factors.

Is finding the correlation between the factors as simple as running cor(factor1_vector, factor2_vector)?

And how do you find the loading for each SNP?
[hr]
Ps. you can get automatic code analysis tools like pep8 and pylint to give suggestions for improving your Python code:

conda install pep8, same for pylint.

Then you just run pep8 or pylint with the name of your python file and get plenty of suggestions for improvement. Best to start this on a new file though, since the analysers are fussy and you do not want too much output at once.
Before I begin for real I'd just like to see that we are all on the same page (no point in writing this code
if you expect something very different from what is possible or feasible to do). If you don't think this project
seems worthwhile after you have read my description (or if something seems off) please tell me.

I have read "Simple statistical tools to detect signals of recent polygenic selection" and will try to
implement that pipeline. The paper can be found here:
http://www.academia.edu/6127615/Simple_statistical_tools_to_detect_signals_of_recent_polygenic_selection

What I have been doing is checking that all of the steps described there are possible to do with R/Python,
and that I am able to make them work.

I only intend to write the pipeline for 1000 genomes to begin with because the ALFRED database is harder to
use programatically. If you think the program seems worthwhile for 1kg, I'll add stuff that makes it work
with ALFRED.

I will also skip the check for whether alleles with larger frequency differences between populations have
greater p-values. That will be possible to add later, if desired.

What I propose to do is written in the outline below.

PROPOSAL:

This will be a command line tool, i.e. you write a command like piffer --snpfile your_snpfile.txt
--genotype_file genotypes_scores.txt --output_file the_file_where_you_want_to_store_your_results.txt and then
the program runs without any more user input. I intend to print the results to the screen or to a file,
depending on the user's wishes.

The planned structure of the output is shown below. What I intend to do to achieve the desired output is written
as steps.

1 ====[ ANOVA ]============================================

1.1) Run ANOVA on continental groups, add results to output file
1.2) Run ANOVA on populations, add results to output file

2.1 ====[ PCA ]============================================

2.1.1) Sort the list of snps on p-value, divide the SNPs into groups of x, where x is set by the user.
2.1.2) Run PCA and reduce to two factors, with the SNP groups as different variables.

2.2 ====[ PCA ]====[ CHECK WHETHER PCA APPROPRIATE ]=======

2.2.1) Check KMO and Bartletts to see that data suitable for PCA.

2.3 ====[ PCA ]====[ CHECK THAT THE RESULT IS VALID/INTERPRETABLE ]===
2.3.1) Check whether there is one interpretable factor, i.e. a factor with a good correlation between PC
loading and pvalues.
2.3.2) Check that the components are not correlated (ie. cannot be considered an overarching factor)

2.4 ====[ PCA ]=====[ CHECK RESULT ]============================================

2.4.1) Then test whether lower p-values predict higher polygenic scores. Use Spearman rank correlation.

============================================

APPENDIX: WHY ALFRED IS HARDER TO WORK WITH:

The main reason why ALFRED is harder to work with is that many of the phenotype SNPs are not in it.
Therefore we have to find substitute SNPs with the SNAP-tool, which is not hard, but I see no way of finding
which allele (trait increasing or decreasing) in the proxy SNP corresponds to which allele in the original SNP.

QUESTION:
Are the height SNPs available in a structured file somewhere, or do I have to look up the additional material for
each study and convert the data to a common format?
Before I begin for real I'd just like to see that we are all on the same page (no point in writing this code
if you expect something very different from what is possible or feasible to do). If you don't think this project
seems worthwhile after you have read my description (or if something seems off) please tell me.

I have read "Simple statistical tools to detect signals of recent polygenic selection" and will try to
implement that pipeline. The paper can be found here:
http://www.academia.edu/6127615/Simple_statistical_tools_to_detect_signals_of_recent_polygenic_selection

What I have been doing is checking that all of the steps described there are possible to do with R/Python,
and that I am able to make them work.

I only intend to write the pipeline for 1000 genomes to begin with because the ALFRED database is harder to
use programatically. If you think the program seems worthwhile for 1kg, I'll add stuff that makes it work
with ALFRED.

I will also skip the check for whether alleles with larger frequency differences between populations have
greater p-values. That will be possible to add later, if desired.

What I propose to do is written in the outline below.

PROPOSAL:

This will be a command line tool, i.e. you write a command like piffer --snpfile your_snpfile.txt
--genotype_file genotypes_scores.txt --output_file the_file_where_you_want_to_store_your_results.txt and then
the program runs without any more user input. I intend to print the results to the screen or to a file,
depending on the user's wishes.

The planned structure of the output is shown below. What I intend to do to achieve the desired output is written
as steps.

1 ====[ ANOVA ]============================================

1.1) Run ANOVA on continental groups, add results to output file
1.2) Run ANOVA on populations, add results to output file

2.1 ====[ PCA ]============================================

2.1.1) Sort the list of snps on p-value, divide the SNPs into groups of x, where x is set by the user.
2.1.2) Run PCA and reduce to two factors, with the SNP groups as different variables.

2.2 ====[ PCA ]====[ CHECK WHETHER PCA APPROPRIATE ]=======

2.2.1) Check KMO and Bartletts to see that data suitable for PCA.

2.3 ====[ PCA ]====[ CHECK THAT THE RESULT IS VALID/INTERPRETABLE ]===
2.3.1) Check whether there is one interpretable factor, i.e. a factor with a good correlation between PC
loading and pvalues.
2.3.2) Check that the components are not correlated (ie. cannot be considered an overarching factor)

2.4 ====[ PCA ]=====[ CHECK RESULT ]============================================

2.4.1) Then test whether lower p-values predict higher polygenic scores. Use Spearman rank correlation.

============================================

APPENDIX: WHY ALFRED IS HARDER TO WORK WITH:

The main reason why ALFRED is harder to work with is that many of the phenotype SNPs are not in it.
Therefore we have to find substitute SNPs with the SNAP-tool, which is not hard, but I see no way of finding
which allele (trait increasing or decreasing) in the proxy SNP corresponds to which allele in the original SNP.

QUESTION:
Are the height SNPs available in a structured file somewhere, or do I have to look up the additional material for
each study and convert the data to a common format?


Your plan is fine. One note though: it's important to use only SNPs with low p values (p<10^8 or better p < 5 x10^8). Otherwise you'll just introduce more noise than signal and your output will not be interpretable. In the case of height, there are several loci that meet this condition but not so for intelligence, where only 4 loci have this requirement (please see my last paper in submission thread titled "Estimating genotypic IQ.."). For height you can try the attached 46SNPs dataset I used for my paper. For other phenotypes, there is an easily accessible database here (I do not know how frequently it's updated though): http://www.genome.gov/gwastudies/

Regarding your concern about ALFRED, it's really easy to see which allele in LD corresponds to the reference allele. It's always the one with very similar frequency distribution across populations (as a rule of thumb if two alleles have linkage of r>0.8, their cross population frequencies will be correlated at that level, hence highly similar because linkage is usually well preserved across human groups apart from rare cases when LD breaks down).
Admin
http://database.oxfordjournals.org/content/2013/bat063.full

IQdb: an intelligence quotient score-associated gene resource for human intelligence
Abstract

Intelligence quotient (IQ) is the most widely used phenotype to characterize human cognitive abilities. Recent advances in studies on human intelligence have identified many new susceptibility genes. However, the genetic mechanisms involved in IQ score and the relationship between IQ score and the risk of mental disorders have won little attention. To address the genetic complexity of IQ score, we have developed IQdb (http://IQdb.cbi.pku.edu.cn), a publicly available database for exploring IQ-associated human genes. In total, we collected 158 experimental verified genes from literature as a core dataset in IQdb. In addition, 46 genomic regions related to IQ score have been curated from literature. Based on the core dataset and 46 confirmed linked genomic regions, more than 6932 potential IQ-related genes are expanded using data of protein–protein interactions. A systematic gene ranking approach was applied to all the collected and expanded genes to represent the relative importance of all the 7090 genes in IQdb. Our further systematic pathway analysis reveals that IQ-associated genes are significantly enriched in multiple signal events, especially related to cognitive systems. Of the 158 genes in the core dataset, 81 are involved in various psychotic and mental disorders. This comprehensive gene resource illustrates the importance of IQdb to our understanding on human intelligence, and highlights the utility of IQdb for elucidating the functions of IQ-associated genes and the cross-talk mechanisms among cognition-related pathways in some mental disorders for community.