Back to [Archive] Other discussions

Proposal: Piffer method as software
Hi--

I was just reading the papers by Davide Piffer using allele counting + factor analysis and it seemed to me a simple method for disentangling the most knotty of nature/nurture questions, that of group differences.

Furthermore, it struck me as a method that could and should be used again and again as reference panels become bigger/more inclusive and more SNPs are found for each phenotype (furthermore, there does exist SNPs for many untested phenotypes). Therefore I think it would be worthwhile to automate the process and create software for doing the analysis. This post is really just a proposal for a software pipeline, and a call for help/guidance with the statistics (which I do not understand well).

What follows is my proposal for the software pipeline. I have written how I think the pipeline should proceed in the "methods" part below. If anything there is murky or wrong, please tell me. And if you have other comments/insights/thoughts please post.

Main question: I'd really like to know what you think the output should be? What statistic(s) would best tell if there had been differential selection?

Input:
  1. List of SNPs affecting the phenotype
  2. List of phenotype scores per population
  3. Thousand Genomes VCF file

Output:

  1. If one factor seems like a plausible candidate for a factor that "likely represents a nonrandom evolutionary force such as natural selection", output ***see main question above***.
  2. Otherwise, output that none of the factors seemed like plausible candidates for a selection factor.
  3. In both cases also print the statistical values gotten from the analysis of whether one factor represented evolution (how many of the snps loaded on the factor, were the SNPs with the lowest p-values the ones that loaded the highest etc.)


Method:


1. Read SNPs in list of SNPs affecting trait value (Input 1) from 1kg VCF (Input 3)
2. Compute frequencies of SNPs in each population, creating a matrix of rows with SNPs and columns with populations; the data in the matrix are simply the allele frequencies.
3. Compute PCA of aforementioned matrix, with reduction to two dimensions.
4. Find if one of these vectors are suitable as selection factor
5. Use the vector suitable as selection factor to see if those SNPs with the highest scores on this are more frequent in populations with the highest phenotype trait.


I'm thinking of doing this in Ruby, but interfacing with R to do the statistical analyses. I know how to code and do bioinformatics, but would need some help to understand the statistics and logic used in the Piffer papers. Ideally, someone would just give feedback in this thread and I could ask questions once I hit a snag.

I'm thinking of starting with the above, but adding more bells and whistles later, once I get the simple stuff above running. Open Source Software often grows by itself if you just get something decent up and running.
[hr]
Some errors.
Admin
I have also thought about it and it is necessary to automatic this process.

For instance, here's some data I want to try: http://www.nature.com/nature/journal/v511/n7510/full/nature13595.html "Biological insights from 108 schizophrenia-associated genetic loci". First we want to confirm that these really measure the same. This can be done with factor analysis (FA). If they don't load on a general factor (or perhaps a few factors), then I don't see how they could be true positives. We want to see if these fit with national rates of schizophrenia per capita (surely such data exist). It has also been claimed that schizoid personality is related to creativity and genius. So countries with more geniuses (say, use Nobel prizes per capita or something), should have higher rates.

It gets more complicated because there are too few populations to use so many SNPs on at once for FA. One will have to do repeated sampling like I did in my paper that's in review now.

It is also unwise to use PCA with small numbers of SNPs. It is better to use some other method of extraction, see also my paper in review.

Ruby I don't know. But I know Python and R. I guess I could take the Codeacademy course in Ruby.
Nice, that was one of the datasets I was thinking of too.

You can read and understand most of Ruby if you know Python (like Swedish and Danish). I know Python well, Ruby only a little, so we can use Python too (I just like to learn new technologies in my personal hack-time). But we can't do many of the statistical analyses in Python (I looked up oblimin rotation, e.g.), so we'd have to interface with R.

Okay, so it seems my proposal was too simplistic then. I'll still throw together something that is a beginning and we can riff off that.

A pipeline is necessary to be taken seriously, since a pipeline allows stuff to be reproduced just with running a command. When st is done manually, exact reproduction is often impossible.
Admin
Why not just use R then? I don't see why we need Python/Ruby at all.

Yes, I agree. (I didn't know that term. http://en.wikipedia.org/wiki/Pipeline_%28computing%29)

Basically, we want to do something like this:

For n in range(1000):
....pick 10 random SNPs
....divide those into two groups
....perform FA on each group
....save results (mean abs. loading, loadings for each SNP, size of the first factor, correlation between both factors extracted)
I prefer to use R for only the stuff I need it for because Python/Ruby code is much easier to read, maintain and extend. Programmers are rejoicing over Python getting more and more of Rs functionality (ggplot, pandas), and that is with good reason.

Furthermore, Python and Ruby are much better for writing actual software. Programs often need to do many other things than stats. In bioinformatics that other thing is parsing text files which are formatted ad-hoc and that is much easier with Ruby/Python. In addition Python/Ruby has libraries like that make producing software much easier.

Using both languages might seem like a hassle, but using R from Python is trivial. And requires nothing else from the user of the software than having R installed. Example code.

I'll begin on this project within the end of the week.

I don't know if your OBG accepts "Applications Note" (See "Types of manuscript" here), but if so, that is what I will be aiming for.

Comments from others still welcome.
Admin
I haven't used R from Python, but I agree Python is superior in ease of writing to R (and to PHP and all other languages I've used). From looking at your link, it does indeed seem trivial to use R from Python.

Your link to the PDF is broken (404). I don't see any reason why one could not publish an application as well.

Clearly you have more experience in working with this than I do.
Hi--

I was just reading the papers by Davide Piffer using allele counting + factor analysis and it seemed to me a simple method for disentangling the most knotty of nature/nurture questions, that of group differences.

Furthermore, it struck me as a method that could and should be used again and again as reference panels become bigger/more inclusive and more SNPs are found for each phenotype (furthermore, there does exist SNPs for many untested phenotypes). Therefore I think it would be worthwhile to automate the process and create software for doing the analysis. This post is really just a proposal for a software pipeline, and a call for help/guidance with the statistics (which I do not understand well).

What follows is my proposal for the software pipeline. I have written how I think the pipeline should proceed in the "methods" part below. If anything there is murky or wrong, please tell me. And if you have other comments/insights/thoughts please post.

Main question: I'd really like to know what you think the output should be? What statistic(s) would best tell if there had been differential selection?

Input:
  1. List of SNPs affecting the phenotype
  2. List of phenotype scores per population
  3. Thousand Genomes VCF file

Output:

  1. If one factor seems like a plausible candidate for a factor that "likely represents a nonrandom evolutionary force such as natural selection", output ***see main question above***.
  2. Otherwise, output that none of the factors seemed like plausible candidates for a selection factor.
  3. In both cases also print the statistical values gotten from the analysis of whether one factor represented evolution (how many of the snps loaded on the factor, were the SNPs with the lowest p-values the ones that loaded the highest etc.)


Method:


1. Read SNPs in list of SNPs affecting trait value (Input 1) from 1kg VCF (Input 3)
2. Compute frequencies of SNPs in each population, creating a matrix of rows with SNPs and columns with populations; the data in the matrix are simply the allele frequencies.
3. Compute PCA of aforementioned matrix, with reduction to two dimensions.
4. Find if one of these vectors are suitable as selection factor
5. Use the vector suitable as selection factor to see if those SNPs with the highest scores on this are more frequent in populations with the highest phenotype trait.


I'm thinking of doing this in Ruby, but interfacing with R to do the statistical analyses. I know how to code and do bioinformatics, but would need some help to understand the statistics and logic used in the Piffer papers. Ideally, someone would just give feedback in this thread and I could ask questions once I hit a snag.

I'm thinking of starting with the above, but adding more bells and whistles later, once I get the simple stuff above running. Open Source Software often grows by itself if you just get something decent up and running.
[hr]
Some errors.


This is a brilliant idea. As creator of the method, count me in for theoretical, statistical and methodological support. I am afraid I will not be able to help you with R though as I do not use R (yet). Make sure you use both ANOVA (when you compare frequencies from a high number of alleles across a few racial groups) and PCA/FA when you have a sample of many populations. Davide Piffer
Updating this post with info for anyone that wants to tag along:

Will be using Python with pandas for 95% of the application and outsourcing the factor analysis to R using pyper. Currently using bitbucket for hosting. I made a repo, but haven't added anything to it. You will need OS X, Linux or cygwin if on Windows. Will be unit testing the tricky parts so that we can have confidence it works.

Research so far:

Instead of forcing the users to download humongous vcfs, we can simply run tabix, like so:

tabix -f -h ftp://ftp.1000genomes.ebi.ac.uk/blabla.vcf.gz 12:101266000-101422000 to get particular SNPs.

1kg haven't made the average population allele frequencies available, but we can extract them ourselves using vcftools.

The first order of business will simply be creating the part that reads the input SNP list file, gets the required SNPs from 1kg and computes the allele frequencies per population.

Questions:

I want to make it possible for anyone to individually check each step of the calculations with whatever statistical software it is you guys use (SPSS)? If I output every intermediate matrix as a CSV will that be possible?

Ps. starting with Emils method in http://www.openpsych.net/forum/showthread.php?tid=77 since I have some code to help me there. But will take a while before I get that far.
[hr]
----
Edit (appended): Parenthetically, its strange that you have so many more "social science" types than programmers, since the latter seems so much more unorthodox on average. Paul Graham noted the same tendency in his essay "great hackers":

One difference I've noticed between great hackers and smart people in general is that hackers are more politically incorrect.


Anyways, my point is that it might be worthwhile to try to find programmers willing to work on interesting projects you have. Perhaps you could try to request help on your blogs or whatever. My bet is at least 10% of those who read them are very good with computers. And 'pro bono' is very common among programmers; its called open source.
Admin
I also think this is an excellent idea, but I don't have the experience with R or Python to be much help with that aspect of it.

Over the past few years, there have been several different genome-wide association studies that identified alleles association with intelligence. I think the next step should be to apply Piffer's method to as many of these studies as possible, to show that the alleles identified by all of them are grouped similarly to the ones from the Rietvald study that Piffer used.


We already did this for all the SNPs we could find. We continue to look for GWAS papers that report any. So far Piffer has been using mainly those with p values <1e-8 because of the high risk of false positives above that. However I think it should be possible to use factor analyses to see if a candidate SNP fits with the others. The theory is that selection on SNPs for g has been pretty uniform, so that they all are positively correlated, which means that a FA will pick up a general factor. In the case of SNPs that boost g but have some detrimental effect, then uniform selection cannot be assumed and it is not certain that it will load on a general factor.

One would need to be systematic about this, so I'm very happy that someone with more programming skills than me as come along. If only for the reason that we can speed things up. Two programmers work faster than one. :)
Admin
Most people use SPSS I think. I don't think I will go back after I have learned R. It's much better like this. I cross checked my initial results from R using SPSS for my recent paper.

I have been collecting programmers in my network to work projects with me for years, actually. :) For instance, I built a logic tree prover with Laird Shaw. http://creativeandcritical.net/prooftools/

My favorite Paul Graham essay is: http://www.paulgraham.com/say.html
I also performed my analysis on 46 height increasing alleles.The results are published here: http://openpsych.net/OBG/wp-content/uploads/2014/05/Piffer2014oppositeselectionFinal.pdf
In table 2 here, how do I interpret the last two worksheets (brain_eQTL, blood_eQTL)?

I'm currently using the following format for the SNP input list:

SNP_name chromosome position allele_increasing_pheno allele_decreasing_pheno p_value

Hence, example output from worksheet one ("coding variants") of that .xlsx file is:

rs56873913 19 50091199 T G 2.188E-007
rs6670165 1 177280121 T C 1.156E-007
chr5_140143664_I 5 140143664 I 1 3.596E-007


How do I interpret the eQTL worksheets to create the same table for them? The problem is filling in the allele_increasing_pheno, allele_decreasing_pheno values.

I guess I could just use the most frequent allele for good phenotypes and least frequent allele for bad phenotypes, but this feels hackish to me.
I guess I could just use the most frequent allele for good phenotypes and least frequent allele for bad phenotypes, but this feels hackish to me.


No, absolutely do not do that. Many studies do not report the reference allele, but report effect for minor allele which may or may not be positive. You've got to figure out which allele is associated with the trait, it's usually not straightforward but it can be done after some detective work. I will have a look at the paper and try to figure it out!
I've attached the file adding the risk allele in column N. I illustrate how to proceed for future reference. Ok so here I selected 2 SNPs: rs56873913 and rs6670165 from suppl.table4 (see attached file) as exampes. Frequency of allele A1 is reported for cases and for controls. In both, the risk increasing allele is A1 because its frequency is higher in cases (0.775 and 0.196) than in controls (0.766 and 0.184). Allele A1 is T in both, as can be seen from column H (A1A2), where we've got TG and TC( Allele A1 is the allele reported first and allele 2 is the second, so it's G and C respectively)

FRQ_A1 (cases) FRQ_A1 (con)
0,775 0,766
0,196 0,184


For SNP rs9607782 (row 6), allele A2 is risk increasing because A1's frequency is higher in controls and lower in cases. So the risk allele is the T allele because A2 is T (column H: AT).
I had already written code to extract those :)

The result is here.

It seems that there are a few mistakes in your table? It is always preferable to automate the parsing...

Example: rs9607782 A T 0,25 0,232

Here A should be the trait increasing allele since the frequency is .25.

What I was wondering about was the SNPs you see in the attachment. How do you get the phenotype increasing alleles for those?
I had already written code to extract those :)



It seems that there are a few mistakes in your table? It is always preferable to automate the parsing...
Example: rs9607782 A T 0,25 0,232

Here A should be the trait increasing allele since the frequency is .25.

What I was wondering about was the SNPs you see in the attachment. How do you get the phenotype increasing alleles for those?


Yes, my mistake, I did it really quickly and manually.However, it is rs20551 not rs9607782.

I'll have a better look at the paper and see if I can answer your question.
I am afraid it's impossible to infer the risk allele from the table. In this case it's best to contact the authors, but no guarantee they'll reply. In the meantime you could run the analysis with the coding variants from worksheet 1.
Repo here: https://bitbucket.org/hbdtools/

Btw, I'm making a toolbox of command line scripts to perform various tasks. This is according to the Unix philosophy:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.


Will finish up the first script soon (all it does is take in a list of SNPs like shown earlier in this thread and output a .csv file of trait increasing allele frequencies with SNPs as rows and populations as columns.)
[hr][hr][hr]
(My new post was appended as an edit:) This will be slow but steady, btw. I'm lucky enough to get paid to hack biology, so I'm giving a 120% there and spend approx. 30 minutes on this each day after work. But since I can do most of this stuff in my sleep we'll make steady progress.

And does anyone understand stats well enough to do a more quantitative analysis of the Piffer method, st like this: Conditions for the validity of SNP-based heritability estimation? (I know the method discussed is very different).

Ps. I do not really believe the current Piffer-method results are terribly informative, as we only have approx. one permille of the SNPs affecting complex traits. (I think SNP counting is the way to go to get a great many nature/nurture questions settled eventually though, which is why I'm bothering to do this.)
I'm removing the public repo and have created a private one. People interested in testing or other programmers can apply for access. But that will be later.

Btw, I'm making a toolbox of command line scripts to perform various tasks. This is according to the Unix philosophy:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.


Will finish up the first script soon (all it does is take in a list of SNPs like shown earlier in this thread and output a .csv file of trait increasing allele frequencies with SNPs as rows and populations as columns.)
[hr][hr][hr]
(My new post was appended as an edit:) This will be slow but steady, btw. I'm lucky enough to get paid to hack biology, so I'm giving a 120% there and spend approx. 30 minutes on this each day after work. But since I can do most of this stuff in my sleep we'll make steady progress.

And does anyone understand stats well enough to do a more quantitative analysis of the Piffer method, st like this: Conditions for the validity of SNP-based heritability estimation? (I know the method discussed is very different).

Ps. I do not really believe the current Piffer-method results are terribly informative, as we only have approx. one permille of the SNPs affecting complex traits. (I think SNP counting is the way to go to get a great many nature/nurture questions settled eventually though, which is why I'm bothering to do this.)


Do not forget to control for linkage disequilibrium. Ideally SNPs should have zero linkage (be located on different chromosomes) but if you want to use all SNPs (even those located on same chromosome), you'll have to calculate LD and partial out its effects. An LD calculator can be found here: https://www.broadinstitute.org/mpg/snap/ldsearchpw.php
Not controlling for LD will inflate correlations between SNPs.
Thanks. I'll probably use Plink for this.

It has an LD option which outputs both r2 and D'. Which one should I use?

And how do I partial out the effects of LD?

E: Seems like if I avoid those with <500kb between each other I'll be safe. But I'm sure there are better ways. Perhaps I should avoid those that are above a certain r2, but use the rest?