I was just reading the papers by Davide Piffer using allele counting + factor analysis and it seemed to me a simple method for disentangling the most knotty of nature/nurture questions, that of group differences.
Furthermore, it struck me as a method that could and should be used again and again as reference panels become bigger/more inclusive and more SNPs are found for each phenotype (furthermore, there does exist SNPs for many untested phenotypes). Therefore I think it would be worthwhile to automate the process and create software for doing the analysis. This post is really just a proposal for a software pipeline, and a call for help/guidance with the statistics (which I do not understand well).
What follows is my proposal for the software pipeline. I have written how I think the pipeline should proceed in the "methods" part below. If anything there is murky or wrong, please tell me. And if you have other comments/insights/thoughts please post.
Main question: I'd really like to know what you think the output should be? What statistic(s) would best tell if there had been differential selection?
Input:
- List of SNPs affecting the phenotype
- List of phenotype scores per population
- Thousand Genomes VCF file
Output:
- If one factor seems like a plausible candidate for a factor that "likely represents a nonrandom evolutionary force such as natural selection", output ***see main question above***.
- Otherwise, output that none of the factors seemed like plausible candidates for a selection factor.
- In both cases also print the statistical values gotten from the analysis of whether one factor represented evolution (how many of the snps loaded on the factor, were the SNPs with the lowest p-values the ones that loaded the highest etc.)
Method:
1. Read SNPs in list of SNPs affecting trait value (Input 1) from 1kg VCF (Input 3)
2. Compute frequencies of SNPs in each population, creating a matrix of rows with SNPs and columns with populations; the data in the matrix are simply the allele frequencies.
3. Compute PCA of aforementioned matrix, with reduction to two dimensions.
4. Find if one of these vectors are suitable as selection factor
5. Use the vector suitable as selection factor to see if those SNPs with the highest scores on this are more frequent in populations with the highest phenotype trait.
I'm thinking of doing this in Ruby, but interfacing with R to do the statistical analyses. I know how to code and do bioinformatics, but would need some help to understand the statistics and logic used in the Piffer papers. Ideally, someone would just give feedback in this thread and I could ask questions once I hit a snag.
I'm thinking of starting with the above, but adding more bells and whistles later, once I get the simple stuff above running. Open Source Software often grows by itself if you just get something decent up and running.
[hr]
Some errors.