Before I begin for real I'd just like to see that we are all on the same page (no point in writing this code
if you expect something very different from what is possible or feasible to do). If you don't think this project
seems worthwhile after you have read my description (or if something seems off) please tell me.
I have read "Simple statistical tools to detect signals of recent polygenic selection" and will try to
implement that pipeline. The paper can be found here:
http://www.academia.edu/6127615/Simple_statistical_tools_to_detect_signals_of_recent_polygenic_selectionWhat I have been doing is checking that all of the steps described there are possible to do with R/Python,
and that I am able to make them work.
I only intend to write the pipeline for 1000 genomes to begin with because the ALFRED database is harder to
use programatically. If you think the program seems worthwhile for 1kg, I'll add stuff that makes it work
with ALFRED.
I will also skip the check for whether alleles with larger frequency differences between populations have
greater p-values. That will be possible to add later, if desired.
What I propose to do is written in the outline below.
PROPOSAL:
This will be a command line tool, i.e. you write a command like piffer --snpfile your_snpfile.txt
--genotype_file genotypes_scores.txt --output_file the_file_where_you_want_to_store_your_results.txt and then
the program runs without any more user input. I intend to print the results to the screen or to a file,
depending on the user's wishes.
The planned structure of the output is shown below. What I intend to do to achieve the desired output is written
as steps.
1 ====[ ANOVA ]============================================
1.1) Run ANOVA on continental groups, add results to output file
1.2) Run ANOVA on populations, add results to output file
2.1 ====[ PCA ]============================================
2.1.1) Sort the list of snps on p-value, divide the SNPs into groups of x, where x is set by the user.
2.1.2) Run PCA and reduce to two factors, with the SNP groups as different variables.
2.2 ====[ PCA ]====[ CHECK WHETHER PCA APPROPRIATE ]=======
2.2.1) Check KMO and Bartletts to see that data suitable for PCA.
2.3 ====[ PCA ]====[ CHECK THAT THE RESULT IS VALID/INTERPRETABLE ]===
2.3.1) Check whether there is one interpretable factor, i.e. a factor with a good correlation between PC
loading and pvalues.
2.3.2) Check that the components are not correlated (ie. cannot be considered an overarching factor)
2.4 ====[ PCA ]=====[ CHECK RESULT ]============================================
2.4.1) Then test whether lower p-values predict higher polygenic scores. Use Spearman rank correlation.
============================================
APPENDIX: WHY ALFRED IS HARDER TO WORK WITH:
The main reason why ALFRED is harder to work with is that many of the phenotype SNPs are not in it.
Therefore we have to find substitute SNPs with the SNAP-tool, which is not hard, but I see no way of finding
which allele (trait increasing or decreasing) in the proxy SNP corresponds to which allele in the original SNP.
QUESTION:
Are the height SNPs available in a structured file somewhere, or do I have to look up the additional material for
each study and convert the data to a common format?