I have a good knowledge on imputation. If you want I can guide toward the best methods. Not all imputations are equivalent. Some are adequate only in some type of data. What's the data you use ? Is it the file named "dataset.csv" that you uploaded here ? Because I'm not sure i can recommend the use of imputation. I have explained that here. Your variables must be highly correlated (approaching at least 0.40 or so). The ratio of subjects/variables (put in the imputation procedure) should not fall below 10/1. Alternatively, Hardt et al. (2012) recommend a maximum ratio of 1:3 for variables (with or without auxiliaries) against complete cases, that is, for 60 people having complete data, up to 20 variables could be used. The auxiliary variables are those that can be substitute because of their high correlation, such as, identical variables measured at different points in time (i.e., repeated measures). If the % of missing cases is too high, such as in your variables ViolentCrimeNorway, LarcenyNorway, ViolentCrimeFinland, and LarcenyFinland, I can tell you'll have big troubles.
Besides, the superiority of imputation over complete case is not restricted to FA, but extend to all kind of analyses. For example, in that case, multiple regression is inefficient, and then nearly 100% of this kind of analyses published in various other journals should be wrong, because they almost never apply imputation. Either because they don't know a thing about it or because it's time consuming. For example, 5 imputation minimum is recommended, but can be higher given the features of your data. But with 5 imputation, you will need to run the analysis five times with each of the imputed data set, and then, average the results, and as is recommended, you should also provide the standard error or CI, or standard deviation, to let the readers know about how much the estimates vary over the imputation. If your estimates vary too much, that may be a problem, a signals that your estimate is not stable, and that maybe, you'll need more imputation, 10, 20, 30, etc. But repeating the analysis 30 times with your 30 dataset is something researchers don't want to do. And I understand that...
Personally, I prefer maximum likelihood (ML) estimation, because multiple imputation (MI) gives me some headache about choosing which kind of MI is good depending on the data you have. If you have AMOS, or R, you can easily use ML.
*****
In your last sentence of your paper, "All datasets and source code is available in the supplementary materials.", it's optional but I recommend you to add "on OpenPsych Forum".
Back to [Archive] Post-review discussions
Here is my current code. Updated dataset file is attached.
As you can see, I am using multiple imputation. There seems to be a bug becus MI fucks up one of the variables, but only that one (Norway.tertiary.edu.att.bigsamples.2013). And the fuck-up has no effect on the factor analysis (S factors from the complete cases dataset correlates .99 or 1.00 with the ones from the imputed datasets). However, it has changed the data of the cases with actual data and changed the mean and the standard deviation. Everything else seems fine. Odd?
It can be seen in the descriptive statistics for each dataset (all data, complete cases, impute 1, impute 2).
setwd("Z:/Code/database")
read = read.csv("Megadataset_v1.4.csv")
colnames(read)
library(Hmisc) # for rcorr
library(car) # for scatterplot
library(stats) #for automatic multiple regression
library(psych) #for r.test
library(XLConnect) #writing to xls
library(nFactors) #how many factors to extract
library(mi) #imputation
DF.work = cbind(read["Norway.OutOfWork.2010Q2.men"], #for work data
read["Norway.OutOfWork.2011Q2.men"],
read["Norway.OutOfWork.2012Q2.men"],
read["Norway.OutOfWork.2013Q2.men"],
read["Norway.OutOfWork.2014Q2.men"],
read["Norway.OutOfWork.2010Q2.women"],
read["Norway.OutOfWork.2011Q2.women"],
read["Norway.OutOfWork.2012Q2.women"],
read["Norway.OutOfWork.2013Q2.women"],
read["Norway.OutOfWork.2014Q2.women"])
DF.income = cbind(read["Norway.Income.index.2009"], #for income data
read["Norway.Income.index.2010"],
read["Norway.Income.index.2011"],
read["Norway.Income.index.2012"])
#make DF
DF = cbind(read["LV2012estimatedIQ"],
read["Altinok2013ACH"],
read["IslamPewResearch2010"],
log(read["GDPpercapitaWorldBank2013"]),
read["S.scores"],
read["NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014"],
read["FinlandViolentCrimeAdjustedOddsRatioSkardhamar2014"],
read["NorwayLarcenyAdjustedOddsRatioSkardhamar2014"],
read["FinlandLarcenyAdjustedOddsRatioSkardhamar2014"],
read["Norway.tertiary.edu.att.2013"],
read["Norway.tertiary.edu.att.bigsamples.2013"])
#get 5 year means
DF["OutOfWork.2010to2014.men"] = apply(DF.work[1:5],1,mean,na.rm=T) #get means, ignore missing
DF["OutOfWork.2010to2014.women"] = apply(DF.work[6:10],1,mean,na.rm=T) #get means, ignore missing
#get means for income and add to DF
DF["Income.index.2009to2012"] = apply(DF.income[1:4],1,mean,na.rm=T) #get means, ignore missing
#compare islam vars
DF.islam = as.data.frame(cbind(read["IslamPewResearch1990"],
read["IslamPewResearch2010"],
read["IslamPewResearch2030Projected"],
read["Islam"]))
#correlation matrix
DF.cor = rcorr(as.matrix(DF)) #create correlation matrix with pairwise miss data deleted
round(DF.cor$r,2)
#write results to xlsx file
writeWorksheetToFile(file = "correlations_Norway2014.xlsx", data = round(DF.cor$r,2), sheet = "cors")
writeWorksheetToFile(file = "correlations_Norway2014.xlsx", data = round(DF.cor$P,4), sheet = "p")
writeWorksheetToFile(file = "correlations_Norway2014.xlsx", data = DF.cor$n, sheet = "n")
#are the same vars hard/easy to predict across predictors?
IQ.cors = DF.cor$r[6:nrow(DF.cor$r),1]
Altinok.cors = DF.cor$r[6:nrow(DF.cor$r),2]
Islam.cors = DF.cor$r[6:nrow(DF.cor$r),3]
GDP.cors = DF.cor$r[6:nrow(DF.cor$r),4]
S.cors = DF.cor$r[6:nrow(DF.cor$r),5]
DF.predict = as.data.frame(cbind(IQ.cors,Altinok.cors,Islam.cors,GDP.cors,S.cors))
DF.predict.cor = rcorr(as.matrix(DF.predict)) #are there clear patterns in how well the vars are predictable? YES!
#factor analysis
#subset data
DF.norway = DF[c("NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014", #only norwegian vars
"NorwayLarcenyAdjustedOddsRatioSkardhamar2014",
"Norway.tertiary.edu.att.bigsamples.2013", #skip the small ed. att var
"OutOfWork.2010to2014.men",
"OutOfWork.2010to2014.women",
"Income.index.2009to2012")]
#handle missing values
DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15
#count NA's
DF.norway.missing = apply(DF.norway, 1, is.na) #produces a col with T/F for each case
DF.norway.missing = apply(DF.norway.missing, 2, sum) #sums the number of missing per col
DF.norway.missing.table = table(DF.norway.missing) #tabulates them
#initial info
mi.info = mi.info(DF.norway)
missing.pattern.plot(DF.norway) #visual depiction of missing values
mp.plot(DF.norway, y.order = TRUE, x.order = TRUE)
#subsets
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26
#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)
#compare desc. stats
DF.desc.stats = as.data.frame(rbind(describe(DF.norway),
describe(DF.norway.complete),
describe(DF.norway.miss.1.imputed),
describe(DF.norway.miss.2.imputed)))
DF.desc.stats.ordered = DF.desc.stats[with(DF.desc.stats, order(vars)), ] #reorder
write.csv(DF.desc.stats.ordered, "desc_stats.csv")
#sampling tests
#bartlett's test
cortest.bartlett(DF.norway.complete)
cortest.bartlett(DF.norway.miss.1.imputed)
cortest.bartlett(DF.norway.miss.2.imputed)
#KMO function
kmo <- function(x)
{
x <- subset(x, complete.cases(x)) # Omit missing values
r <- cor(x) # Correlation matrix
r2 <- r^2 # Squared correlation coefficients
i <- solve(r) # Inverse matrix of correlation matrix
d <- diag(i) # Diagonal elements of inverse matrix
p2 <- (-i/sqrt(outer(d, d)))^2 # Squared partial correlation coefficients
diag(r2) <- diag(p2) <- 0 # Delete diagonal elements
KMO <- sum(r2)/(sum(r2)+sum(p2))
MSA <- colSums(r2)/(colSums(r2)+colSums(p2))
return(list(KMO=KMO, MSA=MSA))
}
#get KMO
kmo(DF.norway.complete)$KMO
kmo(DF.norway.miss.1.imputed)$KMO
kmo(DF.norway.miss.2.imputed)$KMO
#nfactors
nScree(DF.norway.complete)
nScree(DF.norway.miss.1.imputed)
nScree(DF.norway.miss.2.imputed)
#PA and ML
DF.norway.complete.pa = fa(DF.norway.complete, nfactors=1,rotate="none",scores="regression",fm="pa")
DF.norway.complete.ml = fa(DF.norway.complete, nfactors=1,rotate="none",scores="regression",fm="ml")
factor.congruence(DF.norway.complete.pa,DF.norway.complete.ml) #identical
DF.norway.miss.1.imputed.pa = fa(DF.norway.miss.1.imputed, nfactors=1,rotate="none",scores="regression",fm="pa")
DF.norway.miss.1.imputed.ml = fa(DF.norway.miss.1.imputed, nfactors=1,rotate="none",scores="regression",fm="ml")
factor.congruence(DF.norway.miss.1.imputed.pa,DF.norway.miss.1.imputed.ml) #identical
DF.norway.miss.2.imputed.pa = fa(DF.norway.miss.2.imputed, nfactors=1,rotate="none",scores="regression",fm="pa")
DF.norway.miss.2.imputed.ml = fa(DF.norway.miss.2.imputed, nfactors=1,rotate="none",scores="regression",fm="ml")
factor.congruence(DF.norway.miss.2.imputed.pa,DF.norway.miss.2.imputed.ml) #identical
DF.norway.miss.2.imputed.pa2 = fa(DF.norway.miss.2.imputed, nfactors=2,rotate="none",scores="regression",fm="pa")
DF.norway.miss.2.imputed.ml2 = fa(DF.norway.miss.2.imputed, nfactors=2,rotate="none",scores="regression",fm="ml")
#Strength of general factor
omega(DF.norway.complete)
omega(DF.norway.miss.1.imputed)
omega(DF.norway.miss.2.imputed)
#put the factor scores back into the big dataset - for SPI
Sfactor.scores = rep(NA,nrow(DF)) #make an empty vector of the right size for the factor scores
Sfactor.scores1 = rep(NA,nrow(DF))
Sfactor.scores2 = rep(NA,nrow(DF))
dims = as.integer(dimnames(DF.norway.complete.pa$scores)[[1]]) #converts the dimnames to integers
dims1 = as.integer(dimnames(DF.norway.miss.1.imputed.pa$scores)[[1]]) #converts the dimnames to integers
dims2 = as.integer(dimnames(DF.norway.miss.2.imputed.pa$scores)[[1]]) #converts the dimnames to integers
for (n in 1:nrow(DF)) #this puts the factor scores back into the big dataset
{
Sfactor.scores[dims[n]] = DF.norway.complete.pa$scores[n]
Sfactor.scores1[dims1[n]] = DF.norway.miss.1.imputed.pa$scores[n]
Sfactor.scores2[dims2[n]] = DF.norway.miss.2.imputed.pa$scores[n]
}
#reverse Sfactor scores
Sfactor.scores = Sfactor.scores*-1
Sfactor.scores1 = Sfactor.scores1*-1
Sfactor.scores2 = Sfactor.scores2*-1
#make DF
DF2 = cbind(read["LV2012estimatedIQ"],
read["Altinok2013ACH"],
read["IslamPewResearch2010"],
log(read["GDPpercapitaWorldBank2013"]),
read["S.scores"],
Sfactor.scores,
Sfactor.scores1,
Sfactor.scores2)
DF2.cor = rcorr(as.matrix(DF2)) #create correlation matrix with pairwise miss data deleted
round(DF2.cor$r,2)
#visualize
scatterplot(OutOfWork.2010to2014.men ~ LV2012estimatedIQ, #predict employment from IQ
data=DF,
smoother=NULL, #no moving average
labels=unlist(read["ID"]), #labels, but they dont work
id.n = length(unlist(read["ID"])), #pointless, but needed
xlab = "Lynn and Vanhanen national IQ",
ylab="% of men unemployed, 2010-2014 average")
scatterplot(OutOfWork.2010to2014.women ~ IslamPewResearch2010, #predict employment from Islam
data=DF,
smoother=NULL, #no moving average
labels=unlist(read["ID"]), #labels, but they dont work
id.n = length(unlist(read["ID"])), #pointless, but needed
xlab = "Prevalence of Islam",
ylab="% of women unemployed, 2010-2014 average")
scatterplot(Norway.tertiary.edu.att.bigsamples.2013 ~ S.scores, #predict employment from Islam
data=DF,
smoother=NULL, #no moving average
labels=unlist(read["ID"]), #labels, but they dont work
id.n = length(unlist(read["ID"])), #pointless, but needed
xlab = "S factor",
ylab="fraction with long tertiary education")
scatterplot(Sfactor.scores2 ~ LV2012estimatedIQ, #predict S factor from IQ
data=DF2,
smoother=NULL, #no moving average
labels=unlist(read["ID"]), #labels, but they dont work
id.n = length(unlist(read["ID"])), #pointless, but needed
xlab = "National IQ in home country",
ylab="General socioeconomic factor (S) in Norway",
main = "National IQ predicts immigrant group performance at the group level in Norway")
scatterplot(Sfactor.scores2 ~ IslamPewResearch2010, #predict S factor from Islam
data=DF2,
smoother=NULL, #no moving average
labels=unlist(read["ID"]), #labels, but they dont work
id.n = length(unlist(read["ID"])), #pointless, but needed
xlab = "National prevalence of Islam in home country",
ylab="General socioeconomic factor (S) in Norway",
main = "National Islam prevalence predicts immigrant group performance at the group level in Norway")
As you can see, I am using multiple imputation. There seems to be a bug becus MI fucks up one of the variables, but only that one (Norway.tertiary.edu.att.bigsamples.2013). And the fuck-up has no effect on the factor analysis (S factors from the complete cases dataset correlates .99 or 1.00 with the ones from the imputed datasets). However, it has changed the data of the cases with actual data and changed the mean and the standard deviation. Everything else seems fine. Odd?
It can be seen in the descriptive statistics for each dataset (all data, complete cases, impute 1, impute 2).
> round(DF.desc.stats.ordered,2)
vars n mean sd median trimmed mad min max
NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014 1 26 1.31 0.87 1.25 1.24 1.11 0.20 3.20
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20141 1 15 1.41 0.99 1.50 1.36 1.19 0.20 3.20
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20142 1 18 1.33 0.93 1.15 1.28 0.96 0.20 3.20
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20143 1 26 1.16 0.86 0.85 1.08 0.89 -0.01 3.20
NorwayLarcenyAdjustedOddsRatioSkardhamar2014 2 26 0.77 0.56 0.60 0.74 0.59 0.10 2.00
NorwayLarcenyAdjustedOddsRatioSkardhamar20141 2 15 0.78 0.55 0.50 0.76 0.44 0.20 1.60
NorwayLarcenyAdjustedOddsRatioSkardhamar20142 2 18 0.72 0.53 0.55 0.71 0.44 0.10 1.60
NorwayLarcenyAdjustedOddsRatioSkardhamar20143 2 26 0.60 0.51 0.47 0.58 0.33 -0.26 1.60
Norway.tertiary.edu.att.bigsamples.2013 3 67 0.12 0.08 0.11 0.12 0.09 0.01 0.31
Norway.tertiary.edu.att.bigsamples.20131 3 15 0.10 0.07 0.09 0.10 0.07 0.01 0.23
Norway.tertiary.edu.att.bigsamples.20132 3 18 0.52 0.02 0.52 0.52 0.02 0.50 0.56
Norway.tertiary.edu.att.bigsamples.20133 3 26 0.52 0.02 0.52 0.52 0.02 0.50 0.56
OutOfWork.2010to2014.men 4 120 7.05 4.18 5.96 6.51 3.62 1.38 22.08
OutOfWork.2010to2014.men1 4 15 7.40 5.36 5.96 6.64 4.00 2.68 22.08
OutOfWork.2010to2014.men2 4 18 7.31 4.89 6.53 6.68 4.26 2.68 22.08
OutOfWork.2010to2014.men3 4 26 6.88 4.30 6.32 6.20 3.38 2.66 22.08
OutOfWork.2010to2014.women 5 120 7.50 4.97 6.30 6.75 2.95 1.32 31.82
OutOfWork.2010to2014.women1 5 15 8.90 6.20 7.30 8.40 4.69 1.98 22.42
OutOfWork.2010to2014.women2 5 18 8.17 5.93 6.48 7.67 4.46 1.90 22.42
OutOfWork.2010to2014.women3 5 26 7.40 5.25 6.48 6.69 3.53 1.56 22.42
Income.index.2009to2012 6 23 79.86 14.58 80.25 79.87 13.71 53.25 108.25
Income.index.2009to20121 6 15 78.78 14.78 78.25 78.48 10.75 53.25 108.25
Income.index.2009to20122 6 18 6852.23 2415.20 6500.53 6799.16 2890.42 2835.56 11718.06
Income.index.2009to20123 6 26 6689.35 2268.99 6500.53 6635.00 2031.16 2835.56 11718.06
range skew kurtosis se
NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014 3.00 0.55 -0.83 0.17
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20141 3.00 0.39 -1.25 0.26
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20142 3.00 0.57 -0.94 0.22
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20143 3.21 0.77 -0.28 0.17
NorwayLarcenyAdjustedOddsRatioSkardhamar2014 1.90 0.56 -1.09 0.11
NorwayLarcenyAdjustedOddsRatioSkardhamar20141 1.40 0.38 -1.74 0.14
NorwayLarcenyAdjustedOddsRatioSkardhamar20142 1.50 0.55 -1.42 0.12
NorwayLarcenyAdjustedOddsRatioSkardhamar20143 1.86 0.70 -0.58 0.10
Norway.tertiary.edu.att.bigsamples.2013 0.30 0.42 -0.91 0.01
Norway.tertiary.edu.att.bigsamples.20131 0.22 0.39 -1.32 0.02
Norway.tertiary.edu.att.bigsamples.20132 0.05 0.50 -1.12 0.00
Norway.tertiary.edu.att.bigsamples.20133 0.06 0.75 -0.64 0.00
OutOfWork.2010to2014.men 20.70 1.26 1.69 0.38
OutOfWork.2010to2014.men1 19.40 1.38 1.18 1.38
OutOfWork.2010to2014.men2 19.40 1.57 2.16 1.15
OutOfWork.2010to2014.men3 19.42 1.80 3.72 0.84
OutOfWork.2010to2014.women 30.50 1.92 5.11 0.45
OutOfWork.2010to2014.women1 20.44 0.83 -0.58 1.60
OutOfWork.2010to2014.women2 20.52 1.03 -0.04 1.40
OutOfWork.2010to2014.women3 20.86 1.30 1.14 1.03
Income.index.2009to2012 55.00 -0.01 -0.92 3.04
Income.index.2009to20121 55.00 0.19 -0.78 3.82
Income.index.2009to20122 8882.50 0.20 -0.97 569.27
Income.index.2009to20123 8882.50 0.24 -0.74 444.99
If someone wants to use multiple imputation (MI), I have nothing against it. But (s)he seriously needs to read a lot about MI. I'm serious. When I read about that, I really gave me a headache. And I have also read some data analysts (Paul D Allison maybe?) saying the same thing : if you are not very sure about what you do, then don't do it. A lot of researchers don't know the proper way to use MI, and some have reported others used a sub-optimal option of imputation.
This said, can you tell me the meaning of this syntax ?
#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)
Why is there norway.miss.1 and norway.miss.2 ? Does that mean you use 2 data imputations ? I ask this because normally, in the literature, the number of imputed data set is called "m". And in your data, m=3. So, does that mean you use 3*2=6 imputations ? [edit: no, I understand now. You had 2 data sets, one with N=18 the other with N=26, so in fact you use 3 imputations]
Also, can you tell me the % of missing value per variables ? Normally, the more missing value you have, the more imputation you need.
This said, can you tell me the meaning of this syntax ?
#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)
Why is there norway.miss.1 and norway.miss.2 ? Does that mean you use 2 data imputations ? I ask this because normally, in the literature, the number of imputed data set is called "m". And in your data, m=3. So, does that mean you use 3*2=6 imputations ? [edit: no, I understand now. You had 2 data sets, one with N=18 the other with N=26, so in fact you use 3 imputations]
Also, can you tell me the % of missing value per variables ? Normally, the more missing value you have, the more imputation you need.
As you can see from the code, I reduced the full dataset to three subsets. One with complete cases
And two with missing values
The comments explain what I did.
The lines
The first and third lines runs the imputation, with 200 iterations and 3 imputations. The second and fourth lines extracts the imputed datasets.
We can try ML imputation too.
DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15
And two with missing values
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26
The comments explain what I did.
The lines
#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)
The first and third lines runs the imputation, with 200 iterations and 3 imputations. The second and fourth lines extracts the imputed datasets.
We can try ML imputation too.
I switched to using the VIM package instead for imputing. This produces results with bugs.
The current draft is 10 pages, but I still need to add more. Perhaps 12-15 pages total when done.
#subsets
DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26
DF.norway.miss.1.imp.vim = irmi(DF.norway.miss.1) #impute using VIM
DF.norway.miss.2.imp.vim = irmi(DF.norway.miss.2) #impute using VIM
The current draft is 10 pages, but I still need to add more. Perhaps 12-15 pages total when done.
Ok, I managed to keep it at 8 pages due to dropping some of the scatter-plots.
New version is attached as well as updated supplementary material.
New version is attached as well as updated supplementary material.
You should make explicit how many imputations you had (m=3, or more). And remember that the more missing values you have, the more imputation you need.
http://www.statisticalhorizons.com/more-imputations
http://www.statisticalhorizons.com/more-imputations
Ok, I managed to keep it at 8 pages due to dropping some of the scatter-plots.
New version is attached as well as updated supplementary material.
You didn't make the corrections regarding the intro [url=bio-ecological]which I noted previously[/url].
Here is a new draft. I have made most of the changes suggested by Chuck and Dalliard. A native speaker has corrected the language in the paper. I have added a table in the appendix with S factor scores in NO and DK. I have added another dataset with up to 3 imputations which results in N=67, but more uncertain data. Various other smaller changes.
Optimally, I'd like to switch to deterministic imputation so that my analyses are completely reproducible, but haven't found an easy way to do this in R so far. It will very likely make little difference to the predictive results which are already very strong (|.55| - |.78| for the fully imputed dataset NO, |.51| - |.71| for DK). They are also similar to the unimputed datasets for those variables with sufficient sampling size and variance.
Optimally, I'd like to switch to deterministic imputation so that my analyses are completely reproducible, but haven't found an easy way to do this in R so far. It will very likely make little difference to the predictive results which are already very strong (|.55| - |.78| for the fully imputed dataset NO, |.51| - |.71| for DK). They are also similar to the unimputed datasets for those variables with sufficient sampling size and variance.
The essence of imputation is that you can never have the same result. You only need to write it in the paper, as you did. But next time, i recommend you to use more imputation. Ideally, the number of imputations should be a function of % of missing values.
One can use deterministic imputation with multiple regression. For a given dataset, one uses every other variable to predict the values of that dataset perhaps including interactions. Then in the cases where a value is missing, one imputes the predicted value.
I remembered the first time I was using imputation, it was on AMOS. I have requested several data set, one by one. And when working with each of them, I had identical results. I discovered after this, that I was using the option "regression imputation" while the recommended option would have been "stochastic regression imputation" (even though AMOS is bad tool for making imputed data sets). In the latter option, you cannot have identical data set. And all data analysts will tell you not to work with imputation that is not "stochastic". In the first case, there is no random component (i.e., no error term), and you will under-estimate standard errors. When I said it's not possible to get identical data set, i was referring to the imputation with random component, as is usually recommended to do.
1) "Recent studies show that criminality and other useful socioeconomic traits"
I don't think criminality is a "useful socioeconomic trait." Perhaps "important social and economic characteristics"?
2) Use the equal or greater than sign (≥) rather than >=.
3) "Are some predictors just generally better at predicting than others, or is there an interaction effect between predictor and variables?"
Not sure what you mean by interaction here. The question is whether any of the predictors have unique predictive power.
4) "using multiple imputation8 to impute data to cases with 1 or fewer missing values"
How is it possible to have fewer than 1 missing values?
5) "Table 4 shows description statistics"
Descriptive statistics.
6) "the squared multiple correlation of regression the first factor on the original variables"
Word missing or something.
I don't think criminality is a "useful socioeconomic trait." Perhaps "important social and economic characteristics"?
2) Use the equal or greater than sign (≥) rather than >=.
3) "Are some predictors just generally better at predicting than others, or is there an interaction effect between predictor and variables?"
Not sure what you mean by interaction here. The question is whether any of the predictors have unique predictive power.
4) "using multiple imputation8 to impute data to cases with 1 or fewer missing values"
How is it possible to have fewer than 1 missing values?
5) "Table 4 shows description statistics"
Descriptive statistics.
6) "the squared multiple correlation of regression the first factor on the original variables"
Word missing or something.
Thank you for reviewing again, Dailliard.
Removed "useful". The editor must have added it.
Done.
An interaction would be that a given predictor P1 is better at predicting variable V1 than P2, but that P2 is better at predicting V2 than P1. A predictor x outcome variable interaction. This shows up as lower than |1| correlations between the prediction correlation vectors. However, surprisingly, they were all close or closish to 1.
For instance, one might have posited that Islam prevalence should be good at predicting crime due to incompatible religion/culture, but that it has little influence on male unemployment. This wasn't found. The IQ x Islam vectors correlate |.99| (shown in Table 2). Surprising to me. I remember when I first got my hands on the Danish data, I checked for this. Obviously, having read a lot about intelligence and education, I would expect IQ to be a better predictor than Islam, but perhaps the other way around for crime. However, it was pretty linear there too. Actually, I didn't think of this way of testing it before now. Perhaps I should add this to the reanalysis of the Danish data. The Danish data is much more suited for testing it since the number of outcome variables is much larger (9 vs. 25).
How would you like me to change this?
Islam does not correlate highly with the others (data from the DF.cor object in R):
Islam x IQ = -27
x Altinok = -43
x GDP = -14
x International S = -.33
so it can be combined fruitfully in multiple regression. For instance, I tried with IQ+Islam to predict S scores in Norway (imp. 3). R2 adjusted is .59, i.e. r=0.77 which is higher than any of the predictors alone (if one doesn't consider S scores from Denmark a predictor, it has .78).
If there are no missing values for a case. Check the relevant code:
Fixed.
Fixed to: "the squared multiple correlation of regressing the first factor on the original variables".
1) "Recent studies show that criminality and other useful socioeconomic traits"
Removed "useful". The editor must have added it.
2) Use the equal or greater than sign (≥) rather than >=.
Done.
3) "Are some predictors just generally better at predicting than others, or is there an interaction effect between predictor and variables?"
Not sure what you mean by interaction here. The question is whether any of the predictors have unique predictive power.
An interaction would be that a given predictor P1 is better at predicting variable V1 than P2, but that P2 is better at predicting V2 than P1. A predictor x outcome variable interaction. This shows up as lower than |1| correlations between the prediction correlation vectors. However, surprisingly, they were all close or closish to 1.
For instance, one might have posited that Islam prevalence should be good at predicting crime due to incompatible religion/culture, but that it has little influence on male unemployment. This wasn't found. The IQ x Islam vectors correlate |.99| (shown in Table 2). Surprising to me. I remember when I first got my hands on the Danish data, I checked for this. Obviously, having read a lot about intelligence and education, I would expect IQ to be a better predictor than Islam, but perhaps the other way around for crime. However, it was pretty linear there too. Actually, I didn't think of this way of testing it before now. Perhaps I should add this to the reanalysis of the Danish data. The Danish data is much more suited for testing it since the number of outcome variables is much larger (9 vs. 25).
How would you like me to change this?
Islam does not correlate highly with the others (data from the DF.cor object in R):
Islam x IQ = -27
x Altinok = -43
x GDP = -14
x International S = -.33
so it can be combined fruitfully in multiple regression. For instance, I tried with IQ+Islam to predict S scores in Norway (imp. 3). R2 adjusted is .59, i.e. r=0.77 which is higher than any of the predictors alone (if one doesn't consider S scores from Denmark a predictor, it has .78).
4) "using multiple imputation8 to impute data to cases with 1 or fewer missing values"
How is it possible to have fewer than 1 missing values?
If there are no missing values for a case. Check the relevant code:
#count NA's
DF.norway.missing = apply(DF.norway, 1, is.na) #produces a col with T/F for each case
DF.norway.missing = apply(DF.norway.missing, 2, sum) #sums the number of missing per col
DF.norway.missing.table = table(DF.norway.missing) #tabulates them
#subsets
DF.norway.complete = DF.norway[DF.norway.missing <= 0,] #complete cases only, reduces N to 15
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26
DF.norway.miss.3 = DF.norway[DF.norway.missing <= 3,] #keep data with 3 or less missing values, N=67
5) "Table 4 shows description statistics"
Fixed.
6) "the squared multiple correlation of regression the first factor on the original variables"
Word missing or something.
Fixed to: "the squared multiple correlation of regressing the first factor on the original variables".
Emil, could you post the latest version?
"Factor analytic methods require that there are no missing values. The easiest and most common way to deal with this is to limit the data to the subset with complete cases. This however produces biased results if the data are not missing completely at random ...For the above reasons, I used three methods for handling missing cases"
One of the MI assumptions is that data is MAR. Why would the possibility of MNAR then be reason to use it (as opposed to deletion), as your wording suggests?
"Factor analytic methods require that there are no missing values. The easiest and most common way to deal with this is to limit the data to the subset with complete cases. This however produces biased results if the data are not missing completely at random ...For the above reasons, I used three methods for handling missing cases"
One of the MI assumptions is that data is MAR. Why would the possibility of MNAR then be reason to use it (as opposed to deletion), as your wording suggests?
An interaction would be that a given predictor P1 is better at predicting variable V1 than P2, but that P2 is better at predicting V2 than P1. A predictor x outcome variable interaction. This shows up as lower than |1| correlations between the prediction correlation vectors. However, surprisingly, they were all close or closish to 1.
Perhaps someone here can translate for me ? I don't understand the entire sentence.
P1 = predictor 1, P2 = predictor 2, etc.
V1 = outcome var 1, V2 = outcome var 2, etc.
V1 = outcome var 1, V2 = outcome var 2, etc.
I performed the analogous analysis on the Danish data as mentioned above.
The results are even better than in the Norwegian data. The Danish data is larger (25 variables) and higher quality (age and sex controlled). Apparently predictors are substantially general, not very specific. Theories of their explanatory power should be general, not specific.
> DF.Denmark.predict.cor
IQ Altinok Islam logGDP S.score
IQ 1.00 0.99 -0.96 0.98 0.99
Altinok 0.99 1.00 -0.94 0.98 0.98
Islam -0.96 -0.94 1.00 -0.94 -0.96
logGDP 0.98 0.98 -0.94 1.00 0.99
S.score 0.99 0.98 -0.96 0.99 1.00
The results are even better than in the Norwegian data. The Danish data is larger (25 variables) and higher quality (age and sex controlled). Apparently predictors are substantially general, not very specific. Theories of their explanatory power should be general, not specific.
I performed the analogous analysis on the Danish data as mentioned above.> DF.Denmark.predict.cor
IQ Altinok Islam logGDP S.score
IQ 1.00 0.99 -0.96 0.98 0.99
Altinok 0.99 1.00 -0.94 0.98 0.98
Islam -0.96 -0.94 1.00 -0.94 -0.96
logGDP 0.98 0.98 -0.94 1.00 0.99
S.score 0.99 0.98 -0.96 0.99 1.00
The results are even better than in the Norwegian data. The Danish data is larger (25 variables) and higher quality (age and sex controlled). Apparently predictors are substantially general, not very specific. Theories of their explanatory power should be general, not specific.
Anyways, as I'm fine with the paper, I approve publication.
Here's a new revision.
It has a lot of smaller language changes, and the changes mentioned above. I have also added a short explanation of the predictor x outcome variable interaction idea that Dalliard noted wasn't well-enough explained. The appendix now contains a list of the variables in the Danish dataset too so that readers don't have to find the prior Danish study to know what I analyzed there.
Reviewers who ok'd this paper before the addition of the Danish re-analyses should re-read the submission and see if they still agree with publication.
It has a lot of smaller language changes, and the changes mentioned above. I have also added a short explanation of the predictor x outcome variable interaction idea that Dalliard noted wasn't well-enough explained. The appendix now contains a list of the variables in the Danish dataset too so that readers don't have to find the prior Danish study to know what I analyzed there.
Reviewers who ok'd this paper before the addition of the Danish re-analyses should re-read the submission and see if they still agree with publication.