[ODP] Crime, income and employment among immigrant groups in Norway and Finland

Back to [Archive] Post-review discussions

First Previous 1 2 3 4 Next Last

[ODP] Crime, income and employment among immigrant groups in Norway and Finland

Thu 11 Sep 2014 13:56

Admin

I have a good knowledge on imputation. If you want I can guide toward the best methods. Not all imputations are equivalent. Some are adequate only in some type of data. What's the data you use ? Is it the file named "dataset.csv" that you uploaded here ? Because I'm not sure i can recommend the use of imputation. I have explained that here. Your variables must be highly correlated (approaching at least 0.40 or so). The ratio of subjects/variables (put in the imputation procedure) should not fall below 10/1. Alternatively, Hardt et al. (2012) recommend a maximum ratio of 1:3 for variables (with or without auxiliaries) against complete cases, that is, for 60 people having complete data, up to 20 variables could be used. The auxiliary variables are those that can be substitute because of their high correlation, such as, identical variables measured at different points in time (i.e., repeated measures). If the % of missing cases is too high, such as in your variables ViolentCrimeNorway, LarcenyNorway, ViolentCrimeFinland, and LarcenyFinland, I can tell you'll have big troubles.

Besides, the superiority of imputation over complete case is not restricted to FA, but extend to all kind of analyses. For example, in that case, multiple regression is inefficient, and then nearly 100% of this kind of analyses published in various other journals should be wrong, because they almost never apply imputation. Either because they don't know a thing about it or because it's time consuming. For example, 5 imputation minimum is recommended, but can be higher given the features of your data. But with 5 imputation, you will need to run the analysis five times with each of the imputed data set, and then, average the results, and as is recommended, you should also provide the standard error or CI, or standard deviation, to let the readers know about how much the estimates vary over the imputation. If your estimates vary too much, that may be a problem, a signals that your estimate is not stable, and that maybe, you'll need more imputation, 10, 20, 30, etc. But repeating the analysis 30 times with your 30 dataset is something researchers don't want to do. And I understand that...

Personally, I prefer maximum likelihood (ML) estimation, because multiple imputation (MI) gives me some headache about choosing which kind of MI is good depending on the data you have. If you have AMOS, or R, you can easily use ML.

*****

In your last sentence of your paper, "All datasets and source code is available in the supplementary materials.", it's optional but I recommend you to add "on OpenPsych Forum".

Emil O. W. Kirkegaard

Thu 11 Sep 2014 15:45

Admin

Here is my current code. Updated dataset file is attached.

setwd("Z:/Code/database")
read = read.csv("Megadataset_v1.4.csv")
colnames(read)

library(Hmisc) # for rcorr
library(car) # for scatterplot
library(stats) #for automatic multiple regression
library(psych) #for r.test
library(XLConnect) #writing to xls
library(nFactors) #how many factors to extract
library(mi) #imputation


DF.work = cbind(read["Norway.OutOfWork.2010Q2.men"], #for work data
          read["Norway.OutOfWork.2011Q2.men"],
          read["Norway.OutOfWork.2012Q2.men"],
          read["Norway.OutOfWork.2013Q2.men"],
          read["Norway.OutOfWork.2014Q2.men"],
          read["Norway.OutOfWork.2010Q2.women"],
          read["Norway.OutOfWork.2011Q2.women"],
          read["Norway.OutOfWork.2012Q2.women"],
          read["Norway.OutOfWork.2013Q2.women"],
          read["Norway.OutOfWork.2014Q2.women"])

DF.income = cbind(read["Norway.Income.index.2009"], #for income data
                  read["Norway.Income.index.2010"],
                  read["Norway.Income.index.2011"],
                  read["Norway.Income.index.2012"])

#make DF
DF = cbind(read["LV2012estimatedIQ"],
           read["Altinok2013ACH"],
           read["IslamPewResearch2010"],
           log(read["GDPpercapitaWorldBank2013"]),
           read["S.scores"],
           read["NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014"],
           read["FinlandViolentCrimeAdjustedOddsRatioSkardhamar2014"],
           read["NorwayLarcenyAdjustedOddsRatioSkardhamar2014"],
           read["FinlandLarcenyAdjustedOddsRatioSkardhamar2014"],
           read["Norway.tertiary.edu.att.2013"],
           read["Norway.tertiary.edu.att.bigsamples.2013"])

  
#get 5 year means
DF["OutOfWork.2010to2014.men"] = apply(DF.work[1:5],1,mean,na.rm=T) #get means, ignore missing
DF["OutOfWork.2010to2014.women"] = apply(DF.work[6:10],1,mean,na.rm=T) #get means, ignore missing

#get means for income and add to DF
DF["Income.index.2009to2012"] = apply(DF.income[1:4],1,mean,na.rm=T) #get means, ignore missing

#compare islam vars
DF.islam = as.data.frame(cbind(read["IslamPewResearch1990"],
                               read["IslamPewResearch2010"],
                               read["IslamPewResearch2030Projected"],
                               read["Islam"]))

#correlation matrix
DF.cor = rcorr(as.matrix(DF)) #create correlation matrix with pairwise miss data deleted
round(DF.cor$r,2)

#write results to xlsx file
writeWorksheetToFile(file = "correlations_Norway2014.xlsx", data = round(DF.cor$r,2), sheet = "cors")
writeWorksheetToFile(file = "correlations_Norway2014.xlsx", data = round(DF.cor$P,4), sheet = "p")
writeWorksheetToFile(file = "correlations_Norway2014.xlsx", data = DF.cor$n, sheet = "n")

#are the same vars hard/easy to predict across predictors?
IQ.cors = DF.cor$r[6:nrow(DF.cor$r),1]
Altinok.cors = DF.cor$r[6:nrow(DF.cor$r),2]
Islam.cors = DF.cor$r[6:nrow(DF.cor$r),3]
GDP.cors = DF.cor$r[6:nrow(DF.cor$r),4]
S.cors = DF.cor$r[6:nrow(DF.cor$r),5]

DF.predict = as.data.frame(cbind(IQ.cors,Altinok.cors,Islam.cors,GDP.cors,S.cors))
DF.predict.cor = rcorr(as.matrix(DF.predict)) #are there clear patterns in how well the vars are predictable? YES!

#factor analysis
#subset data
DF.norway = DF[c("NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014", #only norwegian vars
                 "NorwayLarcenyAdjustedOddsRatioSkardhamar2014",
                 "Norway.tertiary.edu.att.bigsamples.2013", #skip the small ed. att var
                 "OutOfWork.2010to2014.men",
                 "OutOfWork.2010to2014.women",
                 "Income.index.2009to2012")]

#handle missing values
DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15

#count NA's
DF.norway.missing = apply(DF.norway, 1, is.na) #produces a col with T/F for each case
DF.norway.missing = apply(DF.norway.missing, 2, sum) #sums the number of missing per col
DF.norway.missing.table = table(DF.norway.missing) #tabulates them

#initial info
mi.info = mi.info(DF.norway)
missing.pattern.plot(DF.norway) #visual depiction of missing values
mp.plot(DF.norway, y.order = TRUE, x.order = TRUE)

#subsets
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26

#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)

#compare desc. stats
DF.desc.stats = as.data.frame(rbind(describe(DF.norway),
                                    describe(DF.norway.complete),
                                    describe(DF.norway.miss.1.imputed),
                                    describe(DF.norway.miss.2.imputed)))
DF.desc.stats.ordered = DF.desc.stats[with(DF.desc.stats, order(vars)), ] #reorder

write.csv(DF.desc.stats.ordered, "desc_stats.csv")

#sampling tests
#bartlett's test
cortest.bartlett(DF.norway.complete)
cortest.bartlett(DF.norway.miss.1.imputed)
cortest.bartlett(DF.norway.miss.2.imputed)

#KMO function
kmo <- function(x)
{
  x <- subset(x, complete.cases(x)) # Omit missing values
  r <- cor(x) # Correlation matrix
  r2 <- r^2 # Squared correlation coefficients
  i <- solve(r) # Inverse matrix of correlation matrix
  d <- diag(i) # Diagonal elements of inverse matrix
  p2 <- (-i/sqrt(outer(d, d)))^2 # Squared partial correlation coefficients
  diag(r2) <- diag(p2) <- 0 # Delete diagonal elements
  KMO <- sum(r2)/(sum(r2)+sum(p2))
  MSA <- colSums(r2)/(colSums(r2)+colSums(p2))
  return(list(KMO=KMO, MSA=MSA))
}

#get KMO
kmo(DF.norway.complete)$KMO
kmo(DF.norway.miss.1.imputed)$KMO
kmo(DF.norway.miss.2.imputed)$KMO

#nfactors
nScree(DF.norway.complete)
nScree(DF.norway.miss.1.imputed)
nScree(DF.norway.miss.2.imputed)

#PA and ML
DF.norway.complete.pa = fa(DF.norway.complete, nfactors=1,rotate="none",scores="regression",fm="pa")
DF.norway.complete.ml = fa(DF.norway.complete, nfactors=1,rotate="none",scores="regression",fm="ml")
factor.congruence(DF.norway.complete.pa,DF.norway.complete.ml) #identical

DF.norway.miss.1.imputed.pa = fa(DF.norway.miss.1.imputed, nfactors=1,rotate="none",scores="regression",fm="pa")
DF.norway.miss.1.imputed.ml = fa(DF.norway.miss.1.imputed, nfactors=1,rotate="none",scores="regression",fm="ml")
factor.congruence(DF.norway.miss.1.imputed.pa,DF.norway.miss.1.imputed.ml) #identical

DF.norway.miss.2.imputed.pa = fa(DF.norway.miss.2.imputed, nfactors=1,rotate="none",scores="regression",fm="pa")
DF.norway.miss.2.imputed.ml = fa(DF.norway.miss.2.imputed, nfactors=1,rotate="none",scores="regression",fm="ml")
factor.congruence(DF.norway.miss.2.imputed.pa,DF.norway.miss.2.imputed.ml) #identical

DF.norway.miss.2.imputed.pa2 = fa(DF.norway.miss.2.imputed, nfactors=2,rotate="none",scores="regression",fm="pa")
DF.norway.miss.2.imputed.ml2 = fa(DF.norway.miss.2.imputed, nfactors=2,rotate="none",scores="regression",fm="ml")


#Strength of general factor
omega(DF.norway.complete)
omega(DF.norway.miss.1.imputed)
omega(DF.norway.miss.2.imputed)

#put the factor scores back into the big dataset - for SPI
Sfactor.scores = rep(NA,nrow(DF))   #make an empty vector of the right size for the factor scores
Sfactor.scores1 = rep(NA,nrow(DF))
Sfactor.scores2 = rep(NA,nrow(DF))

dims = as.integer(dimnames(DF.norway.complete.pa$scores)[[1]])	#converts the dimnames to integers
dims1 = as.integer(dimnames(DF.norway.miss.1.imputed.pa$scores)[[1]])  #converts the dimnames to integers
dims2 = as.integer(dimnames(DF.norway.miss.2.imputed.pa$scores)[[1]])  #converts the dimnames to integers

for (n in 1:nrow(DF))		#this puts the factor scores back into the big dataset
{  
  Sfactor.scores[dims[n]] = DF.norway.complete.pa$scores[n]
  Sfactor.scores1[dims1[n]] = DF.norway.miss.1.imputed.pa$scores[n]
  Sfactor.scores2[dims2[n]] = DF.norway.miss.2.imputed.pa$scores[n]
}

#reverse Sfactor scores
Sfactor.scores = Sfactor.scores*-1
Sfactor.scores1 = Sfactor.scores1*-1
Sfactor.scores2 = Sfactor.scores2*-1

#make DF
DF2 = cbind(read["LV2012estimatedIQ"],
           read["Altinok2013ACH"],
           read["IslamPewResearch2010"],
           log(read["GDPpercapitaWorldBank2013"]),
           read["S.scores"],
           Sfactor.scores,
           Sfactor.scores1,
           Sfactor.scores2)

DF2.cor = rcorr(as.matrix(DF2)) #create correlation matrix with pairwise miss data deleted
round(DF2.cor$r,2)


#visualize
scatterplot(OutOfWork.2010to2014.men ~ LV2012estimatedIQ, #predict employment from IQ
            data=DF,
            smoother=NULL, #no moving average
            labels=unlist(read["ID"]), #labels, but they dont work
            id.n = length(unlist(read["ID"])), #pointless, but needed
            xlab = "Lynn and Vanhanen national IQ",
            ylab="% of men unemployed, 2010-2014 average")


scatterplot(OutOfWork.2010to2014.women ~ IslamPewResearch2010, #predict employment from Islam
            data=DF,
            smoother=NULL, #no moving average
            labels=unlist(read["ID"]), #labels, but they dont work
            id.n = length(unlist(read["ID"])), #pointless, but needed
            xlab = "Prevalence of Islam",
            ylab="% of women unemployed, 2010-2014 average")

scatterplot(Norway.tertiary.edu.att.bigsamples.2013 ~ S.scores, #predict employment from Islam
            data=DF,
            smoother=NULL, #no moving average
            labels=unlist(read["ID"]), #labels, but they dont work
            id.n = length(unlist(read["ID"])), #pointless, but needed
            xlab = "S factor",
            ylab="fraction with long tertiary education")

scatterplot(Sfactor.scores2 ~ LV2012estimatedIQ, #predict S factor from IQ
            data=DF2,
            smoother=NULL, #no moving average
            labels=unlist(read["ID"]), #labels, but they dont work
            id.n = length(unlist(read["ID"])), #pointless, but needed
            xlab = "National IQ in home country",
            ylab="General socioeconomic factor (S) in Norway",
            main = "National IQ predicts immigrant group performance at the group level in Norway")

scatterplot(Sfactor.scores2 ~ IslamPewResearch2010, #predict S factor from Islam
            data=DF2,
            smoother=NULL, #no moving average
            labels=unlist(read["ID"]), #labels, but they dont work
            id.n = length(unlist(read["ID"])), #pointless, but needed
            xlab = "National prevalence of Islam in home country",
            ylab="General socioeconomic factor (S) in Norway",
            main = "National Islam prevalence predicts immigrant group performance at the group level in Norway")

As you can see, I am using multiple imputation. There seems to be a bug becus MI fucks up one of the variables, but only that one (Norway.tertiary.edu.att.bigsamples.2013). And the fuck-up has no effect on the factor analysis (S factors from the complete cases dataset correlates .99 or 1.00 with the ones from the imputed datasets). However, it has changed the data of the cases with actual data and changed the mean and the standard deviation. Everything else seems fine. Odd?

It can be seen in the descriptive statistics for each dataset (all data, complete cases, impute 1, impute 2).

> round(DF.desc.stats.ordered,2)
                                                   vars   n    mean      sd  median trimmed     mad     min      max
NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014     1  26    1.31    0.87    1.25    1.24    1.11    0.20     3.20
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20141    1  15    1.41    0.99    1.50    1.36    1.19    0.20     3.20
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20142    1  18    1.33    0.93    1.15    1.28    0.96    0.20     3.20
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20143    1  26    1.16    0.86    0.85    1.08    0.89   -0.01     3.20
NorwayLarcenyAdjustedOddsRatioSkardhamar2014          2  26    0.77    0.56    0.60    0.74    0.59    0.10     2.00
NorwayLarcenyAdjustedOddsRatioSkardhamar20141         2  15    0.78    0.55    0.50    0.76    0.44    0.20     1.60
NorwayLarcenyAdjustedOddsRatioSkardhamar20142         2  18    0.72    0.53    0.55    0.71    0.44    0.10     1.60
NorwayLarcenyAdjustedOddsRatioSkardhamar20143         2  26    0.60    0.51    0.47    0.58    0.33   -0.26     1.60
Norway.tertiary.edu.att.bigsamples.2013               3  67    0.12    0.08    0.11    0.12    0.09    0.01     0.31
Norway.tertiary.edu.att.bigsamples.20131              3  15    0.10    0.07    0.09    0.10    0.07    0.01     0.23
Norway.tertiary.edu.att.bigsamples.20132              3  18    0.52    0.02    0.52    0.52    0.02    0.50     0.56
Norway.tertiary.edu.att.bigsamples.20133              3  26    0.52    0.02    0.52    0.52    0.02    0.50     0.56
OutOfWork.2010to2014.men                              4 120    7.05    4.18    5.96    6.51    3.62    1.38    22.08
OutOfWork.2010to2014.men1                             4  15    7.40    5.36    5.96    6.64    4.00    2.68    22.08
OutOfWork.2010to2014.men2                             4  18    7.31    4.89    6.53    6.68    4.26    2.68    22.08
OutOfWork.2010to2014.men3                             4  26    6.88    4.30    6.32    6.20    3.38    2.66    22.08
OutOfWork.2010to2014.women                            5 120    7.50    4.97    6.30    6.75    2.95    1.32    31.82
OutOfWork.2010to2014.women1                           5  15    8.90    6.20    7.30    8.40    4.69    1.98    22.42
OutOfWork.2010to2014.women2                           5  18    8.17    5.93    6.48    7.67    4.46    1.90    22.42
OutOfWork.2010to2014.women3                           5  26    7.40    5.25    6.48    6.69    3.53    1.56    22.42
Income.index.2009to2012                               6  23   79.86   14.58   80.25   79.87   13.71   53.25   108.25
Income.index.2009to20121                              6  15   78.78   14.78   78.25   78.48   10.75   53.25   108.25
Income.index.2009to20122                              6  18 6852.23 2415.20 6500.53 6799.16 2890.42 2835.56 11718.06
Income.index.2009to20123                              6  26 6689.35 2268.99 6500.53 6635.00 2031.16 2835.56 11718.06
                                                     range  skew kurtosis     se
NorwayViolentCrimeAdjustedOddsRatioSkardhamar2014     3.00  0.55    -0.83   0.17
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20141    3.00  0.39    -1.25   0.26
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20142    3.00  0.57    -0.94   0.22
NorwayViolentCrimeAdjustedOddsRatioSkardhamar20143    3.21  0.77    -0.28   0.17
NorwayLarcenyAdjustedOddsRatioSkardhamar2014          1.90  0.56    -1.09   0.11
NorwayLarcenyAdjustedOddsRatioSkardhamar20141         1.40  0.38    -1.74   0.14
NorwayLarcenyAdjustedOddsRatioSkardhamar20142         1.50  0.55    -1.42   0.12
NorwayLarcenyAdjustedOddsRatioSkardhamar20143         1.86  0.70    -0.58   0.10
Norway.tertiary.edu.att.bigsamples.2013               0.30  0.42    -0.91   0.01
Norway.tertiary.edu.att.bigsamples.20131              0.22  0.39    -1.32   0.02
Norway.tertiary.edu.att.bigsamples.20132              0.05  0.50    -1.12   0.00
Norway.tertiary.edu.att.bigsamples.20133              0.06  0.75    -0.64   0.00
OutOfWork.2010to2014.men                             20.70  1.26     1.69   0.38
OutOfWork.2010to2014.men1                            19.40  1.38     1.18   1.38
OutOfWork.2010to2014.men2                            19.40  1.57     2.16   1.15
OutOfWork.2010to2014.men3                            19.42  1.80     3.72   0.84
OutOfWork.2010to2014.women                           30.50  1.92     5.11   0.45
OutOfWork.2010to2014.women1                          20.44  0.83    -0.58   1.60
OutOfWork.2010to2014.women2                          20.52  1.03    -0.04   1.40
OutOfWork.2010to2014.women3                          20.86  1.30     1.14   1.03
Income.index.2009to2012                              55.00 -0.01    -0.92   3.04
Income.index.2009to20121                             55.00  0.19    -0.78   3.82
Income.index.2009to20122                           8882.50  0.20    -0.97 569.27
Income.index.2009to20123                           8882.50  0.24    -0.74 444.99

Meng Hu

Thu 11 Sep 2014 16:47

Admin

If someone wants to use multiple imputation (MI), I have nothing against it. But (s)he seriously needs to read a lot about MI. I'm serious. When I read about that, I really gave me a headache. And I have also read some data analysts (Paul D Allison maybe?) saying the same thing : if you are not very sure about what you do, then don't do it. A lot of researchers don't know the proper way to use MI, and some have reported others used a sub-optimal option of imputation.

This said, can you tell me the meaning of this syntax ?

#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)

Why is there norway.miss.1 and norway.miss.2 ? Does that mean you use 2 data imputations ? I ask this because normally, in the literature, the number of imputed data set is called "m". And in your data, m=3. So, does that mean you use 3*2=6 imputations ? [edit: no, I understand now. You had 2 data sets, one with N=18 the other with N=26, so in fact you use 3 imputations]

Also, can you tell me the % of missing value per variables ? Normally, the more missing value you have, the more imputation you need.

Emil O. W. Kirkegaard

Thu 11 Sep 2014 17:04

Admin

As you can see from the code, I reduced the full dataset to three subsets. One with complete cases

DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15

And two with missing values


DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26

The comments explain what I did.

The lines

#impute
DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations
DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3)
DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations
DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3)

The first and third lines runs the imputation, with 200 iterations and 3 imputations. The second and fourth lines extracts the imputed datasets.

We can try ML imputation too.

Emil O. W. Kirkegaard

Thu 11 Sep 2014 20:16

Admin

I switched to using the VIM package instead for imputing. This produces results with bugs.

#subsets
DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26

DF.norway.miss.1.imp.vim = irmi(DF.norway.miss.1) #impute using VIM
DF.norway.miss.2.imp.vim = irmi(DF.norway.miss.2) #impute using VIM

The current draft is 10 pages, but I still need to add more. Perhaps 12-15 pages total when done.

Emil O. W. Kirkegaard

Fri 12 Sep 2014 00:36

Admin

Ok, I managed to keep it at 8 pages due to dropping some of the scatter-plots.

New version is attached as well as updated supplementary material.

Meng Hu

Fri 12 Sep 2014 16:14

Admin

You should make explicit how many imputations you had (m=3, or more). And remember that the more missing values you have, the more imputation you need.
http://www.statisticalhorizons.com/more-imputations

John Fuerst

Fri 12 Sep 2014 20:14

Ok, I managed to keep it at 8 pages due to dropping some of the scatter-plots.

New version is attached as well as updated supplementary material.

You didn't make the corrections regarding the intro [url=bio-ecological]which I noted previously[/url].

Emil O. W. Kirkegaard

Tue 16 Sep 2014 02:57

Admin

Here is a new draft. I have made most of the changes suggested by Chuck and Dalliard. A native speaker has corrected the language in the paper. I have added a table in the appendix with S factor scores in NO and DK. I have added another dataset with up to 3 imputations which results in N=67, but more uncertain data. Various other smaller changes.

Optimally, I'd like to switch to deterministic imputation so that my analyses are completely reproducible, but haven't found an easy way to do this in R so far. It will very likely make little difference to the predictive results which are already very strong (|.55| - |.78| for the fully imputed dataset NO, |.51| - |.71| for DK). They are also similar to the unimputed datasets for those variables with sufficient sampling size and variance.

Meng Hu

Tue 16 Sep 2014 15:43

Admin

The essence of imputation is that you can never have the same result. You only need to write it in the paper, as you did. But next time, i recommend you to use more imputation. Ideally, the number of imputations should be a function of % of missing values.

Emil O. W. Kirkegaard

Tue 16 Sep 2014 17:20

Admin

One can use deterministic imputation with multiple regression. For a given dataset, one uses every other variable to predict the values of that dataset perhaps including interactions. Then in the cases where a value is missing, one imputes the predicted value.

Meng Hu

Tue 16 Sep 2014 19:18

Admin

I remembered the first time I was using imputation, it was on AMOS. I have requested several data set, one by one. And when working with each of them, I had identical results. I discovered after this, that I was using the option "regression imputation" while the recommended option would have been "stochastic regression imputation" (even though AMOS is bad tool for making imputed data sets). In the latter option, you cannot have identical data set. And all data analysts will tell you not to work with imputation that is not "stochastic". In the first case, there is no random component (i.e., no error term), and you will under-estimate standard errors. When I said it's not possible to get identical data set, i was referring to the imputation with random component, as is usually recommended to do.

Marc Dalliard

Tue 16 Sep 2014 20:14

1) "Recent studies show that criminality and other useful socioeconomic traits"

I don't think criminality is a "useful socioeconomic trait." Perhaps "important social and economic characteristics"?

2) Use the equal or greater than sign (≥) rather than >=.

3) "Are some predictors just generally better at predicting than others, or is there an interaction effect between predictor and variables?"

Not sure what you mean by interaction here. The question is whether any of the predictors have unique predictive power.

4) "using multiple imputation8 to impute data to cases with 1 or fewer missing values"

How is it possible to have fewer than 1 missing values?

5) "Table 4 shows description statistics"

Descriptive statistics.

6) "the squared multiple correlation of regression the first factor on the original variables"

Word missing or something.

Emil O. W. Kirkegaard

Tue 16 Sep 2014 22:11

Admin

Thank you for reviewing again, Dailliard.

1) "Recent studies show that criminality and other useful socioeconomic traits"

Removed "useful". The editor must have added it.

2) Use the equal or greater than sign (≥) rather than >=.

Done.

3) "Are some predictors just generally better at predicting than others, or is there an interaction effect between predictor and variables?"

Not sure what you mean by interaction here. The question is whether any of the predictors have unique predictive power.

An interaction would be that a given predictor P1 is better at predicting variable V1 than P2, but that P2 is better at predicting V2 than P1. A predictor x outcome variable interaction. This shows up as lower than |1| correlations between the prediction correlation vectors. However, surprisingly, they were all close or closish to 1.

For instance, one might have posited that Islam prevalence should be good at predicting crime due to incompatible religion/culture, but that it has little influence on male unemployment. This wasn't found. The IQ x Islam vectors correlate |.99| (shown in Table 2). Surprising to me. I remember when I first got my hands on the Danish data, I checked for this. Obviously, having read a lot about intelligence and education, I would expect IQ to be a better predictor than Islam, but perhaps the other way around for crime. However, it was pretty linear there too. Actually, I didn't think of this way of testing it before now. Perhaps I should add this to the reanalysis of the Danish data. The Danish data is much more suited for testing it since the number of outcome variables is much larger (9 vs. 25).

How would you like me to change this?

Islam does not correlate highly with the others (data from the DF.cor object in R):

Islam x IQ = -27
x Altinok = -43
x GDP = -14
x International S = -.33

so it can be combined fruitfully in multiple regression. For instance, I tried with IQ+Islam to predict S scores in Norway (imp. 3). R2 adjusted is .59, i.e. r=0.77 which is higher than any of the predictors alone (if one doesn't consider S scores from Denmark a predictor, it has .78).

4) "using multiple imputation8 to impute data to cases with 1 or fewer missing values"

How is it possible to have fewer than 1 missing values?

If there are no missing values for a case. Check the relevant code:

#count NA's
DF.norway.missing = apply(DF.norway, 1, is.na) #produces a col with T/F for each case
DF.norway.missing = apply(DF.norway.missing, 2, sum) #sums the number of missing per col
DF.norway.missing.table = table(DF.norway.missing) #tabulates them

#subsets
DF.norway.complete = DF.norway[DF.norway.missing <= 0,] #complete cases only, reduces N to 15
DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18
DF.norway.miss.2 = DF.norway[DF.norway.missing <= 2,] #keep data with 2 or less missing values, N=26
DF.norway.miss.3 = DF.norway[DF.norway.missing <= 3,] #keep data with 3 or less missing values, N=67

5) "Table 4 shows description statistics"

Fixed.

6) "the squared multiple correlation of regression the first factor on the original variables"

Word missing or something.

Fixed to: "the squared multiple correlation of regressing the first factor on the original variables".

John Fuerst

Wed 17 Sep 2014 23:12

Emil, could you post the latest version?

"Factor analytic methods require that there are no missing values. The easiest and most common way to deal with this is to limit the data to the subset with complete cases. This however produces biased results if the data are not missing completely at random ...For the above reasons, I used three methods for handling missing cases"

One of the MI assumptions is that data is MAR. Why would the possibility of MNAR then be reason to use it (as opposed to deletion), as your wording suggests?

Meng Hu

Thu 18 Sep 2014 11:44

Admin

An interaction would be that a given predictor P1 is better at predicting variable V1 than P2, but that P2 is better at predicting V2 than P1. A predictor x outcome variable interaction. This shows up as lower than |1| correlations between the prediction correlation vectors. However, surprisingly, they were all close or closish to 1.

Perhaps someone here can translate for me ? I don't understand the entire sentence.

Emil O. W. Kirkegaard

Thu 18 Sep 2014 11:46

Admin

P1 = predictor 1, P2 = predictor 2, etc.
V1 = outcome var 1, V2 = outcome var 2, etc.

Emil O. W. Kirkegaard

Thu 18 Sep 2014 13:30

Admin

I performed the analogous analysis on the Danish data as mentioned above.

> DF.Denmark.predict.cor
           IQ Altinok Islam logGDP S.score
IQ       1.00    0.99 -0.96   0.98    0.99
Altinok  0.99    1.00 -0.94   0.98    0.98
Islam   -0.96   -0.94  1.00  -0.94   -0.96
logGDP   0.98    0.98 -0.94   1.00    0.99
S.score  0.99    0.98 -0.96   0.99    1.00

The results are even better than in the Norwegian data. The Danish data is larger (25 variables) and higher quality (age and sex controlled). Apparently predictors are substantially general, not very specific. Theories of their explanatory power should be general, not specific.

John Fuerst

Thu 18 Sep 2014 14:26

I performed the analogous analysis on the Danish data as mentioned above.

> DF.Denmark.predict.cor IQ Altinok Islam logGDP S.score IQ 1.00 0.99 -0.96 0.98 0.99 Altinok 0.99 1.00 -0.94 0.98 0.98 Islam -0.96 -0.94 1.00 -0.94 -0.96 logGDP 0.98 0.98 -0.94 1.00 0.99 S.score 0.99 0.98 -0.96 0.99 1.00

The results are even better than in the Norwegian data. The Danish data is larger (25 variables) and higher quality (age and sex controlled). Apparently predictors are substantially general, not very specific. Theories of their explanatory power should be general, not specific.

Anyways, as I'm fine with the paper, I approve publication.

Emil O. W. Kirkegaard

Sun 21 Sep 2014 21:41

Admin

Here's a new revision.

It has a lot of smaller language changes, and the changes mentioned above. I have also added a short explanation of the predictor x outcome variable interaction idea that Dalliard noted wasn't well-enough explained. The appendix now contains a list of the variables in the Danish dataset too so that readers don't have to find the prior Danish study to know what I analyzed there.

Reviewers who ok'd this paper before the addition of the Danish re-analyses should re-read the submission and see if they still agree with publication.

First Previous 1 2 3 4 Next Last