[ODP] Crime, income and employment among immigrant groups in Norway and Finland - Printable Version +- OpenPsych forums ( https://www.openpsych.net/forum)+-- Forum: Forums ( https://www.openpsych.net/forum/forumdisplay.php?fid=1)+--- Forum: Post-review discussions ( https://www.openpsych.net/forum/forumdisplay.php?fid=5)+--- Thread: [ODP] Crime, income and employment among immigrant groups in Norway and Finland ( /showthread.php?tid=136) |

RE: [ODP] Crime, income and employment among immigrant groups in Norway and Finland - Meng Hu - 2014-Sep-11
I have a good knowledge on imputation. If you want I can guide toward the best methods. Not all imputations are equivalent. Some are adequate only in some type of data. What's the data you use ? Is it the file named "dataset.csv" that you uploaded here ? Because I'm not sure i can recommend the use of imputation. I have explained that here. Your variables must be highly correlated (approaching at least 0.40 or so). The ratio of subjects/variables (put in the imputation procedure) should not fall below 10/1. Alternatively, Hardt et al. (2012) recommend a maximum ratio of 1:3 for variables (with or without auxiliaries) against complete cases, that is, for 60 people having complete data, up to 20 variables could be used. The auxiliary variables are those that can be substitute because of their high correlation, such as, identical variables measured at different points in time (i.e., repeated measures). If the % of missing cases is too high, such as in your variables ViolentCrimeNorway, LarcenyNorway, ViolentCrimeFinland, and LarcenyFinland, I can tell you'll have big troubles. Besides, the superiority of imputation over complete case is not restricted to FA, but extend to all kind of analyses. For example, in that case, multiple regression is inefficient, and then nearly 100% of this kind of analyses published in various other journals should be wrong, because they almost never apply imputation. Either because they don't know a thing about it or because it's time consuming. For example, 5 imputation minimum is recommended, but can be higher given the features of your data. But with 5 imputation, you will need to run the analysis five times with each of the imputed data set, and then, average the results, and as is recommended, you should also provide the standard error or CI, or standard deviation, to let the readers know about how much the estimates vary over the imputation. If your estimates vary too much, that may be a problem, a signals that your estimate is not stable, and that maybe, you'll need more imputation, 10, 20, 30, etc. But repeating the analysis 30 times with your 30 dataset is something researchers don't want to do. And I understand that... Personally, I prefer maximum likelihood (ML) estimation, because multiple imputation (MI) gives me some headache about choosing which kind of MI is good depending on the data you have. If you have AMOS, or R, you can easily use ML. ***** In your last sentence of your paper, "All datasets and source code is available in the supplementary materials.", it's optional but I recommend you to add "on OpenPsych Forum". Imputation - Emil - 2014-Sep-11
Here is my current code. Updated dataset file is attached. Code: `setwd("Z:/Code/database")` As you can see, I am using multiple imputation. There seems to be a bug becus MI fucks up one of the variables, but only that one (Norway.tertiary.edu.att.bigsamples.2013). And the fuck-up has no effect on the factor analysis (S factors from the complete cases dataset correlates .99 or 1.00 with the ones from the imputed datasets). However, it has changed the data of the cases with actual data and changed the mean and the standard deviation. Everything else seems fine. Odd? It can be seen in the descriptive statistics for each dataset (all data, complete cases, impute 1, impute 2). Code: `> round(DF.desc.stats.ordered,2)` RE: [ODP] Crime, income and employment among immigrant groups in Norway and Finland - Meng Hu - 2014-Sep-11
If someone wants to use multiple imputation (MI), I have nothing against it. But (s)he seriously needs to read a lot about MI. I'm serious. When I read about that, I really gave me a headache. And I have also read some data analysts (Paul D Allison maybe?) saying the same thing : if you are not very sure about what you do, then don't do it. A lot of researchers don't know the proper way to use MI, and some have reported others used a sub-optimal option of imputation. This said, can you tell me the meaning of this syntax ? #impute DF.norway.miss.1.impute = mi(DF.norway.miss.1, n.iter=200) #imputes, needs more interations DF.norway.miss.1.imputed = mi.data.frame(DF.norway.miss.1.impute, m = 3) DF.norway.miss.2.impute = mi(DF.norway.miss.2, n.iter=200) #imputes, needs more interations DF.norway.miss.2.imputed = mi.data.frame(DF.norway.miss.2.impute, m = 3) Why is there norway.miss.1 and norway.miss.2 ? Does that mean you use 2 data imputations ? I ask this because normally, in the literature, the number of imputed data set is called "m". And in your data, m=3. So, does that mean you use 3*2=6 imputations ? [edit: no, I understand now. You had 2 data sets, one with N=18 the other with N=26, so in fact you use 3 imputations] Also, can you tell me the % of missing value per variables ? Normally, the more missing value you have, the more imputation you need. Number of imputations - Emil - 2014-Sep-11
As you can see from the code, I reduced the full dataset to three subsets. One with complete cases Code: `DF.norway.complete = DF.norway[complete.cases(DF.norway),] #complete cases only, reduces N to 15` And two with missing values Code: `DF.norway.miss.1 = DF.norway[DF.norway.missing <= 1,] #keep data with 1 or less missing values, N=18` The comments explain what I did. The lines Code: `#impute` The first and third lines runs the imputation, with 200 iterations and 3 imputations. The second and fourth lines extracts the imputed datasets. We can try ML imputation too. VIM - Emil - 2014-Sep-11
I switched to using the VIM package instead for imputing. This produces results with bugs. Code: `#subsets` The current draft is 10 pages, but I still need to add more. Perhaps 12-15 pages total when done. New draft - Emil - 2014-Sep-12
Ok, I managed to keep it at 8 pages due to dropping some of the scatter-plots. New version is attached as well as updated supplementary material. RE: [ODP] Crime, income and employment among immigrant groups in Norway and Finland - Meng Hu - 2014-Sep-12
You should make explicit how many imputations you had (m=3, or more). And remember that the more missing values you have, the more imputation you need. http://www.statisticalhorizons.com/more-imputations RE: [ODP] Crime, income and employment among immigrant groups in Norway and Finland - Chuck - 2014-Sep-12
(2014-Sep-12, 04:16:01)Emil Wrote: Ok, I managed to keep it at 8 pages due to dropping some of the scatter-plots. You didn't make the corrections regarding the intro which I noted previously. New draft, 16th sep. - Emil - 2014-Sep-16
Here is a new draft. I have made most of the changes suggested by Chuck and Dalliard. A native speaker has corrected the language in the paper. I have added a table in the appendix with S factor scores in NO and DK. I have added another dataset with up to 3 imputations which results in N=67, but more uncertain data. Various other smaller changes. Optimally, I'd like to switch to deterministic imputation so that my analyses are completely reproducible, but haven't found an easy way to do this in R so far. It will very likely make little difference to the predictive results which are already very strong (|.55| - |.78| for the fully imputed dataset NO, |.51| - |.71| for DK). They are also similar to the unimputed datasets for those variables with sufficient sampling size and variance. RE: [ODP] Crime, income and employment among immigrant groups in Norway and Finland - Meng Hu - 2014-Sep-16
The essence of imputation is that you can never have the same result. You only need to write it in the paper, as you did. But next time, i recommend you to use more imputation. Ideally, the number of imputations should be a function of % of missing values. |