Back to [Archive] Post-review discussions

[ODP] Examining the ICAR and CRT tests in a Danish student sample
Admin
Title
Examining the ICAR and CRT tests in a Danish student sample

Authors
Emil O. W. Kirkegaard
Oliver Nordbjerg

Abstract
We translated the International Cognitive Ability Resource sample test (ICAR16) and the Cognitive Reflection Test (CRT) into Danish. We administered the test online to a student sample (N=72, mean age 17.4). Factor analysis revealed a general factor. The summed score of all test items correlated .42 with GPA. Item difficulties correlated .85 with those reported in the Internet norming sample. Method of correlated vectors analysis showed positive relationships between g-loading of items/subtests and their correlation with GPA (r=.53/.85).

Files
https://osf.io/udgms/files/
Unlike some people, I'm not impressed by bayesian methods. The bayes factor, so often used and praised by people such as, e.g., Sprenger et al. (2013) among many others, has the same problem as the p value and other significance tests. The bayes factor is a function of effect size and sample size. The higher the sample size, the greater the difference in "fit" between the models (or hypotheses). In other words : new methods but same problems.

That is a problem similar to CFA-SEM testing hypotheses. Some fit indices are too easily significant, e.g., AIC, BIC, etc. When that happens, a quick look at them and one has the feeling that the two models differ "significantly". But when looking at, e.g., CFI, Mc, ECVI, and RMSEA, one sometimes see that the difference is really small or too small to be of "practical significance".

If one believes that bayes is superior than traditional methods of significance testing, he will be right. But if one believes that bayes removes the problems inherent to significance testing, he will be wrong.

Concerning item correlations, I must recommend the following readings : Raven (2010), Wicherts & Johnson (2009), Rushton & Jensen (2010, p. 16). The first one is the most important to have a grasp on the problem.

Raven, John (2010). Testing the Spearman-Jensen Hypothesis Using the Items of the RPM.
Wicherts, J. M., & Johnson, W. (2009). Group differences in the heritability of items and test scores. Proceedings of the Royal Society B: Biological Sciences, 276(1667), 2675-2683.
Rushton, J. P., & Jensen, A. R. (2010). Race and IQ: A theory-based review of the research in Richard Nisbett’s Intelligence and How to Get It. The Open Psychology Journal, 3(1), 9-35.

The use of MCV with low number of subtests is problematic. And with just 5, MCV has no credibility in my eyes. I have explained that here and here.

The CRT seems to be a test first devised by Frederick (2005) and I first discovered it by reading his intriguing but fascinating article. One thing that strikes me is the shortness of the test. One other problem is that the answers to the test can be accessed by any curious internet users.
Admin
I've read about halfway through Kruschke's Bayesian textbook, and I'm not too impressed. However, it does allow specifying models for model comparison more explicitly.

https://www.goodreads.com/book/show/9003187-doing-bayesian-data-analysis

I also liked CFA/SEM methods more before I learned more about them. Maybe it's a general finding that the more you learn about some method, the less you like it...

-

Our analysis here has low power (N=72), so the results are not very convincing regarding the incremental validity of the CRT for GPA. Still, it is a first step. There is a good chance that I can get an agreement with my prior 'high school', and then I can perhaps get to test their students every year. E.g. test them when they start, and follow their grades for the next 3 years. I don't know yet, but this could yield more data as well as a truly predictive design instead of cross-sectional as in the paper here.

We used only NHST for the MCV results. I can remove it as I think it is rather uninteresting. We didn't do any testing with the Bayes' factor, just noted the results. I wanted to compete 'confidence intervals' (or some uncertainty measure) for these Bayes' factors, but I couldn't find a way to do that. Also note that the ratio between the two BFs is not that great, weak evidence by the usual standards:

> BFs
bf error time code
CRT 3.344825 4.740419e-05 Wed Mar 18 07:26:34 2015 529c8d256f9
ICAR16 + CRT 23.402299 6.170106e-07 Wed Mar 18 07:26:34 2015 529c93723da
ICAR16 65.109398 6.403466e-07 Wed Mar 18 07:26:34 2015 529c4fe10e3
> 65.1/23.4
[1] 2.782051


So according to the interpretation keys on Wikipedia, this evidence strength is in the category "barely worth mentioning". https://en.wikipedia.org/wiki/Bayes_factor#Interpretation



As for the item-level MCV, using Raven's items alone doesn't seem like a good idea without a large sample Too little diversity. The Wicherts dataset has the Raven's items (N=500ish).

wicherts = read.mega("wicherts_data.csv")

ravens.items = wicherts[,grep("rav\\d",colnames(wicherts))] #ravens items
ravens.items = ravens.items[,-36] #remove last item (no answers)
ravens.items$GPA = wicherts$Zgpa #add GPA
ravens.cor = rcorr2(ravens.items)$r #cors

ravens.fa = fa(ravens.items) #factor analyze
loadings = ravens.fa$loadings #loadings
GPA.item.cors = ravens.cor[36,1:35] #item x GPA cors

MCV.plot(loadings,GPA.item.cors)


See attached.

As for the CRT, it is interesting in its shortness. But the items are getting too common now, so that there is a serious chance of testees being acquainted with them from some other source.
1) Regarding MH's (and Raven's) concerns about MCV, the expectation is that MCV correlations with item data will be lower than with subtest data because the former are less reliable and more heterogeneous in terms of reliable variance. This does not mean that MCV should not be used with item data, just that the results may be weaker, on the average.

2) "He posted it twice with a about a week's delay"

remove the extra 'a'

3) "Gymnasie is a secondary education taken by approximately two thirds of a generation. In years of education, it is grade 10-13 or 9-12."

'High school' appears to be the closest equivalent of 'gymnasie', which should be mentioned for clarity's sake.

4) "92% of the students attended STX, with the rest attending various other kinds (HTX 1, HF, 4, HHX 1)."

What are these acronyms?

5) "Since no fact-sheet was provided"

By 'fact-sheet', do you mean an answer key?

6) "traditional factor analysis and item response theory factor analysis"

Specify/explain the algorithms used.
The paper gives very little info about the nature of the test. What exactly do the subtests measure, and what do the acronyms in Tables 1 and 3 and Figure 2 (VR, LN etc) mean?

In what way are VR and LN so similar that they contribute most to the g factor of this test, and in which way is R3D different from the others, to give it a lower g loading? If R3D is 3-dimensional rotation and the others are mainly tests of "crystallized" abilities, that would be the explanation.

Page 2, footnote 3: Here you should specify that gymnasie is the academic branch of the educational system, so you expect people with higher IQ and a theoretical rather than practical slant in this school type.

The abbreviations on page 2 should be explained: HTX1, HF etc.

A few typos, for example Wechschler test in 2nd line of intro and least sqauare in 2nd line of section 6.1.
Admin
We will work out a new draft based on the above. Thanks for reviewing.

I won't have time before after the conference in London 7-11th May, http://emilkirkegaard.dk/LCI15/. (See you there!)
Admin
Using the Jensen method (I think this is a more suitable name than MCV, which is inspecific, all correlations involve too vectors...) on homogeneous item-level data probably just results in spurious positive findings. I will add this to a future revision of my paper here https://thewinnower.com/papers/spearman-s-hypothesis-on-item-level-data-from-raven-s-standard-progressive-matrices-a-replication-and-extension Basically, doing this on purely Raven's items should be always positive due to a statistical artifact, regardless of whether SH is true for the groups or not. But my mega-analysis shows that it does hold, so I guess others can use their own interpretation of that. I have 2 more samples not added to the analysis, which will bring the total up to 66 IIRC.

Back to the present article. Sorry I have been busy with other things, but also I just get distracted easily with new projects.

Dalliard,

2)
Fixed.

3)
Changed footnote to
Gymnasie is a secondary education taken by approximately two thirds of a generation. In years of education, it is grade 10-13 or 9-12 depending on whether the student took the optional 10th grade or not. In US terms, it is similar but not identical to high school.


4)
Added a footnote:
In Denmark there are 4 main types of gymnasier. STX is the standard type, HTX is technology-oriented, HHX is trade-oriented. Lastly, HF is a shorter 2-year which gives a roughly equivalent degree.


5)
By 'fact-sheet', do you mean an answer key?

Yes.

6)
Specify/explain the algorithms used.


Here or in the paper?

I changed the paragraph to:

To examine the internal structure, we used both traditional factor analysis and item response theory factor analysis on all the cognitive items to extract 1 factor. Although popular, principal components analysis has been shown to give misleading results in some cases\cite{Kirkegaard2014The}. For this reason, we used another extraction method which by default is MinRes, but it does not appear to make a large difference which method is used\cite{Kirkegaard2014The}. The functions fa() and irt.fa() from the psych package were used for extraction.

Factor loadings are shown in Table \ref{itemloadings}. The factor congruence was 1.00.


I don't know how IRT works in detail, but it is designed to work on item-level data (i.e. dichotomous) whereas traditional FA is designed to work on continuous data. The main goal was to show the two methods give similar results.
Admin
The paper gives very little info about the nature of the test. What exactly do the subtests measure, and what do the acronyms in Tables 1 and 3 and Figure 2 (VR, LN etc) mean?


We cite the Condon and Revelle paper for the test in the introduction. You can find the test in their supplementary material. We use the 16-item abbreviated version, translated to Danish.

The subtests are:
LN = Alphanumerical series. e.g. A B C D _ fill in E.
VR = Verbal reasoning.
R3D = 3D rotation.
MR = matrix reasoning. Similar to Raven's.
CRT = Cognitive Reflection Test. 3 questions supposed to measure whether one is a reflective thinker. You can find this easily all over the internet.

In what way are VR and LN so similar that they contribute most to the g factor of this test, and in which way is R3D different from the others, to give it a lower g loading? If R3D is 3-dimensional rotation and the others are mainly tests of "crystallized" abilities, that would be the explanation.


They are not particularly similar.

I don't think it is appropriate to speculate about the exact loadings of the items due to the small sample size. The main point in showing the loadings is to show that they are all positive as they should be.

You could use the large sample of 1300 or so cases built into the psych package (called ability) to examine factor loadings if you want to try to make sense of them. This was not a major goal of this paper.

Page 2, footnote 3: Here you should specify that gymnasie is the academic branch of the educational system, so you expect people with higher IQ and a theoretical rather than practical slant in this school type.


See the reply to Dalliard above. Added:
The gymnasie is meant as a preparation for further education, so it is somewhat selected for academic ability and hence general intelligence.


The abbreviations on page 2 should be explained: HTX1, HF etc.


Has been done.

A few typos, for example Wechschler test in 2nd line of intro and least sqauare in 2nd line of section 6.1.


Fixed the grammar error in the intro.

Fixed OLS type.

I will see if I can upload the new version.
Admin
I have put the article on Authorea as a test. The figures are missing and tables somewhat messed up, but the text has been updated.

https://www.authorea.com/users/24740/articles/40233/_show_article
1) <i>92% of the students attended STX, with the rest attending various other kinds (HTX 1, HF, 4, HHX 1).</i>

What do those numbers (1, 4, 1) stand for, percentages?

2) <i>To examine the internal structure, we used both traditional factor analysis and item response theory factor analysis on all the cognitive items to extract 1 factor.</i>

According to the <a href="http://www.inside-r.org/packages/cran/psych/docs/irt.fa">explanation here</a>, IRT factor analysis means that first a tetrachoric correlation matrix is calculated from item data and then the correlation matrix is factor-analyzed in the usual way. In other words, it is assumed that normally distributed continuous latent variables underlie dictotomous item responses. Add this explanation to the paper because it's not obvious otherwise.

3) <i>Next we compared the item means and SDs (shown in Table undefined) with those published by Condon and Revelle(Condon 2014). The correlations were .85 and .63 indicating high construct reliability across languages and samples.</i>

On what basis are those values high? Unless you have some good standard to compare them against, I would say that the values suggest <i>reasonable</i> or <i>good</i> congruence between languages, especially as your sample is small.

4) <i>Both predictors had positive betas, however, the beta for CRT was only .11 and with a relatively wide confidence interval because of the small sample size (72), so we cannot be very certain about its real value</i>

The CIs are the same for both tests, so the amount of uncertainty is the same. I would just say that the 95% CI for CRT includes zero.
Admin
Dalliard,

Thanks for the comments. Quotations below are from you unless otherwise stated.

What do those numbers (1, 4, 1) stand for, percentages?


They were raw numbers. I have changed them to percentages and checked that they sum to 100%.

According to the explanation here, IRT factor analysis means that first a tetrachoric correlation matrix is calculated from item data and then the correlation matrix is factor-analyzed in the usual way. In other words, it is assumed that normally distributed continuous latent variables underlie dictotomous item responses. Add this explanation to the paper because it's not obvious otherwise.


Yes, it uses the tetrachoric correlations which estimates the Pearson correlations if the variable wasn't dichotomous. I have added:

To examine the internal structure, we used both traditional factor analysis (FA) and item response theory factor analysis (IRT FA) on all the cognitive items to extract 1 factor. Although popular, principal components analysis has been shown to give misleading results in some cases.[6] For this reason, we used another extraction method which by default is MinRes, but it does not appear to make a large difference which method is used.[6] The functions fa() and irt.fa() from the psych package were used for extraction.[7] The difference between FA and IRT FA is that latter is done on the correlation matrix calculated using tetrachronic correlations. A tetrachronic correlation is designed to estimate the regular Pearson correlation when used on dichotomous variables such as correct/incorrect items


On what basis are those values high? Unless you have some good standard to compare them against, I would say that the values suggest reasonable or good congruence between languages, especially as your sample is small.


In my study of item-level SPM's studies, I found a mean item pass rate correlation of .88 across 66 comparisons. These were very diverse samples (Roma, White, Colored, Black, Indian, North African). The SPM has 60 items, making it much easier to attain high correlations than with only 16 items. Thus, the pass rate correlation of .85 obtained in this study seems remarkably good given that this is a partly verbal translated test.

http://emilkirkegaard.dk/en/?p=4971

I'm not aware of any standard of comparison for standard deviations, but I know that they fluctuate more than the item pass rates. A value of .63 thus seems good to me.

The CIs are the same for both tests, so the amount of uncertainty is the same. I would just say that the 95% CI for CRT includes zero.


I have changed the text to:

Both predictors had positive betas; however, the beta for CRT was only .11 and the confidence interval included 0. Due to the small sample, this may either be because it has no incremental validity or because power was too low to detect it.


---

I have updated the OSF with the new files. PDF draft #7

https://osf.io/2ipgb/
1) "A large fraction of cognitive and personality tests are privately owned and are usually very expensive to obtain legally (e.g. the Wechschler test is owned by Pearson)."

It's spelled Wechsler. Given that individually administered tests like Wechsler's comprise <a href="http://mla-s2-p.mlstatic.com/wisc-iiiequipo-completo-en-bolso-y-con-cajitas-de-acrilico-4061-MLA120359773_1740-F.jpg">booklets and physical stimulus materials</a>, the suggestion that you could obtain them illegally is unusual (i.e., you'd have to physically steal them).

2) "The difference between FA and IRT FA is that latter is done on the correlation matrix calculated using tetrachronic correlations. A tetrachronic correlation is designed to estimate the regular Pearson correlation when used on dichotomous variables such as correct/incorrect items."

This is not quite correct. Pearson correlations can be calculated for dichotomous data -- that's what the "traditional factor analysis" in your paper is about. IRT FA is based on the assumption that underlying the observed dichotomous data are normally distributed continuous latent variables. Tetrachoric (not tetrachronic!) correlations are used to estimate correlations between these latent variables.

I would put it like this:

"The difference between FA and IRT FA is that the latter is done on a correlation matrix calculated using tetrachoric correlations. A tetrachoric correlation estimates the Pearson correlation between two normally distributed continuous latent variables that are assumed to underlie dichotomous variables such as correct/incorrect items."

3) The caption of Table 6 should read "... Parameter ESTIMATES from OLS regression."

4) You apparently have the subjects' ages. How do they influence test scores? It would make sense to adjust the scores for age.

5) There are, in principle, ethical issues when you recruit human subjects. How did you present your project to the participants?
Admin
Dalliard,

Thanks for commenting again.

It's spelled Wechsler. Given that individually administered tests like Wechsler's comprise booklets and physical stimulus materials, the suggestion that you could obtain them illegally is unusual (i.e., you'd have to physically steal them).


You seem to neglect the possibility that one could simply copy them, which I what I was alluding to. This wouldn't be stealing, but it would involve breaching intellectual monopoly laws (copyright in this case).

This is not quite correct. Pearson correlations can be calculated for dichotomous data -- that's what the "traditional factor analysis" in your paper is about. IRT FA is based on the assumption that underlying the observed dichotomous data are normally distributed continuous latent variables. Tetrachoric (not tetrachronic!) correlations are used to estimate correlations between these latent variables.


We don't seem to be in disagreement, aside from the spelling error. I will swap to your preferred phrasing as it makes no difference to me.

3) The caption of Table 6 should read "... Parameter ESTIMATES from OLS regression."


Ok.

4) You apparently have the subjects' ages. How do they influence test scores? It would make sense to adjust the scores for age.


We have the variables listed in section 2. You could download the data yourself if you want to examine age. It was not a planned research question for us and besides the variation in age is fairly small (sd = 1.25), so the expected effect size of age is not very large and sample size is way too small to detect it reliably.

I tried now that you asked. The correlation with factor scores is .11. Running a partial correlation with factor scores and GPA controlling for age resulted in r = .404, or pretty much the same as without controlling for age.

5) There are, in principle, ethical issues when you recruit human subjects. How did you present your project to the participants?


The questionnaire can be found here: https://docs.google.com/forms/d/1QYXN9OL-_BKhVZleLul6zEZWBMcsdWDDt6qJjB4cQog/edit#

The description:

Vi ønsker at studere gymnasieelevers karakterer og deres sammenhæng med andre faktorer. Derfor har vi sammensat dette spørgeskema som vi håber du vil være med til at svare på.

Alle besvarelser er anonyme. Spørgeskemaet er struktureret som følger.

Først spørger vi efter grundliggende demografisk information. Derefter spørger vi om karakterer.

Derefter kommer en kognitive test i 5 dele. Del 1 måler din evne til at reflektere, del 2 måler din evne til sproglig ræsonnering, del 3 måler din evne til finde alfanumeriske mønstrer, del 4 måler din evne til se mønstrer i figurer, og del 5 måler din evne til at rotere figurer i 3D.

Der er INGEN TIDSBEGRÆNSNING på opgaverne, så giv dig god tid uden at blive forstyrret. Det tager ca. 10-15 minutter at besvare.


In English (just a quick translation):

We are interested in studying the grades of gymnasie students and their relationship to others factors. For this reason we have put together this questionnaire which we hope that you will take part in.

All responses are anonymous. The questionnaire is structured as follows:

First we ask about basic demographical information. After that we ask about grades.

Then comes a cognitive test in 5 parts. Part 1 measures your ability to reflect, part 2 measures your ability to reason verbally, part 3 measures your ability to find alphanumeric patterns, part 4 measures your ability to see patterns in figures and part 5 measures your ability to rotate figures in 3D.

There is NO TIME LIMIT on the tasks, so give yourself good time without being distracted. It takes about 10-15 minutes to answer.


We did not seek any kind of pre-approval for this, either from the schools or my university as that seemed to be a waste of time.

---

Changes:
* Fixed spelling of Wechsler.
* Swapped to Dalliard's preferred phrasing re. FA and FA IRT.
* Changed Table 6 caption.
* Fixed "international reliability measures" in Table 4 caption.

I have updated the file on OSF: https://osf.io/2ipgb/
You seem to neglect the possibility that one could simply copy them, which I what I was alluding to. This wouldn't be stealing, but it would involve breaching intellectual monopoly laws (copyright in this case).


It takes a dedicated pirate to copy, for example, the <a href="http://log24.com/log/images/020831-wechsler.jpg">plastic blocks</a> used in Wechsler's block design test, but okay.

It's not true that there's "no good reason" for psychometric tests not being free. The obvious advantage of a commercial IQ test is the accompanying normative data based on a representative national sample. It's not cheap for psychologists to administer tests one-on-one to thousands of people. The ICAR doesn't have this kind of normative data. Group differences cannot be properly studied without random samples, for one thing.

We have the variables listed in section 2. You could download the data yourself if you want to examine age. It was not a planned research question for us and besides the variation in age is fairly small (sd = 1.25), so the expected effect size of age is not very large and sample size is way too small to detect it reliably.

I tried now that you asked. The correlation with factor scores is .11. Running a partial correlation with factor scores and GPA controlling for age resulted in r = .404, or pretty much the same as without controlling for age.


It's standard practice to residualize IQ scores for age before factor analysis, and this should have been done in your paper, too. However, given the small age range in your data, age effects cannot be large, so I don't think it's a big problem. You should add a mention of at least the GPA-IQ correlation adjusting for age.

I approve publication once you add some discussion of the effect of the age variable in the paper.
Admin
Dalliard,

I have added the following to the discussion section at the end:

One reviewer criticized the study for not regressing out the effect of age. One could do this, but given the small variation (SD = 1.25) of age in the sample, the expected effect size of age was minute. However, we did calculate the correlation of the IRT FA scores with age, which was .11. The partial correlation of the IRT FA scores with GPA controlling for age was virtually identical at .404.


I have updated the OSF with the new files.

---

It takes a dedicated pirate to copy, for example, the plastic blocks used in Wechsler's block design test, but okay.

It's not true that there's "no good reason" for psychometric tests not being free. The obvious advantage of a commercial IQ test is the accompanying normative data based on a representative national sample. It's not cheap for psychologists to administer tests one-on-one to thousands of people. The ICAR doesn't have this kind of normative data. Group differences cannot be properly studied without random samples, for one thing.


Oh, pirates are very dedicated. There's an entire Asian industry for fake smart phones for instance.

The things you mention could easily be done with less money than is currently spent on buying tests from the testing companies. The government can easily obtain random samples. Cognitive tests are not exactly difficult to make; pretty much no matter what you do, you end up with a g test, if I may cite Dalliard 2013. ;)

Besides, test companies are pretty reluctant to share their data, especially with regards to group differences. I recall that at least one testing company has a moratorium on sharing data for studying racial differences. If the government sponsored the data, it would be publicly available for any purpose.
I approve publication.

While it would make sense to have freely available, publicly funded IQ tests with good normative data, I like the fact that there are competing commercial operators making and selling tests. If the construction and standardization of tests was a government monopoly, the whole process could become politicized.
This looks okay now. Ready for publishing.
Last paragraph of introduction is sloppy. What you are doing is a psychometric validation of a Danish translation of psychometric tests originally made for English speakers. So instead of saying “we decided”, I’d say “the aim of this study is…” and instead of “to translate the above two tests into Danish and administer them to a student sample to verify that the tests function as expected”, say “to psychometrically validate the Danish translation of two tests using a student sample”. For example, see http://www.ncbi.nlm.nih.gov/pubmed/17516705 or http://www.pec-journal.com/article/S0738-3991(13)00035-9/abstract
The title should be edited accordingly. Instead of “examine”, say “Validating the Danish translation of the ICAR and CRT tests in a student sample”.
“we used another extraction method which by default is MinRes”: non R users do not know what minres stands for (unweighted least squares). This should be specified.
Admin
Piffer,

Thanks for taking the time to review this paper.

Last paragraph of introduction is sloppy. What you are doing is a psychometric validation of a Danish translation of psychometric tests originally made for English speakers. So instead of saying “we decided”, I’d say “the aim of this study is…” and instead of “to translate the above two tests into Danish and administer them to a student sample to verify that the tests function as expected”, say “to psychometrically validate the Danish translation of two tests using a student sample”. For example, see http://www.ncbi.nlm.nih.gov/pubmed/17516705 or http://www.pec-journal.com/article/S0738-3991(13)00035-9/abstract
The title should be edited accordingly. Instead of “examine”, say “Validating the Danish translation of the ICAR and CRT tests in a student sample”.
“we used another extraction method which by default is MinRes”: non R users do not know what minres stands for (unweighted least squares). This should be specified.


Title changed to:

Validating a Danish translation of the International Cognitive Ability Resource sample test and Cognitive Reflection Test in a student sample


Changed paragraph to:

Since we want to contribute to the on-going development of free psychology tools and have a Danish language test to use for future projects, the aim of this study was to psychometrically validate the Danish translation of two tests using a student sample.


Changed sentence to:

For this reason, we used another extraction method which by default is MinRes (minimum residual), but it does not appear to make a large difference which method is used.\cite{Kirkegaard2014The}


Let me know whether these edits are okay, then I will upload the new version.
Piffer,

Thanks for taking the time to review this paper.

Last paragraph of introduction is sloppy. What you are doing is a psychometric validation of a Danish translation of psychometric tests originally made for English speakers. So instead of saying “we decided”, I’d say “the aim of this study is…” and instead of “to translate the above two tests into Danish and administer them to a student sample to verify that the tests function as expected”, say “to psychometrically validate the Danish translation of two tests using a student sample”. For example, see http://www.ncbi.nlm.nih.gov/pubmed/17516705 or http://www.pec-journal.com/article/S0738-3991(13)00035-9/abstract
The title should be edited accordingly. Instead of “examine”, say “Validating the Danish translation of the ICAR and CRT tests in a student sample”.
“we used another extraction method which by default is MinRes”: non R users do not know what minres stands for (unweighted least squares). This should be specified.


Title changed to:

Validating a Danish translation of the International Cognitive Ability Resource sample test and Cognitive Reflection Test in a student sample


Changed paragraph to:

Since we want to contribute to the on-going development of free psychology tools and have a Danish language test to use for future projects, the aim of this study was to psychometrically validate the Danish translation of two tests using a student sample.


Changed sentence to:

For this reason, we used another extraction method which by default is MinRes (minimum residual), but it does not appear to make a large difference which method is used.\cite{Kirkegaard2014The}


Let me know whether these edits are okay, then I will upload the new version.


Yes these are ok. I look forward to reading the new version so that I can give my approval.