Hello There, Guest!  

[ODP] Examining the ICAR and CRT tests in a Danish student sample

#1
Title
Examining the ICAR and CRT tests in a Danish student sample

Authors
Emil O. W. Kirkegaard
Oliver Nordbjerg

Abstract
We translated the International Cognitive Ability Resource sample test (ICAR16) and the Cognitive Reflection Test (CRT) into Danish. We administered the test online to a student sample (N=72, mean age 17.4). Factor analysis revealed a general factor. The summed score of all test items correlated .42 with GPA. Item difficulties correlated .85 with those reported in the Internet norming sample. Method of correlated vectors analysis showed positive relationships between g-loading of items/subtests and their correlation with GPA (r=.53/.85).

Files
https://osf.io/udgms/files/
 Reply
#2
Unlike some people, I'm not impressed by bayesian methods. The bayes factor, so often used and praised by people such as, e.g., Sprenger et al. (2013) among many others, has the same problem as the p value and other significance tests. The bayes factor is a function of effect size and sample size. The higher the sample size, the greater the difference in "fit" between the models (or hypotheses). In other words : new methods but same problems.

That is a problem similar to CFA-SEM testing hypotheses. Some fit indices are too easily significant, e.g., AIC, BIC, etc. When that happens, a quick look at them and one has the feeling that the two models differ "significantly". But when looking at, e.g., CFI, Mc, ECVI, and RMSEA, one sometimes see that the difference is really small or too small to be of "practical significance".

If one believes that bayes is superior than traditional methods of significance testing, he will be right. But if one believes that bayes removes the problems inherent to significance testing, he will be wrong.

Concerning item correlations, I must recommend the following readings : Raven (2010), Wicherts & Johnson (2009), Rushton & Jensen (2010, p. 16). The first one is the most important to have a grasp on the problem.

Raven, John (2010). Testing the Spearman-Jensen Hypothesis Using the Items of the RPM.
Wicherts, J. M., & Johnson, W. (2009). Group differences in the heritability of items and test scores. Proceedings of the Royal Society B: Biological Sciences, 276(1667), 2675-2683.
Rushton, J. P., & Jensen, A. R. (2010). Race and IQ: A theory-based review of the research in Richard Nisbett’s Intelligence and How to Get It. The Open Psychology Journal, 3(1), 9-35.

The use of MCV with low number of subtests is problematic. And with just 5, MCV has no credibility in my eyes. I have explained that here and here.

The CRT seems to be a test first devised by Frederick (2005) and I first discovered it by reading his intriguing but fascinating article. One thing that strikes me is the shortness of the test. One other problem is that the answers to the test can be accessed by any curious internet users.
 Reply
#3
I've read about halfway through Kruschke's Bayesian textbook, and I'm not too impressed. However, it does allow specifying models for model comparison more explicitly.

https://www.goodreads.com/book/show/9003...a-analysis

I also liked CFA/SEM methods more before I learned more about them. Maybe it's a general finding that the more you learn about some method, the less you like it...

-

Our analysis here has low power (N=72), so the results are not very convincing regarding the incremental validity of the CRT for GPA. Still, it is a first step. There is a good chance that I can get an agreement with my prior 'high school', and then I can perhaps get to test their students every year. E.g. test them when they start, and follow their grades for the next 3 years. I don't know yet, but this could yield more data as well as a truly predictive design instead of cross-sectional as in the paper here.

We used only NHST for the MCV results. I can remove it as I think it is rather uninteresting. We didn't do any testing with the Bayes' factor, just noted the results. I wanted to compete 'confidence intervals' (or some uncertainty measure) for these Bayes' factors, but I couldn't find a way to do that. Also note that the ratio between the two BFs is not that great, weak evidence by the usual standards:

Quote:> BFs
bf error time code
CRT 3.344825 4.740419e-05 Wed Mar 18 07:26:34 2015 529c8d256f9
ICAR16 + CRT 23.402299 6.170106e-07 Wed Mar 18 07:26:34 2015 529c93723da
ICAR16 65.109398 6.403466e-07 Wed Mar 18 07:26:34 2015 529c4fe10e3
> 65.1/23.4
[1] 2.782051

So according to the interpretation keys on Wikipedia, this evidence strength is in the category "barely worth mentioning". https://en.wikipedia.org/wiki/Bayes_fact...rpretation



As for the item-level MCV, using Raven's items alone doesn't seem like a good idea without a large sample Too little diversity. The Wicherts dataset has the Raven's items (N=500ish).

Code:
wicherts = read.mega("wicherts_data.csv")

ravens.items = wicherts[,grep("rav\\d",colnames(wicherts))] #ravens items
ravens.items = ravens.items[,-36] #remove last item (no answers)
ravens.items$GPA = wicherts$Zgpa #add GPA
ravens.cor = rcorr2(ravens.items)$r #cors

ravens.fa = fa(ravens.items) #factor analyze
loadings = ravens.fa$loadings #loadings
GPA.item.cors = ravens.cor[36,1:35] #item x GPA cors

MCV.plot(loadings,GPA.item.cors)


See attached.

As for the CRT, it is interesting in its shortness. But the items are getting too common now, so that there is a serious chance of testees being acquainted with them from some other source.


Attached Files Thumbnail(s)
   
 Reply
#4
1) Regarding MH's (and Raven's) concerns about MCV, the expectation is that MCV correlations with item data will be lower than with subtest data because the former are less reliable and more heterogeneous in terms of reliable variance. This does not mean that MCV should not be used with item data, just that the results may be weaker, on the average.

2) "He posted it twice with a about a week's delay"

remove the extra 'a'

3) "Gymnasie is a secondary education taken by approximately two thirds of a generation. In years of education, it is grade 10-13 or 9-12."

'High school' appears to be the closest equivalent of 'gymnasie', which should be mentioned for clarity's sake.

4) "92% of the students attended STX, with the rest attending various other kinds (HTX 1, HF, 4, HHX 1)."

What are these acronyms?

5) "Since no fact-sheet was provided"

By 'fact-sheet', do you mean an answer key?

6) "traditional factor analysis and item response theory factor analysis"

Specify/explain the algorithms used.
 Reply
#5
The paper gives very little info about the nature of the test. What exactly do the subtests measure, and what do the acronyms in Tables 1 and 3 and Figure 2 (VR, LN etc) mean?

In what way are VR and LN so similar that they contribute most to the g factor of this test, and in which way is R3D different from the others, to give it a lower g loading? If R3D is 3-dimensional rotation and the others are mainly tests of "crystallized" abilities, that would be the explanation.

Page 2, footnote 3: Here you should specify that gymnasie is the academic branch of the educational system, so you expect people with higher IQ and a theoretical rather than practical slant in this school type.

The abbreviations on page 2 should be explained: HTX1, HF etc.

A few typos, for example Wechschler test in 2nd line of intro and least sqauare in 2nd line of section 6.1.
 Reply
#6
We will work out a new draft based on the above. Thanks for reviewing.

I won't have time before after the conference in London 7-11th May, http://emilkirkegaard.dk/LCI15/. (See you there!)
 Reply
#7
Using the Jensen method (I think this is a more suitable name than MCV, which is inspecific, all correlations involve too vectors...) on homogeneous item-level data probably just results in spurious positive findings. I will add this to a future revision of my paper here https://thewinnower.com/papers/spearman-...-extension Basically, doing this on purely Raven's items should be always positive due to a statistical artifact, regardless of whether SH is true for the groups or not. But my mega-analysis shows that it does hold, so I guess others can use their own interpretation of that. I have 2 more samples not added to the analysis, which will bring the total up to 66 IIRC.

Back to the present article. Sorry I have been busy with other things, but also I just get distracted easily with new projects.

Dalliard,

2)
Fixed.

3)
Changed footnote to
Quote:Gymnasie is a secondary education taken by approximately two thirds of a generation. In years of education, it is grade 10-13 or 9-12 depending on whether the student took the optional 10th grade or not. In US terms, it is similar but not identical to high school.

4)
Added a footnote:
Quote:In Denmark there are 4 main types of gymnasier. STX is the standard type, HTX is technology-oriented, HHX is trade-oriented. Lastly, HF is a shorter 2-year which gives a roughly equivalent degree.

5)
Quote:By 'fact-sheet', do you mean an answer key?
Yes.

6)
Quote:Specify/explain the algorithms used.

Here or in the paper?

I changed the paragraph to:

Quote:To examine the internal structure, we used both traditional factor analysis and item response theory factor analysis on all the cognitive items to extract 1 factor. Although popular, principal components analysis has been shown to give misleading results in some cases\cite{Kirkegaard2014The}. For this reason, we used another extraction method which by default is MinRes, but it does not appear to make a large difference which method is used\cite{Kirkegaard2014The}. The functions fa() and irt.fa() from the psych package were used for extraction.

Factor loadings are shown in Table \ref{itemloadings}. The factor congruence was 1.00.

I don't know how IRT works in detail, but it is designed to work on item-level data (i.e. dichotomous) whereas traditional FA is designed to work on continuous data. The main goal was to show the two methods give similar results.
 Reply
#8
(2015-May-02, 14:49:12)Gmeisenberg Wrote: The paper gives very little info about the nature of the test. What exactly do the subtests measure, and what do the acronyms in Tables 1 and 3 and Figure 2 (VR, LN etc) mean?


We cite the Condon and Revelle paper for the test in the introduction. You can find the test in their supplementary material. We use the 16-item abbreviated version, translated to Danish.

The subtests are:
LN = Alphanumerical series. e.g. A B C D _ fill in E.
VR = Verbal reasoning.
R3D = 3D rotation.
MR = matrix reasoning. Similar to Raven's.
CRT = Cognitive Reflection Test. 3 questions supposed to measure whether one is a reflective thinker. You can find this easily all over the internet.

Quote:In what way are VR and LN so similar that they contribute most to the g factor of this test, and in which way is R3D different from the others, to give it a lower g loading? If R3D is 3-dimensional rotation and the others are mainly tests of "crystallized" abilities, that would be the explanation.

They are not particularly similar.

I don't think it is appropriate to speculate about the exact loadings of the items due to the small sample size. The main point in showing the loadings is to show that they are all positive as they should be.

You could use the large sample of 1300 or so cases built into the psych package (called ability) to examine factor loadings if you want to try to make sense of them. This was not a major goal of this paper.

Quote:Page 2, footnote 3: Here you should specify that gymnasie is the academic branch of the educational system, so you expect people with higher IQ and a theoretical rather than practical slant in this school type.

See the reply to Dalliard above. Added:
Quote:The gymnasie is meant as a preparation for further education, so it is somewhat selected for academic ability and hence general intelligence.

Quote:The abbreviations on page 2 should be explained: HTX1, HF etc.

Has been done.

Quote:A few typos, for example Wechschler test in 2nd line of intro and least sqauare in 2nd line of section 6.1.

Fixed the grammar error in the intro.

Fixed OLS type.

I will see if I can upload the new version.
 Reply
#9
I have put the article on Authorea as a test. The figures are missing and tables somewhat messed up, but the text has been updated.

https://www.authorea.com/users/24740/art...ow_article
 Reply
#10
1) <i>92% of the students attended STX, with the rest attending various other kinds (HTX 1, HF, 4, HHX 1).</i>

What do those numbers (1, 4, 1) stand for, percentages?

2) <i>To examine the internal structure, we used both traditional factor analysis and item response theory factor analysis on all the cognitive items to extract 1 factor.</i>

According to the <a href="http://www.inside-r.org/packages/cran/psych/docs/irt.fa">explanation here</a>, IRT factor analysis means that first a tetrachoric correlation matrix is calculated from item data and then the correlation matrix is factor-analyzed in the usual way. In other words, it is assumed that normally distributed continuous latent variables underlie dictotomous item responses. Add this explanation to the paper because it's not obvious otherwise.

3) <i>Next we compared the item means and SDs (shown in Table undefined) with those published by Condon and Revelle(Condon 2014). The correlations were .85 and .63 indicating high construct reliability across languages and samples.</i>

On what basis are those values high? Unless you have some good standard to compare them against, I would say that the values suggest <i>reasonable</i> or <i>good</i> congruence between languages, especially as your sample is small.

4) <i>Both predictors had positive betas, however, the beta for CRT was only .11 and with a relatively wide confidence interval because of the small sample size (72), so we cannot be very certain about its real value</i>

The CIs are the same for both tests, so the amount of uncertainty is the same. I would just say that the 95% CI for CRT includes zero.
 Reply
 
 
Forum Jump:

Users browsing this thread: 1 Guest(s)