Journal:
Open Differential Psychology.
Authors:
Emil O. W. Kirkegaard
Julius D. Bjerrekær
Title:
ICAR5: a 5-item public domain cognitive test
Abstract:
A 5-item abbreviation of the ICAR (International Cognitive Ability Resource) 16-item sample test was created thru exhaustive search. The 5-item version (ICAR5) was optimized for correlation with the 16-item version and for administration time.
To validate the test, it was given to students in 6th to 10th grade in two Danish schools (N=236). Age was used as a criterion variable and showed the expected positive relationship (r=.43). Results furthermore showed that the abbreviated test was too difficult for the younger students (6th and 7th grades), but not for the older students. One item was found not to be very discriminative, so it should be replaced with a more suitable item.
Key words:
ICAR, international cognitive ability resource, cognitive ability, intelligence, IQ, abbreviation, age, Danish
Length:
3421 words, excluding references.
Files:
https://osf.io/yqe6p/files/
External reviewers:
We will attempt to get one of the ICAR designers to review the paper. These are: William Revelle and David Condon.
Back to [Archive] Post-review discussions
This looks like a worthwhile project, given the difficulty in working with research participants with microscopic attention spans. I don't know enough about the methods to make suggestions for improvements, though.
Find attached version with track changes and comments.
Overall, it's a good paper. A major limitation is that it's based on only one criterion variable for item selection, and 2 criteria for validation.
Overall, it's a good paper. A major limitation is that it's based on only one criterion variable for item selection, and 2 criteria for validation.
Piffer,
Thank you for reviewing. I have made the requested corrections. I have added your name to the acknowledgements section.
I have changed the title to "ICAR5: design and validation of a 5-item public domain cognitive ability test" as a compromise between your suggestion and the old title.
You have misunderstood. The criterion correlation in Section 2 is the correlation between the 5-item test and the 16-item test. We used this criterion correlation to find the abbreviated tests that had the highest correlation with the 16-item test. The inter-correlations in Table 1 are the correlations between these 5-item x 16-item tests correlations between the three datasets. I.e., do the datasets agree which 5-item tests have the highest correlations with the 16-item test? Yes, to some degree (r's .15 to .30) but there is a lot of noise with the two small datasets.
Figure 2 has the distribution of all the criterion correlations using the large dataset.
The (external) validation with age is first in Section 3. GPA would have been better and we did try to get this data, but the school administrator denied us access to this information. Age has a proven relationship to cognitive ability below the age of 20 or so, so it is a useful criterion to use.
How would you like us to make it more clear?
Revision #4 uploaded: https://osf.io/j9k7g/
Thank you for reviewing. I have made the requested corrections. I have added your name to the acknowledgements section.
I have changed the title to "ICAR5: design and validation of a 5-item public domain cognitive ability test" as a compromise between your suggestion and the old title.
please add a sentence explaining what this criterion correlation is. You mentioned it is correlation with age only in the abstract but not in the text. What I understand is that you calculated for each combination of 5 items, the correlation of the raw score to Age. You did this for each dataset. Then you correlated this correlation across them. It’s a rather long process that requires a longer explanation. You also need to justify this process. E.g. Why you decided to correlate them across the two datasets. Then you have suddenly moved on to figure 2, showing the criterion correlations.You need to make clear that these are not intercorrelations, or that the others were intercorrelations. At first I was puzzled to see such a big difference in magnitude, then I realized you were talking about 2 different things. This should be made clearer in the text. Actually, I think it’d be better if you first report the criterion correlations, then you report the intercorrelations.This would make it less confusing.
You have misunderstood. The criterion correlation in Section 2 is the correlation between the 5-item test and the 16-item test. We used this criterion correlation to find the abbreviated tests that had the highest correlation with the 16-item test. The inter-correlations in Table 1 are the correlations between these 5-item x 16-item tests correlations between the three datasets. I.e., do the datasets agree which 5-item tests have the highest correlations with the 16-item test? Yes, to some degree (r's .15 to .30) but there is a lot of noise with the two small datasets.
Figure 2 has the distribution of all the criterion correlations using the large dataset.
The (external) validation with age is first in Section 3. GPA would have been better and we did try to get this data, but the school administrator denied us access to this information. Age has a proven relationship to cognitive ability below the age of 20 or so, so it is a useful criterion to use.
How would you like us to make it more clear?
Revision #4 uploaded: https://osf.io/j9k7g/
Overall:
0/ From a practical standpoint, I find this paper to be a very useful contribution: The real world has real constraints on the measures that can be administered (a priori) and the types of data/items that could be assembled to reflect the measure of a construct (post hoc).
Substance:
1/ The abstract mentions that reliability and administration time were 'optimized' but (a) there are multiple ways to 'optimize' the inter-measure correlation and the 5-item measure administration time; (b) administration time per item was not measured, and the 'optimized' test is based on a reasonable guess about verbal items generally taking more time to complete - the limitations/future directions could suggest timing the items along w/ an algorithm for optimization.
2/ There are free cognitive measures available (albeit few in number). It's great that you mention the International Cognitive Ability Resource (ICAR), which comprises 60 items. A development study administered the ICAR in 16-item subsets to almost 97,000 people; see https://sapa-project.org/dmc/docs/CRICAR2014.pdf
3/ What is the specific evidence supporting the claim that "we found that some participants thought that taking the test took too long"? It is not provided here; doing so would help the reader understand the extent of the problem (against the problem of reliability that might arise from an even-shorter measure).
4/ You used exhaustive search vs. a genetic algorithm for your 16 item set; similarly, here's a paper that also took an exhaustive search for 3 items within each of 3 types of working memory: http://englelab.gatech.edu/2014/Foster%20et%20al%202014.pdf
This paper (Foster et al.) also balance reliability with administration time, so this could be an especially interesting paper (see comment #1).
5/ Why would having more matrix reasoning or dice rotation items correlate negatively with the criterion score? A few possibilities come to mind: (a) analyses somehow captured the distinction between fluid intelligence (Gf: matrix, dice) and crystallized intelligence (Gc: verbal, alphanumeric) -- see Carroll, 1993, which authors should cite (or Cattell-Horn-Carroll theory -- or some theory of cognitive ability); (b) maybe the verbal and alphanumeric items had higher average item correlations with one another so that the part-whole correlation between the 5- and 16-item scales would be stronger with more of these types of items; perhaps age is a third variable here that would drive the Gc but not Gf correlation (b/c knowledge accumulates with age; you have some nice item analysis of this in your Danish school sample)? These might not be the best explanations for the findings; nonetheless, the pattern of positive and negative correlations deserves greater consideration in the paper. Also note in Table 2 that N=4638, and I suspect even the .08 correlation has a low standard error (and is statistically significant).
6/ Note that the 5- vs. 16-item composite correlations will be highest when the 5 overlapping items have a high average correlations with one another and higher variance; the remaining 11 items have lower average correlations with one another and lower variance; and the average over the 5x11=55 cross-correlations are higher....I think. You could decompose the correlations into these 3 partitions to take a look (compare the highest and lowest correlations across these partitions).
7/ Limitations could be relabeled as limitations and future directions -- also include what criteria you would measure in future research (vs. merely a 'broader collection' being needed).
Methods:
8/ Provide the approximate standard errors for correlations, perhaps on the small data set but certainly on the large data set for Table 1; also note that you deleted data listwise, which implies that users of your measure should do the same. You could reanalyze using MI or FIML for missing data; results will likely not meaningfully change much if at all, but still, you make use of all the data.
9/ Consider a loess or smoother in Fig 6 (where the sample is large); remove the loess in Fig 7 (where there are only 11 grade/class data points).
Mechanics:
10/ Please number the pages for easier reference by reviewers (or anyone).
11/ Spelling: Thru and altho are viewed as informal (vs. through and although).
Hope this review, while imperfect, is helpful to the effort.
0/ From a practical standpoint, I find this paper to be a very useful contribution: The real world has real constraints on the measures that can be administered (a priori) and the types of data/items that could be assembled to reflect the measure of a construct (post hoc).
Substance:
1/ The abstract mentions that reliability and administration time were 'optimized' but (a) there are multiple ways to 'optimize' the inter-measure correlation and the 5-item measure administration time; (b) administration time per item was not measured, and the 'optimized' test is based on a reasonable guess about verbal items generally taking more time to complete - the limitations/future directions could suggest timing the items along w/ an algorithm for optimization.
2/ There are free cognitive measures available (albeit few in number). It's great that you mention the International Cognitive Ability Resource (ICAR), which comprises 60 items. A development study administered the ICAR in 16-item subsets to almost 97,000 people; see https://sapa-project.org/dmc/docs/CRICAR2014.pdf
3/ What is the specific evidence supporting the claim that "we found that some participants thought that taking the test took too long"? It is not provided here; doing so would help the reader understand the extent of the problem (against the problem of reliability that might arise from an even-shorter measure).
4/ You used exhaustive search vs. a genetic algorithm for your 16 item set; similarly, here's a paper that also took an exhaustive search for 3 items within each of 3 types of working memory: http://englelab.gatech.edu/2014/Foster%20et%20al%202014.pdf
This paper (Foster et al.) also balance reliability with administration time, so this could be an especially interesting paper (see comment #1).
5/ Why would having more matrix reasoning or dice rotation items correlate negatively with the criterion score? A few possibilities come to mind: (a) analyses somehow captured the distinction between fluid intelligence (Gf: matrix, dice) and crystallized intelligence (Gc: verbal, alphanumeric) -- see Carroll, 1993, which authors should cite (or Cattell-Horn-Carroll theory -- or some theory of cognitive ability); (b) maybe the verbal and alphanumeric items had higher average item correlations with one another so that the part-whole correlation between the 5- and 16-item scales would be stronger with more of these types of items; perhaps age is a third variable here that would drive the Gc but not Gf correlation (b/c knowledge accumulates with age; you have some nice item analysis of this in your Danish school sample)? These might not be the best explanations for the findings; nonetheless, the pattern of positive and negative correlations deserves greater consideration in the paper. Also note in Table 2 that N=4638, and I suspect even the .08 correlation has a low standard error (and is statistically significant).
6/ Note that the 5- vs. 16-item composite correlations will be highest when the 5 overlapping items have a high average correlations with one another and higher variance; the remaining 11 items have lower average correlations with one another and lower variance; and the average over the 5x11=55 cross-correlations are higher....I think. You could decompose the correlations into these 3 partitions to take a look (compare the highest and lowest correlations across these partitions).
7/ Limitations could be relabeled as limitations and future directions -- also include what criteria you would measure in future research (vs. merely a 'broader collection' being needed).
Methods:
8/ Provide the approximate standard errors for correlations, perhaps on the small data set but certainly on the large data set for Table 1; also note that you deleted data listwise, which implies that users of your measure should do the same. You could reanalyze using MI or FIML for missing data; results will likely not meaningfully change much if at all, but still, you make use of all the data.
9/ Consider a loess or smoother in Fig 6 (where the sample is large); remove the loess in Fig 7 (where there are only 11 grade/class data points).
Mechanics:
10/ Please number the pages for easier reference by reviewers (or anyone).
11/ Spelling: Thru and altho are viewed as informal (vs. through and although).
Hope this review, while imperfect, is helpful to the effort.
Figure 2 has the distribution of all the criterion correlations using the large dataset.
How would you like us to make it more clear?
I think it's still confusing that you jump from the inter-correlations (across datasets) to the correlations within one dataset.
Before jumping to the figure without introduction, I'd add a sentence such as:
"Again, we calculated the correlation between all the possible 5-item tests and the ICAR16 (the criterion correlation) using the large dataset.Figure 2 shows..."
Nick,
I have followed most of your suggestions.
I have added a paragraph in the discussion about this limitation.
That's right. However, this dataset was not known to us when we did this study. We relied on a dataset we had available. However, using a larger dataset will not change much because there are strong diminishing returns to increasing the sample size further.
The two previous studies had an open ended comment field at the end and some participants noted that the survey took too much time. We did not do any systematic study of this matter.
I confess that I don't have any particular theories about this pattern and it was not the topic of this study. Given the small variation in item types (only 4 types), it's probably best not to conclude much from this. It would be more interesting to do if one had external criterion variables to optimize against and more variation in item types. One could do this with the recently released SAPA dataset I think.
I added two examples of other criterion variables one could use (GPA and parental educational attainment.
I don't know how to calculate standard errors for the numbers in Table 1, except perhaps by bootstrapping. But that would take a very long time to run and add little to the paper.
I have added a note in the caption about the approximate standard error for the correlations in Table 2. This is indeed very small (about .015), but we are not really interested in statistical significance (in fact, I'm an opponent of NHST).
One could impute the missing data or some missing-data-robust method, but it was deemed unnecessary. In fact, an earlier version of this write-up did use an imputed version of the ICAR16 data. The results, however, were almost identical (r>.99) and we didn't want to complicate matters unnecessarily, so we just used the case-wise complete dataset.
This has been done.
I don't see the point because the PDF reader already provides this functionality. However, I have added page numbers as requested.
I use them on purpose because I support spelling reform.
----
Davide,
I have changed the text to:
Figure 2 shows the distribution of criterion correlations, that is, the correlations between all the possible 5-item tests and the ICAR16.
Hope this is clear enough now.
----
Updated the paper to version #6. This version also now uses APA citation style and has some other minor fixes.
https://osf.io/j9k7g/
I have followed most of your suggestions.
1/ The abstract mentions that reliability and administration time were 'optimized' but (a) there are multiple ways to 'optimize' the inter-measure correlation and the 5-item measure administration time; (b) administration time per item was not measured, and the 'optimized' test is based on a reasonable guess about verbal items generally taking more time to complete - the limitations/future directions could suggest timing the items along w/ an algorithm for optimization.
I have added a paragraph in the discussion about this limitation.
2/ There are free cognitive measures available (albeit few in number). It's great that you mention the International Cognitive Ability Resource (ICAR), which comprises 60 items. A development study administered the ICAR in 16-item subsets to almost 97,000 people; see https://sapa-project.org/dmc/docs/CRICAR2014.pdf
That's right. However, this dataset was not known to us when we did this study. We relied on a dataset we had available. However, using a larger dataset will not change much because there are strong diminishing returns to increasing the sample size further.
3/ What is the specific evidence supporting the claim that "we found that some participants thought that taking the test took too long"? It is not provided here; doing so would help the reader understand the extent of the problem (against the problem of reliability that might arise from an even-shorter measure).
The two previous studies had an open ended comment field at the end and some participants noted that the survey took too much time. We did not do any systematic study of this matter.
5/ Why would having more matrix reasoning or dice rotation items correlate negatively with the criterion score? A few possibilities come to mind: (a) analyses somehow captured the distinction between fluid intelligence (Gf: matrix, dice) and crystallized intelligence (Gc: verbal, alphanumeric) -- see Carroll, 1993, which authors should cite (or Cattell-Horn-Carroll theory -- or some theory of cognitive ability); (b) maybe the verbal and alphanumeric items had higher average item correlations with one another so that the part-whole correlation between the 5- and 16-item scales would be stronger with more of these types of items; perhaps age is a third variable here that would drive the Gc but not Gf correlation (b/c knowledge accumulates with age; you have some nice item analysis of this in your Danish school sample)? These might not be the best explanations for the findings; nonetheless, the pattern of positive and negative correlations deserves greater consideration in the paper. Also note in Table 2 that N=4638, and I suspect even the .08 correlation has a low standard error (and is statistically significant).
I confess that I don't have any particular theories about this pattern and it was not the topic of this study. Given the small variation in item types (only 4 types), it's probably best not to conclude much from this. It would be more interesting to do if one had external criterion variables to optimize against and more variation in item types. One could do this with the recently released SAPA dataset I think.
7/ Limitations could be relabeled as limitations and future directions -- also include what criteria you would measure in future research (vs. merely a 'broader collection' being needed).
I added two examples of other criterion variables one could use (GPA and parental educational attainment.
8/ Provide the approximate standard errors for correlations, perhaps on the small data set but certainly on the large data set for Table 1; also note that you deleted data listwise, which implies that users of your measure should do the same. You could reanalyze using MI or FIML for missing data; results will likely not meaningfully change much if at all, but still, you make use of all the data.
I don't know how to calculate standard errors for the numbers in Table 1, except perhaps by bootstrapping. But that would take a very long time to run and add little to the paper.
I have added a note in the caption about the approximate standard error for the correlations in Table 2. This is indeed very small (about .015), but we are not really interested in statistical significance (in fact, I'm an opponent of NHST).
One could impute the missing data or some missing-data-robust method, but it was deemed unnecessary. In fact, an earlier version of this write-up did use an imputed version of the ICAR16 data. The results, however, were almost identical (r>.99) and we didn't want to complicate matters unnecessarily, so we just used the case-wise complete dataset.
9/ Consider a loess or smoother in Fig 6 (where the sample is large); remove the loess in Fig 7 (where there are only 11 grade/class data points).
This has been done.
10/ Please number the pages for easier reference by reviewers (or anyone).
I don't see the point because the PDF reader already provides this functionality. However, I have added page numbers as requested.
11/ Spelling: Thru and altho are viewed as informal (vs. through and although).
I use them on purpose because I support spelling reform.
----
Davide,
I think it's still confusing that you jump from the inter-correlations (across datasets) to the correlations within one dataset.
Before jumping to the figure without introduction, I'd add a sentence such as:
"Again, we calculated the correlation between all the possible 5-item tests and the ICAR16 (the criterion correlation) using the large dataset.Figure 2 shows..."
I have changed the text to:
Figure 2 shows the distribution of criterion correlations, that is, the correlations between all the possible 5-item tests and the ICAR16.
Hope this is clear enough now.
----
Updated the paper to version #6. This version also now uses APA citation style and has some other minor fixes.
https://osf.io/j9k7g/
I approve publication
All decisions/edits are reasonable, and I approve this paper for publication. Note that I am also on the bandwagon against statistical significance testing (comment #8), yet one should still be concerned about the precision of statistical estimates (so approximate standard errors in Table 2 as provided are helpful).
I approve the last version for publication. I would like to offer a couple of minor suggestions:
On page 2, item 2 Abbreviating the test
"Whether this will happen or not depends on how the space 'looks like'; ..."
To me, it reads better with "how" changed to "what."
Page 15, item 4 Discussion and conclusion
Although the test times were not measured, it would be helpful to have some idea of how long the students took to take the test. This could be a range, or something like 70% completed it in X_minutes or less. The reason for a test that is highly abbreviated is that it is less time consuming. It would also be helpful to readers to see a general comparison between the typical test taking times of ICAR5 and ICAR16.
On page 2, item 2 Abbreviating the test
"Whether this will happen or not depends on how the space 'looks like'; ..."
To me, it reads better with "how" changed to "what."
Page 15, item 4 Discussion and conclusion
Although the test times were not measured, it would be helpful to have some idea of how long the students took to take the test. This could be a range, or something like 70% completed it in X_minutes or less. The reason for a test that is highly abbreviated is that it is less time consuming. It would also be helpful to readers to see a general comparison between the typical test taking times of ICAR5 and ICAR16.
Thanks for your reviews.
We have added the last changes suggested and fixed some other minor variants.
---
http://openpsych.net/ODP/2016/07/icar5-design-and-validation-of-a-5-item-public-domain-cognitive-ability-test/
The paper has been published. Moving thread to post-publication.
We have added the last changes suggested and fixed some other minor variants.
---
http://openpsych.net/ODP/2016/07/icar5-design-and-validation-of-a-5-item-public-domain-cognitive-ability-test/
The paper has been published. Moving thread to post-publication.