An examination of the openpsychometrics.org vocabulary test

An examination of the openpsychometrics.org vocabulary test

Previous Versions

Version #10 Changed Submission editor. - May 24th Version #9 - Published - Jul 5th Version #8 - Jul 5th Version #7 - Accepted - Jun 27th Version #6 - Jun 27th Version #5 - Apr 12th Version #4 - Mar 19th Version #3 - Mar 12th Version #2 - Mar 8th Version #0 - Jan 25th

Submission status
Accepted

Submission Editor
Submission editor not assigned yet.

Author
Emil O. W. Kirkegaard

Title
An examination of the openpsychometrics.org vocabulary test

Abstract

We examined data from the popular free online 45-item “Vocabulary IQ Test” from https://openpsychometrics.org/tests/VIQT/. We used data from native English speakers (n = 9,278). Item response theory analysis (IRT) showed that most items had substantial g-loadings (mean = .59, sd = .22), but that some were problematic (4 items being lower than .25). Nevertheless, we find that using the site’s scoring rules (that include penalty for incorrect answers) give results that correlate very strongly (r = .92) with IRT-derived scores. This is also true when using nominal IRT. The empirical reliability was estimated to be about .90. Median test completion time was 9 minutes (median absolute deviation = 3.5) and was mostly unrelated to the score obtained (r = -.02).

The test scores correlated well with self-reported criterion variables educational attainment (r = .44) and age (r = .40). To examine the test for measurement bias, we employed both Jensen’s method and differential item functioning (DIF) testing. With Jensen’s method, we see strong associations with education (r = .89) and age (r = .88), and less so for sex (r = .32). With differential item functioning, we only tested the sex difference for bias. We find that some items display moderate biases in favor of one sex (13 items had pbonferroni < .05 evidence of bias). However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05). Overall, the test seems mostly well-constructed, and recommended for use with native English speakers.

Keywords
intelligence, cognitive ability, method of correlated vectors, measurement invariance, sex difference, Jensen’s method, sex bias, openpsychometrics.org, online testing, vocabulary, differential item functioning

Supplemental materials link
https://osf.io/vpn2a/

Pdf

Paper

Typeset Pdf

Typeset Paper

Reviewers ( 0 / 0 / 2 )
Reviewer 1: Accept
Reviewer 2: Accept

Mon 25 Jan 2021 23:57

Reviewer 1

Sat 27 Feb 2021 06:17

Reviewer

Finally got around to reviewing this manuscript. Here are my thoughts to improve the manuscript:

Abstract: When you use the phrase "English natives", do you refer to native-born British individuals, or native English language speakers? These are two different populations, and it's not clear which one(s) you are referring to. (Based on the end of the abstract, I think you're referring to native English language speakers, but it is best to eliminate ambiguity.)
Abstract: "mostly unrelated"? Why not report a correlation coefficient between time to completion and score(s)?
Page 4: Revise ". . . too small to care about" to ". . . trivial." This sounds more formal and professional.
Page 5: What is the metric for the item-level bias reported at the bottom of the page?
Page 5: Revise ". . . positive values means male-favored . . ." to ". . . positive values indicate items that favor males . . ." Again, this sounds more professional, plus it doesn't make people think of the mean (i.e., central tendency) of a variable.
Page 11: Revise "Using the test to compare scores of men and women is thus not an issue" to read "This test can justifiably used to compare scores of male and female examinees" (or similar language).
Page 12: You have a typo. Change "This confounding this spuriously . . ." to "This confounding thus spuriously . . ."
Page 12: Don't include an author's given name(s) in a citation.
Please state whether your data analysis was pre-registered or not.
The noticeable sex difference favoring males is unusual. Please contrast this with other studies on sex differences in vocabulary knowledge and/or verbal ability in more representative samples. You have done an excellent job showing that the higher male average is not due to test bias. But this result still requires some more discussion, because it is highly unusual.
Optional change: Adding a sentence or two stating how the strong correlations among the scores derived from different scoring methods (all r = .87 or higher) shows that--at least for this test--the Classical Test Theory assumption that item-level error is random and holds up reasonably well. The gains in accuracy from IRT-based scoring are very modest. This could be an additional 1-3 sentences at the end of the manuscript.

Good luck with the revisions.

Emil O. W. Kirkegaard

Sat 06 Mar 2021 06:57

Author | Admin

Replying to

Finally got around to reviewing this manuscript. Here are my thoughts to improve the manuscript:

Abstract: When you use the phrase "English natives", do you refer to native-born British individuals, or native English language speakers? These are two different populations, and it's not clear which one(s) you are referring to. (Based on the end of the abstract, I think you're referring to native English language speakers, but it is best to eliminate ambiguity.)
Abstract: "mostly unrelated"? Why not report a correlation coefficient between time to completion and score(s)?
Page 4: Revise ". . . too small to care about" to ". . . trivial." This sounds more formal and professional.
Page 5: What is the metric for the item-level bias reported at the bottom of the page?
Page 5: Revise ". . . positive values means male-favored . . ." to ". . . positive values indicate items that favor males . . ." Again, this sounds more professional, plus it doesn't make people think of the mean (i.e., central tendency) of a variable.
Page 11: Revise "Using the test to compare scores of men and women is thus not an issue" to read "This test can justifiably used to compare scores of male and female examinees" (or similar language).
Page 12: You have a typo. Change "This confounding this spuriously . . ." to "This confounding thus spuriously . . ."
Page 12: Don't include an author's given name(s) in a citation.
Please state whether your data analysis was pre-registered or not.
The noticeable sex difference favoring males is unusual. Please contrast this with other studies on sex differences in vocabulary knowledge and/or verbal ability in more representative samples. You have done an excellent job showing that the higher male average is not due to test bias. But this result still requires some more discussion, because it is highly unusual.
Optional change: Adding a sentence or two stating how the strong correlations among the scores derived from different scoring methods (all r = .87 or higher) shows that--at least for this test--the Classical Test Theory assumption that item-level error is random and holds up reasonably well. The gains in accuracy from IRT-based scoring are very modest. This could be an additional 1-3 sentences at the end of the manuscript.

Good luck with the revisions.

Thank you for these good suggestions. Changes:

All the language suggestions were implemented.
Correlation about time use and score added to abstract (r = -0.02, given in the table on page 3).
"What is the metric for the item-level bias reported at the bottom of the page?". It is Cohen's d. This metric is from https://rdrr.io/cran/mirt/man/empirical_ES.html "ESSD: Expected Score Standardized Difference. Cohen's D for difference in expected scores." I have added a note to the table caption.
"Page 12: Don't include an author's given name(s) in a citation." It's a bug from the citation manager (Zotero) that I can't fix. The typesetter should before it before publication.
"Please state whether your data analysis was pre-registered or not." Nothing was pre-registered for this archival data study.
I compiled results for vocabulary differences for adults from large samples. Added a new table with these results. As R1 says, it is unusual, usually the gap is about 0 in European-ancestry samples. I think the difference here has to do with differential self-selection by male and females. I expanded a paragraph in the discussion to include these data.
The discussion, paragraph 3 already discusses the IRT and simple sum scores. However, I added a bit more to this.

There appears to be a bug on the site, I will upload the new PDF whenever possible.

Emil O. W. Kirkegaard

Mon 08 Mar 2021 23:53

Author | Admin

New version uploaded (after bugfix).

Reviewer 1

Thu 11 Mar 2021 19:40

Reviewer

Thank you for accommodating my original requests. I have a few more minor suggestions to make the language seem more natural and/or diplomatic:

Page 2: change ". . . pick 2 of the 5 . . ." to ". . . pick 2 of the 5 options . . ."
Foonote 4: change ". . . still considered fine . . ." to ". . . still considered valid . . ." or ". . . still considered acceptable . . ."
Page 11: change "Here it should be noted . . ." to "However, it should be noted . . ."
Page 11: change ". . . smarter or duller . . ." to ". . . smarter or less intelligent than the general population . . ."
Page 13: change ". . . duller men . . ." to ". . . less intelligent men . . ."
Page 13: Add a semicolon after "Pirastu et al., 2021"

These are optional suggestions that have no bearing on the methological or scientific quality of the manuscript. Good work.

Emil O. W. Kirkegaard

Fri 12 Mar 2021 11:09

Author | Admin

Replying to Thu 11 Mar 2021 19:40

Thank you for accommodating my original requests. I have a few more minor suggestions to make the language seem more natural and/or diplomatic:

Page 2: change ". . . pick 2 of the 5 . . ." to ". . . pick 2 of the 5 options . . ."
Foonote 4: change ". . . still considered fine . . ." to ". . . still considered valid . . ." or ". . . still considered acceptable . . ."
Page 11: change "Here it should be noted . . ." to "However, it should be noted . . ."
Page 11: change ". . . smarter or duller . . ." to ". . . smarter or less intelligent than the general population . . ."
Page 13: change ". . . duller men . . ." to ". . . less intelligent men . . ."
Page 13: Add a semicolon after "Pirastu et al., 2021"

These are optional suggestions that have no bearing on the methological or scientific quality of the manuscript. Good work.

Implemented suggestions 1-3 and 6. I prefer the plain language simple comparative duller over formal 2-word less intelligent.

Forum Bot

Fri 12 Mar 2021 11:10

Bot

Authors have updated the submission to version #3

Emil O. W. Kirkegaard

Thu 18 Mar 2021 15:24

Author | Admin

Richard Lynn has coincidentally published a relevant meta-analysis on sex differences in verbal ability. He finds the same results as me, but used a lot more datasets. https://mankindquarterly.org/archive/issue/61-3/16

I will update to add a reference to this work once Reviewer 2 has given their comments.

Forum Bot

Fri 19 Mar 2021 23:51

Bot

Authors have updated the submission to version #4

Reviewer 2

Fri 26 Mar 2021 14:02

Reviewer

This is a professionally prepared and highly technical paper.

Some suggestions:
Add page numbers.
Add table and figure numbering.
r should be in italics.

Scale scoring, four different methods: Which method is theoretically the best? Describe the advantages and disadvantages of each method. Give a recommendation only based on theoretical considerations before analyzing empirical relationships.
“Table X. Correlations between test scores and criterion variables” -> “Table X. Correlations between four different variants of test scores and criterion variables”
What is “mirt”? Explain.
“Of the 45 items, not all are good items, as scored using the site’s key.” What are the criteria? Explain.
“Of the 45 items, 13 showed evidence of sex-bias” – what is the criterion, describe/explain.
“may be smarter or duller” -> “may be smarter or less smart”
How such a result can emerge? “ On this test, males obtained somewhat higher scores, d = 0.28 (4.2 IQ points). We found evidence of sex bias in 13 of the 45 items. However, the directions of bias were roughly balanced (6 and 7 items) such that the test level bias was near zero.” – Explain.
Table X: “studies of sex difference” -> “studies of sex differences”
For your empirical analysis which scoring method is better you have used only two criteria. This should be critically mentioned.

Emil O. W. Kirkegaard

Mon 12 Apr 2021 14:03

Author | Admin

Replying to Reviewer 2

This is a professionally prepared and highly technical paper.

Some suggestions:
Add page numbers.
Add table and figure numbering.
r should be in italics.

Scale scoring, four different methods: Which method is theoretically the best? Describe the advantages and disadvantages of each method. Give a recommendation only based on theoretical considerations before analyzing empirical relationships.
“Table X. Correlations between test scores and criterion variables” -> “Table X. Correlations between four different variants of test scores and criterion variables”
What is “mirt”? Explain.
“Of the 45 items, not all are good items, as scored using the site’s key.” What are the criteria? Explain.
“Of the 45 items, 13 showed evidence of sex-bias” – what is the criterion, describe/explain.
“may be smarter or duller” -> “may be smarter or less smart”
How such a result can emerge? “ On this test, males obtained somewhat higher scores, d = 0.28 (4.2 IQ points). We found evidence of sex bias in 13 of the 45 items. However, the directions of bias were roughly balanced (6 and 7 items) such that the test level bias was near zero.” – Explain.
Table X: “studies of sex difference” -> “studies of sex differences”
For your empirical analysis which scoring method is better you have used only two criteria. This should be critically mentioned.

Thanks for the review.

Add page numbers.
Add table and figure numbering.
r should be in italics.

These will be done at the typesetting step.

Scale scoring, four different methods: Which method is theoretically the best? Describe the advantages and disadvantages of each method. Give a recommendation only based on theoretical considerations before analyzing empirical relationships.

We added some discussion of these:

The simple sum is the most commonly used method, and can be interpreted as a latent variable model with equal loadings (McNeish & Wolf, 2020). The advantage here is the simplicity of use, especially for manual scoring by hand, and the fact that one does not need to estimate factor loadings. Estimation of factor loadings in small samples produces unreliable results, and it may be better to simply assume equal loadings (Gorsuch, 2015; Ree et al., 1998). The simple sum with subtraction for incorrect responses attempts to deal with differences in guessing rates, by subtracting the expected score gains from this. This method should produce better estimates if all guessing is done completely at random and individuals simply vary in how much they guess. This assumption is not likely to be accurate, so it is unclear how this correction will affect estimates. The binary 2-parameter logistic model (2PL) allows for items to vary in difficulty and factor loading. Thus, items that are more informative for a subject are given more weight in the scoring, and there is no bias from the binary nature of the data. This model should produce more accurate estimates than the simple sum when items actually vary in factor loadings, which almost any collection of items will do to a large degree. The nominal nominal model further extends this by allowing that different incorrect responses may be differentially informative. In the binary models, each response is assumed to be informative only in two degrees, whether it is correct or incorrect. In the nominal, some incorrect responses are deemed more incorrect than others, and thus used to estimate the ability. This approach should be slightly more effective if a large sample is available for the model training (Storme et al., 2019).

“Table X. Correlations between test scores and criterion variables” -> “Table X. Correlations between four different variants of test scores and criterion variables”

Updated.

What is “mirt”? Explain.

Added:

(MIRT = Multidimensional Item Response Theory)

We only used unidimensional IRT here, that is just the software package's name. https://cran.r-project.org/web/packages/mirt/index.html

“Of the 45 items, not all are good items, as scored using the site’s key.” What are the criteria? Explain.

I don't understand the comment. The next sentence already explains this. The factor loadings are very low.

“Of the 45 items, 13 showed evidence of sex-bias” – what is the criterion, describe/explain.

I don't understand the comment. This was explained in the previous section in the study which showed the item response functions. Male bias is seen in the table as well, the effect size is given in Cohen's d, so it is easy to understand. E.g., item 8 favors women by 0.23 d, and item 9 men by 0.12 d.

“may be smarter or duller” -> “may be smarter or less smart”

I prefer the simpler language.

How such a result can emerge? “ On this test, males obtained somewhat higher scores, d = 0.28 (4.2 IQ points). We found evidence of sex bias in 13 of the 45 items. However, the directions of bias were roughly balanced (6 and 7 items) such that the test level bias was near zero.” – Explain.

I don't know what needs to be explained. Some items favor one sex, others favor the other. Their opposite effects cancel out, so the test score does not have any notable sex bias.

Table X: “studies of sex difference” -> “studies of sex differences”

Fixed.

For your empirical analysis which scoring method is better you have used only two criteria. This should be critically mentioned.

Added some dicussion of this limitation. Unfortunatel, there are no more criterion variables to use here:

The main limitation of this criterion analysis is that we only have 2 variables to investigate. It would be preferable to repeat this method comparison using a wider range of criterion variables, and preferably in a larger dataset, so that precision would be sufficiently high to detect even small differences between correlations (e.g., r = .20 vs. .22).

Forum Bot

Mon 12 Apr 2021 14:09

Bot

Author has updated the submission to version #5

Reviewer 2

Wed 02 Jun 2021 09:49

Reviewer

Many thanks for the revision!

Formal aspects:

This would make it easier for the reviewer, author and reader:

A manuscript PDF including at the beginning the review (part 1) and then author’s answer to the review (part 2). Finally coming the revised manuscript (part 3). (Standard procedure as in Elsevier manuscript review process.)
Page numbers (page 1 at first page if the reviews; page 1 at first page of the answer to the reviews; page 1 at first page of the manuscript).
Date of the answer to the reviews and date of the manuscript completion.
Correct numbering of tables and figures.

I suggest presenting correlations without a zero. ”Do not use a zero before a decimal when the statistic cannot be greater than 1 (proportion, correlation, level of statistical significance).” So no leading zero for correlations (and p-values and standardized betas). See: https://apastyle.apa.org/instructional-aids/numbers-statistics-guide.pdf

In the discussion: ”Rushton J. Philippe et al., 2007”,

”Less smart” sounds friendlier than ”duller”.

Content:

Quote from the abstract: ”With Jensen’s method, we see ... association ... with sex (r = .32). ... However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05).”

Here are two messages: 1. Males have better results. 2. If there are sex differences in items, about 50% favor males and 50% favor females. This is (at least at first glance) contradictory. This I noticed also in my first review and you wrote

”I don't know what needs to be explained. Some items favor one sex, others favor the other. Their opposite effects cancel out, so the test score does not have any notable sex bias.”

That is no answer. I think the answer is: Items, in which males have an advantage, also show higher g-loadings compared to items in which females show better results. This leads at the end to a higher male mean. Correct? Clarify this. Give numbers. Explain, why it is. [See also your Table with regressions.]

What is ”model training”?

Write one to three sentences about ”LOESS”, explain it and its advantages.

”(when the lines are not overlapping, e.g., item 37 has male-bias).” -> ”(when the lines are not overlapping, e.g., item 37 has male-bias and item 21 female-bias).”

Emil O. W. Kirkegaard

Sun 27 Jun 2021 03:34

Author | Admin

R2,

The requests for the forum software are not best handled here.

Content:

Quote from the abstract: ”With Jensen’s method, we see ... association ... with sex (r = .32). ... However, the item pool contains roughly even numbers of male-favored and female-favored items, so the test level bias is negligible (|d| < 0.05).”

Here are two messages: 1. Males have better results. 2. If there are sex differences in items, about 50% favor males and 50% favor females. This is (at least at first glance) contradictory. This I noticed also in my first review and you wrote

”I don't know what needs to be explained. Some items favor one sex, others favor the other. Their opposite effects cancel out, so the test score does not have any notable sex bias.”

That is no answer. I think the answer is: Items, in which males have an advantage, also show higher g-loadings compared to items in which females show better results. This leads at the end to a higher male mean. Correct? Clarify this. Give numbers. Explain, why it is. [See also your Table with regressions.]

You are missing the part where the part talks about item bias. The latter part you quote is about items with a sex bias in favor of a sex, not a sex difference in favor of that sex (these may be highly correlated (here r = .71) but conceptually distinct). The first part is talking about differences, not biases. There is no conflict here.

The better items (higher g-loadings) show larger male advantages, even adjusting for item difficulty (males do better on the harder items as well, a greater male variance effect?), and item directional bias. That is model what 3 in the regression table shows.

What is ”model training”?

Fitting a model (finding the best parameters for that model on a given dataset). In machine learning, the more common term is training the algorithm or the model. Discplinary semantic differences.

Write one to three sentences about ”LOESS”, explain it and its advantages.

Included a footnote: " LOESS = locally estimated scatterplot smoothing, a method for deriving a moving average that will include nonlinear effects. This is the most common algorithm for handling the simple case of two continuous variables." I don't know if it has any particular advantages over any other set of alternatives, but it is widely used and the default algorithm in the ggplot2 plotting package. https://ggplot2.tidyverse.org/reference/geom_smooth.html defaults to LOESS for n < 1000 (because it is slow, not because it is inaccurate for larger n), it is done with the base R function https://rdrr.io/r/stats/loess.html.

”(when the lines are not overlapping, e.g., item 37 has male-bias).” -> ”(when the lines are not overlapping, e.g., item 37 has male-bias and item 21 female-bias).”

Added.

Forum Bot

Sun 27 Jun 2021 03:34

Bot

Author has updated the submission to version #6

Forum Bot

Sun 27 Jun 2021 10:29

Bot

The submission was accepted for publication.

Forum Bot

Mon 05 Jul 2021 02:50

Bot