Back to [Archive] Post-review discussions

[ODP] Country of origin and use of social benefits: A pilot study of stereotype accur
Admin
Journal:
Open Differential Psychology.

Authors:
Emil O. W. Kirkegaard
Julius D. Bjerrekær

Title:
Country of origin and use of social benefits: A pilot study of stereotype accuracy in Denmark

Abstract:
We asked a broad online sample of Danes (N=60; N=48 after quality control) to estimate the use of social benefits for persons grouped after country of origin. The median personal stereotype accuracy correlation was .55 [CI95: .46 to .58]. The aggregate stereotype accuracy was .70 [Ncountries=71, CI95: .56 to .80].
The study was underpowered to detect relationships to many predictors, but some plausible predictors were found including being male d = .86 [CI95: .17 to 1.56], being older r=.56 [CI95: .33 to .73], Nationalism r=.34 [CI95: .07 to .57], personal liberalism, r=.32 [CI95: .04 to .55] and cognitive ability (r=.23 [CI95: -.06 to .48]).
The study was preregistered.

Key words:
stereotypes, stereotype accuracy, Denmark, immigrants, social benefits, group differences, Muslims

Length:
3700 words, excluding references.

Files:
https://osf.io/ecpzj/files/

External reviewers:
We will attempt to get one of the stereotype accuracy researchers to review the paper as well as 2 in-house reviewers. These are: Lee Jussim, Jarret T. Crawford, and Rachel S. Rubinstein. They wrote a recent review paper (preprint).
That's really just a pilot study, given the small sample size. Too bad that it is so hard to recruit survey participants, not to mention getting an acceptable response rate. Still, even as a pilot study with a convenience sample it has some interesting results. Inclusion of elevation and dispersion bias in addition to correlation is especially useful.
Only some minor points:
1. In 4.3 you write "by taking the mean ... and subtracting that of the real values." Do you subtract the estimates from the real values, or do you subtract the real values from the estimates? I guess you did the latter. Make sure you state that more clearly.
2. In Figure 10, X axis: Is this log 10 or natural logarithm? Log 10 looks unlikely, unless there are 100 billion Turks in Denmark.

Otherwise, it' fine.
Admin
Gerhard,

Thanks for commenting. We had originally thought we would get more responses, especially since we spent money on advertisement on Reddit. However, that did not work well and subjects complained about the length of the cognitive test, probably reducing the sample further (and the reliability of the cognitive test).

We had in mind doing a larger study using a website we would make, but I simply do not have enough time to work on that for now. But we didn't want to not publish our findings at all, even if they are only suggestive (small, unrepresentative sample).

--

I will edit it to make it more clear with regards to the elevation. We used estimates minus real values, such that positive values reflect too high estimates.

It is natural algorithm as you expect. It makes no difference which one uses (I have tried both), but log_10 is more interpretable (because calculation using e in the head in difficult!), so I will change it to log_10.
1) "We asked a broad online sample of Danes (N=60; N=48 after quality control)"

It's a small convenience sample, not broad in any way.

2) "The study was underpowered to detect relationships to many predictors, but some plausible predictors were found including being male d = .86"

This should say what the dependent variable is, e.g. "The study was underpowered to detect relationships between the accuracy of beliefs and many predictors..."

For consistency, you should report the male effect size as a correlation (r) like the other effect sizes.

3) "The data and R source code is publicly available "

are publicly available

4) "decided to post the survey multiple places"

to multiple places

5) "used by free-market, nationalists and conservatives"

free-market who?

6) "a class of students at Mercantec"

What's Mercantec?

7) "Experience with ghettos (6 options)"

How do you define a ghetto in this study?

8) "were asked to estimate the percent of persons"

percentage, not percent

9) "would not receive the student scholarship that all students are eligible to in Denmark."

'Scholarship' is something that you get based on merit. If everyone gets it, it's 'student allowance' or something like that.

10) "The criteria data were"

criterion data

11) "Delta scores are sensitive to both elevation and dispersion bias"

This is an unclear sentence. Do you mean to say that delta scores can be used to study those biases, or that delta scores are distorted by those biases? I assume the former.

12) "while in third we are aggregating estimates and then calculating accuracy scores"

Clarify how you actually aggregate them.

13) Figure 1 needs more explication. Is it based on correlations between welfare use estimated by individuals for all countries of origin and the actual welfare use by people from each country of origin? Whatever it is, say it explicitly. What is the scale of the y axis, e.g., what does it mean when y=0.5? What's the difference between the bar graph and the line graph?

14) "yielding a mean of .48 and median of .55 [CI95: .46 to .58]3. Using Jussim et al (2015)'s cutoffs of .30 and .50 for levels of accuracy"

Is the CI for the median? Are Jussim's cutoffs means or medians? How were the cutoffs decided? How can both be 'accurate'?

15) Table 1 needs more explication. I assume these are all individual measures, which should be stated. What are 'sd error abs' and 'sd error' exactly? Don't use abbreviations without defining them. Also, remove the redundant entries above the diagonal.

16) "unweighted sums are more resistant to sampling error and was chosen instead"

were chosen

17) Could you add significance asterisks or CIs to Table 2?

18) Add descriptive statistics for the predictors (IQ, age, sex, etc.).

19) "Examining accuracy at the aggregate-level begs the question of how one should aggregate the estimates."

invites the question, not begs

See: http://idioms.thefreedictionary.com/beg+the+question

To keep the different levels of analysis clear, you should clarify here what you are actually aggregating.

20) altho

ugh

This error is repeated elsewhere in the paper.

21) "(Arthur Robert Jensen, 1980)"

this citation is not needed, you can assume that the reader knows what a systematic error is

22) "we downloaded GDP (per capita, in dollars)"

Is this PPP?

23) "immigrants from countries with more Muslims tend to perform poorly"

perform poorly on what?

24) "Due to the non-random sampling and relationships between predictors"

What do you mean by relationships between predictors?
Admin
Dalliard,

Very thorough review as usual.

Here's the changes. They are numbered in reply to Meisenberg's and your points.

a) Elevation bias
We wrote:
“This can be calculated by taking the mean (or another central tendency measure) of the estimates and subtracting that of the real values.”

Possibly this could be more clear. What about?

“This can be calculated by taking the mean (or another central tendency measure) of the estimates and subtracting from that the real values.”

--

b) Logarithms
We used the natural logarithm (this is the default used by R). However, it makes more sense to use base-10 logarithm because the numbers are more interpretable without using a calculator. We have changed the numbers in the paper to use log10. Updated the figures and the figure captions.

--

1) "We asked a broad online sample of Danes (N=60; N=48 after quality control)"

It's a small convenience sample, not broad in any way.


It is a small convenience sample, but it is broad. Remember that most samples in psychology are university students (usually introductory students in psychology too; see the cited reference about WEIRD subjects), which vary little in e.g. age. Compared to this, our sample is broad because it has a large age variation and includes both secondary school children, persons from university, Mensians, persons reading a national conservative news aggregator, our personal Facebooks, and more.

However, I have added “small” to the abstract, altho the sample size was already visible in the parenthesis.

--

2) "The study was underpowered to detect relationships to many predictors, but some plausible predictors were found including being male d = .86"

This should say what the dependent variable is, e.g. "The study was underpowered to detect relationships between the accuracy of beliefs and many predictors..."

For consistency, you should report the male effect size as a correlation (r) like the other effect sizes.


You are right about the dependent variable. Changed to:
“The study was underpowered to detect relationships between the accuracy of beliefs and many predictors”

I disagree with regards to the use of correlations. I think correlations make little sense for dichotomous variables (like gender) in which case d values (or some other standardized mean difference metric) should be used. It's a consistency vs. interpretability of metric trade-off. In this case, it is a minor problem because d≈2r.

--

3) "The data and R source code is publicly available "

are publicly available

Fixed.

--

4) "decided to post the survey multiple places"

to multiple places


Fixed.

--

5) "used by free-market, nationalists and conservatives"

free-market who?


Added “proponents”.

--

6) "a class of students at Mercantec"

What's Mercantec?


We should have introduced this. However, on second thought it is perhaps best not to mention the school by name. It could cause problems for the school in the worst case scenario that some activists pick up this article and start making noise.

Changed to:
“Furthermore, with the help of a teacher, we gave the survey to a class of students at a gymnasie in Viborg, central Jutland.”

--

7) "Experience with ghettos (6 options)"

How do you define a ghetto in this study?


The question asked:

“Hvad er din erfaring med ghettoer?

Med "ghetto" forstås et boligområde hvor der 1) bor mange indvandrere, 2) er en høj arbejdsløshed og 3) har en høj kriminalitetsrate.”

Translated:

“What is your experience with ghettos?

By “ghetto” we mean a residential area where 1) many immigrants live, 2) there is a high unemployment rate, and 3) there is a high crime rate.”

We followed the (original) definition used by the Danish government in 2010 but without using the numerical thresholds. The left-wing government that took over in 2011 renamed the ghetto list to the list “of særligt udsatte boligområder”, which means the list of particularly vulnerable residential areas. They then changed the criteria to include two more requirements: one about income and one about education (one can speculate that the reason to do this was to avoid the immigrant heavy areas by adding more non-national origin based criteria, it did not work well).

Due to small yearly fluctuations and the new criteria, this somewhat changed which areas met the inclusion criteria. Denmark got a new right-wing government in 2015, which has support from the anti-immigration nationalist party, Danish People's Party, so it is possible that the list will be renamed or changed again soon. One can find the current list (per 2014, December) here (Danish): http://www.mbbl.dk/sites/mbbl.dk/files/dokumenter/publikationer/liste_over_saerligt_udsatte_boligomraader_1._dec._2014_1.pdf

Previously, I had tried to get the dataset the list was made from by using our Freedom of Information Act. However, apparently there is a clause in the law that makes it possible for them to deny such requests to scientific data used for public statistics. I complained to the Ombudsman, but he may agree with their decision.

--

8) "were asked to estimate the percent of persons"

percentage, not percent


I looked up the issue. There seems to be little point in using the longer version. In that case we follow the principle of using the shorter word/spelling (compare: math vs. mathematics; color vs. colour).

--

9) "would not receive the student scholarship that all students are eligible to in Denmark."

'Scholarship' is something that you get based on merit. If everyone gets it, it's 'student allowance' or something like that.


It is this: http://www.su.dk/english/state-educational-grant-and-loan-scheme-su/
They apparently call it State Educational Grant, so we will call it that as well.

--

10) "The criteria data were"

criterion data


Fixed in two places.

--

11) "Delta scores are sensitive to both elevation and dispersion bias"

This is an unclear sentence. Do you mean to say that delta scores can be used to study those biases, or that delta scores are distorted by those biases? I assume the former.


Changed to:
“Because delta scores concern any deviation from perfect accuracy, they include both the elevation and dispersion bias components.”

--

12) "while in third we are aggregating estimates and then calculating accuracy scores"

Clarify how you actually aggregate them.


It is clarified in the later section (Section 6). However, we also rewrote it to:

“One can examine stereotype accuracy at three main levels (Jussim, 2012, p. 317): 1) individual-level (also called personal stereotypes), 2) individual-level on average and 3) aggregate-level (also called consensual stereotypes). The difference between the second and the third is that in the second we calculate accuracy scores for each individual and then average, while in the third we aggregate estimates and then calculate accuracy scores.”

--

13) Figure 1 needs more explication. Is it based on correlations between welfare use estimated by individuals for all countries of origin and the actual welfare use by people from each country of origin? Whatever it is, say it explicitly. What is the scale of the y axis, e.g., what does it mean when y=0.5? What's the difference between the bar graph and the line graph?


The method is explained in Section 4.1 where the measure is explained. It is the individual (Pearson) correlations of accuracy. That is, each subject gave a set of estimates which were correlated with the real values to produce a correlation (a single number). Then, the distribution of these numbers was plotted.

The density curve (“line graph”) shows the empirically estimated distribution. The details of this are found in the ggplot2 package (http://docs.ggplot2.org/0.9.3.1/geom_density.html, and leads to this brief and mathematical explanation https://stat.ethz.ch/R-manual/R-devel/library/stats/html/density.html). The values of the y-axis are not very interpretable, but useful for relative comparison. It would be better if ggplot2 would give a more useful y-axis, but it appears not to be a concern of many. I have tried to get a more useful scale (e.g. with proportion on the y-axis), but it appears that others are fine with the way it is. See details here: http://stackoverflow.com/questions/32412805/ggplot2-histogram-with-density-curve-that-sums-to-1

How would you like us to make it more clear?

--

14) "yielding a mean of .48 and median of .55 [CI95: .46 to .58]3. Using Jussim et al (2015)'s cutoffs of .30 and .50 for levels of accuracy"

Is the CI for the median? Are Jussim's cutoffs means or medians? How were the cutoffs decided? How can both be 'accurate'?


Yes, altho I can see how that was not clear. I have added “CI95 for the median”.

I don't understand the question. They are not means or medians. They are the number of persons who got over a particular accuracy threshold.

It is not clear how Jussim chose those values. A brief explanation of them appears in the caption to the caption in the cited paper:

“Note: Percentages of stereotype accuracy correlations exceeding .30 and .50 are presented because only 24% and 5%, respectively, of all effects in social psychology exceed correlations of .30 and .50 (Richard, Bond, & Stokes-Zoota, 2003).“

The values do not appear in his 2012 book, where he uses the value r = .40 instead (see p. 320; the book is on libgen: http://gen.lib.rus.ec/search.php?req=social+reality+jussim&open=0&res=25&view=simple&phrase=1&column=def). This was based off Cohen's guidelines, where d >= .80 corresponds to a large effect. This corresponds to a correlation of about .40 (.3714).

--

15) Table 1 needs more explication. I assume these are all individual measures, which should be stated. What are 'sd error abs' and 'sd error' exactly? Don't use abbreviations without defining them. Also, remove the redundant entries above the diagonal.


Added “individual” to the caption, and “abs = absolute.".
I prefer to keep the redundant values because it makes it easier to look up correlations, i.e. one can follow a row or a column to find all the correlations with that variable.

--

16) "unweighted sums are more resistant to sampling error and was chosen instead"

were chosen


Fixed.

--

17) Could you add significance asterisks or CIs to Table 2?


Adding CIs would clutter it up somewhat, but I see the benefit. I have spent a number of hours writing a function for R that supports calculating a correlation matrix with CIs. It also supports weighted correlations and rounding digits. After that I updated the table (it is now Table 3 because there is a new table with descriptive statistics).

--

18) Add descriptive statistics for the predictors (IQ, age, sex, etc.).


Added a new subsubsection in Section 3 with descriptive statistics of the analysis sample.

--

19) "Examining accuracy at the aggregate-level begs the question of how one should aggregate the estimates."

invites the question, not begs

See: http://idioms.thefreedictionary.com/beg+the+question

To keep the different levels of analysis clear, you should clarify here what you are actually aggregating.


The dictionary is incorrect: the meaning of a phrase is what it is commonly used to mean. A common misuse of meaning is impossible. However, it can be useful to be more clear and in this case your choice of phrasing is superior. However, a better one yet is raises the question, which we changed it to.

--

20) altho

ugh

This error is repeated elsewhere in the paper.


I do it on purpose since I am in favor of spelling reform, e.g. https://en.wikipedia.org/wiki/Cut_Spelling

--

21) "(Arthur Robert Jensen, 1980)"

this citation is not needed, you can assume that the reader knows what a systematic error is


I think it is still a good reference work on the meaning of systematic error and bias in mental testing in general. I don't see the harm in citing it. However, the reference was somewhat malformatted for some strange reason. This has been fixed.

--

22) "we downloaded GDP (per capita, in dollars)"

Is this PPP?


Good question. The IMF does not say on their website directly: data explorer.

However, from their menu where variables are chosen (step 3), there is an option to use PPP values, so my guess is that we are using the nominal GDP per capita values. The option I chose is “Gross domestic product per capita, current prices, in US dollars”.

It may not matter much. I downloaded the nominal and PPP data from Wikipedia (years 2014 and 2015). They correlate .90.

https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28PPP%29_per_capita
https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29_per_capita


--

23) "immigrants from countries with more Muslims tend to perform poorly"

perform poorly on what?


Much or less any metric. I had in mind social outcomes in general. I have rewritten it to:

“Public debate in Denmark about immigrants often concern the role of Islam or Muslims (Engelbreth Larsen, 2009; Fruensgaard, 2012; Holstein & Jenvall, 2014, 2014; Lassen, 2014), probably because immigrants from countries with more Muslims tend to fare poorly as measured by standard socioeconomic outcomes such as education, income, crime and use of social benefits. The correlation between percent Muslim in the origin country and a general index of socioeconomic outcomes is -.78 (N=58) (Kirkegaard & Fuerst, 2014, Table 12).”

--

24) "Due to the non-random sampling and relationships between predictors"

What do you mean by relationships between predictors?


Non-random sampling can lead to correlations between predictors that would not normally be found (or stronger or weaker correlations). For instance, the nationalist conservative website where we had many responses from generally have an older and more male readership. If stereotype accuracy is related to nationalist beliefs, then this could have induced a spuriously large effect of age. The plot below shows mean age by collection site (error bars = analytic CI95).

We added a sentence about this.

--

Many other small improvements (hopefully!) to the text.


--

Files updated! New draft version is 5. https://osf.io/bwtg8/

--

We have since found some more things in need of improvement:
- Consistent capitalization for gymnasie.
- Accuracy predictor table caption needs a note that numbers in brackets are confidence intervals.
- We should add a note in the paper regarding the definition of "ghetto".

These will be fixed in the next revision.
This was a nice study. What was the correlation between the perceptions, though? It's nice when studies present both e.g., section 4.1 here. If the inter-rater reliability of stereotypes is high, then the use of aggregate scores is more justified.

"we could have used were"most Europeans are taller than most East Asians"

Some people might actually not know this. They might be generally poor stereotypers. How did the inclusion of these participants affect the correlations?

Language issues:

"of social benefits for persons grouped after country of origin"

grouped by

"We wanted to avoid a narrow student sample (Henrich, Heine, & Norenzayan, 2010) and thus decided to post the survey to multiple places on the Internet."

Try, for example: "We wanted to avoid a narrow student sample (for the reasons discussed in: Henrich, Heine, & Norenzayan, 2010) and thus decided to post the survey to multiple places on the Internet."

"The survey can be found at"

The survey can be found at the following site:

"This age group was chosen because they are old enough to be finished with education and thus would not receive the State Educational Grant that all students are eligible to in Denmark (see
http://www.su.dk/english/state-educational-grant-and-loan-scheme-su/),
and not old enough that many of them would be receiving retirement benefits"

"This age group was chosen because members of it are ..."

"We chose this measure of socioeconomic performance because it is on a ratio scale (has a true zero) and is simple to understand."

(having a true zero)

"This is probably a reasonable proxy for percent Muslims among these groups in Denmark, altho we are not aware of any studies of this. "

Although

"Stereotypes are often mentioned in public discussion about immigration in Denmark, often as a cause of group differences (Sareen, 2011) or unfair treatment such as housing discrimination (Ekberg, 2015; Hussein, 2014). "

Rewrite.

"and are postulated as a cause of group differences" n case of the aggregate estimates somewhat underestimated them (by -3.43 percent points)."

they somewhat

"Altho plausible..."

Rewrite. What results? Also, altho is unorth(odox).

"than the actual GDP values"

than to
Admin
John,

Thanks for taking the time to review this. We will work on a new version fixing these problems.
Admin
John,

Inter-stereotype correlations
Based on the final data (users who failed control questions excluded and the one outlier excluded, N=48). Using alpha() from the psych package, the standardized alpha coefficient of consistency is .98 and the mean inter-correlation is .47. To make sure nothing was strange, I also plotted the intercorrelations. The distribution shows a long left tail (towards 0), so the median is somewhat higher at .51.

In personal communication with me, you asked for intraclass correlations (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/ ). I have calculated those using ICC() from psych (http://personality-project.org/r/html/ICC.html ).

Intraclass correlation coefficients
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.30 22 70 3337 0 0.24 0.39
Single_random_raters ICC2 0.31 36 70 3290 0 0.24 0.40
Single_fixed_raters ICC3 0.42 36 70 3290 0 0.35 0.52
Average_raters_absolute ICC1k 0.95 22 70 3337 0 0.94 0.97
Average_random_raters ICC2k 0.96 36 70 3290 0 0.94 0.97
Average_fixed_raters ICC3k 0.97 36 70 3290 0 0.96 0.98

Number of subjects = 71 Number of Judges = 48


Reviewing the type of study (fully crossed), ICC2 is the correct type to use here. This gives .31. The question is then whether this is high or low. Usually, papers on ICC do not report interpretation guidelines. However, I found one guideline:

“Cicchetti (1994) provides commonly-cited cutoffs for qualitative ratings of agreement based on ICC values, with IRR being poor for ICC values less than .40, fair for values between .40 and .59, good for values between .60 and .74, and excellent for values between .75 and 1.0. “ (from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3402032/ )

So, inter-rater agreement is poor according to this interpretation guideline. I note that ICC is sensitive to the mean levels of each rater (see equation 5 in the cited paper). Thus, ICC is a kind of absolute deviance measure (similar to that employed in the paper with respect to stereotype accuracy). If one standardizes the ratings first (thus setting their mean to the same value), the ICC goes up to .47 which is the same value as the mean inter-rater correlation.

I have added a new subsection (3.1.6) that reports some of these results.

Effect of excluding participants
[stereotype check for Asians and European]
Some people might actually not know this. They might be generally poor stereotypers. How did the inclusion of these participants affect the correlations?


Many participants that fulfilled one exclusion criterion also fulfilled one of the others, thus some that failed the Eurasian height stereotype, would be excluded following one of the other criteria anyway. In general, results including the excluded cases were somewhat weaker. If one excludes only the case with massive missing data, the mean Pearson correlation accuracy is .41 and the median is .51. This contrasts with .48 and .55 if they are excluded.

The general problem is that of data integrity vs. exclusion of outliers. To which degree should the excluded cases by regarded as merely outliers and to what degree regarded as improper data (e.g. participants who were trying to screw up the results)? We chose fairly strict inclusion criteria, but others may disagree. The data are open so if someone wants to use different exclusion criteria, they are more than welcome to do so.

Language edits
grouped by


Fixed.

Try, for example: "We wanted to avoid a narrow student sample (for the reasons discussed in: Henrich, Heine, & Norenzayan, 2010) and thus decided to post the survey to multiple places on the Internet."


Changed to “We wanted to avoid a narrow student sample (see discussion in Henrich, Heine, & Norenzayan, 2010)”

The survey can be found at the following site:


Added.

This age group was chosen because members of it are ...


Fixed.

(having a true zero)


I don't see the problem.

Although


This is intentional.

Rewrite
.

Changed to: “Stereotypes are often mentioned in public discussion about immigration in Denmark. They have been proposed as a cause of group differences (Sareen, 2011) and unfair treatment such as housing discrimination (Ekberg, 2015; Hussein, 2014).”

they somewhat


Added.

Rewrite. What results? Also, altho is unorth(odox).


Changed to “We found evidence that stereotypes were derived from participants' estimates of the countries of origin's wealth (GDP per capita). We see two ways to further examine this question.”.

Altho is intentional too. See e.g. https://simplerspelling.wordpress.com/thru-with-through/ I'm a spelling reformist.

than to


Added.

---

Updated the draft to version 6 (2016-01-30 02:50 AM).
This also includes the new figure and the ODT file, as well as some slight changes to the R code.
The subject is interesting, but something may need to be cleared up. How, exactly, do you calculate the absolute delta score ? I ask because i can't figure out the relationship between the name "delta descrepancy score" and the method applied as written in your text "is to calculate the absolute difference between each real value and each estimate".
Admin
Hi Meng Hu,

The absolute delta scores were calculated as: abs(real value - estimate value). The programmatic details are found in score_accuracy() in the kirkegaard package (for R). This is a fairly comprehensive function I wrote for the purpose of automatically calculating a bunch of accuracy scores given a vector of criteria values and a dataset of estimates.

https://github.com/Deleetdk/kirkegaard/blob/master/R/statistics.R#L495
Admin
Jarret Crawford says that he does not have time to review this paper.

Lee Jussim's email has an auto-reply saying that he is very busy.

Have not heard back from the others so far.
My comments on this paper:

1. Is the word 'stereotype' appropriate? A stereotype is a belief that is passed on unchanged from one person to another with little personal input. A stereotype is not a product of personal experience or reflection. In the case of this study, the term 'stereotype' seems especially inappropriate given the low mean inter-rater correlation and the correlation with age (both of which suggest that personal experience and/or reflection are important inputs).

I understand many people now use the word 'stereotype' in this way, i.e., negative opinions of an out-group regardless of whether the opinions are the product of experience and reflection.

2. The sample size is very small. There is also the risk of selection bias, i.e., the people who are motivated to fill out this survey may not be representative. It would be better to confine this survey to a closed group, e.g., students in a classroom.

3. p. 5 "were pushed towards" should be "tended towards"

4. "Participants were asked to estimate the percent of persons aged 30-39 who were receiving social benefits for each country of origin"

I would question the assumption that participant responses could not be more accurate than the actual data. 'Country of origin' corresponds imperfectly to ethnicity. For example, many immigrants from Turkey are not ethnic Turks. They may actually be of Greek, Armenian, Jewish, or Russian background (Istanbul used to have a large Russian community). But the survey participants will interpret that question as pertaining only to ethnic Turks. So their assessment of ethnic Turkish behavior may actually be more accurate than what the official data lead us to believe.

I see this problem with much of the official data. How many Vietnamese immigrants are ethnic Vietnamese? Many are in fact ethnic Chinese. How many Kenyan immigrants are ethnic Kenyans? Most are of South Asian origin. The same is true for Ugandan immigrants. Yet when people are asked about the behavior of Kenyan immigrants, they will interpret the question in ethnic terms.

Same problem with many European countries. How many Romanian immigrants are ethnic Romanians? I suspect a large proportion are Roma. Yet most people, Roma and non-Roma alike, do not perceive Romanians and Roma as forming a single category.

You will get this problem even if you put this sort of question to the immigrants themselves. From my experience, very few East African Asians identify primarily as East Africans. Keep in mind that many of these countries are recent creations that date to the 20th century. Before 1918, there was no "Turkey." There was only the Ottoman Empire, which was home to people of many national and religious origins.
Admin
Peter,

Thank you for your review. Below, we reply to your comments point by point.

1. Is the word 'stereotype' appropriate? A stereotype is a belief that is passed on unchanged from one person to another with little personal input. A stereotype is not a product of personal experience or reflection. In the case of this study, the term 'stereotype' seems especially inappropriate given the low mean inter-rater correlation and the correlation with age (both of which suggest that personal experience and/or reflection are important inputs).

I understand many people now use the word 'stereotype' in this way, i.e., negative opinions of an out-group regardless of whether the opinions are the product of experience and reflection.


Stereotype is used by different people with somewhat different meanings. Dictionary.com gives the following definitions:

noun
1.
a process, now often replaced by more advanced methods, for making metal printing plates by taking a mold of composed type or the like in papier-mâché or other material and then taking from this mold a cast in type metal.
2.
a plate made by this process.
3.
a set form; convention.
4.
Sociology. a simplified and standardized conception or image invested with special meaning and held in common by members of a group


Wiktionary gives:

stereotype ‎(plural stereotypes)
- A conventional, formulaic, and often oversimplified or exaggerated conception, opinion, or image of (a person).
- (psychology) A person who is regarded as embodying or conforming to a set image or type.
- (printing) A metal printing plate cast from a matrix moulded from a raised printing surface.
- (computing, UML) An extensibility mechanism of the Unified Modeling Language.


So, as you can see, there are multiple meanings of the word in current use. The 4th meaning from Dictionary.com and 1st meaning from Wiktionary are related to our use.

Lee Jussim (with and without co-authors) spends quite a bit of time debating the meaning of the word in his book and in his review papers on the topic. In his meaning, a stereotype is merely a belief about a group. The belief may be inaccurate or accurate. Social psychologists tend to regard stereotypes as inaccurate, but this belief is itself inaccurate as past research shows stereotypes (that is, beliefs about groups) to be generally but not always fairly accurate. We have no intention of going into this discussion at length in the paper and merely follow the practice of many other researchers in using stereotype in this neutral fashion.

We have added a footnote to the introduction to clarify our use of stereotype.

--

2. The sample size is very small. There is also the risk of selection bias, i.e., the people who are motivated to fill out this survey may not be representative. It would be better to confine this survey to a closed group, e.g., students in a classroom.


We don't have access to a large body of students that we can force to fill out the questionnaire. Neither of the authors are teachers or others in command on a large pool of willing persons.

However, we did not want to simply not research the question, so we gathered what evidence we could online. As you say, this may result in selection bias. At present, we cannot know. We tried to recruit participants from many different online locations and succeeded to some degree.

In defense of the sample, it is much broader (closer to a total population sample) than typical student samples are. This may be relevant to stereotype research.

3. p. 5 "were pushed towards" should be "tended towards"


Fixed.

4. "Participants were asked to estimate the percent of persons aged 30-39 who were receiving social benefits for each country of origin"

I would question the assumption that participant responses could not be more accurate than the actual data. 'Country of origin' corresponds imperfectly to ethnicity. For example, many immigrants from Turkey are not ethnic Turks. They may actually be of Greek, Armenian, Jewish, or Russian background (Istanbul used to have a large Russian community). But the survey participants will interpret that question as pertaining only to ethnic Turks. So their assessment of ethnic Turkish behavior may actually be more accurate than what the official data lead us to believe.

I see this problem with much of the official data. How many Vietnamese immigrants are ethnic Vietnamese? Many are in fact ethnic Chinese. How many Kenyan immigrants are ethnic Kenyans? Most are of South Asian origin. The same is true for Ugandan immigrants. Yet when people are asked about the behavior of Kenyan immigrants, they will interpret the question in ethnic terms.

Same problem with many European countries. How many Romanian immigrants are ethnic Romanians? I suspect a large proportion are Roma. Yet most people, Roma and non-Roma alike, do not perceive Romanians and Roma as forming a single category.

You will get this problem even if you put this sort of question to the immigrants themselves. From my experience, very few East African Asians identify primarily as East Africans. Keep in mind that many of these countries are recent creations that date to the 20th century. Before 1918, there was no "Turkey." There was only the Ottoman Empire, which was home to people of many national and religious origins.


Our study did not concern ethnic or racial groups, only national groups (the words do not appear in the paper). We did not ask participants to estimate the performance of ethnic/racial groups, just persons by country of origin. As you note, country of origin and ethnic/racial group are strongly but imperfectly related.

Some countries have populations that are diverse in terms of ethnic and racial groups. If ethnicity/race is used as a tool in estimating group outcomes, more diversity might be negatively related to the accuracy of the stereotype. Because you brought up the idea, we did an analysis of diversity and aggregate stereotype accuracy. For this purpose, we used 3 different types of diversity: ethnic, linguistic and religious, as well as from two different sources, in total 5 measures. We also used a general score of these.

There was not much to see for the directional error, but there was a reasonably strong effect for the absolute error. I attach the two plots of the general diversity factor and accuracy.

[attachment=705]
[attachment=706]

The correlation table looks like this:


Predictor aggregate estimate error aggregate estimate error abs
EthnicFractionizationIndexFearon03 0.01 0.41
CulturalDiversityIndexFearon03 0.07 0.32
EthnicFractionalizationAlesina03 0.15 0.39
LinguisticFractionalizationAlesina03 0.29 0.31
ReligiousFractionalizationAlesina03 0.15 0.28
general diversity 0.09 0.42
aggregate estimate error 1 -0.22
aggregate estimate error abs -0.22 1


We have written a new subsection (6.1.4) of the paper to add this analysis.

In general, publicly available statistics for European countries do not provide ethnicity information, so it is usually not possible to do analysis of ethnicity/race. One must work with country of origin or even worse, citizenship data. The notable exception is the United Kingdom. We are not aware of any public datasets for Denmark that provide data for ethnicity or race and as such, we could not use that for this study.

--

We have uploaded revision 7 of the paper to OSF: https://osf.io/bwtg8/
I am fine with the content of this version. I approve publication, though I would suggest another proof reading.
Admin
Sean Stevens, a stereotype researcher, has agreed to review this paper (external reviewer). He said he would have a review ready on Sunday.

https://www.researchgate.net/profile/Sean_Stevens/publications
I am fine with the content of this version. I approve publication, though I would suggest another proof reading.


Emil and I have begun to work on a number of projects together. As this could lead to the erroneous impression of impartiality on my part, I rescind my approval and recuse myself from this paper's review process.
I give my approval for this paper. I would like to see, however, some acknowledgment that the findings are preliminary. Perhaps the term "research note" could be added to the title.
To me the paper looks publishable. I approve.
Sorry for the delay, I forgot about this.

1) "This age group was chosen because members of it are are old enough to be finished with"

remove superfluous 'are'

2) Could you state all the answer choices you had for the control questions? Unless the other choices are obviously wrong, it's not clear that not answering that East Asians are shorter is not due to ignorance.

3) There's still no explanation of what a ghetto is.

4) "Party voting desires tended towards national, conservative"

By national do you mean nationalist?

5) Why is the section "Inter-rater consistency" before section 4? The inter-rater correlations concern (unspecified) measures that are introduced in section 4.

6) In Figure 2, the scale of the Y axis is still a complete mystery as is the meaning of the curve. Why can't you use a simple histogram like this one (without the normal curve): http://humanvarieties.org/wp-content/uploads/2016/01/permanent_income_distribution1.png

7) "Using Jussim et al (2015)'s cutoffs of .30 and .50 for levels of accuracy"

Better: Using the cutoffs of .30 and .50 for levels of accuracy, as recommended in Jussim et al. (2015)

8) In Table 3, add a note about the use of CIs.

9) "the log 10 value"

call it the common logarithm or the base-10 logarithm or log[subscript]10[/subscript]

10) A research paper like this is not the right place for orthographic innovations. Given that the English in the paper isn't that fluent to begin with, those altho's look like misspellings.

11) In the section "Levels of analysis", be more explicit about how the accuracy measures are constructed. Give examples.
Review: Country of origin and use of social benefits: A pilot study of stereotype accuracy in Denmark.

Overall I think this is an interesting study. I was happy the authors preregistered the study and were transparent about it being a fortuitous one of convenience. Because of this, I found their discussion of the current findings to be measured. I also think that the authors' assessment of stereotype accuracy in a variety of ways (via correlations, discrepancy scores, elevation bias, and dispersion bias), at both the individual and group levels., is a strength of the paper.

I would like to see an elaboration on how the current findings fit into the existing literature on national character stereotypes (e.g., Costa & McCrae, 2008; Costa, Terracciano, & McCrae, 2001; McCrae & Allik, 2002, McCrae & Terracciano, 2005; Terracciano et al., 2005). Much of this literature has assessed national character/personality stereotypes and the evidence for accuracy is mixed (for reviews see Jussim et al., 2009; Jussim, Crawford, Anglin, Chambers, Stevens, & Cohen, 2015). One possibility is that the current study assesses the perception of a behavior (i.e., the use of governmental social benefits) and does not ask for personality assessments. Indeed, Heine, Buchtel, and Norenzayan (2008) challenged the “no accuracy in national character stereotypes” conclusion by comparing stereotypes to behavior potentially reflecting conscientiousness. When behavior (GDP, longevity, walking speed, clock accuracy, and postal worker speed) rather than self-reports on Big Five personality questionnaires were used as the criteria for accuracy, the correlations between consensual stereotypes and behavior averaged about .60.

I would also like to see more discussion of the role of gender in the current study. After the removal of 12 subjects, for various justified reasons, the sample is 75% male (36 males vs. 12 females). Yet, the study also reports a strong effect of gender on stereotype accuracy (d = .86). I wonder how much, if any, of the accuracy in the current study is related to gender and the greater number of males in the sample. While the authors may not be able to answer that question directly, it would be nice to see it addressed in the discussion.

Finally, I found myself wondering about the conjecture on page 12 that the "media
are probably likelier to discuss and report about members of the larger groups." By larger groups do the authors mean there are more members of these groups in Denmark or does it refer to the overall population numbers for that group across the world?

I also wonder about other potential media effects. For instance, on page 11 Figure presents a scatterplot of aggregate estimates and the actual proportion of people receiving social benefits. The top right corner contains data points for Somalia, Syria, Iraq, Afghanistan, and Lebanon. There are currently ongoing armed conflicts in all of those countries, with the exception of Lebanon which borders Syria. Might the media be covering those conflicts and thus those countries more than some of the others? Admittedly this point is speculative and I do not think the authors can assess this possibility with the current data, so this is likely more of a suggested future direction.

Before recommending publication, I would like to look over the questionnaire used and thus would require an English translation of the measure, which is currently in Danish.

References:

Costa, P. T., Jr., & McCrae, R. R. (2008). The revised NEO Personality Inventory (NEO-PI-R). In G. J. Boyle, G. Matthews, & D. H. Saklofske (Eds.), The Sage Handbook of Personality Theory, and Assessment: Personality Measurement and Testing, Volume 2. London: Sage Publications.

Costa, P. Jr., Terracciano, A., & McCrae, R. R. (2001). Gender differences in personality traits across cultures: Robust and surprising findings. Journal of Personality and Social Psychology, 81, 322-331.

Heine, S. J., Buchtel, E. E., & Norenzayan, A. (2008). What do cross-national comparisons of personality traits tell us? The case of conscientiousness. Psychological Science, 19, 309-313.

Jussim, L., Cain, T., Crawford, J., Harber, K., & Cohen, F. (2009). The unbearable accuracy of stereotypes. Pp. 199-227 in T. Nelson (ed.), Handbook of prejudice, stereotyping, and discrimination. (Hillsdale, NJ: Erlbaum).

Jussim, L., Crawford, J.T., Anglin, S. M., Chambers, J., Stevens, S. T., & Cohen, F. Stereotype accuracy: One of the largest relationships in all of social psychology. To appear in T. Nelson (ed.), Handbook of prejudice, stereotyping, and discrimination (2nd ed). Hillsdale, NJ: Erlbaum.

McCrae, R. R., & Allik, J. (Eds.). (2002). The five-factor model of personality across cultures. New York: Kluwer Academic/Plenum Publishers.

McCrae, R. R., & Terracciano, A. (2005). Universal features of personality traits from the observer’s perspective: Data from 50 cultures. Journal of Personality and Social Psychology, 88, 547-561.

Terracciano, A., Abdel-Khalek, A. M., Adam, N., Adamovova, L., Ahn, C., Ahn., H. N., … & Meshcheriakov, B. (2005). National character does not reflect mean personality trait levels in 49 cultures. Science, 310, 96-100.