Back to [Archive] Post-review discussions

[ODP] Country of origin and use of social benefits: A pilot study of stereotype accur
Admin
Sean,

Thanks for the detailed review. Julius (co-author) has translated the questionnaire and placed it in the supplementary materials.

https://osf.io/ecpzj/files/

--

We will work on a revision that takes into account the comments given by both Dalliard (above) and you. It may take a few days.
Admin
Peter,

I give my approval for this paper. I would like to see, however, some acknowledgment that the findings are preliminary. Perhaps the term "research note" could be added to the title.


The paper already has “pilot study” in its title. We also discussed this in the discussion section, under limitations:

● Due to the non-random sampling and relationships between predictors, the findings should be seen as preliminary. Non-random sampling can produce or eliminate relationships between predictors which can result in spurious correlations or suppressing real correlations.
● The limited sample size makes conclusions uncertain, especially about correlates of (in)accuracy.


Thus, it seems to us that we have sufficiently noted that the findings are preliminary.

--

Dalliard,

Fixed (1)

2) Could you state all the answer choices you had for the control questions? Unless the other choices are obviously wrong, it's not clear that not answering that East Asians are shorter is not due to ignorance.


Sean requested an English translation of the questionnaire (not done by an experienced translator), which we have supplied. In it, you can find the question and answer options for the second control question. Namely:

What is the relationship between Europeans' and East-Asians' height? (East-Asians include Chinese, Japanese and Koreans.):
● All Europeans are taller than all East-asians.
● Most Europeans are taller than most East-asians.
● Europeans and East-asians are equally tall.
● Most East-asians are taller than most Europeans.
● All East-asians are taller than all Europeans.
● Don’t know / other.


Here, we only accepted “Most Europeans are taller than most East-Asians.” as the correct answer. All the other options are clearly false, except for the last, in which case the participant was either confused about the question or did not know that East Asians are generally less tall than Europeans.

3) There's still no explanation of what a ghetto is.


We added the following to Section 2.2. It reads:
A ghetto was defined as being an area where 1) many immigrants live, 2) there is a high unemployment rate and 3) there is a high crime rate. This is a non-quantitative definition that follows the definition of ghetto as used by the former Ministry of Cities, Housing and Land districts.

4) "Party voting desires tended towards national, conservative"
By national do you mean nationalist?


Yes. We have changed it to “nationalist”.

5) Why is the section "Inter-rater consistency" before section 4? The inter-rater correlations concern (unspecified) measures that are introduced in section 4.


Inter-rater consistency (or reliability) is typically regarded as a descriptive statistic given in the methodology section. The measures reported in Section 3.1.6 do not depend on any measures introduced in Section 4, so we don't understand this comment. They are based on the estimates themselves, not scored versions of estimates.

6) In Figure 2, the scale of the Y axis is still a complete mystery as is the meaning of the curve. Why can't you use a simple histogram like this one (without the normal curve): http://humanvarieties.org/wp-content/upl...ution1.png


The scale of the Y-axis is not important, only the relative differences. Compare with Figure 1. The scales don't match, but it doesn't matter for the interpretation. I have tried to put a useful scale on these sorts of plots, but it appears not to be so easy. http://stackoverflow.com/questions/32412805/ggplot2-histogram-with-density-curve-that-sums-to-1

One can work out the meaning of the y axis numbers, but it requires diving into the code basis for the function.

We don't use SPSS, so we cannot use the plot you suggest and we like the current plots more. The density curve fit in your example assumes a normal distribution, whereas those in our plots do not assume any particular distribution.

7) "Using Jussim et al (2015)'s cutoffs of .30 and .50 for levels of accuracy"

Better: Using the cutoffs of .30 and .50 for levels of accuracy, as recommended in Jussim et al. (2015)


Changed to your version.

8) In Table 3, add a note about the use of CIs.


Good idea. Added “The numbers in brackets show the 95% analytic confidence intervals.”.

9) "the log 10 value"

call it the common logarithm or the base-10 logarithm or log[subscript]10[/subscript]


Changed to “We transformed the population data using the base-10 logarithm to make them more normal.”.

10) A research paper like this is not the right place for orthographic innovations. Given that the English in the paper isn't that fluent to begin with, those altho's look like misspellings.


We disagree. Given the strong conversatism about spellings, the only way to get a change is to use alternative spellings in refined writing, such as academic papers. Using a few alternative writings is not seriously confusing, as it would be to use e.g. Cut Spelling (https://en.wikipedia.org/wiki/Cut_Spelling).

11) In the section "Levels of analysis", be more explicit about how the accuracy measures are constructed. Give examples.


We have added an example:

For instance, suppose we have two raters who each rate 5 groups on some trait. Their estimates are {10, 5, 7, 8, 5} and {12, 7, 8, 5, 8}. Suppose further that the true values are {15, 4, 10, 5, 6}. Using the Pearson correlation as the measure of accuracy, the raters' accuracy scores are .78 and .89, and their inter-correlation is .51 (individual-level). Their average accuracy is .83 (individual-level on average). If instead we average their estimates first, they become {11, 6, 7.5, 6.5, 6.5}. The accuracy of the aggregated estimates is .97 (aggregate-level).

To be sure, one could collapse levels 1-2 into one. Ultimately, it does not matter much which exact way it is conceptualized.

--

Sean,

I would like to see an elaboration on how the current findings fit into the existing literature on national character stereotypes (e.g., Costa & McCrae, 2008; Costa, Terracciano, & McCrae, 2001; McCrae & Allik, 2002, McCrae & Terracciano, 2005; Terracciano et al., 2005). Much of this literature has assessed national character/personality stereotypes and the evidence for accuracy is mixed (for reviews see Jussim et al., 2009; Jussim, Crawford, Anglin, Chambers, Stevens, & Cohen, 2015). One possibility is that the current study assesses the perception of a behavior (i.e., the use of governmental social benefits) and does not ask for personality assessments. Indeed, Heine, Buchtel, and Norenzayan (2008) challenged the “no accuracy in national character stereotypes” conclusion by comparing stereotypes to behavior potentially reflecting conscientiousness. When behavior (GDP, longevity, walking speed, clock accuracy, and postal worker speed) rather than self-reports on Big Five personality questionnaires were used as the criteria for accuracy, the correlations between consensual stereotypes and behavior averaged about .60.


While one could elaborate further on how the present findings fit with other findings, it seems to us that given the preliminary nature of the results, it is not wise to speculate on this matter. Patterns found in small studies are often not replicated in larger samples (sampling error and publication bias). We will try to replicate the study using new data before trying to synthesize with the broader literature.

I would also like to see more discussion of the role of gender in the current study. After the removal of 12 subjects, for various justified reasons, the sample is 75% male (36 males vs. 12 females). Yet, the study also reports a strong effect of gender on stereotype accuracy (d = .86). I wonder how much, if any, of the accuracy in the current study is related to gender and the greater number of males in the sample. While the authors may not be able to answer that question directly, it would be nice to see it addressed in the discussion.


The gender effect is possibly a fluke or a confound and we don't make much of it. Note that d values depend on the observed standard deviations within groups and does not necessarily mean that a difference is large in absolute size. In this case, the mean correlation for males is .52 and that for females is .37, so the difference is not large.

Age is confounded with gender in this sample: the mean ages are 35 and 23 for males and females respectively. Using both age and gender in a linear model shrinks the beta for gender by about half. R output:


lm("pearson_r ~ Gender", data = std_df(d_crit)) %>% lm_CI()
$coefs
Beta SE CI.lower CI.upper
GenderMale 0.81 0.31 0.2 1.43

$effect_size
R2 R2 adj.
0.1334814 0.1146441

lm("pearson_r ~ Gender + Age", data = std_df(d_crit)) %>% lm_CI()
$coefs
Beta SE CI.lower CI.upper
GenderMale 0.42 0.29 -0.16 1.00
Age 0.49 0.13 0.23 0.75

$effect_size
R2 R2 adj.
0.3460710 0.3170075


The CI for Gender spans 0 after the inclusion of age, while that for age still does not span 0. Thus, the gender effect is possibly just a confound due to age and non-random sampling.

We did not conduct these kinds of multiple regression analyses because the sample size is too small for reliable results.

Finally, I found myself wondering about the conjecture on page 12 that the "media are probably likelier to discuss and report about members of the larger groups." By larger groups do the authors mean there are more members of these groups in Denmark or does it refer to the overall population numbers for that group across the world?


We meant that the media are likelier to report on members of the larger groups in Denmark. We added “in Denmark” to the Section header to make it more clear.

I also wonder about other potential media effects. For instance, on page 11 Figure presents a scatterplot of aggregate estimates and the actual proportion of people receiving social benefits. The top right corner contains data points for Somalia, Syria, Iraq, Afghanistan, and Lebanon. There are currently ongoing armed conflicts in all of those countries, with the exception of Lebanon which borders Syria. Might the media be covering those conflicts and thus those countries more than some of the others? Admittedly this point is speculative and I do not think the authors can assess this possibility with the current data, so this is likely more of a suggested future direction.


There are currently large-scale immigration waves from Syria and nearby countries (http://www.bbc.com/news/world-europe-34131911), so yes, the media will talk more about these. However, looking into the effects of this is very speculative and not really within our power to find out.

---

We have updated the files given these changes. https://osf.io/ecpzj/files/ This is revision #8.
"The paper already has “pilot study” in its title."

Emil,

At the risk of splitting hairs, the term "research note" seems more appropriate than "pilot study." The term "pilot study" implies no judgment on the quality of the findings. Indeed, most pilot studies never produce any publishable findings. The purpose is simply to determine the conditions under which a study should be carried out, in particular the optimal sample size and methodology. This hasn't been done.

Please, let's be specific here. What will be the sample size of the subsequent study -- the real study? What will be the methodology? If you have no answers, you shouldn't be using the term "pilot study."

I will defer to the other reviewers on this point. From my experience, the term "research note" is a more appropriate term for a situation where the nature of further research is far from clear and where it is far from clear that such research will ever be done. The term "pilot project" implies that stage II is already in the works and that the parameters of stage II have been established.
Admin
Peter,

We are planning to collect more data from three sources:

First, we are planning on finding funding to pay a pollster to collect responses from a random sample. I have a sample size of about 200 in mind. We will possibly have to reduce the sample of countries in order to shorten the questionnaire (we had in mind using the 20 largest). It is very expensive to ask pollsters to collect data and I don't have access to research grants because I'm not working in academia. However, I will try to convince a few people to pool the necessary money.

Second, we have both made contacts with several teachers of secondary education in our extended networks. This allows us to collect more data from the students, like was done in the pilot study. Naturally, using students has problems. For instance, they show very little age variability, and we found age to be a strong predictor. On the other hand, students are fairly representative with regards to political opinions, which the pilot study sample was not.

Third, we will set up a website where one can answer the stereotype questions. The participants then research instant feedback on their accuracy that they can share on social media. Hopefully, this ease of access will result in a lot of responses. It will be less easy to collect other information this way, however.

So to conclude, this is a study that we intent to replicate and expand upon in the future. Wikipedia cites the following definition of "pilot study" (https://en.wikipedia.org/wiki/Pilot_experiment):

A pilot study, pilot project or pilot experiment is a small scale preliminary study conducted in order to evaluate feasibility, time, cost, adverse events, and effect size (statistical variability) in an attempt to predict an appropriate sample size and improve upon the study design prior to performance of a full-scale research project.
The difference between the second and the third is that in the second we calculate accuracy scores for each individual and then average, while in the third we aggregate estimates and then calculate accuracy scores.


You can shorten the sentence by just saying the 3rd is done in the reverse way.

Nationalism, personal liberalism and especially age seem to be somewhat useful predictors of correlational accuracy.


Nothing especially wrong, but just a reminder. When you mention age effect, be careful that sometimes an age effect is confounded with cohort effect.

Concerning table 3, I don't understand why both conservatism and liberalism have both positive r with both elevation and dispersion bias; that conservatism and liberalism tend to exaggerate the group differences is something that I didn't expected.

I believe there are many aspects that are covered here : many types of bias (elevation/dispersion) and the possibility of confoundings (e.g., GDP*real estimates corr. removed from GDP*stereotype accuracy scores), which is a good thing. In general I agree with your general conclusion.

There's, in my opinion another little thing that bothers me. I agree with one reviewer here that the role of gender should not be obscured; as your page 10 reveals, the d is not just smaller for females but also tends toward the opposite direction.

I disagree with the following comment :

In this case, the mean correlation for males is .52 and that for females is .37, so the difference is not large.


Because it looks like you think these means can be reliably trusted. It would be yes if the dispersion (or SD) was not really large. And for females, I find the SD being really too large (and looking at figure 3, the numbers are far from being normally distributed). So, I don't trust the mean of 0.37. Maybe there's a gender effect, maybe no. You need more samples. Still, I will not be surprised if you can find a gender effect in other, larger samples.

I think I will give you my approval, since it has no big flaws, but I want to know first what you do about the above-mentioned point.
Admin
Meng Hu,

Nationalism, personal liberalism and especially age seem to be somewhat useful predictors of correlational accuracy.


Nothing especially wrong, but just a reminder. When you mention age effect, be careful that sometimes an age effect is confounded with cohort effect.


You are correct. We will be more clear. It could be a cohort effect.

Concerning table 3, I don't understand why both conservatism and liberalism have both positive r with both elevation and dispersion bias; that conservatism and liberalism tend to exaggerate the group differences is something that I didn't expected.


It is personal liberalism, as in favoring more personal freedoms, not the vague concept of liberalism as used in the US. The question was (translated):

How important is personal liberty to you?
(By “personal liberty” we mean the freedom to do things in private that do not hurt others. These include the freedom to eat unhealthy food, not exercising, drinking alcohol, smoking cigarettes, smoking marijuana, taking ecstasy, heroin, eating mushrooms, practice extreme sports, sunbathing, have sex with people outside of marriage, have sex with people of the same gender as yourself, marry more than one person, buy / sell sexual services, refuse military service, and the like.)
1-7
1 is not important at all and 7 is very important.


The translation is poor at places (I have changed the one above a little). We will work on a better version. The translation was done by my co-author who is not used to this kind of work.

I would not try to make much of the correlates of accuracy when the sample size is so small and unrepresentative. We offer the correlations only as suggestive: worthy of further study.

I believe there are many aspects that are covered here : many types of bias (elevation/dispersion) and the possibility of confoundings (e.g., GDP*real estimates corr. removed from GDP*stereotype accuracy scores), which is a good thing. In general I agree with your general conclusion.

There's, in my opinion another little thing that bothers me. I agree with one reviewer here that the role of gender should not be obscured; as your page 10 reveals, the d is not just smaller for females but also tends toward the opposite direction.


I don't understand. The gender difference is in the same direction in both cases, towards more higher accuracy for males. Note that the for mean absolute errors, a smaller value is better, so d=-.22 is .22 in favor of male accuracy. For Pearson correlations, positive signal higher accuracy, so d=.86 mean higher male accuracy in this sample.

I disagree with the following comment :

In this case, the mean correlation for males is .52 and that for females is .37, so the difference is not large.


Because it looks like you think these means can be reliably trusted. It would be yes if the dispersion (or SD) was not really large. And for females, I find the SD being really too large (and looking at figure 3, the numbers are far from being normally distributed). So, I don't trust the mean of 0.37. Maybe there's a gender effect, maybe no. You need more samples. Still, I will not be surprised if you can find a gender effect in other, larger samples.


We already noted in the text in section 5.1.2 that "The female distribution of results was peculiar, so some caution is advised. We note that the difference was much smaller using mean absolute errors (d=-.22 [CI95: -.89 to 0.45]).".

I think I will give you my approval, since it has no big flaws, but I want to know first what you do about the above-mentioned point.


Let us know what caveat you would like us to add. We did not draw any strong conclusions in the study regarding gender, just presented the differences and their CIs and advising caution.

The purpose of this study is not so much to draw conclusions, just to showcase the methods one could use in a larger sample. Hence, pilot study.
Admin
I have added a new revision, version 9. Files updated https://osf.io/ecpzj/files/

Changes:
- Added a new paragraph in the conclusion:
We found some evidence that some variables are associated with stereotype accuracy. For instance, observed age had a correlation of 0.56 [CI95: 0.32 0.81] with correlational accuracy. If real, it is unclear whether this is an age or cohort effect. In general, we do not draw strong conclusions from the analyses of accuracy predictors because the sample was both small and unrepresentative.


- Edited the paragraph discussing the gender differences to be more clear:
Men had higher accuracy: .52 vs. .37. This is actually a large standardized difference (d=.86 [CI95: 0.17 to 1.56]; using pooled sd). The female distribution of results is peculiar, so some caution is advised. We also note that the difference was much smaller but in the same direction using mean absolute errors (d=-.22 [CI95: -.89 to 0.45]). Recall, that lower values of mean absolute errors signal higher accuracy, so a d of -.22 is .22 in favor of male accuracy.


- Added clarification in Section 5.1.1 about the meaning of personal liberalism. ("(preference for more personal freedoms)")

- Various other small changes.
It is personal liberalism, as in favoring more personal freedoms, not the vague concept of liberalism as used in the US. The question was (translated):

How important is personal liberty to you?
(By “personal liberty” we mean the freedom to do things in private that do not hurt others. These include the freedom to eat unhealthy food, not exercising, drinking alcohol, smoking cigarettes, smoking marijuana, taking ecstasy, heroin, eating mushrooms, practice extreme sports, sunbathing, have sex with people outside of marriage, have sex with people of the same gender as yourself, marry more than one person, buy / sell sexual services, refuse military service, and the like.)
1-7
1 is not important at all and 7 is very important.


Yes, my mistake. I don't know why I was thinking about liberalism as understood in the US. Automatism, probably...

I don't understand. The gender difference is in the same direction in both cases, towards more higher accuracy for males. Note that the for mean absolute errors, a smaller value is better, so d=-.22 is .22 in favor of male accuracy. For Pearson correlations, positive signal higher accuracy, so d=.86 mean higher male accuracy in this sample.


No problem.

We already noted in the text in section 5.1.2 that "The female distribution of results was peculiar, so some caution is advised. We note that the difference was much smaller using mean absolute errors (d=-.22 [CI95: -.89 to 0.45]).".


I know. But you are missing the implication of this finding, i.e., that such an abnormal SD causes the mean (here, the mean correlation) to be unreliable. So when you say "yes, the difference in mean correlation is not so large" I would rather say "yes but...".

Let us know what caveat you would like us to add.


I do not necessarily want you to modify something, particularly if it's not needed. I just wanted to make sure that you understand my point about the gender effect.

For instance, as ststevens pointed out, there might be a gender effect in your data, but the SD for females is so weird that perhaps it may lack representativity, such that any comparison between males and females here can be taken with a pinch of salt. Thus, this causes no problem for any of the points you've made in the discussion section.

That's why I don't need you to modify/add anything about gender effect. I am giving you my approval anyway.
Admin
Meng Hu,

The point you seem to be making is that the mean is a poor summary statistic of a distribution as peculiar as that seen for females accuracy correlations. I agree, but the number of datapoints is so small that it is hard to say which summary statistic would be better. The median? Trimmed mean? Robust measure of dispersion too (mean absolute deviation)?

Here's the numbers:
Means + SDs: -0.865
Medians + SDs: -0.863
5% trimmed mean + SDs: -0.914
Medians + MADs: -0.736

So, it does not matter too much which combination of statistics is chosen in this case.

Furthermore, probably, the population of female correlation accuracy scores will be fairly normal (with long tail). The odd pattern here is likely some combination of sampling error and non-random sampling.

Would it satisfy you if I added a small note that the results were similar using other measures of central tendency and dispersion?

In any case, this gender difference is not important to our paper. Just an exploratory analysis.
My point is that the mean is unreliable is SD is too large, not just because of non-normal distribution. If you have a correlation mean of zero, but have lot of moderate-to-high negative correlations and lot of moderate-to-high positive correlations, what should you make of the mean ? Even the trimmed means and medians are helpless. In this situation, the mean is uninformative, and there's a probability that some confounding factors are involved in this curious pattern.

In any case, you can modify your paper as you wish, or not modify it at all. As I said, it's not something really important, especially since you don't have the sample to answer this delicate question.
1) "The structure of the survey is as following:"

... was as follows:

2) Likert should be capitalized.

3) "There are several measures of inter-rater consistency. Perhaps the simplest is to calculate the mean correlation between raters. Figure 1 shows the distribution of rater intercorrelations."

No explanation is given what is being correlated here.

4) "One method is to correlate the estimates with the real values (Jussim, 2012, p. 205)."

I would clarify, "... to correlate participants' estimates of group values with the real group values..."

5) "Using the Pearson correlation as the measure of accuracy, the raters' accuracy scores are .78 and .89, and their inter-correlation is .51 (individual-level)."

Isn't that .51 just the inter-rater correlation which tells us nothing about accuracy? What does it do there? Very confusing. Also I'd use values in that example that yield accuracy scores more similar to those in the real data (.78 and .89 are much higher than real accuracy scores).

6) "Figure 2 shows the distribution of Pearson correlations."

Clarify, e.g., "... of Pearson correlations, each data point being a correlation between a participant's estimates of group values and the real group values." (Assuming I get this right.)

7) Regarding Table 2, why do you report correlations between individual accuracy measures rather than the accuracy measures themselves? I would merge section 5.2 to the beginning of section 5 so that at least mean/median individual accuracy measures would be presented before correlations between them.

8) "(systematic error; (Jensen, 1980))"

(systematic error; Jensen, 1980)

9) "Because the population of Danish was an extreme outlier"

the native Danish population

10) "In reviewing the paper (http://openpsych.net/forum/showthread.php?tid=256&pid=3888#pid3888), Peter Frost (http://evoandproud.blogspot.dk/)"

I don't think there's a need to link to the forum comment. At least the link should be placed in a footnote. There's no reason to link to Frost's blog.

11) "Rarely is it considered that they instead reflect group differences"

... that they may instead reflect genuine group differences
Admin
Dalliard,

Thank you for another set of good suggestions. We have implemented most of them.

1)
Fixed.

2)
Switched to using "scale".

3)
Changed to "There are several measures of inter-rater consistency. Perhaps the simplest is to calculate all the correlations between raters' estimates."

4)
Added your version.

5)
Yes. It is there for comparison purposes.

I have changed the numbers so they are less accurate: mean cor = 51, aggregate .68.

6)
Used your version.

7)
I don't understand what you want. We have 48 subjects in the final sample, so reporting the accuracy scores by subject would result in a table with 48 rows.

The reason the summary statistics are presented at the end is that after inspecting the Pearson distribution plot, one further outlier is removed. If the summary statistics were presented first, they would not reflect this extra exclusion.

8)
Fixed.

9)
Fixed.

10)
Moved to footnote. Removed link to Peter's blog.

11)
Added "may".

---

Sean has informed me by email that he approves of the paper after having reviewed the questionnaire. He should make a post here soon about it.

---

New version with the above updates. Version 10.

https://osf.io/bwtg8/

Files updated:
  • paper.odt
  • paper.pdf
  • scripts/example.R
Thank you Emil for posting a translated version of the questionnaire. I have reviewed the items and I am recommending publication of the article.

My only real concern was simply being thorough, as my previous review stated I feel the authors did a good job being measured in their conclusions given that the study was a pilot test. I feel it is important for us as a field to publish most if not all of the studies we as researchers conduct. This helps to reduce publication bias and the inflation of effect sizes in meta-analyses.

I look forward to seeing what the larger replication attempt finds.
Okay, I approve the paper.