Journal:
Open Differential Psychology.
Authors:
Emil O. W. Kirkegaard
Julius D. Bjerrekær
Title:
Are stereotypes about immigrants accurate in Denmark?: a large, preregistered study
Abstract:
The study was preregistered with almost all analyses being specified before data collection began.
A nationally representative sample was asked to estimate the percentage of persons aged 30-39 receiving social benefits for 70 countries of origin (N = 766). After extensive quality control procedures, a sample of 484 persons were available for analysis. Stereotypes were scored by accuracy by comparing the estimates values to values obtained from an official source. Individual stereotypes were found to be moderately accurate (median/mean correlation with criterion values = .48/.43), while the aggregate stereotype was found to be very accuracy (r = .70). Both individual and aggregate-level stereotypes tended to underestimate the percentages of persons receiving social benefits and underestimate real group differences.
In bivariate analysis, stereotype correlational accuracy was found to be predicted by a variety of predictors at above chance levels, including conservatism (r = .13), nationalism (r = .11), some immigration critical beliefs/preferences, agreement with a few political parties, educational attainment (r = .20), being male (d = .19) and cognitive ability (r = .22). Agreement with most political parties, experience with ghettos, age, and policy positions on immigrant questions had little or no predictive validity.
In multivariate predictive analysis using OLS and LASSO regression, correlational accuracy was found to be predicted only by cognitive ability and educational attainment with even moderate level of reliability. In general, stereotype accuracy was not easy to predict, even using 24 predictors (k fold cross-validated R2 = 4%).
We examined whether stereotypes were more or less accurate for the more Muslim groups. Stereotypes were found to be less accurate in that they underestimated the percentages of persons from more Muslim groups receiving social benefits (mean estimation error for Muslim groups relative to overall elevation error = -8.09 %points).
Key words:
stereotypes, stereotype accuracy, Denmark, immigrants, social benefits, group differences, Muslims, replication, preregistered, open data
Length:
~7100 words, excluding references. PDF is 29 pages including references and appendix.
Files:
https://osf.io/wxqma/
Note that the questionnaire is currently not available. It will be made available shortly (pending email reply from the survey handler).
External reviewers:
Sean T. Stevens, who has published on stereotype research before and who reviewed the pilot study, has agreed to review this paper as well.
Back to [Archive] Post-review discussions
The paper is interesting and is an example of carrying out a study plan with attention to details. My primary concern with the paper involves the nature of what is a stereotype and whether the study actually measured stereotype accuracy or instead measured aggregate guessing, possibly influenced by stereotypes. The relatively low correlations suggest the latter.
The usual usage of "stereotype" is in the context of social memes about groups. As we all know, these exist for most identifiable groups (races, nationalities, social groups, etc.) and have persisted for long periods; much of the comedy we enjoy is based on stereotypes. In this study, asking people about the amount of government support received by a group, falls into a category that I believe does not follow stereotyping because it is information that is not accurately known by almost the entire population of any country.
If people were asked to describe Muslims (because this group is central to the study), the descriptions are likely to involve stereotypes pertaining to appearance and behavior. I would expect this category of stereotypes to have a much higher accuracy than the items addressed by the paper.
As a matter of curiosity, I searched Google Scholar for "stereotype -threat" to exclude stereotype threat studies. The results returned a fairly large number of papers. Three that may be worth skimming (I don't have access to them):
http://jcr.oxfordjournals.org/content/21/2/354
http://jcr.oxfordjournals.org/content/21/2/354.abstract
http://psycnet.apa.org/psycinfo/2003-88371-012
---
Solutions
The issue described above may be resolved by a careful description of the objective of the study, noting that the questions being asked were intentionally selected to require respondents to call upon their images of groups (immigrants) to guess an answer. Or, the stereotype might be identified separately from the response, for example: the immigrant is associated with a prosperous or not prosperous group (stereotype) and this indicates the direction of guessing how much government assistance is received by the group. I think the second approach is better.
The authors could simply stick to the plan and explain that there are very large numbers of stereotypes and the ones they are examining come from a rarely probed category. This approach seems less attractive to me, but it might work with some extra discussion of objective, expectations (magnitude of correlation or accuracy measure), and the notion of strong versus weak stereotyping.
I looked at dictionary definitions of "stereotype" and found a wider range than I expected. One that I think works is from Dictionary.com: "a simplified and standardized conception or image invested with special meaning and held in common by members of a group." I think this definition works well for the things most people consider to be stereotypes.
---
"Participants were asked to rate themselves (0-100) on four scales: conservatism,
nationalism, economic liberalism and personal liberalism."
Were the definitions given to the participants? Were there instructions about how to choose from 0-100? Is the use of 100 points more meaningful than a narrower scale (5, 9, etc.)? I ask because a person is likely to select a number somewhat randomly and not feel that it is more or less accurate than a number over a range. I believe that people simply don't have very fine grained resolution of their reactions to questions.
---
"Participants were asked to rate their agreement (0-100) with each of the main political
parties in Denmark (15 parties), including parties outside parliament. They had the option
of stating that they had no preference or no knowledge about a particular party. This was
because the list included relatively unknown parties outside parliament."
Were party platforms given to participants? If not, are they understood by the range of people who are being asked? This item is somewhat difficult to understand from my perspective, because we have only two serious parties. I understand that more parties exist in other nations, but have no idea how they are understood by their citizens.
Text comments
These are suggestions for wording that may enhance the readability of the paper.
We examined whether stereotypes were more or less accurate for the more* Muslim groups. Stereotypes
were found to be less accurate in that they underestimated the percentages of persons from more**
Muslim groups receiving social benefits (mean estimation error for Muslim groups relative to overall
elevation error = -8.09 %points).
* change more to -- predominantly
** "more" is unclear here. Deleting the word seems to work, or another wording may clarify the meaning.
---
It found that stereotypes were moderately accurate (median correlational accuracy score = .51), but the
results are hard to generalize to the general* population.
* Following "generalize" the word "general" should be changed. Suggestions: national, or overall.
---
"We also noted that some users'
estimates were reverse of reality, indicating that they did not understand the task or that they were
purposefully filling it out in reversely." *
* suggest for last two words: "in reverse" or "backwards"
---
"Users that * fail these questions could then easily
be filtered afterwards. Additionally, we changed the order of the cognitive items to be random so that
the presenting order** could not have a systematic effect on the responses."
"For our analyses, we excluded all participants that * failed one of the first 7 controls or who gave
reversed answers"
* The use of "that" is not a serious error, but it is likely to be disconcerting to most readers. There are many grammatical discussions of "who" vs. "that" on the web. This one is worthwhile:
http://blog.apastyle.org/apastyle/2012/06/who-versus-that.html
** change "presenting order" to "order of presentation" or "presentation order"
---
"If people rely upon GDP per capita to
estimate immigrant performance, then their estimates will be highly correlated with these * which they
are (r = -.79)."
* Something is needed to set the last three words off as a phrase. A comma at the point of the * may work.
The usual usage of "stereotype" is in the context of social memes about groups. As we all know, these exist for most identifiable groups (races, nationalities, social groups, etc.) and have persisted for long periods; much of the comedy we enjoy is based on stereotypes. In this study, asking people about the amount of government support received by a group, falls into a category that I believe does not follow stereotyping because it is information that is not accurately known by almost the entire population of any country.
If people were asked to describe Muslims (because this group is central to the study), the descriptions are likely to involve stereotypes pertaining to appearance and behavior. I would expect this category of stereotypes to have a much higher accuracy than the items addressed by the paper.
As a matter of curiosity, I searched Google Scholar for "stereotype -threat" to exclude stereotype threat studies. The results returned a fairly large number of papers. Three that may be worth skimming (I don't have access to them):
http://jcr.oxfordjournals.org/content/21/2/354
http://jcr.oxfordjournals.org/content/21/2/354.abstract
http://psycnet.apa.org/psycinfo/2003-88371-012
---
Solutions
The issue described above may be resolved by a careful description of the objective of the study, noting that the questions being asked were intentionally selected to require respondents to call upon their images of groups (immigrants) to guess an answer. Or, the stereotype might be identified separately from the response, for example: the immigrant is associated with a prosperous or not prosperous group (stereotype) and this indicates the direction of guessing how much government assistance is received by the group. I think the second approach is better.
The authors could simply stick to the plan and explain that there are very large numbers of stereotypes and the ones they are examining come from a rarely probed category. This approach seems less attractive to me, but it might work with some extra discussion of objective, expectations (magnitude of correlation or accuracy measure), and the notion of strong versus weak stereotyping.
I looked at dictionary definitions of "stereotype" and found a wider range than I expected. One that I think works is from Dictionary.com: "a simplified and standardized conception or image invested with special meaning and held in common by members of a group." I think this definition works well for the things most people consider to be stereotypes.
---
"Participants were asked to rate themselves (0-100) on four scales: conservatism,
nationalism, economic liberalism and personal liberalism."
Were the definitions given to the participants? Were there instructions about how to choose from 0-100? Is the use of 100 points more meaningful than a narrower scale (5, 9, etc.)? I ask because a person is likely to select a number somewhat randomly and not feel that it is more or less accurate than a number over a range. I believe that people simply don't have very fine grained resolution of their reactions to questions.
---
"Participants were asked to rate their agreement (0-100) with each of the main political
parties in Denmark (15 parties), including parties outside parliament. They had the option
of stating that they had no preference or no knowledge about a particular party. This was
because the list included relatively unknown parties outside parliament."
Were party platforms given to participants? If not, are they understood by the range of people who are being asked? This item is somewhat difficult to understand from my perspective, because we have only two serious parties. I understand that more parties exist in other nations, but have no idea how they are understood by their citizens.
Text comments
These are suggestions for wording that may enhance the readability of the paper.
We examined whether stereotypes were more or less accurate for the more* Muslim groups. Stereotypes
were found to be less accurate in that they underestimated the percentages of persons from more**
Muslim groups receiving social benefits (mean estimation error for Muslim groups relative to overall
elevation error = -8.09 %points).
* change more to -- predominantly
** "more" is unclear here. Deleting the word seems to work, or another wording may clarify the meaning.
---
It found that stereotypes were moderately accurate (median correlational accuracy score = .51), but the
results are hard to generalize to the general* population.
* Following "generalize" the word "general" should be changed. Suggestions: national, or overall.
---
"We also noted that some users'
estimates were reverse of reality, indicating that they did not understand the task or that they were
purposefully filling it out in reversely." *
* suggest for last two words: "in reverse" or "backwards"
---
"Users that * fail these questions could then easily
be filtered afterwards. Additionally, we changed the order of the cognitive items to be random so that
the presenting order** could not have a systematic effect on the responses."
"For our analyses, we excluded all participants that * failed one of the first 7 controls or who gave
reversed answers"
* The use of "that" is not a serious error, but it is likely to be disconcerting to most readers. There are many grammatical discussions of "who" vs. "that" on the web. This one is worthwhile:
http://blog.apastyle.org/apastyle/2012/06/who-versus-that.html
** change "presenting order" to "order of presentation" or "presentation order"
---
"If people rely upon GDP per capita to
estimate immigrant performance, then their estimates will be highly correlated with these * which they
are (r = -.79)."
* Something is needed to set the last three words off as a phrase. A comma at the point of the * may work.
Bob,
Thanks for the review. We will think about it and get back to you.
Thanks for the review. We will think about it and get back to you.
This is far more informative than the average stereotype threat study. I was impressed by the care taken to clean the data and filter out noncompliant respondents. My only suggestion is that the term stereotype should be briefly defined in the introduction.
Gerhard,
Thank you fro taking the time to read it over. It is pretty long. We spent a large number of hours in discussing in emails with the pollster trying to filter out the noncompliants. We will probably be returning to this issue as it is somewhat important for interpretation of these kinds of studies.
We will expand the introduction to better deal with the term "stereotype" which was also a suggestion by Bob Williams (above).
Thank you fro taking the time to read it over. It is pretty long. We spent a large number of hours in discussing in emails with the pollster trying to filter out the noncompliants. We will probably be returning to this issue as it is somewhat important for interpretation of these kinds of studies.
We will expand the introduction to better deal with the term "stereotype" which was also a suggestion by Bob Williams (above).
Bob,
There are two parts to this: semantic/conceptual and empirical.
Semantic. We use the word stereotype in the meaning used by many other researchers and recommended by Lee Jussim, the current foremost expert on stereotypes in my opinion (One the papers you link to is in fact an old review by Lee Jussim!). In this context, a stereotype is a belief about a group. The beliefs may be and are mostly statistical in nature, not absolute (many/few not all/no). Some have used the word in other senses usually involving defining stereotypes as inaccurate or exaggerated beliefs about groups (you give an example of this in your post). However, this leaves us with the problem of what to call beliefs about groups that are accurate (such as beliefs about the clothing that you mention), and also results in various logical problems that are dealt with at length in Jussim's 2012 book. I recommend reading the book. The book can be freely downloaded on Libgen (http://gen.lib.rus.ec/book/index.php?md5=B1E521E63710CFED147BF3885DFB25CA ).
The use of stereotype this way is also in line with the popular stereotype threat hypothesis (mentioned by you) which is that making stereotypes about lower performance about a group salient depresses that group's performance via some threat response/emotional stress. The most two commonly researched cases are the lower scholastic/cognitive ability of African Americans and the lower math performance by girls. The stereotypes being made salient if of course the common beliefs that African Americans and females perform worse on these tasks.
Per request (you and Gerhard), I have added a few lines explaining the use of the word to readers who are unfamiliar with this use.
Empirical. The stereotype accuracy correlations reported in this study are 1) strong compared to most social science effects, especially social psychology (http://neuron4.psych.ubc.ca/~schaller/Psyc591Readings/RichardBondStokes-Zoota2003.pdf ), and 2) very similar to other studies (see various reviews by Jussim).
Yes and yes. You can verify this by reading the questionnaire in the project files.
The use of a scale with more options reduces the impact of the violation of statistical assumptions, i.e. that of continuity. I will be exploring the effect of using a 7-level (1-7) scale vs. a 101-level (0-100) scale in an upcoming study (with Noah Carl). It's an empirical question of how fine-grained people's self-ratings are if one gives them the chance. As far as I know, no one knows because I looked but could not find a study exploring this matter. However, one finding from the Good Judgment Project was that better forecasters used more fine-grained predictions. You can read about this in Tetlock's book Superforecasting: The Art and Science of Prediction. https://en.wikipedia.org/wiki/Superforecasting
No and yes. All the main parties are widely known to the public as well as their approximate positions on policy matters. I suggest you spend some time reading about multi-party systems on Wikipedia. The US is the outlier here, all other Western countries have more than two parties, even the UK which uses a similar system (first past the post).
-
The use of gradual words is intentional. Being Muslim is not dichotomous for groups. For instance, Nigerians are about 50% Muslim, 50% Christian. Take a look at Pew Research's work on the topic which is where the Islam data comes from. http://www.pewforum.org/2011/01/27/table-muslim-population-by-country/
Changed to overall.
Changed to We also noted that some users' estimates were opposite of reality, indicating that they did not understand the task or that they were purposefully answering dishonestly.
Changed to who and to order of presentation.
Added a comma.
---
Updated the files to version 13 (paper.pdf and paper.odt).
The paper is interesting and is an example of carrying out a study plan with attention to details. My primary concern with the paper involves the nature of what is a stereotype and whether the study actually measured stereotype accuracy or instead measured aggregate guessing, possibly influenced by stereotypes. The relatively low correlations suggest the latter.
There are two parts to this: semantic/conceptual and empirical.
Semantic. We use the word stereotype in the meaning used by many other researchers and recommended by Lee Jussim, the current foremost expert on stereotypes in my opinion (One the papers you link to is in fact an old review by Lee Jussim!). In this context, a stereotype is a belief about a group. The beliefs may be and are mostly statistical in nature, not absolute (many/few not all/no). Some have used the word in other senses usually involving defining stereotypes as inaccurate or exaggerated beliefs about groups (you give an example of this in your post). However, this leaves us with the problem of what to call beliefs about groups that are accurate (such as beliefs about the clothing that you mention), and also results in various logical problems that are dealt with at length in Jussim's 2012 book. I recommend reading the book. The book can be freely downloaded on Libgen (http://gen.lib.rus.ec/book/index.php?md5=B1E521E63710CFED147BF3885DFB25CA ).
The use of stereotype this way is also in line with the popular stereotype threat hypothesis (mentioned by you) which is that making stereotypes about lower performance about a group salient depresses that group's performance via some threat response/emotional stress. The most two commonly researched cases are the lower scholastic/cognitive ability of African Americans and the lower math performance by girls. The stereotypes being made salient if of course the common beliefs that African Americans and females perform worse on these tasks.
Per request (you and Gerhard), I have added a few lines explaining the use of the word to readers who are unfamiliar with this use.
Empirical. The stereotype accuracy correlations reported in this study are 1) strong compared to most social science effects, especially social psychology (http://neuron4.psych.ubc.ca/~schaller/Psyc591Readings/RichardBondStokes-Zoota2003.pdf ), and 2) very similar to other studies (see various reviews by Jussim).
"Participants were asked to rate themselves (0-100) on four scales: conservatism, nationalism, economic liberalism and personal liberalism."
Were the definitions given to the participants? Were there instructions about how to choose from 0-100?
Yes and yes. You can verify this by reading the questionnaire in the project files.
Is the use of 100 points more meaningful than a narrower scale (5, 9, etc.)? I ask because a person is likely to select a number somewhat randomly and not feel that it is more or less accurate than a number over a range. I believe that people simply don't have very fine grained resolution of their reactions to questions.
The use of a scale with more options reduces the impact of the violation of statistical assumptions, i.e. that of continuity. I will be exploring the effect of using a 7-level (1-7) scale vs. a 101-level (0-100) scale in an upcoming study (with Noah Carl). It's an empirical question of how fine-grained people's self-ratings are if one gives them the chance. As far as I know, no one knows because I looked but could not find a study exploring this matter. However, one finding from the Good Judgment Project was that better forecasters used more fine-grained predictions. You can read about this in Tetlock's book Superforecasting: The Art and Science of Prediction. https://en.wikipedia.org/wiki/Superforecasting
"Participants were asked to rate their agreement (0-100) with each of the main political parties in Denmark (15 parties), including parties outside parliament. They had the option of stating that they had no preference or no knowledge about a particular party. This was because the list included relatively unknown parties outside parliament."
Were party platforms given to participants? If not, are they understood by the range of people who are being asked? This item is somewhat difficult to understand from my perspective, because we have only two serious parties. I understand that more parties exist in other nations, but have no idea how they are understood by their citizens.
No and yes. All the main parties are widely known to the public as well as their approximate positions on policy matters. I suggest you spend some time reading about multi-party systems on Wikipedia. The US is the outlier here, all other Western countries have more than two parties, even the UK which uses a similar system (first past the post).
-
We examined whether stereotypes were more or less accurate for the more* Muslim groups. Stereotypes were found to be less accurate in that they underestimated the percentages of persons from more** Muslim groups receiving social benefits (mean estimation error for Muslim groups relative to overall elevation error = -8.09 %points).
* change more to -- predominantly
** "more" is unclear here. Deleting the word seems to work, or another wording may clarify the meaning.
The use of gradual words is intentional. Being Muslim is not dichotomous for groups. For instance, Nigerians are about 50% Muslim, 50% Christian. Take a look at Pew Research's work on the topic which is where the Islam data comes from. http://www.pewforum.org/2011/01/27/table-muslim-population-by-country/
It found that stereotypes were moderately accurate (median correlational accuracy score = .51), but the results are hard to generalize to the general* population.
* Following "generalize" the word "general" should be changed. Suggestions: national, or overall.
Changed to overall.
"We also noted that some users' estimates were reverse of reality, indicating that they did not understand the task or that they were purposefully filling it out in reversely." *
* suggest for last two words: "in reverse" or "backwards"
Changed to We also noted that some users' estimates were opposite of reality, indicating that they did not understand the task or that they were purposefully answering dishonestly.
"Users that * fail these questions could then easily be filtered afterwards. Additionally, we changed the order of the cognitive items to be random so that the presenting order** could not have a systematic effect on the responses."
"For our analyses, we excluded all participants that * failed one of the first 7 controls or who gave reversed answers"
* The use of "that" is not a serious error, but it is likely to be disconcerting to most readers. There are many grammatical discussions of "who" vs. "that" on the web. This one is worthwhile:
http://blog.apastyle.org/apastyle/2012/0...-that.html
** change "presenting order" to "order of presentation" or "presentation order"
Changed to who and to order of presentation.
"If people rely upon GDP per capita to estimate immigrant performance, then their estimates will be highly correlated with these * which they are (r = -.79)."
* Something is needed to set the last three words off as a phrase. A comma at the point of the * may work.
Added a comma.
---
Updated the files to version 13 (paper.pdf and paper.odt).
-We examined whether stereotypes were more or less accurate for the more* Muslim groups. Stereotypes were found to be less accurate in that they underestimated the percentages of persons from more** Muslim groups receiving social benefits (mean estimation error for Muslim groups relative to overall elevation error = -8.09 %points).
* change more to -- predominantly
** "more" is unclear here. Deleting the word seems to work, or another wording may clarify the meaning.
The use of gradual words is intentional. Being Muslim is not dichotomous for groups. For instance, Nigerians are about 50% Muslim, 50% Christian. Take a look at Pew Research's work on the topic which is where the Islam data comes from. http://www.pewforum.org/2011/01/27/table-muslim-population-by-country/
The problem remains that the wording used above will not be clear to many readers. The intent could be explained, using parenthesis or footnote, or worded so that the intent is clear. A reader could translate "more Muslim groups" as meaning:
►a larger number of distinct Muslim groups
►more intense or observant Muslim groups
►additional Muslim groups
Other than this small item, I think you have addressed the things I mentioned and think the paper is well written and interesting.
Bob,
The context seems to rule out your (1) and (3) interpretations, e.g. context for the first use rules our (1) due to the definite article. The second one rules out (3). (2) is possible, however.
I have changed the wording to a longer less ambiguous phrasing "groups with larger proportions of Muslims" and variants. Also rephrased the abstract slightly.
Files updated (version 14).
The context seems to rule out your (1) and (3) interpretations, e.g. context for the first use rules our (1) due to the definite article. The second one rules out (3). (2) is possible, however.
I have changed the wording to a longer less ambiguous phrasing "groups with larger proportions of Muslims" and variants. Also rephrased the abstract slightly.
Files updated (version 14).
Emil - Thanks. I think the sentence is clear now. In fact, I think I was guessing incorrectly on all three of my interpretations.
I see nothing else to change. I approve for publication.
I see nothing else to change. I approve for publication.
This is a lege artis conducted study and a very well prepared paper.
Some points can be improved in the manuscript and some aspects in further studies.
First, there is nearly no introduction and no theory. Its a very Ockham-British-Lynn like data driven paper. More verbalization and generally a better formatting of ODP papers would be helpful.
Next there is a lot of statistical analysis and it seems the authors have analyzed all what is possible to do with the given data. However, data analysis should be driven by research questions and theory.
Introduction: You wrote “stereotypes were moderately accurate (median correlational accuracy score = .51)”. Usually, taken Cohen’s levels for interpretation r=.10 is a small correlation, .30 a medium and .50 a large correlation. For the validity of tests r=.50 is taken as a large correlation (Fisseni).
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum.
Fisseni, H.-J. (2004/1997/1990). Lehrbuch der psychologischen Diagnostik. Göttingen: Verlag für Psychologie.
However, for reliability measures higher correlations are expected (r=.80 medium, r=.90 high; Fisseni). And as I remember averaging Jussim’s (2012, his tables 17-1 to 17-3) results the average correlation between stereotypes and criteria is about r=.81.
So you should discuss possible benchmarks/norms and justify your taken benchmark for interpretation.
I see two aspects that could be improved in the survey hopefully possible in a future/further study:
1. Disclosure. You start your questionnaire (translated to English) with:
“When people have expectations about how one or more groups are or behaves, it’s often called a “prejudice” or “stereotype”. Often it’s said that they are inaccurate, even without examining this. The purpose of this study is to examine how precise the Danes’ stereotypes of immigrants are. Which is why we ask of you to evaluate how well a plethora of immigrant groups in Denmark perform, without looking up the numbers beforehand.”
In this way the surveyed persons know what is going on and that may bias their answers and your results. You should better frame it more neutrally and bend thoughts [distract] from the stereotype subject.
2. Ask more questions. Not only ask about migrants and public assistance (welfare), but also about crime and about divorce and family stability or production of jobs (anything positive on immigrants) or student performance. And a totally different subject as differences between men and women.
In the paper itself the question has to be mentioned, this is central! Currently the paper cannot be understood. (“People who live in Denmark originate from many countries. We would like you to evaluate how many people among the 30-39 year olds that you think are on public assistance, from each country of origin.”)
Very, very good: Your control questions and the procedure to clean the data. This is exemplary!
Always add page numbers in manuscripts. Always add page numbers in by Open Psych published papers. Papers should be finally formatted as usual by professional publishers (Cambridge, Elsevier, Sage, Springer). The more professional the better.
I do not understand the utility of the analyses in chapter 6 (“Inter-rater agreement”). Start your analyses and paper with theory and research questions and then add only the necessary analyses.
Very good and important: What predicts stereotype accuracy. Here you found that conservatism leads to higher accuracy of judgments. However, in areas of lefty questions (e.g. gender differences) maybe progressivity would lead to better stereotype accuracy.
As far as I remember also Lee Jussim has done some analyses on predictors/determinants of stereotype accuracy. Compare your results with his results and bring it in the introduction and discussion.
Table 5 in chapter 7.1: Never bring abbreviations – I do not understand what means “DF” etc.
Below Figure 6 (why there are no page numbers?): d=.19 has to be written as d=0.19. Values (p, r, usually beta) which can be only between 1, 0 and –1 write as “.19”, values that can be larger (d) write as “0.19”.
Finally, choose different outlets. Also include standard APS, APA and Elsevier journals.
Some points can be improved in the manuscript and some aspects in further studies.
First, there is nearly no introduction and no theory. Its a very Ockham-British-Lynn like data driven paper. More verbalization and generally a better formatting of ODP papers would be helpful.
Next there is a lot of statistical analysis and it seems the authors have analyzed all what is possible to do with the given data. However, data analysis should be driven by research questions and theory.
Introduction: You wrote “stereotypes were moderately accurate (median correlational accuracy score = .51)”. Usually, taken Cohen’s levels for interpretation r=.10 is a small correlation, .30 a medium and .50 a large correlation. For the validity of tests r=.50 is taken as a large correlation (Fisseni).
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum.
Fisseni, H.-J. (2004/1997/1990). Lehrbuch der psychologischen Diagnostik. Göttingen: Verlag für Psychologie.
However, for reliability measures higher correlations are expected (r=.80 medium, r=.90 high; Fisseni). And as I remember averaging Jussim’s (2012, his tables 17-1 to 17-3) results the average correlation between stereotypes and criteria is about r=.81.
So you should discuss possible benchmarks/norms and justify your taken benchmark for interpretation.
I see two aspects that could be improved in the survey hopefully possible in a future/further study:
1. Disclosure. You start your questionnaire (translated to English) with:
“When people have expectations about how one or more groups are or behaves, it’s often called a “prejudice” or “stereotype”. Often it’s said that they are inaccurate, even without examining this. The purpose of this study is to examine how precise the Danes’ stereotypes of immigrants are. Which is why we ask of you to evaluate how well a plethora of immigrant groups in Denmark perform, without looking up the numbers beforehand.”
In this way the surveyed persons know what is going on and that may bias their answers and your results. You should better frame it more neutrally and bend thoughts [distract] from the stereotype subject.
2. Ask more questions. Not only ask about migrants and public assistance (welfare), but also about crime and about divorce and family stability or production of jobs (anything positive on immigrants) or student performance. And a totally different subject as differences between men and women.
In the paper itself the question has to be mentioned, this is central! Currently the paper cannot be understood. (“People who live in Denmark originate from many countries. We would like you to evaluate how many people among the 30-39 year olds that you think are on public assistance, from each country of origin.”)
Very, very good: Your control questions and the procedure to clean the data. This is exemplary!
Always add page numbers in manuscripts. Always add page numbers in by Open Psych published papers. Papers should be finally formatted as usual by professional publishers (Cambridge, Elsevier, Sage, Springer). The more professional the better.
I do not understand the utility of the analyses in chapter 6 (“Inter-rater agreement”). Start your analyses and paper with theory and research questions and then add only the necessary analyses.
Very good and important: What predicts stereotype accuracy. Here you found that conservatism leads to higher accuracy of judgments. However, in areas of lefty questions (e.g. gender differences) maybe progressivity would lead to better stereotype accuracy.
As far as I remember also Lee Jussim has done some analyses on predictors/determinants of stereotype accuracy. Compare your results with his results and bring it in the introduction and discussion.
Table 5 in chapter 7.1: Never bring abbreviations – I do not understand what means “DF” etc.
Below Figure 6 (why there are no page numbers?): d=.19 has to be written as d=0.19. Values (p, r, usually beta) which can be only between 1, 0 and –1 write as “.19”, values that can be larger (d) write as “0.19”.
Finally, choose different outlets. Also include standard APS, APA and Elsevier journals.
Heiner,
Thank you for the review. We will get back to you with a reply.
Thank you for the review. We will get back to you with a reply.
Heiner,
Personally, I find it annoying to read papers with long introductions. I tend to just skim or skip them entirely. I write my own papers in the style that I would like to read: straightforward and to the point. Other reviewers like John Fuerst has applauded this presentation style previously and it is thus impossible to satisfy both kinds of reviewers.
I prefer to put more analyses in one paper than publishing multiple papers. This saves work because one does not have to write two similar introductions, repeat descriptive analyses and go thru peer review twice, which can take months. The result is that papers tend to be longer than normal and be target-paper-type such as the very long paper recently published in Mankind Quarterly (>100 pages with a ~50 page reply; http://mankindquarterly.org/archive/volume.php?v=107 ). For the benefit of the reader, it keeps related analyses together so one has doesn't have to download two papers to find all the analyses done on the same dataset (until/unless someone reuses the data for some other study, that is).
We have changed the wording to fairly accurate. The correlations in the area of .80 are consensual/aggregate stereotypes, not individual/personal stereotypes. E.g. Jussim's Table 17-1 has .42, .36 and .69 as individual-level accuracy correlations (he does not say which kind of average was used, so I presume it's the arithmetic mean; we used medians due to the skewed distribution). We found mean/median individual accuracy of .43/.48, so the values are in the same ballpark. His aggregate accuracy correlations in the same table are .60, .93, .88, .93, .53, .77, .77, .68, .72. This is close to our aggregate accuracy finding of .70. The previous studies used small, unrepresentative samples (N's 60-90ish), so not too much weight should be put into them. We have expanded on our discussion to include a discussion of previous numeric results.
Explaining the purpose of the study was done by the suggestion of the pollster who thought it might help get higher quality data. Note that we had a fair amount of problems with getting people to understand the assignment and/or fill it out honestly. It's hard to say whether this introduction helped or not. We have more stereotype accuracy studies planned and so we can test whether explaining the purpose makes a difference or not.
We have added a paragraph in the discussion about this fact and how it may have biased findings.
Stereotype accuracy studies of gender and political labels are planned to be done 'in the next year or so', depending on time and monetary limitations. We have data for the immigrant groups for other sociological outcomes: income, criminality, educational attainment, so it is possible to carry out a follow-up immigrant stereotype accuracy study using another outcome.
There are two dimensions to stereotype accuracy studies: the number of groups and the number of attributes. The present study is extreme in that it has a very large number of groups, but only one attribute (70x1 design). However, for gender, there are only two main groups (the proportion who do not consider themselves male/female is tiny, .3% in the OKCupid dataset), but we have many attributes. We plan on using about 50, making it a 2x50 design. Our attribute data are based on a collection published by a Danish newspaper who bought gendered statistics for some 250 categories or so from a number of pollsters. These are all recent, nationally representative samples for the Danish population. The outcomes are very varied.
For political labels, we have not decided on how many attributes to use yet or how many labels, but perhaps the numbers will be about 20x4, so the design will be intermediate between the immigrant and gender studies. The attribute data will come from another planned study where we ask people questions about their political preferences and ask them which labels they self-identify with.
After one has done the above, one can look for general stereotype accuracy across participants. I'm not sure anyone has investigated that yet. In general, previous studies of stereotype accuracy have not been very systematic large-scale, so there is plenty of room for improvement.
We have included the central question in the text (Section 2.2).
PDF readers supply the page numbers, so if one reads it electronically, they are redundant. However, some people print papers making them not redundant. We have added page numbers.
With regards to a more professional style, have a look at this recently published paper: http://openpsych.net/ODP/2016/07/putting-spearmans-hypothesis-to-work-job-iq-as-a-predictor-of-employee-racial-composition/
Julius is offering to style up papers like this for a small fee (depending on the length of the paper). Since he is a co-author, he will style up this paper as well. I think that this markedly improves the respectability of the finalized papers and may help convince more conservative colleagues. Personally, I focus on the content, not the presentation.
In the review of the pilot study one or more reviewers asked for this information and hence we provide it here as well. One cannot please everybody! :)
In general, we didn't find many useful predictors of accuracy. The only moderately good ones were cognitive ability (r=.22) and education (r=.20). But in some contexts (e.g. with Muslims), some political preferences predicted higher accuracy (e.g. nationalism, r=-.12). You are right that in the main analyses, conservatism was slightly correlated with correlational accuracy (r=.13) but not with absolute accuracy (r=.02) and not in multivariate LASSO regression either, so the predictive validity is questionable, not general and very small to begin with.
The previous studies that Jussim discusses that looked into correlates of accuracy were seriously underpowered and probably useless to draw conclusions from. E.g. he summarizes the findings of Ashton and Esses (1999) which had measly sample of N=94 university students. He seems unaware that these kind of interaction effects cannot be reasonably established by studies of that size.
It is necessary to use abbreviations because the party names are too long and also in Danish. The appendix (after references) has information about the parties. The abbreviations are mentioned in Section 2.2, but for ease of finding the information, we have written a brief explanation in the caption of each table that has the parties. DF, by the way, is Dansk Folkeparti (Dänische Volkspartei), the main nationalist, conservative, immigration skeptical party. However, three smaller parties, more extreme parties are now gathering signatures to run for the next election, so things may change in the next few years.
It is common practice in many areas to omit redundant digits, particularly in programming. The general reason to omit leading zeroes is the same as that to avoid padding zeroes. One could write .1 as 0.1 or 00.1 or 0.100 or 000.1000 etc. The shortest version that conveys the information is .1.
The rule with having 0 when the number is not bounded between 1 and -1 is just something APA made up. Just as Wikipedia made up some other rules, like requiring leading zeroes for numbers bounded between 1 and -1... except for baseball batting averages where they prefer omitting them and for 12-hour system hours (2:30, not 02:30).
When these publishers stop leeching money from the public (journal subscriptions) and stop demanding ridiculous charges for publishing (article processing charges), we will consider it. Since they are quite content to abuse the scientific reputational system for their own monetary benefit, I refuse to work for them for free. “the mountain must come to Muhammad”. https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Dates_and_numbers
We are aware that this study would probably get quite a bit of attention if it was published in a mainstream journal. However, since we consider this to be unethical, we will have to pursue other means of getting attention. For instance, by sending the paper to Jussim/other colleagues, posting it on Twitter, Researchgate, Facebook, etc.
--
The project files have been updated with the changes mentioned above (version 15).
First, there is nearly no introduction and no theory. Its a very Ockham-British-Lynn like data driven paper. More verbalization and generally a better formatting of ODP papers would be helpful.
Personally, I find it annoying to read papers with long introductions. I tend to just skim or skip them entirely. I write my own papers in the style that I would like to read: straightforward and to the point. Other reviewers like John Fuerst has applauded this presentation style previously and it is thus impossible to satisfy both kinds of reviewers.
Next there is a lot of statistical analysis and it seems the authors have analyzed all what is possible to do with the given data. However, data analysis should be driven by research questions and theory.
I prefer to put more analyses in one paper than publishing multiple papers. This saves work because one does not have to write two similar introductions, repeat descriptive analyses and go thru peer review twice, which can take months. The result is that papers tend to be longer than normal and be target-paper-type such as the very long paper recently published in Mankind Quarterly (>100 pages with a ~50 page reply; http://mankindquarterly.org/archive/volume.php?v=107 ). For the benefit of the reader, it keeps related analyses together so one has doesn't have to download two papers to find all the analyses done on the same dataset (until/unless someone reuses the data for some other study, that is).
Introduction: You wrote “stereotypes were moderately accurate (median correlational accuracy score = .51)”. Usually, taken Cohen’s levels for interpretation r=.10 is a small correlation, .30 a medium and .50 a large correlation. For the validity of tests r=.50 is taken as a large correlation (Fisseni).
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum.
Fisseni, H.-J. (2004/1997/1990). Lehrbuch der psychologischen Diagnostik. Göttingen: Verlag für Psychologie.
However, for reliability measures higher correlations are expected (r=.80 medium, r=.90 high; Fisseni). And as I remember averaging Jussim’s (2012, his tables 17-1 to 17-3) results the average correlation between stereotypes and criteria is about r=.81.
So you should discuss possible benchmarks/norms and justify your taken benchmark for interpretation.
We have changed the wording to fairly accurate. The correlations in the area of .80 are consensual/aggregate stereotypes, not individual/personal stereotypes. E.g. Jussim's Table 17-1 has .42, .36 and .69 as individual-level accuracy correlations (he does not say which kind of average was used, so I presume it's the arithmetic mean; we used medians due to the skewed distribution). We found mean/median individual accuracy of .43/.48, so the values are in the same ballpark. His aggregate accuracy correlations in the same table are .60, .93, .88, .93, .53, .77, .77, .68, .72. This is close to our aggregate accuracy finding of .70. The previous studies used small, unrepresentative samples (N's 60-90ish), so not too much weight should be put into them. We have expanded on our discussion to include a discussion of previous numeric results.
1. Disclosure. You start your questionnaire (translated to English) with:
“When people have expectations about how one or more groups are or behaves, it’s often called a “prejudice” or “stereotype”. Often it’s said that they are inaccurate, even without examining this. The purpose of this study is to examine how precise the Danes’ stereotypes of immigrants are. Which is why we ask of you to evaluate how well a plethora of immigrant groups in Denmark perform, without looking up the numbers beforehand.”
In this way the surveyed persons know what is going on and that may bias their answers and your results. You should better frame it more neutrally and bend thoughts [distract] from the stereotype subject.
Explaining the purpose of the study was done by the suggestion of the pollster who thought it might help get higher quality data. Note that we had a fair amount of problems with getting people to understand the assignment and/or fill it out honestly. It's hard to say whether this introduction helped or not. We have more stereotype accuracy studies planned and so we can test whether explaining the purpose makes a difference or not.
We have added a paragraph in the discussion about this fact and how it may have biased findings.
2. Ask more questions. Not only ask about migrants and public assistance (welfare), but also about crime and about divorce and family stability or production of jobs (anything positive on immigrants) or student performance. And a totally different subject as differences between men and women.
Stereotype accuracy studies of gender and political labels are planned to be done 'in the next year or so', depending on time and monetary limitations. We have data for the immigrant groups for other sociological outcomes: income, criminality, educational attainment, so it is possible to carry out a follow-up immigrant stereotype accuracy study using another outcome.
There are two dimensions to stereotype accuracy studies: the number of groups and the number of attributes. The present study is extreme in that it has a very large number of groups, but only one attribute (70x1 design). However, for gender, there are only two main groups (the proportion who do not consider themselves male/female is tiny, .3% in the OKCupid dataset), but we have many attributes. We plan on using about 50, making it a 2x50 design. Our attribute data are based on a collection published by a Danish newspaper who bought gendered statistics for some 250 categories or so from a number of pollsters. These are all recent, nationally representative samples for the Danish population. The outcomes are very varied.
For political labels, we have not decided on how many attributes to use yet or how many labels, but perhaps the numbers will be about 20x4, so the design will be intermediate between the immigrant and gender studies. The attribute data will come from another planned study where we ask people questions about their political preferences and ask them which labels they self-identify with.
After one has done the above, one can look for general stereotype accuracy across participants. I'm not sure anyone has investigated that yet. In general, previous studies of stereotype accuracy have not been very systematic large-scale, so there is plenty of room for improvement.
In the paper itself the question has to be mentioned, this is central! Currently the paper cannot be understood. (“People who live in Denmark originate from many countries. We would like you to evaluate how many people among the 30-39 year olds that you think are on public assistance, from each country of origin.”)
We have included the central question in the text (Section 2.2).
Always add page numbers in manuscripts. Always add page numbers in by Open Psych published papers. Papers should be finally formatted as usual by professional publishers (Cambridge, Elsevier, Sage, Springer). The more professional the better.
PDF readers supply the page numbers, so if one reads it electronically, they are redundant. However, some people print papers making them not redundant. We have added page numbers.
With regards to a more professional style, have a look at this recently published paper: http://openpsych.net/ODP/2016/07/putting-spearmans-hypothesis-to-work-job-iq-as-a-predictor-of-employee-racial-composition/
Julius is offering to style up papers like this for a small fee (depending on the length of the paper). Since he is a co-author, he will style up this paper as well. I think that this markedly improves the respectability of the finalized papers and may help convince more conservative colleagues. Personally, I focus on the content, not the presentation.
I do not understand the utility of the analyses in chapter 6 (“Inter-rater agreement”). Start your analyses and paper with theory and research questions and then add only the necessary analyses.
In the review of the pilot study one or more reviewers asked for this information and hence we provide it here as well. One cannot please everybody! :)
Very good and important: What predicts stereotype accuracy. Here you found that conservatism leads to higher accuracy of judgments. However, in areas of lefty questions (e.g. gender differences) maybe progressivity would lead to better stereotype accuracy.
In general, we didn't find many useful predictors of accuracy. The only moderately good ones were cognitive ability (r=.22) and education (r=.20). But in some contexts (e.g. with Muslims), some political preferences predicted higher accuracy (e.g. nationalism, r=-.12). You are right that in the main analyses, conservatism was slightly correlated with correlational accuracy (r=.13) but not with absolute accuracy (r=.02) and not in multivariate LASSO regression either, so the predictive validity is questionable, not general and very small to begin with.
The previous studies that Jussim discusses that looked into correlates of accuracy were seriously underpowered and probably useless to draw conclusions from. E.g. he summarizes the findings of Ashton and Esses (1999) which had measly sample of N=94 university students. He seems unaware that these kind of interaction effects cannot be reasonably established by studies of that size.
Table 5 in chapter 7.1: Never bring abbreviations – I do not understand what means “DF” etc.
It is necessary to use abbreviations because the party names are too long and also in Danish. The appendix (after references) has information about the parties. The abbreviations are mentioned in Section 2.2, but for ease of finding the information, we have written a brief explanation in the caption of each table that has the parties. DF, by the way, is Dansk Folkeparti (Dänische Volkspartei), the main nationalist, conservative, immigration skeptical party. However, three smaller parties, more extreme parties are now gathering signatures to run for the next election, so things may change in the next few years.
Below Figure 6 (why there are no page numbers?): d=.19 has to be written as d=0.19. Values (p, r, usually beta) which can be only between 1, 0 and –1 write as “.19”, values that can be larger (d) write as “0.19”.
It is common practice in many areas to omit redundant digits, particularly in programming. The general reason to omit leading zeroes is the same as that to avoid padding zeroes. One could write .1 as 0.1 or 00.1 or 0.100 or 000.1000 etc. The shortest version that conveys the information is .1.
The rule with having 0 when the number is not bounded between 1 and -1 is just something APA made up. Just as Wikipedia made up some other rules, like requiring leading zeroes for numbers bounded between 1 and -1... except for baseball batting averages where they prefer omitting them and for 12-hour system hours (2:30, not 02:30).
Finally, choose different outlets. Also include standard APS, APA and Elsevier journals.
When these publishers stop leeching money from the public (journal subscriptions) and stop demanding ridiculous charges for publishing (article processing charges), we will consider it. Since they are quite content to abuse the scientific reputational system for their own monetary benefit, I refuse to work for them for free. “the mountain must come to Muhammad”. https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Dates_and_numbers
We are aware that this study would probably get quite a bit of attention if it was published in a mainstream journal. However, since we consider this to be unethical, we will have to pursue other means of getting attention. For instance, by sending the paper to Jussim/other colleagues, posting it on Twitter, Researchgate, Facebook, etc.
--
The project files have been updated with the changes mentioned above (version 15).
We are still waiting for a reply from Sean who said he would have time to review this paper. Last reminder sent to him Aug. 18.
I have sent a reminder email to Rindermann to let him know we are waiting for his second comment.
I think we will give up getting a review from Sean Stevens as he did not reply to my email. However, with Rindermann, there will be 3 approvals (Bob Williams, Gerhard Meisenberg, Heiner Rindermann).
I think we will give up getting a review from Sean Stevens as he did not reply to my email. However, with Rindermann, there will be 3 approvals (Bob Williams, Gerhard Meisenberg, Heiner Rindermann).
Rindermann sent his review to me and I post it here:
Thanks for the revision.
You distinguish sometimes (abstract, text, answer to the reviewers) for Jussim and your data between ”mean/median individual accuracy” and ”aggregate accuracy”. This is important and explain in detail and bring an example what does it mean.
”So far we have examined individual-level (in)accuracy and its correlates (also called personal stereotypes). However, one can also aggregate the estimates and then examine (in)accuracy and its correlates (consensual stereotypes) (Jussim, 2012).”
Yes, this is only word sound without explaining the meaning and give examples!
Please add in your abstract:
old:
A nationally representative sample was asked to estimate the percentage of persons aged 30-39 receiving social benefits for 70 countries of origin
new:
A nationally representative Danish sample was asked to estimate the percentage of persons aged 30-39 living in Denmark receiving social benefits for 70 countries of origin
Table 5: Still acronyms – write all out!
The paper has been revised.
1.
I looked over the paper. This distinction is explained in briefly in Section 8. However, I added a worked example as well.
2.
We have updated the abstract to use your wording.
3.
It is necessary to use acronyms because the real party names (translate or not) would take up too much space. We give the full list of parties in the appendix, so readers can consult that table.
1.
You distinguish sometimes (abstract, text, answer to the reviewers) for Jussim and your data between ”mean/median individual accuracy” and ”aggregate accuracy”. This is important and explain in detail and bring an example what does it mean.
”So far we have examined individual-level (in)accuracy and its correlates (also called personal stereotypes). However, one can also aggregate the estimates and then examine (in)accuracy and its correlates (consensual stereotypes) (Jussim, 2012).”
Yes, this is only word sound without explaining the meaning and give examples!
I looked over the paper. This distinction is explained in briefly in Section 8. However, I added a worked example as well.
2.
We have updated the abstract to use your wording.
3.
It is necessary to use acronyms because the real party names (translate or not) would take up too much space. We give the full list of parties in the appendix, so readers can consult that table.
Typo, p. 15: Tho individuals
The distinction between individual and aggregate accuracy is now clear. (It is similar to inter-rater reliability: While individual agreement between two randomly chosen raters is low, about r=.20 to .30 for students’ evaluations of instruction, the ”reliability” or ”objectivity” of the mean of two raters is about r=.25 to .35. It can be simply calculated using Spearman Brown formula – the more raters, the higher the reliability-objectivity of the mean of raters.)
However, it is still not clear what they mean.
I suggest to add sentences similar to those (write in your own words!):
Individual stereotypes represent the thinking of a single person. Individual accuracy stands for the accuracy of the thinking of an average single person, here about immigrants. However, this is not what is really interesting. Much more important is the accuracy of collective thinking, the accuracy of generally shared stereotypes. If the term ”stereotypes” is applied, usually not stereotypes of single persons are meant but in a society widely spread patterns of thinking influencing individuals in their thinking and behavior, e.g. about sex differences in mathematics vs. language or about differences in ability and crime between races. These are the relevant stereotypes, the collective ones. They impact society and culture and the people. We use for them the term aggregate stereotypes and the aggregate accuracy stands for the accuracy of the typical thinking in society.
The distinction between individual and aggregate accuracy is now clear. (It is similar to inter-rater reliability: While individual agreement between two randomly chosen raters is low, about r=.20 to .30 for students’ evaluations of instruction, the ”reliability” or ”objectivity” of the mean of two raters is about r=.25 to .35. It can be simply calculated using Spearman Brown formula – the more raters, the higher the reliability-objectivity of the mean of raters.)
However, it is still not clear what they mean.
I suggest to add sentences similar to those (write in your own words!):
Individual stereotypes represent the thinking of a single person. Individual accuracy stands for the accuracy of the thinking of an average single person, here about immigrants. However, this is not what is really interesting. Much more important is the accuracy of collective thinking, the accuracy of generally shared stereotypes. If the term ”stereotypes” is applied, usually not stereotypes of single persons are meant but in a society widely spread patterns of thinking influencing individuals in their thinking and behavior, e.g. about sex differences in mathematics vs. language or about differences in ability and crime between races. These are the relevant stereotypes, the collective ones. They impact society and culture and the people. We use for them the term aggregate stereotypes and the aggregate accuracy stands for the accuracy of the typical thinking in society.
Heiner,
Thanks for reviewing.
Fixed.
I rewrote part of the Discussion to be:
We observed relatively high levels of accuracy. The accuracy for aggregate stereotypes was much higher (r = .70) than the median individual accuracy (r = .48) as expected based on the Spearman-Brown formula. In thinking about stereotypes, the aggregate stereotypes are usually the important ones to focus on. This is because these represent the typical or average expectations of the population. The beliefs and any resultant actions of single persons average out with each other.
In general, the present results are similar to those found in the pilot study. The only findings that did not replicate were the strong predictive validities of age and gender observed in the pilot study.
The findings fit well with the general literature on stereotype accuracy (Jussim, 2012; Jussim et al., 2015). The average correlation in social psychology has been estimated to be around .20 (Richard, Bond, & Stokes-Zoota, 2003),[url=#sdfootnote1sym]1[/url] while we found that 78% of participants had accuracy correlations above .30 and 45% had scores above .50. Previous studies of racial/ethnic stereotypes reported average accuracies between .36 to .69 and .53 to .93 for individual and aggregate-level stereotypes, respectively (Jussim, 2012, p. 327).
[foot note]
[url=#sdfootnote1anc]1[/url]This value is very likely to be too large. The estimate is based on a large number of meta-analyses which mostly did not correct for the endemic publication bias in this field (Open Science Collaboration, 2015).
Let me know whether this is satisfactory.
---
Files updated.
Thanks for reviewing.
Typo, p. 15: Tho individuals
Fixed.
The distinction between individual and aggregate accuracy is now clear. (It is similar to inter-rater reliability: While individual agreement between two randomly chosen raters is low, about r=.20 to .30 for students’ evaluations of instruction, the ”reliability” or ”objectivity” of the mean of two raters is about r=.25 to .35. It can be simply calculated using Spearman Brown formula – the more raters, the higher the reliability-objectivity of the mean of raters.)
However, it is still not clear what they mean.
I suggest to add sentences similar to those (write in your own words!):
Individual stereotypes represent the thinking of a single person. Individual accuracy stands for the accuracy of the thinking of an average single person, here about immigrants. However, this is not what is really interesting. Much more important is the accuracy of collective thinking, the accuracy of generally shared stereotypes. If the term ”stereotypes” is applied, usually not stereotypes of single persons are meant but in a society widely spread patterns of thinking influencing individuals in their thinking and behavior, e.g. about sex differences in mathematics vs. language or about differences in ability and crime between races. These are the relevant stereotypes, the collective ones. They impact society and culture and the people. We use for them the term aggregate stereotypes and the aggregate accuracy stands for the accuracy of the typical thinking in society.
I rewrote part of the Discussion to be:
We observed relatively high levels of accuracy. The accuracy for aggregate stereotypes was much higher (r = .70) than the median individual accuracy (r = .48) as expected based on the Spearman-Brown formula. In thinking about stereotypes, the aggregate stereotypes are usually the important ones to focus on. This is because these represent the typical or average expectations of the population. The beliefs and any resultant actions of single persons average out with each other.
In general, the present results are similar to those found in the pilot study. The only findings that did not replicate were the strong predictive validities of age and gender observed in the pilot study.
The findings fit well with the general literature on stereotype accuracy (Jussim, 2012; Jussim et al., 2015). The average correlation in social psychology has been estimated to be around .20 (Richard, Bond, & Stokes-Zoota, 2003),[url=#sdfootnote1sym]1[/url] while we found that 78% of participants had accuracy correlations above .30 and 45% had scores above .50. Previous studies of racial/ethnic stereotypes reported average accuracies between .36 to .69 and .53 to .93 for individual and aggregate-level stereotypes, respectively (Jussim, 2012, p. 327).
[foot note]
[url=#sdfootnote1anc]1[/url]This value is very likely to be too large. The estimate is based on a large number of meta-analyses which mostly did not correct for the endemic publication bias in this field (Open Science Collaboration, 2015).
Let me know whether this is satisfactory.
---
Files updated.
Dear Emil,
Good,
fine with me.
Heiner
Good,
fine with me.
Heiner
I read briefly through the paper and found that it is OK. I think it is ready to be published in its present form.