Back to [Archive] Post-review discussions

[ODP] The OKCupid dataset: A very large public dataset of dating site users
Admin
Journal:
Open Differential Psychology.

Authors:
Emil O. W. Kirkegaard
Julius D. Bjerrekær

Title:
The OKCupid dataset: A very large public dataset of dating site users

Abstract:
A very large dataset (N=68,371, 2,620 variables) from the dating site OKCupid is presented and made publicly available for use by others.
As an example of the analyses one can do with the dataset, a cognitive ability test is constructed from 14 suitable items. To validate the dataset and the test, the relationship of cognitive ability to religious beliefs and political interest/participation is examined. Cognitive ability is found to be negatively related to all measures of religious belief (latent correlations -.26 to -.35), and found to be positively related to all measures of political interest and participation (latent correlations .19 to .32).
To further validate the dataset, we examined the relationship between Zodiac sign and every other variable. We found very scant evidence of any influence (the distribution of p-values from chi square tests was flat).
Limitations of the dataset are discussed.

Key words:
open dataset, big data, open science, OKCupid, dating site, cognitive ability, IQ, intelligence, g-factor, scale construction, religiosity, politics

Length:
4300 words, excluding references.

Files:
https://mega.nz/#F!QIpXkL4Q!b3QXepE6tgyZ3zDhWbv1eg

External reviewers:
None suggested right now.
Admin
Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.
Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.


The thread is actually here.
Admin
Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.


The thread is actually here.


Thanks. I found out why the link keeps breaking. Whenever I move posts to merge the threads, it creates a new thread with the result, which gets a new ID, so hence breaks all links to it.
Figures 6-10 should have clearer captions.
Are the black dots the average Z-scores of people who answered e.g. "Extremely important"? Do the error bars represent uncertainty related to the value of the average?
Admin
Figures 6-10 should have clearer captions.
Are the black dots the average Z-scores of people who answered e.g. "Extremely important"? Do the error bars represent uncertainty related to the value of the average?


Hi hvc,

You are right. I have updated these captions to be a bit more clear, e.g.:

Figure 6: Mean cognitive ability by stated level importance of religion in life. Error bars are 99.9% confidence intervals.

Now it should be more clear that the dots represent the mean. The error bars are the confidence intervals of the means.

Due to the legal attack, the paper is currently not available on OSF, but I have updated the version on ResearchGate.

https://www.researchgate.net/project/The-OKCupid-dataset-A-very-large-public-dataset-of-dating-site-users_573b1d34ed99e1051c2b2953
Admin
JUL 03, 2016 | 04:51AM EDT
Original message
Emil wrote:
Hi OSF,

Any news regarding the OKCupid affair? I ask because I have a talk accepted at a conference in July about this paper. I need to know if I have to upload the dataset elsewhere to avoid spurious copyright claims.

-Emil

JUL 06, 2016 | 01:32PM EDT
Sara Bowman replied:

Hi Emil,

Thanks for your email. We have responded to OKCupid’s legal requests and the issue has been resolved. The data will remain inaccessible via the OSF. I recommend that you be in touch with OKCupid directly about any plans for sharing and resolving any legal restrictions.

Best,
Sara


It seems that the repository will be permanently offline at OSF due to the DMCA claim. However, the project files can now be found at https://mega.nz/#F!QIpXkL4Q!b3QXepE6tgyZ3zDhWbv1eg.

This includes a slight update to the paper to reflect the new location of the project files.

The datafile no longer includes usernames or cities as these were the main privacy concern and are of little relevance to most researchers.
I have no major objections to this paper aside from a few typos and a couple of issues.
E.g.: "merely presents it is a more useful form": Change is to "in"
An issue that I can see coming up is related to this sentence: "Gathering the photos would have taken up a lot of hard drive space but could be done in a future scraping". You should mention possible legal problems arising from releasing users photo.
Also you should update this sentence: "Presenting the data at city-level would however take too much space because there are 8,570 different cities in the dataset". As there are no city-level data in the updated dataset, you cannot present it in the paper (or else you'd violate the open sharing philosophy that you're embracing and revert to the old system). Instead, state the real reason why you cannot present the data at city-level, that is it had to be removed from the data file due to legal issues related to privacy laws.
I didn't see the new data file but I suppose it still has the country-level info, otherwise it'd not make sense to publish table 1.
Admin
Davide,

Thanks for these comments. We will update the paper shortly.
Admin
I have no major objections to this paper aside from a few typos and a couple of issues.
E.g.: "merely presents it is a more useful form": Change is to "it"


Fixed.

An issue that I can see coming up is related to this sentence: "Gathering the photos would have taken up a lot of hard drive space but could be done in a future scraping". You should mention possible legal problems arising from releasing users photo.


Added: Be advised that scraping and releasing users' photos may be illegal due to copyright or privacy laws.

Also you should update this sentence: "Presenting the data at city-level would however take too much space because there are 8,570 different cities in the dataset". As there are no city-level data in the updated dataset, you cannot present it in the paper (or else you'd violate the open sharing philosophy that you're embracing and revert to the old system).
Instead, state the real reason why you cannot present the data at city-level, that is it had to be removed from the data file due to legal issues related to privacy laws.


Added: Due to privacy concerns (Hackett, 2016), the username and city variables were removed from the published version of the dataset.

The citation is: http://fortune.com/2016/05/18/okcupid-data-research/

Removed the mentioning of the number of cities.

I didn't see the new data file but I suppose it still has the country-level info, otherwise it'd not make sense to publish table 1.


It does.

---

Updated the paper files.
Perhaps I missed something but this paper seems to me to be a continuity of your actual research, i.e., correlations between g and social variables. Thus I'm somewhat disconcerted about the title of the paper and the section 1. Introduction, which is only about data sharing. Why nothing about your previous research on g and social variables and your S-factor ? And even in the discussion section, still no word about your previous research. So, what is the purpose of the paper ? To illustrate the importance of data sharing ? Or that this data has some use for psychological research (which is debatable considering its limitations) ?

To examine the effect of using only a smaller number of items to increase the sample with complete data, we also created tests with 2-13 items.


I'm skeptical. I wouldn't qualify as a test, one test composed of only 2-4 items. I think you should write about it in the limitation section.

It can be seen that there is strong stability of estimates across different test compositions


Stability means that the correlation between a variable measured at time1 and the same variable measured at time2 is high.

The difference between the most and least religious is -.67 d


That would be much clearer if you write "between the most and least religious groups".

We see a linear negative relationship between the rated importance of religion/God in life and cognitive ability.


Even if the graph (and some of the following ones) suggests this conclusion, I won't use such wording "linear relationship" when the variable is not nominal. If you have a 3-category variable, 1 "no", 2 "neither", 3 "yes", a line that looks linear shouldn't be qualified as a linear relationship in my opinion.

As expected, we see that people willing to help out more had higher cognitive ability. It's not possible to calculate the latent correlation because the order of the options is not clear: is time or money the greater sacrifice?


What do you mean ?

it is possible that there are effects of time of birth in the year


"time of birth in the year" ?

I will leave another comment later, I think, because I don't understand something about section 5.3. Which is, the use of p-values...
By the way, can you explain what the null hypothesis is about ? NH of what ? Specify it in your paper, also. It helps to clarify things.
Admin
Hi Meng,

Thanks for reviewing. The quotes below are from you unless otherwise specified.

Perhaps I missed something but this paper seems to me to be a continuity of your actual research, i.e., correlations between g and social variables. Thus I'm somewhat disconcerted about the title of the paper and the section 1. Introduction, which is only about data sharing. Why nothing about your previous research on g and social variables and your S-factor ? And even in the discussion section, still no word about your previous research. So, what is the purpose of the paper ? To illustrate the importance of data sharing ? Or that this data has some use for psychological research (which is debatable considering its limitations) ?


The paper is about presenting a new dataset. This is why the introduction mentions this topic, the title is about this and we don't cite any of the research related to S factor, no S factor analysis was carried out, nor were any of the typical socioeconomic data analyzed (such as education or income or criminality).

The analyses presented in the paper are only presented to showcase what kind of analyses one can do with the dataset and show that one finds known results when doing so (successful calibration).

I don't understand how this was not clear to you. Let me know if you have any suggestions for how to make this more clear if you think it should be.

I'm skeptical. I wouldn't qualify as a test, one test composed of only 2-4 items. I think you should write about it in the limitation section.


There are 14 useable questions that can be used as items in a test. The matrix shows the intercorrelations between using tests with different numbers of these items. The trade-off is that using more of the questions results in more missing data but also more precise measurement. IRT is able to estimate scores for persons with missing data, so the trade-off is less grave than it would have been if one had used a method that required full data (such as ordinary factor analysis). If you are interested in the items, you can find them in the supplementary materials (data/test_items.csv).

I added a paragraph in the limitations section about the items:

The cognitive ability data is limited to about 14 items with sufficient amount of data. This necessarily limits the reliability of the measurement. Furthermore, as far as we know, these items have not been validated against known test batteries or used in any other studies.

Let me know if this is satisfactory to you.

That would be much clearer if you write "between the most and least religious groups".


I have added groups.

Even if the graph (and some of the following ones) suggests this conclusion, I won't use such wording "linear relationship" when the variable is not nominal. If you have a 3-category variable, 1 "no", 2 "neither", 3 "yes", a line that looks linear shouldn't be qualified as a linear relationship in my opinion.


I think you meant to say continuous. I think it's alright to say it's linear if the scale is a Likert or similar which is plausibly interpreted as being close to interval. I think this is the case for the analyses we present. For instance, I think the 4 point scale in Figure 6 is pretty plausibly interpreted as being interval scale or close to:

  • Extremely important
  • Somewhat important
  • Not very important
  • Not at all important


Note that a violation of interval scale would be unlikely to result in a linear relationship as seen. It's easier to make a relationship non-linear than linear.

Furthermore, note that the analysis in Figure 8 does not display a linear relationship despite using the same answer options. Thus, it's possible to get both linear and non-linear looking results with these answer options.

What do you mean ?


To calculate a correlation, one must be able to rank the possible values. However, how should one rank the answers "I would donate time" and "I would donate money"? It's not clear which one is the greatest sacrifice.

I note that one should probably reorder the groups on the plot so that the None-answer is on the left. This was already done for the plots found in the supplementary materials, but the figure in the paper was not updated. This has been done now.

"time of birth in the year" ?


Effects of when time of birth falls within a year, e.g. January vs. February. The last clause is necessary because otherwise one might think it includes the difference between being born in 1962 vs. 1970, a cohort or age effect.

I will leave another comment later, I think, because I don't understand something about section 5.3. Which is, the use of p-values...
By the way, can you explain what the null hypothesis is about ? NH of what ? Specify it in your paper, also. It helps to clarify things.


What do you not understand about it?

The null hypothesis for a chi square test is always that the samples come from populations with the same mean, so it seems redundant to specify it explicitly. However, because you requested it, I have done it. The text new reads:

It is possible to do a large-scale test of astrology using the OKCupid dataset by examining whether Zodiac sign is related to every question in the dataset. Zodiac sign is arguably a nominal variable and the questions are either ordinal (possibly interval-like) or nominal. Thus, to use all the questions, a test that can handle nominal x nominal variables was needed. We settled on using the standard chi square test because the goal was to look for any signal at all, not estimate effect sizes. This is a strong test because it is possible that there are effects of time of birth within a given year which are unrelated to Zodiac sign. For instance, being born in summer may be related to which kind of activities one takes part in at age 3 due to limitations of the weather, and the experiences from these activities may have a causal impact on one’s later personality.

To clarify, the null hypothesis tested by the chi square test here is that the answers have the same frequency for all the 12 Zodiac populations. Figure 11 shows a density-histogram of the p-values.


Let me know if this is satisfactory.

---

I noted that there was some odd whitespace on page 9. I have fixed this.

I have added page numbers.

--

A new version will be uploaded shortly.
Admin
I also added the names of the reviewers (Piffer, Hu).

The files have been updated.
To calculate a correlation, one must be able to rank the possible values. However, how should one rank the answers "I would donate time" and "I would donate money"? It's not clear which one is the greatest sacrifice.


Ok, I understand a little bit more now. Maybe try to describe the problem another way : "order of the options" wasn't clear enough. But something like, say, "rank ordering of the answers" is better. I think the problem with this variable is that it is just not a linear/continuous one.

Effects of when time of birth falls within a year, e.g. January vs. February. The last clause is necessary because otherwise one might think it includes the difference between being born in 1962 vs. 1970, a cohort or age effect.


Understood. What about "time of birth within a year" ?

What do you not understand about it?


I thought you would know pretty well my opinion on this, given how many times I said it in the past. P-value is a mixture of sample size and effect size, thus it adds nothing at all above what information is provided by both sample size and effect size. If your research is about "examining whether Zodiac sign is related to every question in the dataset", i.e., "yes" or "no" there is a relationship, then p-value is no more informative than an effect size. And the effect size doesn't have the problem of the p-value, which depends on the sample size. Effect size and p-values can provide different answers sometimes. But I think you should already know that.

I don't see why people continue to rely on the p-values (whatever the research and studied questions are). That's totally useless. And of course, I strongly disagree with your following statement : "This (i.e., the significance test) is a stronger test because it is possible that there are effects of time of birth in the year which would be unrelated to Zodiac sign".

There are some comments I didn't answer, but that's because I don't have much to say (e.g., no objection or equivocal).

The correlation matrix can found in the supplementary materials


can be found
Admin
Hi MH,

Ok, I understand a little bit more now. Maybe try to describe the problem another way : "order of the options" wasn't clear enough. But something like, say, "rank ordering of the answers" is better. I think the problem with this variable is that it is just not a linear/continuous one.


I have changed it to:

It's not possible to calculate the latent correlation because the rank ordering of the answers is not clear: is time or money the greater sacrifice?

Understood. What about "time of birth within a year" ?


I have changed it to:

This is a strong test because it is possible that there are effects of time of birth within a given year (e.g. spring vs. summer) which are unrelated to Zodiac sign. For instance, being born in summer may be related to which kind of activities one takes part in at age 3 due to limitations of the weather, and the experiences from these activities may have a causal impact on one’s later personality (for a possible example of something of this sort, see Gobet and Chassy (2008)).

I thought you would know pretty well my opinion on this, given how many times I said it in the past. P-value is a mixture of sample size and effect size, thus it adds nothing at all above what information is provided by both sample size and effect size. If your research is about "examining whether Zodiac sign is related to every question in the dataset", i.e., "yes" or "no" there is a relationship, then p-value is no more informative than an effect size. And the effect size doesn't have the problem of the p-value, which depends on the sample size. Effect size and p-values can provide different answers sometimes. But I think you should already know that.

I don't see why people continue to rely on the p-values (whatever the research and studied questions are). That's totally useless. And of course, I strongly disagree with your following statement : "This (i.e., the significance test) is a stronger test because it is possible that there are effects of time of birth in the year which would be unrelated to Zodiac sign".

There are some comments I didn't answer, but that's because I don't have much to say (e.g., no objection or equivocal).


While in general I dislike the use of NHST, I think this is a case where they are used well. We are not trying to measure the effect size of Zodiac sign, we are trying to test the null hypothesis that there are no effects at all. Such a test when carried out on many variables leads to a very clear prediction about what the distribution of p values should look like, i.e. uniform. This is also the observed distribution to a close approximation.

As mentioned in the text, since we are dealing with nominal x nominal variables, it is not easy to calculate an effect size. Most effect sizes require that one can rank order the options, which one by definition cannot do with nominal data.

How would you test the null hypothesis here? One complicated idea is to find some kind of effect size that works, calculate it for all the questions and note some summary statistics about the distribution of effects. Then simulate null hypothesis data many times with the same sample sizes and calculate summary statistics of these distributions. Then finally, compare the summary statistics of the real data with those from simulated null data. This would provide just about the same evidence as the current method used I think.

can be found


Fixed.

---

The files were updated.
Ok with your changes.
While in general I dislike the use of NHST, I think this is a case where they are used well. We are not trying to measure the effect size of Zodiac sign, we are trying to test the null hypothesis that there are no effects at all.

As mentioned in the text, since we are dealing with nominal x nominal variables, it is not easy to calculate an effect size. Most effect sizes require that one can rank order the options, which one by definition cannot do with nominal data.

How would you test the null hypothesis here? One complicated idea is to find some kind of effect size that works, calculate it for all the questions and note some summary statistics about the distribution of effects. Then simulate null hypothesis data many times with the same sample sizes and calculate summary statistics of these distributions. Then finally, compare the summary statistics of the real data with those from simulated null data. This would provide just about the same evidence as the current method used I think.


Concerning the above, I said that I know you don't care about effect size, but remember : p-value is a mixture of sample size and effect size. Also, like I've said, p-values can lead to different conclusions than those produced by effect size. For instance, for my most recent research, on MGCFA testing of Spearman's Hypothesis and internal bias, p-values show always significant changes, while indices such as RMSEA, Mc, CFI, don't. It's not possible that p-values can be reliable.

How would I test the null hypothesis ? It depends on which principle it relies upon. If it requires the use of p-value, and its corresponding "higher than 0.05 being not significant", NH is not even worth testing. p-value has never been reliable. And if you're interested in the distribution of p-values, why again it would be more useful than calculating the distribution of effect size ? After all, effect size is a component of p-value.

Also, if you think getting effect sizes such as correlation or d for categorical data is a little bit problematic, you should know there are other types of effect sizes. Such as odd ratio which is appropriate for categorical data. Instead of measuring the strength of relationship with correlation (r), you measure the probability of answering #2 as opposed to #1. In any case, p-value is not a better alternative to effect size such as r or d. If r and d are both inappropriate for categorical data, the resulting p-value from r and d estimates should be always wrong as well.
Admin
I can't think of a better way to test whether Zodiac sign has any predictive validity for this dataset across all the questions, than to look at the p-curve. If you can think of one, then please try yours yourself and report back what you find. The data are public. I don't think there is anything statistically wrong with the present analysis.

ORs ratios do not work well for nom. x nom. data with >2 levels for both variables. For instance, for the questions with 4 answer options, this results in 4 x 12 probabilities being calculated. The data are also problematically hierarchical when analyzed this way because the questions have varying numbers of answers (2-4), which results in different numbers of probabilities: 2 x 12, 3 x 12, 4 x 12. One will have to aggregate within question before aggregating across questions. More complications for little gain...

Could you suggest changes regarding this section that you think are mandatory before you would approve the paper?
You do not answer my comment here. I said that I don't see why p-value is better than effect size, in answering the question that whether there is an effect or not, since effect size is a component of p-value, and that p-value is biased by sample sizes.

Concerning ORs, I don't know what you're talking about by x12 probabilities. What I was thinking is that by looking at your article, given that the IQ variable is a continuous one, you can use OLS, with IQ as dependent var and the various independent categorical vars as dummy vars. If a continuous var has 4 categories, you end up with 3 independant vars to enter in your regression equation (if answer #1 is the reference category, you'll get dummy vars #1vs#2, #1vs#3, #1vs#4).

There is nothing mandatory here. One or two years before, I think I will make answering (and dealing with !) my question of p-value vs effect size mandatory before I give approval. So at this step I would have disapproved if the author doesn't answer my questions. But today, I have time constraints (having many other things to do) and I don't want to make a fuss anymore for something that many other reviewers won't really care about (i.e., p-value and effect size). Furthermore, although I don't agree with you on the two issues mentioned in this post, I'm thinking there is no big, fatal flaw in the paper (even if there are obviously some errors and ways to improve the paper).

So, if you think you don't want to modify anything and think it's OK, then I give my approval.
Admin
I have answered it twice. However, let me try again. The reason to use p-values over effect sizes is that using p-values allows for a direct test of the global null hypothesis. Using effect sizes does not allow for a direct test.

Please state if you can think of any other issues, aside from the p-value one.