[ODP] The OKCupid dataset: A very large public dataset of dating site users
2016-May-08, 13:22:30, (This post was last modified: 2016-Aug-03, 22:20:58 by Emil.)
#1
[ODP] The OKCupid dataset: A very large public dataset of dating site users
Journal:
Open Differential Psychology.

Authors:
Emil O. W. Kirkegaard
Julius D. Bjerrekær

Title:
The OKCupid dataset: A very large public dataset of dating site users

Abstract:
A very large dataset (N=68,371, 2,620 variables) from the dating site OKCupid is presented and made publicly available for use by others.
As an example of the analyses one can do with the dataset, a cognitive ability test is constructed from 14 suitable items. To validate the dataset and the test, the relationship of cognitive ability to religious beliefs and political interest/participation is examined. Cognitive ability is found to be negatively related to all measures of religious belief (latent correlations -.26 to -.35), and found to be positively related to all measures of political interest and participation (latent correlations .19 to .32).
To further validate the dataset, we examined the relationship between Zodiac sign and every other variable. We found very scant evidence of any influence (the distribution of p-values from chi square tests was flat).
Limitations of the dataset are discussed.

Key words:
open dataset, big data, open science, OKCupid, dating site, cognitive ability, IQ, intelligence, g-factor, scale construction, religiosity, politics

Length:
4300 words, excluding references.

Files:
https://mega.nz/#F!QIpXkL4Q!b3QXepE6tgyZ3zDhWbv1eg

External reviewers:
None suggested right now.
Reply
2016-May-13, 10:30:15, (This post was last modified: 2016-May-18, 01:06:04 by Emil.)
#2
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.
Reply
2016-May-18, 10:21:03,
#3
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
(2016-May-13, 10:30:15)Emil Wrote: Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.

The thread is actually here.
Reply
2016-May-18, 11:20:15,
#4
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
(2016-May-18, 10:21:03)pdehaye Wrote:
(2016-May-13, 10:30:15)Emil Wrote: Note: non-scientific discussion of the dataset is moved to this thread. Peer review threads are for just that, actual scientific peer review. Yes, that does mean that your discussion posts about that topic go into that thread.

The thread is actually here.

Thanks. I found out why the link keeps breaking. Whenever I move posts to merge the threads, it creates a new thread with the result, which gets a new ID, so hence breaks all links to it.
Reply
2016-May-23, 13:06:36,
#5
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
Figures 6-10 should have clearer captions.
Are the black dots the average Z-scores of people who answered e.g. "Extremely important"? Do the error bars represent uncertainty related to the value of the average?
Reply
2016-May-23, 13:42:59, (This post was last modified: 2016-May-23, 13:45:17 by Emil.)
#6
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
(2016-May-23, 13:06:36)hvc Wrote: Figures 6-10 should have clearer captions.
Are the black dots the average Z-scores of people who answered e.g. "Extremely important"? Do the error bars represent uncertainty related to the value of the average?

Hi hvc,

You are right. I have updated these captions to be a bit more clear, e.g.:

Figure 6: Mean cognitive ability by stated level importance of religion in life. Error bars are 99.9% confidence intervals.

Now it should be more clear that the dots represent the mean. The error bars are the confidence intervals of the means.

Due to the legal attack, the paper is currently not available on OSF, but I have updated the version on ResearchGate.

https://www.researchgate.net/project/The...051c2b2953
Reply
2016-Jul-20, 00:26:43, (This post was last modified: 2016-Jul-20, 00:29:50 by Emil.)
#7
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
Quote:JUL 03, 2016 | 04:51AM EDT
Original message
Emil wrote:
Hi OSF,

Any news regarding the OKCupid affair? I ask because I have a talk accepted at a conference in July about this paper. I need to know if I have to upload the dataset elsewhere to avoid spurious copyright claims.

-Emil

JUL 06, 2016 | 01:32PM EDT
Sara Bowman replied:

Hi Emil,

Thanks for your email. We have responded to OKCupid’s legal requests and the issue has been resolved. The data will remain inaccessible via the OSF. I recommend that you be in touch with OKCupid directly about any plans for sharing and resolving any legal restrictions.

Best,
Sara

It seems that the repository will be permanently offline at OSF due to the DMCA claim. However, the project files can now be found at https://mega.nz/#F!QIpXkL4Q!b3QXepE6tgyZ3zDhWbv1eg.

This includes a slight update to the paper to reflect the new location of the project files.

The datafile no longer includes usernames or cities as these were the main privacy concern and are of little relevance to most researchers.
Reply
2016-Aug-02, 19:30:35, (This post was last modified: 2016-Sep-06, 10:15:47 by Duxide.)
#8
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
I have no major objections to this paper aside from a few typos and a couple of issues.
E.g.: "merely presents it is a more useful form": Change is to "in"
An issue that I can see coming up is related to this sentence: "Gathering the photos would have taken up a lot of hard drive space but could be done in a future scraping". You should mention possible legal problems arising from releasing users photo.
Also you should update this sentence: "Presenting the data at city-level would however take too much space because there are 8,570 different cities in the dataset". As there are no city-level data in the updated dataset, you cannot present it in the paper (or else you'd violate the open sharing philosophy that you're embracing and revert to the old system). Instead, state the real reason why you cannot present the data at city-level, that is it had to be removed from the data file due to legal issues related to privacy laws.
I didn't see the new data file but I suppose it still has the country-level info, otherwise it'd not make sense to publish table 1.
Reply
2016-Aug-03, 17:18:24,
#9
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
Davide,

Thanks for these comments. We will update the paper shortly.
Reply
2016-Aug-04, 04:59:07,
#10
RE: [ODP] The OKCupid dataset: A very large public dataset of dating site users
(2016-Aug-02, 19:30:35)Duxide Wrote: I have no major objections to this paper aside from a few typos and a couple of issues.
E.g.: "merely presents it is a more useful form": Change is to "it"

Fixed.

Quote:An issue that I can see coming up is related to this sentence: "Gathering the photos would have taken up a lot of hard drive space but could be done in a future scraping". You should mention possible legal problems arising from releasing users photo.

Added: Be advised that scraping and releasing users' photos may be illegal due to copyright or privacy laws.

Quote:Also you should update this sentence: "Presenting the data at city-level would however take too much space because there are 8,570 different cities in the dataset". As there are no city-level data in the updated dataset, you cannot present it in the paper (or else you'd violate the open sharing philosophy that you're embracing and revert to the old system).
Instead, state the real reason why you cannot present the data at city-level, that is it had to be removed from the data file due to legal issues related to privacy laws.

Added: Due to privacy concerns (Hackett, 2016), the username and city variables were removed from the published version of the dataset.

The citation is: http://fortune.com/2016/05/18/okcupid-data-research/

Removed the mentioning of the number of cities.

Quote:I didn't see the new data file but I suppose it still has the country-level info, otherwise it'd not make sense to publish table 1.

It does.

---

Updated the paper files.
Reply


Forum Jump:


Users browsing this thread: 3 Guest(s)