Biz & IT —

Data analysis of GitHub contributions reveals unexpected gender bias

Women's contributions to open source are more likely to be accepted than men's.

Cartoon octopus-cat is flying on a jet pack.
GitHub's octocat.

With more than 12 million users, GitHub is one of the largest online communities for collaborating on development projects. Now a team of researchers has done an exhaustive analysis of millions of GitHub pull requests for open source projects, trying to discover whether the contributions of women were accepted less often than the contributions of men. What they discovered was that women's contributions were actually accepted more often than men's—but only if the women had gender-neutral profiles. Women whose GitHub profiles revealed their genders had a much harder time.

The researchers are American computer scientists whose work was approved by an Institutional Review Board (IRB), a group that determines whether experiments on human subjects are ethical or not. They've published a pre-print of their GitHub analysis on PeerJ Preprints today and offered a deep look at how they did it.

Finding men and women on GitHub

First, they needed a dataset. Luckily, the GHTorrent dataset contains public data on GitHub users, pull requests, and projects up to April 1, 2015. The group writes that they "augmented this GHTorrent data by mining GitHub’s webpages for information about each pull request status, description, and comments." But they had just one problem. GitHub profiles do not include gender information. So the researchers determined the genders of over 1.4 million users by linking their e-mail addresses with G+ profiles that list a gender. They write:

Specifically, we extract users’ email addresses from GHTorrent, look up that email address on the Google+ social network, then, if that user has a profile, extract gender information from these users’ profiles. Out of 4,037,953 GitHub user profiles with email addresses, we were able to identify 1,426,121 (35.3%) of them as men or women through their public Google+ profiles. We are the first to use this technique, to our knowledge.

There are privacy concerns with this method, admit the researchers, but they have not published a list of the genders of those GitHub users. Also, the researchers say they contacted Google to let them know that it was very easy to discover people's genders via G+, but Google said it wasn't a problem and was covered by their terms of service.

At last, the researchers had a huge dataset of GitHub users, identified by gender under relatively ethical circumstances. Still, there was one more piece of information they needed. Could other GitHub users, who weren't snooping around on people's G+ profiles, identify the gender of a person contributing a pull request? To determine that, they created a sample of random profiles and used a combination of tools (including a panel of human judges) to infer gender from profile names and pictures. A profile was considered gender-neutral if it had a non-gendered user name like "fuzzlewump" and an identicon instead of a photo. Others were grouped into male and female.

It was time to do the analysis.

Coding while female

The researchers admit that they expected to find that women's pull requests were accepted less often. Based on previous work done on women in computer science, which has revealed that women consistently earn lower salaries than men and have to prove their worth more often, the researchers hypothesized that open source project leaders would incorporate fewer contributions from women into their code. In a sense, they were right. Far fewer women participate in open source development than men, so in terms of raw numbers there are always going to be fewer contributions from women.

But when they looked at the "merge rate" of women's contributions, they were shocked to find that 78.6 percent of women's pull requests were actually accepted and merged into the code, while only 74.4 percent of men's pull requests were.

Results of a survey of how often pull requests were accepted, grouped by gender.
Enlarge / Results of a survey of how often pull requests were accepted, grouped by gender.

Not only that, but 25 percent of women had almost 100 percent of their pull requests accepted, while only about 13.5 percent of men reached that exalted 100 percent acceptance rate. 

Acceptance rates for individual GitHub contributors. Note that a quarter of women have almost a 100 percent acceptance rate, far higher than men.
Enlarge / Acceptance rates for individual GitHub contributors. Note that a quarter of women have almost a 100 percent acceptance rate, far higher than men.

What could be causing this extraordinary and unexpected statistic? The researchers went to work analyzing the kinds of contributions women were making to see whether that might be skewing the outcomes. First of all, they wondered whether they were seeing the outcome of "survivorship bias," where people who last longer tend to make more contributions. Were these women who had run the open source gauntlet outpacing men because they stuck it out? If that were true, you'd expect to see women making more contributions over time as the non-survivors dropped out. But that wasn't what they found—women contributed more often than men no matter how long they had been involved in open source.

They also checked other things, like whether women were contributing "more valuable" pull requests in response to known issues. But women actually responded less to known issues or bugs than men did. Women outpaced men regardless of what computer language they were writing in. Plus, the changes proposed by women typically included more lines of code than men's, so they weren't just submitting smaller contributions either. What could account for this discrepancy? Was it ...

Bias against men?

The researchers finally had to ask whether they had discovered some kind of bias against men. Perhaps men involved in open source had a "helper" complex, and they wanted to bring women into the fold so badly that they merged women's pull requests more often than they did men's. The only way to know for sure was to look at what happened when open source project leaders actually knew the gender of the contributors. If they knew a person was male and took his contributions less often than a known female, they would be seeing bias in action.

Except that's not what they found.

Here you can see that fewer contributions were accepted by identifiable women if they were outsiders to an open source project.
Enlarge / Here you can see that fewer contributions were accepted by identifiable women if they were outsiders to an open source project.

When a woman offered a pull request on an open source project where she was an outsider—in other words, where none of the project leads knew her—her contributions were far less likely to be accepted than ones from outsider men. Far from showing bias against men, this showed a bias against women. The higher rate of acceptance the researchers found elsewhere was likely because project leads didn't know the gender of the contributor, or they already knew and trusted her.

All things being equal, contributions from unknown women were accepted less often than contributions from unknown men.

A hopeful sign

So why are women in open source more competent than men? Given that there is no "computer science gene" that occurs more often in women than in men, there has to be a social bias at work. Obviously, both sexes are equally good at computer science, but women are doing something differently. The researchers offer a few possibilities:

One explanation is survivorship bias: as women continue their formal and informal education in computer science, the less competent ones may change fields or otherwise drop out. Then, only more competent women remain by the time they begin to contribute to open source. In contrast, less competent men may continue ... Another explanation is self-selection bias: the average woman in open source may be better prepared than the average man, which is supported by the finding that women in open source are more likely to hold Master’s and PhD degrees. Yet another explanation is that women are held to higher performance standards than men.

No matter what the explanation—and it's likely some combination—we have further evidence that there is measurable bias against women in computer science. Women who work in CS have to be better prepared and perform more competently than men in order to survive, and therefore it should come as no surprise that the few women who contribute to open source projects are more skilled than their male counterparts.

Perhaps what's most interesting about this study, however, is the way sexist bias seems to disappear when men know the women who are contributing to an open source project. This is a very hopeful sign, because it means women's participation in projects is helping them overcome existing bias.

PeerJ Preprints, 2016. DOI: 10.7287/peerj.preprints.1733v1

Channel Ars Technica