Counting Stars on GitHub

I’ve been working on a nerd ethnography project with the GitHub API. There’s so much fun data to play with there that it’s inevitable that I’ll get a little distracted…

One distraction was the realization that I could use the search API to get a massive list of the top repos ordered by star count. Once I started looking at the results, I realized that star data is an interesting alternative metric for evaluating language popularity. Instead of looking at which languages people are actually writing new projects using, we can see which languages are used for the most popular projects.

What are stars?

In August 2012, GitHub announced a new version of their notification system that allowed users to easily mark a repository as interesting by “starring” it:

GitHub star UI

Stars are essentially lightweight bookmarks that are publicly visible. Even though they were introduced just over a year ago, all “watches” were converted to stars so there’s plenty of data.

Which are the most starred repos?

Let’s start by looking at the top 20:

Rank Repository Language Stars
1 twbs/bootstrap JavaScript 62111
2 jquery/jquery JavaScript 27082
3 joyent/node JavaScript 26352
4 h5bp/html5-boilerplate CSS 23355
5 mbostock/d3 JavaScript 20715
6 rails/rails Ruby 20284
7 FortAwesome/Font-Awesome CSS 19506
8 bartaz/impress.js JavaScript 18637
9 angular/angular.js JavaScript 17994
10 jashkenas/backbone JavaScript 16502
11 Homebrew/homebrew Ruby 15065
12 zurb/foundation JavaScript 14944
13 blueimp/jQuery-File-Upload JavaScript 14312
14 harvesthq/chosen JavaScript 14232
15 mrdoob/three.js JavaScript 13686
16 vhf/free-programming-books Unknown 13658
17 adobe/brackets JavaScript 13557
18 robbyrussell/oh-my-zsh Shell 13337
19 jekyll/jekyll Ruby 13283
20 github/gitignore Unknown 13128

If you want to play with the data yourself, I’ve put a cache of the top 5000 repositories here. I’ve also posted the Clojure code I wrote to collect the data at adereth/counting-stars.

Which languages have the top spots?

In Adam Bard’s Top Github Languages for 2013 (so far), he counted repo creation and found that JavaScript and Ruby were pretty close. The top star counts tell a very different story, with JavaScript dominating 7 of the top 10 spots. CSS was in 11th place in his analysis, but it’s 2 of the top 10 spots.

Observing that 7 of the top 10 spots are JavaScript gives a sense for both the volume and the relative ranking of JavaScript in that range of the leaderboard, but just seeing that another language is 50 of the top 5000 spots doesn’t give nearly as much color.

One approach is to look at the number of repos in different ranges for each language:

Language 1-10 1-100 1-1000 1-5000 Top Repository
JavaScript 7 54 385 1605 twbs/bootstrap (1)
CSS 2 8 41 174 h5bp/html5-boilerplate (4)
Ruby 1 9 153 786 rails/rails (6)
Python 5 64 420 django/django (44)
Unknown 5 30 138 vhf/free-programming-books (15)
C++ 4 22 108 textmate/textmate (35)
PHP 3 38 248 symfony/symfony (58)
Shell 3 19 89 robbyrussell/oh-my-zsh (18)
Objective-C 2 89 495 AFNetworking/AFNetworking (30)
C 2 31 185 torvalds/linux (25)
Go 2 13 61 dotcloud/docker (45)
Java 1 32 255 nathanmarz/storm (56)
VimL 1 23 66 mathiasbynens/dotfiles (57)
CoffeeScript 1 22 80 jashkenas/coffee-script (43)
Scala 13 46 playframework/playframework (178)
C# 8 65 SignalR/SignalR (205)
Clojure 2 37 technomancy/leiningen (361)
Perl 2 26 sitaramc/gitolite (138)
ActionScript 2 10 mozilla/shumway (606)
Emacs Lisp 1 20 technomancy/emacs-starter-kit (477)
Erlang 1 15 erlang/otp (568)
Haskell 1 12 jgm/pandoc (740)
TypeScript 1 4 bitcoin/bitcoin (161)
Assembly 1 3 jmechner/Prince-of-Persia-Apple-II (269)
Elixir 1 2 elixir-lang/elixir (666)
Objective-J 1 2 cappuccino/cappuccino (667)
Rust 1 1 mozilla/rust (225)
Vala 1 1 p-e-w/finalterm (282)
Julia 1 1 JuliaLang/julia (356)
Visual Basic 1 1 bmatzelle/gow (800)
TeX 6 ieure/sicp (2441)
R 5 johnmyleswhite/ML_for_Hackers (2125)
Lua 4 leafo/moonscript (3351)
PowerShell 3 chocolatey/chocolatey (1580)
Prolog 3 onyxfish/csvkit (3498)
XSLT 2 wakaleo/game-of-life (1093)
Matlab 2 zk00006/OpenTLD (1292)
OCaml 2 MLstate/opalang (1380)
Dart 2 dart-lang/spark (1463)
Groovy 2 Netflix/asgard (1489)
Lasso 1 symfony/symfony-docs (2047)
LiveScript 1 gkz/LiveScript (2226)
Scheme 1 eholk/harlan (2648)
Common Lisp 1 google/lisp-koans (2889)
XML 1 kswedberg/jquery-tmbundle (2972)
Mirah 1 mirah/mirah (2985)
Arc 1 arclanguage/anarki (3389)
DOT 1 cplusplus/draft (3583)
Racket 1 plt/racket (3761)
F# 1 fsharp/fsharp (4518)
D 1 D-Programming-Language/phobos (4719)
Ragel in Ruby Host 1 jgarber/redcloth (4829)
Puppet 1 ansible/ansible-examples (4979)

The table is interesting, but it still doesn’t give us a good sense for how the middle languages (C#, Scala, Clojure, Go) compare. It also reveals that there are different star distributions within the languages. For instance, CSS makes a showing in the top 10 but it has way fewer representatives (174) in the top 5000 than PHP (248), Objective C (495), or Java (255).

Looking at the top repo for each language also exposes a weakness in the methodology: GitHub’s language identification isn’t perfect and there are number of polyglot projects. The top Java repo is Storm, which uses enough Clojure (20.1% by GitHub’s measure) to make this identification questionable when you take into account Clojure’s conciseness over Java’s.

What about star counts?

Looking at the results after ranking obscures the actual distribution of stars. Using a squarified treemap with star count for the size and no hierarchy is a compact way of visualizing the ranking while exposing details about the absolute popularity of each repo. The squarified treemap algorithm roughly maintains the order going from one corner to the other.

Here are the top 1000 repos, using stars for the size and language for the color:

(Language and repository name shown on mouseover, click to visit repository. A bit of a fail on touch devices right now.)

Despite being a little chaotic, we can start to see some of the details of the distributions. It still suffers from being difficult to glean information about the middling languages. The comparisons become a little easier if we group the boxes by language. That’s pretty easy, since that’s really the intended usage of treemaps.

Here are the top 5000 grouped by language:

Honestly, I’m not really in love with this visualization, but it was a fun experiment. I have some ideas for more effective representations, but I need to work on my d3.js-fu. Hopefully it serves as an inspirational starting point for someone else…

Conclusion

Firstly, GitHub’s API is really cool and can give you some insights that aren’t exposed through their UI. Like I said at the start of this post, I have another project that caused me to look at this API in the first place and I’m really excited for the possibilities with this data.

GitHub’s current UI is really focused on using stars to expose what’s trending and doesn’t really make it easy to see the all-time greatest hits. Perhaps the expectation is that everyone already knows these repos, but I certainly didn’t and I’ve discovered or rediscovered a few gems. My previous post came about because of my discovery of Font Awesome through this investigation.

I’ll close out with a couple questions (with no question marks) for the audience:

  1. Through this lens, JavaScript is way more popular than other metrics seem to indicate. One hypothesis is that we all end up exposing things through the browser, so you end up doing something in JavaScript no matter what your language of choice is. I’m interested in other ideas and would also appreciate thoughts on how to validate them.

  2. It’s not obvious to me how to best aggregate ranking data. I’d love to see someone else take this data and expose something more interesting. Even if you’re not going to do anything with the data, any ideas are appreciated.

Comments