|
|
Subscribe / Log in / New account

Debsources as a platform

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Nathan Willis
September 2, 2015
DebConf

Debsources is a project that provides a web-based interface into the source code of every package in the Debian software archive—not a small task by any means. But, as Stefano Zacchiroli and Matthieu Caneill explained in their DebConf 2015 session, Debsources is far more than a source-code browsing tool. It provides a searchable viewport into 20 years of free-software history, which makes it viable as a platform for many varieties of research and experimentation.

Big data

Debsources was first developed at the Initiative de Recherche et Innovation sur le Logiciel Libre (IRILL), Zacchiroli began. Initially, the project implemented a web application for browsing the full repository of Debian source packages. The packages indexed cover the stable, unstable, and experimental archives for every Debian release from 1998's "hamm" up through today's current experimental archive, plus all of the backports. For each package, the Debsources database includes every update that has been pushed to the official archive. Thus, while it does not capture every commit made to a package, it does include every upload made by the Debian project.

[Stephano Zacchiroli at DebConf]

The Debsources browsing tool lets users navigate to specific files—in graphical or text-mode web browsers—and provides syntax highlighting for more than 100 languages. For users needing to explore a particular package, he said, this is far faster than using apt-source to download and install the source code locally (and which must then be explored through an editor or other application).

But the developers did not stop at implementing a browsable archive. They implemented full-text search across the entire database, with support for searching on package names, file hashes (using SHA-256), and functional symbols in the source (e.g., functions, classes, and variables). The symbol-searching functionality is implemented using the ctags utility, and it supports searching by both ctags indices and by regular expressions. Every time a new file is uploaded to the Debian archive, ctags is run automatically to add the changes to Debsources.

Here again, Zacchiroli explained the distinction between what Debsources does and the functionality already offered by an existing tool—in this case, codesearch.debian.net. The codesearch database, he said, is geared toward bug fixing for upcoming Debian releases; it only indexes the current "unstable" archive, and it is not updated on every push. The Debsources web application is also written to facilitate collaboration: on the site, users can generate and share links to specific lines in a file and can highlight or annotate lines. That allows users to reference and comment on potential bugs at a granular level.

Debsources is also integrated with the codesearch site and with Debian's tracker.debian.org package tracker. Debsources and codesearch share the same regular-expression search engine, with the results being automatically redirected to the site from which the search was performed. On the package tracker, each package page includes links to Debsources marked as "browse source code." Integration with additional parts of the Debian infrastructure is still to come. In addition, all of the features that are exposed in the web interface are also available in a JSON-based API, so even more developers can make use of the Debsources service.

The massive collection of package data and source code is interesting from a statistical perspective as well as a practical one, Zacchiroli said. A wide array of metrics is available at sources.debian.net/stats, including disk space consumed, lines of code, number of files, and number of ctags symbols for every release. Altogether, "sid" currently takes up 228GB across 11.7 million files and over one billion lines of code. Of those lines, about 439 million are in C.

Zacchiroli also discussed some ancillary features of Debsources that make it potentially interesting for other uses. Because it tracks SHA-256 hashes of each file, the database can easily identify duplicate files anywhere in the archive. On each file's page, the user interface includes a link that will bring up every incidence of a duplicate file in the archive. This makes it easy to see, for instance, that there are 4,309 copies of the GPLv3 COPYING file.

Ongoing developments

Caneill then took the microphone and discussed recent work, both by existing Debian contributors and by Outreachy or Google Summer of Code (GSoC) interns. The new features include a detailed directory-listings format that includes file sizes and permissions (much like the output of ls -l), plus an in-browser file editor. The editor is implemented as a browser plugin (for Chromium and Firefox/Iceweasel); it lets the user edit any file in Debsources and output the changes as a diff that is ready to send to the package maintainer.

[Matthieu Caneill at DebConf]

Behind the scenes, he said, the interns have done a lot of refactoring of the Debsources application that should make it easier for contributors to add still more functionality. The codebase is a lot more modular now, which has other benefits, too. For example, the file-updating code has been rewritten to be asynchronous at each stage (adding or updating a package, computing the statistics, etc), which helps performance. The charting module has been rewritten to produce nicer-looking graphs, and Python 3 support was added.

Another new feature—still in development—is the "copyright information" application, which is used to scan and track copyright information in the Debian archive. Some (though not all) packages include machine-readable copyright statements, which the application tracks and computes statistics from. In addition, the application generates a Software Package Data Exchange (SPDX) file for each copyright statement that it finds, and will display it in the Debsources web interface. That application was developed by GSoC student Orestis Ioannou, who is also working on a patch-tracking application that will integrate with Debsources.

Moving forward, Caneill said, the roadmap includes a number of other features: automatically running static-analysis tools, providing more live statistics (such as on license and patch information), and linking every binary package to the corresponding source package in Debsources (which is not currently easy, because a binary package might originate from a variety of source packages with "Provides:" or "Replaces:" rules, for instance). There are also some technical hurdles that still need to be overcome, he said, like being able to unpack and index tarballs within tarballs.

The team also wants to implement file-level deduplication to conserve disk space. That includes not just deduplicating the 4,039 copies of COPYING in the current "unstable" archive, but also deduplicating files over time. There are quite a few files that do not change in any given upload, so storing duplicates of them is an unnecessary use of disk space. The current database uses 1.1TB, which is not enormous on its own, but one year ago it only required 800GB.

Future research

Zacchiroli and Caneill closed out the session by discussing how Debsources is viable as a research platform. It includes twenty years of history for tens of thousands of packages. That makes it possible to statistically analyze, for example, how programming language popularity has evolved over the years or how file sizes have changed on a per-language basis. In response to an audience question, the pair added that statistics about build systems, packaging choices, and other factors could be generated as well. The two have written two papers analyzing the source in the archive, both of which have been presented at academic conferences. They have also been contacted by an outside researcher, although his research has not yet been published.

The audience asked quite a few questions in the time remaining. One attendee wondered if the team had encountered any hash collisions among all of the SHA-256 hashes computed; the pair replied that they had not found any, but that it would be fun. Another asked if there was interest in including any Debian derivatives in Debsources. The pair replied that they have a tracking bug open and hope to implement it, but that file-level deduplication needs to be implemented first, "or else it would explode." The Debian project, it seems, is already finding Debsources to be a valuable addition to the project infrastructure, both for tracking statistics and integrating with other tools; given the breadth and depth of the data set it includes, many other projects may find it valuable as well.

[The author would like to thank the Debian project for travel assistance to attend DebConf 2015.]

Index entries for this article
ConferenceDebConf/2015


(Log in to post comments)

Replacement for Google Code Search?

Posted Sep 3, 2015 8:09 UTC (Thu) by debacle (subscriber, #7114) [Link]

Unfortunately, Google Code Search does not exist anymore. It was not free software, but I liked it anyway. It was great to look for practical examples on how to do things - or not to do them.

Debsources seems, however, to lack some its features, such as search by programming language or did I miss it?

One might argue, that Debsources is limited by only covering software packaged for Debian, but remember the wise Confucius quote: "Free software not packaged for Debian does not exist, is not relevant, or will eventually be packaged."

Replacement for Google Code Search?

Posted Sep 3, 2015 17:52 UTC (Thu) by drag (guest, #31333) [Link]

> "Free software not packaged for Debian does not exist, is not relevant, or will eventually be packaged."

Unfortunately that is a VERY untrue statement.

Replacement for Google Code Search?

Posted Sep 9, 2015 15:21 UTC (Wed) by hummassa (guest, #307) [Link]

Do you have a counter-example?

Replacement for Google Code Search?

Posted Sep 9, 2015 23:30 UTC (Wed) by MrWim (subscriber, #47432) [Link]

Do you have a counter-example?

For example none of the open-source projects I am the original author of are packaged in debian (stb-tester, git-meld, pulsevideo), although we do provide an Ubuntu PPA for stb-tester. You might argue that none of these programs or tools are significant enough to be worthwhile packaging, but certainly some people use them beyond just me.

github claims 26 million projects, Debian claims 43000 packages.

I mostly work in Python but prefer to get my Python packages from Debian, but I often have to use pip instead. For example for python-jira and ansi2html. You might argue that pip's fine for Python libraries, but it does make it more difficult to package and distribute applications that depend on those libraries.

PyPI claims 66000 packages, Debian claims 43000

Another class of applications that are not available in Debian are older versions of current applications. A few years ago I remember being able to customise the visible columns in file-roller. Recently I wanted to check that I'd got the right permissions and mtime, etc. in a tarball I'd just created so I once again reached for file-roller but I can no longer find the option to show the permissions of the contained files. Ideally I would have liked to have quickly installed an older version so I could get my job done, but Debian doesn't have it available any-more.

Ultimately I don't think it's feasible for Debian to scale to these sorts of numbers with its current approach. This is one of the reasons I'm so excited about xdg-app and sandboxing in general.

Replacement for Google Code Search?

Posted Sep 26, 2015 13:39 UTC (Sat) by hummassa (guest, #307) [Link]

Thank you; your answer is quite informative and gives a lot of food for thought.

Replacement for Google Code Search?

Posted Sep 27, 2015 8:45 UTC (Sun) by jond (subscriber, #37669) [Link]

These will all eventually be packaged. (Resistance is futile)

Replacement for Google Code Search?

Posted Sep 28, 2015 23:52 UTC (Mon) by MrWim (subscriber, #47432) [Link]

> These will all eventually be packaged. (Resistance is futile)

Possibly, but I don't think it can happen with the current technology and processes that exist in Debian. Fortunately sandboxing technologies like xdg-app and limba[1] seem to be advancing steadily which may provide some of the technology to enable this. I also think that a lot of the technologies and ideas around reproducible builds currently happening in Debian could help - if binaries are entirely deterministic you don't need centralised build infrastructure.

As far as the processes go I have fewer ideas. If that level of packaging were to come to pass there would certainly be a need for a curator, which is a role Debian is already fulfilling, but I think it might look quite different to how it does now. How QA, bug fixing, legal review and integration fit in to such a world I don't know.

[1] I particularly like how limba allows people to reuse their .travis.yml files for app building: it's a model that's proven to scale well to a lot of projects.

Replacement for Google Code Search?

Posted Nov 18, 2015 11:59 UTC (Wed) by nickleverton (subscriber, #81592) [Link]

>Ideally I would have liked to have quickly installed an older version so I could get my job done, but Debian doesn't have it available any-more

Excuse the late followup, but are you aware of snapshot.debian.net ? It has pretty much every Debian version of everything that wasn't removed for legal reasons.

Replacement for Google Code Search?

Posted Sep 3, 2015 20:06 UTC (Thu) by zack (subscriber, #7062) [Link]

> Debsources seems, however, to lack some its features, such as search by programming language or did I miss it?

codesearch.debian.net (the actual regexp-based code search engine) has a filetype selector, that allows to do some of it. See: http://codesearch.debian.net/faq#keywords

Remember that, as noted in the article, that covers "only" Debian sid, currently

Replacement for Google Code Search?

Posted Sep 3, 2015 21:11 UTC (Thu) by lsl (guest, #86508) [Link]

> Unfortunately, Google Code Search does not exist anymore. It was not free software, but I liked it anyway.

Russ Cox wrote about the inner workings of Code Search, though[0]. He also released a basic implementation that's not tied to internal Google infrastructure[1]. This was picked up by Michael Stapelberg (see his BSc thesis) who then ran off to build codesearch.debian.net.

So the successor to Google Code Search is, in fact, codesearch.debian.net.

[0] https://swtch.com/~rsc/regexp/regexp4.html
[1] https://github.com/google/codesearch

Biggest debian source packages

Posted Sep 3, 2015 14:48 UTC (Thu) by ededu (guest, #64107) [Link]

Does anyone know how can we see what projects (source packages) are the biggest in terms of SLOC in debian? For example, looking at https://sources.debian.net/src/linux/4.1.6-1 I see that linux 4.1 has almost 13 Mlines of code, and likely libreoffice 5.0.1 has about 4.5 Mlines of code. However, I would like to have a list in decreasing order of size.

Biggest debian source packages

Posted Sep 3, 2015 20:29 UTC (Thu) by zack (subscriber, #7062) [Link]

> Does anyone know how can we see what projects (source packages) are the biggest in terms of SLOC in debian? For example, looking at https://sources.debian.net/src/linux/4.1.6-1 I see that linux 4.1 has almost 13 Mlines of code, and likely libreoffice 5.0.1 has about 4.5 Mlines of code. However, I would like to have a list in decreasing order of size.

There is no interface to do complex queries like the one you're asking for on the web, but your question can easily be answered using the Debsources DB.

The top-ten (not counting different versions of the same packages) is like this (format is sloc|package|version):

15331031|chromium-browser|43.0.2357.65-1~deb8u1
13726566|linux|4.2~rc8-1~exp1
9256058|mono|4.0.2.5+dfsg-2
7273446|icedove|40.0~b1-1
6900913|iceweasel|40.0.3-3
5923212|netbeans|8.0.2+dfsg1-4
5768171|aspectc++|1:1.2+svn20150823-1
5551762|aces3|3.0.8-4
5483128|libreoffice|1:3.5.4+dfsg2-0+deb7u2
5482310|nvidia-cuda-toolkit|6.0.37-5

I've put a list that includes all packages that have at least 100ksloc up here http://upsilon.cc/~zack/stuff/debsources-pigs.20150903.txt

While we haven't yet automated the periodic publishing of Debsources database dumps, one that is ~6 month old is available from Zenodo at https://zenodo.org/record/16106, for those who might want to play with it.

Some aggregate stats extracted from the same database (but in 2014) are available in this paper: https://upsilon.cc/~zack/research/publications/debsources...

Hope this helps,
Cheers.

Biggest debian source packages

Posted Sep 4, 2015 9:46 UTC (Fri) by jwilk (subscriber, #63328) [Link]

No gcc-5 on the list... How come?

Biggest debian source packages

Posted Sep 4, 2015 10:00 UTC (Fri) by zack (subscriber, #7062) [Link]

Here is why: http://sources.debian.net/src/gcc-5/latest/

The GCC package in Debian still uses tarball-in-tarball, which Debsources does not expand (as mentioned in the article), so sloccount returns bogus results on that package.

Debsources as a platform

Posted Sep 3, 2015 20:02 UTC (Thu) by zack (subscriber, #7062) [Link]

Thanks for your article about our DebConf15 Debsources talk Nathan, it's much appreciated!

Just a clarification about this statement in the article:

> the Debsources database includes every update that has been pushed to the official archive

It is true that Debsources is updated at each Debian archive update, and hence that it indexes/publishes new package uploads in near real-time (+ processing time). But, currently, Debsources does not keep all uploaded package versions—as, for instance, snapshot.debian.org does. Debsources has instead a garbage collection phase that will delete old package versions after 15 days they have disappeared from all known Debian releases. We do want to keep all package versions eventually, but we do need to implement file-level deduplication first to make that viable. (I'm sorry we didn't discuss this aspect in detail during the talk!)

Thanks again for your article,
Cheers.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds