Aug 21, 2012 2:45 PM

Google's Mind-Blowing Big-Data Tool Grows Open Source Twin

Google reinvented data analysis with a sweeping software platform called Dremel. And now, Silicon Valley startup MapR has launched an open source project that seeks to duplicate the platform.

Image may contain Human Person Animal Reptile Dinosaur Machine Clothing Apparel Helmet and Hardhat

Mike Olson and John Schroeder shared a stage at a recent meeting of Silicon Valley's celebrated Churchill Club, and they didn't exactly see eye to eye.

Olson is the CEO of a Valley startup called Cloudera, and Schroeder is the boss at MapR, a conspicuous Cloudera rival. Both outfits deal in Hadoop -- a sweeping open source software platform based on data center technologies that underpinned the rise of Google's web-dominating search engine -- but in building their particular businesses, the two startups approached Hadoop from two very different directions.

Whereas Cloudera worked closely with the open source Hadoop project to enhance the software code that's freely available to the world at large, MapR decided to rebuild the platform from the ground up, and when that was done, it sold the new code as proprietary software. On stage last month during a panel discussion dedicated to Hadoop, Olson and Schroeder went toe-to-toe over whose approach made the most sense, and as so often happens in the Valley when open source is the subject at hand, the dispute raised more than a little heat from those sitting in the audience.

>'Drill is a new set of APIs. It's a new system. It really helps to get adoption of new APIs if those APIs are open.'

— Tomer Shiran

Schroeder said that MapR wasn't necessarily opposed to open development. The company took the Hadoop code behind closed doors, he explained, at least in part because those driving the open source project were unwilling to quickly make the changes MapR wanted to make. "There are a lot of politics in the open source community," he said, "and things are different depending on your situation."

As if to prove his point, MapR has now launched a separate open source project meant to serve as an major complement to Hadoop. At the Apache Software Foundation -- the not-for-profit open source outfit that oversees Hadoop -- MapR recently proposed a project that aims to mimic Dremel, a shockingly effective data-analysis tool built and used by Google. The project is called Drill, and according to Tomer Shiran, the MapR employee who oversaw the proposal, it's suited to completely open development in a way that the company's original Hadoop work was not. With Hadoop, MapR was working with an existing project -- with an entrenched community of developers. With Drill, it's starting something new.

Shiran says MapR opened up the development of Drill because it hopes to turn the platform into the de facto standard for rapidly analyzing data stored in Hadoop. In developer speak, the company wants to promote the use of Drill's APIs, or application programming interfaces, which let you plug other tools into the platform.

"It's a new set of APIs. It's a new system," says Shiran, who previously worked in the research arms of both HP and IBM. "It really helps to get adoption of new APIs if those APIs are open."

In building Drill out in the open, the company may also hope to win some points with the world's developers and IT managers -- points it lost in building its own proprietary version of Hadoop. Shiran denies this is the case, but open source politics pop up in so many different places -- as last month's panel discussion at the Churchill Club so clearly demonstrated, when Schroeder was practically heckled for saying MapR wasn't concerned with open source "ideology." The reasons for open sourcing software code are almost never straightforward, but clearly, keeping code open is an increasing important part of doing business in today's software market.

It helps to spread the adoption of software code, but it can also spread goodwill -- something that can be just as important in its own way.

When MapR started working on Hadoop in 2009, the platform was already widely used across the web. Based on research papers describing MapReduce and the Google File System -- two sweeping software platforms that reinvented the way Google built its search index -- Hadoop was built by Yahoo and Facebook and others as a way of crunching vast amounts of data using thousands of dirt-cheap servers. It was hugely effective -- a Facebook engineer once compared it to the air you breathe -- but it was also somewhat unsuited to use among companies who lacked the engineering expertise of companies like Yahoo and Facebook.

MapR resolved the fix many its of flaws -- including a conspicuous "single point of failure" that plagued the file system -- but according to Schroeder and company co-founder M.C. Srivas, those driving the open source project were unwilling to make these changes as quickly as the company would have liked. So, MapR rebuilt the file system on its own, and in 2011, the company released its own proprietary version of Hadoop, intent on reaping the financial benefits of its engineering work.

As Mike Olson points out, the open source Hadoop project has since solved many of the same problems, and he believes that keeping the platform's core code in the open is a far better solution in the long term. "Most of all, you want open source software because it eliminates vendor lock-in," he said during last month's panel discussion. "You can kick the vendor out, and we can't turn off access to your data. We can't turn off access to your analytics. We can't turn off access to your databases."

But Schroeder argues that Olson and Cloudera also offer proprietary software -- in the form of Hadoop management tools -- and he points out that all software companies must find some way of actually making money from their code. There are many ways of doing so, and with Drill, MapR has shown that it too sees the value of open development.

Shiran says that outside developers have already expressed interest in the project, and two outsiders -- Chris Wensel, founder and CEO of a company called Concurrent, and Ryan Rawson vice president of engineering at Drawn to Scale -- are listed as core developers in the Drill proposal MapR submitted to Apache.

Though Shiran points out that the company has already made open source contributions to Hadoop and various sister projects, Drill is different in that the company intends to build the entire platform in the open. But the way Shiran tells it, this is a necessity. Though Google released a research paper describing Dremel in 2010, the Hadoop community has yet to duplicate its rather astonishing data analysis techniques, and MapR wants to ensure this is done in the "right way." This, he says, is something the company couldn't do with Hadoop itself.

>'You have a SQL-like language that makes it very easy to formulate ad hoc queries or recurring queries-- and you don’t have to do any programming. You just type the query into a command line'

— Urs Hölzle

Yes, Hadoop already serves as a data analysis tool, thanks to sister projects such as Hive and Pig, but it's a "batch" tool, meaning that data query takes a fair amount of time. Drill is meant to analyze large amounts of data almost instantly, following in the footsteps of Dremel. According Google infrastructure guru Urs Hölzle, Dremel can a query on a petabyte of data in about three seconds.

"You have a SQL-like language that makes it very easy to formulate ad hoc queries or recurring queries — and you don’t have to do any programming. You just type the query into a command line," Hölzle told us last month, referring to the Structured Query Language that has long been used with traditional databases designed to handle much smaller amounts of data.

According to MapR's Shiran, Drill is meant as a complement to Hadoop, not a replacement. Hadoop, he says, is best used to transform a large dataset. You can take a vast collection of webpages, for instance, and build a search index. But Drill lets you very quickly pull smaller pieces of information from that same dataset.

"[Hadoop] can take a petabyte of data can crunch it into a new petabyte," Shiran says. "With Dremel or Drill, you can take a petabyte and produce a terabyte or less." Some MapR customers, he says, already use the company's Hadoop platform in tandem with BigQuery, a Google online service that exposes Dremel to the rest of the world.

The name Drill, Shiran says, was proposed by a Google employee that MapR has worked with on BigQuery. MapR co-founder M.C. Srivas is a former Googler who was part of the team that built the company's search infrastructure. Google isn't officially involved in Drill. With these massive infrastructure platforms, it tends to do its own thing.

MapR has also been known to do its own thing. But this time, it's not.

Image: Flickr/Mitch Wagner