Real Experiences from a Hadoop Veteran

Jim Scott
The Ramp
Published in
7 min readNov 10, 2015

--

Some people say I am biased toward certain technologies. That is a completely true statement! Granted, it does depend on the specific technology. But just because I may be biased with certain technologies doesn’t mean I’m not objective or fair.

When it comes to Hadoop, Apache Hadoop really is free–as in beer. But, in reality, unless you are a huge company with a massive team of engineers, you very likely are NOT going to be patching and building your own distribution of Hadoop to run internally. No matter what anyone says, if you are paying for support for your distribution of Hadoop in any way then it is not truly free.

The first Hadoop distribution I started with back in 2009 was Cloudera. It was the only supported distribution of Hadoop available at the time. I had a good experience with the people I interacted with at Cloudera, but at the end of the day, there are just inherent flaws in the design of Apache Hadoop that caused REAL pain.

Hadoop showed so much potential so quickly that everyone wanted Hadoop to be the next hammer. I believe that because the RDBMS hammer was in use for so long that it was a natural response for everyone to want Hadoop to be “The Hammer” in the toolbox. After all, allowing a company to prevent data sprawl and to co-locate their data in a single place to perform data processing across all domains is in fact a game changing technology.

My Experiences and My Future Choices

What I found out very quickly with Hadoop was that there were some severe limitations in the platform. For example, the NameNode was a single point of failure in the design. In my first implementation, we had a NameNode failure and lost 2 days of engineering time for 40 engineers. Additionally, after that implementation started making headway and gained a little visibility, others started putting lots of small files into the system and suddenly the file system was out of storage; we lost another 1–2 days of engineering time cleaning this up and engineering around the problem. MapR solved this issue by getting rid of the NameNode.

So when I had the chance to do it again, having lived through building a Hadoop cluster once, I wanted to focus on the core competencies of my business and leverage Hadoop as a technology to solve problems. I didn’t want to waste engineering resources for working around the flaws or limitations in Hadoop. MapR was out of stealth mode, and they solved all of the issues that I had while using Cloudera’s Distribution of Hadoop. MapR had taken the concepts that Google had used to build their company and MapR enabled any business to be able to run at Google scale.

Given the Hadoop experiences I have had, I feel it is worth detailing the differences that I found most important to my business needs and my employers requirements for an enterprise platform.

Append-only File Access vs. Random Read-Write File Access

HDFS was severely hampered because it implemented append-only access semantics. This would have been fine if the only task ever run on Hadoop was indexing the web, but it wasn’t. This caused limitations in many practical applications and forced downstream projects to work around it, for example HBase. HBase implemented concepts like tombstoning and compactions, which could cripple a production system if they occurred during peak periods. MapR solved these issues with the creation of MapR-DB–a real-time, zero-administration database patterned after Google’s BigTable and supporting the HBase APIs.

Systems Integration

You need security, not just authentication and authorization, but also wire-level security and true multi-tenancy. I know people who have spent hundreds of hours figuring out how to get security working on Apache Hadoop. Due to a lack of support for security in Apache Hadoop, we have started experiencing cluster isolation and data sprawl, which is the exact thing that Hadoop was designed to prevent in the first place.

There is a reason why standards exist; standards allow interoperability and the ability to rip-and-replace. Take NFS or POSIX as examples. HDFS is neither POSIX or NFS compliant. In order to see what files are in HDFS you have to go through the HDFS command line interface to query the file system. The MapR file system (MapR-FS) is an actual OS-level file system that is POSIX compliant; it’s not some file system sitting on top some other file system. If you want to see the files in the distributed file system just do the LS command. Want to edit a file? Go for it. No special tools required. Any application in Linux that can read or write to an NFS mount can read or write to the MapR-FS. Really, who wants to spend their time figuring out how to make these things work together? MapR enables zero barrier to entry for all of those off-the-shelf applications in your business that leverage standard file system protocols. What happens when you want to review job output in Apache Hadoop? Generally you have to copy the data out of HDFS. With MapR you can just access the data in place because it is a standard file system.

Backups and Recovery

Another big area that Apache Hadoop falls short is disaster recovery and backups. We all make mistakes, and we have to plan for them to happen. Hadoop’s data replication protects you from disk failures, not from data corruption or human error. If you destroy a piece of data, it replicates to the other 2 locations and you destroy it everywhere. If you are about to deploy new software to production, you might want to use a little bit of caution and take a snapshot of your data before you run that new code on it. I don’t mean an Apache Hadoop snapshot, because in standard Hadoop a snapshot is a copy of the metadata. I mean a MapR snapshot, which is a nearly instantaneous copy of the data at a point in time, even when a file is actively being written. This means if you corrupt your data, or just accidentally blow it away, you can get it back from your snapshot. And yes, you can snapshot MapR-DB tables. These are capabilities that just shouldn’t be overlooked.

Maintenance and Upgrades

With every good enterprise application, general maintenance, administration and upgrades must be performed. Hadoop is no exception to the rule. What happens when your distribution has a new release and it only supports one version of Hive? What happens if the version of Hive they support broke an API you were using? How do you upgrade your platform? If you think you’ve solved that problem, what happens when you have people on different teams using this technology on the same cluster and you can’t all coordinate when to get your changes in place to be able to upgrade? These are real! MapR tests multiple versions of many open source software projects against each release. MapR is the only distribution that supports the running of multiple versions of software on the same cluster.

Unbiased Open Source

What place does your Hadoop vendor have choosing which open source software projects you should or shouldn’t be able to use to solve your business problems? After all, projects competing for the same space in the open source community is really capitalism at its best. You should be empowered to choose the projects you want, whether it is Drill, Pig, Impala, SparkSQL, Hive or Tez. Don’t let your vendor play politics at your expense. MapR supports open APIs and has an unbiased set of open source software on Hadoop.

Performance

In 2012 MapR ran the TeraSort benchmark. This is a measure to see how quickly 1TB of data can be sorted. MapR set the record in 54 seconds on 1,003 nodes on the Google Compute Engine (virtualized!). Yahoo! previously held the record at 62 seconds on 1,460 nodes. ~33% less hardware, running nearly 15% faster.

Then there is the MinuteSort test. Yahoo had set this record by sorting 1.6TB of data on 2200 servers in one minute. Early in 2014 a MapR customer ran the MinuteSort test and sorted 1.65TB on their 298 node cluster. That is 1/7th the hardware. That is a HUGE difference!

Final Thoughts

Before joining the MapR team, I was a customer of MapR at two different companies for a reason. Now, it is your choice whether you run your enterprise on a free — but unsupported — distribution of Hadoop, and I can’t stop you. If you do choose this route, be sure to keep in mind that MapR supports more standards than any other Hadoop distribution and you and your organization will be better off using the free MapR M3 Community Edition. It is faster and offers you more than any other distribution in the market.

The more experience companies gain with Hadoop the more they realize they need a platform that can integrate into an enterprise. Enterprise features are critical to the longevity of a business. And don’t forget that while you could make your own beer, there are still costs associated with the ingredients, equipment to make it, and the most important resource of time; that beer isn’t really free. My recommendation to you is to focus on your core competencies instead of worrying about how to engineer around your Hadoop platform.

Originally published at www.mapr.com.

--

--

Jim Scott
The Ramp

Digital Transformation and Emerging Technologies Leader | Head of Developer Relations, Data Science @NVIDIA