The amount of data is exploding and the tools to store and process it need to keep pace. Companies like Wal-mart and Facebook deal with data on the petabyte scale. Data at this scale does not fit well in traditional relational database models. Internet companies like Oversee.net who wish to analyze all the data that flows through its network can no longer just depend on SQL based solutions like MySQL. Instead we have to turn to solutions like Hadoop, Cassandra, and Pig. There is a lack of good information on the internet on how the BigData ecosystem fits together so I figured I would create a post to help all those out trying to wrap their heads around this new data paradigm.
In the past, when a company needed to store and access more data they had one of two choices. One option would be to buy a bigger machine with more CPU, RAM, disk space, etc. This is known as scaling vertically. Of course, there is a limit to how big of a machine you can actually buy and this does not work when you start talking about internet scale. The other option would be to scale horizontally. This usually meant contacting your vendor (Oracle, EMC, etc.) to buy a bigger solution. These solutions do not come cheap and therefore required a significant investment. This is where the new breed of solutions comes in. They are designed around cheap commodity hardware that is expected to fail. The system takes care of all the complexities of replication, failover, etc. This allows the developer to think about the actual data problem at hand and allows businesses to scale horizontally but at a much cheaper price.
The foundation of the BigData movement comes from a framework that was originally developed by Google to deal with the large amount of search data, MapReduce. As its name suggest there are two parts to MapReduce, a Map phase and a Reduce phase. In the Map phase, a problem is broken out in to smaller chunks. This enables the work to be redistributed to the various nodes in the cluster. In the Reduce phase, the work is collected from all the nodes and then reassembled to come up with the final output.
In 2006, Yahoo started a MapReduce project that would eventually turn into Hadoop. It would become an open-source top level Apache project in 2008. However, Hadoop is only one small part of a much larger ecosystem. This is where a lot of confusion about Hadoop comes in. By itself, Hadoop is just a framework. It depends on other projects and complimentary services to make it the transformational technology that most people are referring to when they just talk about “Hadoop”.
In order to use Hadoop, you need to have a storage system where Hadoop can pull data from and subsequently store its results. Because of the way Hadoop works, by grabbing very large data sets, it needs to pull this data from hundreds of disks in order to be efficient. This requires that it work with a distributed filesystem. Of course, distributed filesystems are more difficult to work with since you have all the challenges associated with network programming. It needs to be fault tolerant, scalable , and portable. Hadoop comes with a distributed filesystem. It is appropriately named HDFS which stands for Hadoop Distributed Fileystem. While it solves some of the more important issues when dealing with a distributed filesystem, there are several things it does not do well. If low latency is a requirement, HDFS is not a good fit. HDFS is designed to handle large file sizes. It does not handle lots of small files. Simple random access is just not possible. To get over some of these limitations, several higher level abstractions have been developed.
Most users of Hadoop are going to want to use it in a database like way. There are several solutions out there which attempt to provide this functionality. Most of them are based around the NoSQL database management model. One of the more popular solutions which is a subproject of the Apache Hadoop project is HBase. HBase is a distributed, column-oriented database that sits on top of HDFS. It provides database like functionality rather than having to deal with the lower level HDFS interface. Other Database like interfaces that exist are Cassandra (used and developed by Facebook), and Hypertable.
It must also be noted that there are other Database solutions that do not rely on Hadoop to do their map reduce functionality. The most notable of these solutions are MongoDB, CouchDB, and Voldemort.
Writing Map Reduce functions is not intuitive or easy. Most analyst and developers interact with a database system through a higher-level query language, like SQL, that helps simplify their requests. While one could write their own abstraction layer to get around the complexities of MapReduce, several open source projects have taken this task on as well. Pig is part of the Apache Hadoop Project. It allows people to specify data flow through a language called Pig Latin. It then translates the query into the appropriate MapReduce jobs and provides the results to the user. Now, don’t misunderstand this. While Pig is simpler than directly accessing the HDFS filesystem, it probably would take someone familiar with SQL some time to really understand. Another answer to this problem is with the Hive Project. Yet again, another project born out of Facebook, it provides a Data Warehousing framework on top of Hadoop. It’s language is a little bit more familiar to those who come from a SQL background and tries to bridge the gap between those users familiar SQL queries and have some scripting background and the much more complex MapReduce programs that need to be written in Java. It supports things like joins and group bys that a lot of other solutions do not.
I have by no means discussed everything there is to know about BigData; an entire blog could be devoted to just this. BigData is still in its early infancy but it is already having a profound effect on technology companies and the way we do business. In future posts I will discuss all the tradoffs one has to make when faced with the numerous solutions that exist out there. Just small differences in how your organization wishes to access the data or the type of data you might want to store can have a dramatic effect in what solution may best fit your needs. I also hope to discuss how Big Data is changing the way Oversee views its business and the impact it can have in the Online Advertising space.