Hadoop, the data storage and retrieval approach developed by Google to handle its massive data needs, is coming to the enterprise data center. Are you interested?
Behind Hadoop is MapReduce, a programming model and software framework that enables the creation of applications able to rapidly process vast amounts of data in parallel on large clusters of compute nodes. Hadoop is an open source project of the Apache Software Foundation and can be found here.
Specifically, Hadoop offers a framework for running applications on large clusters built from commodity hardware. It uses a style of processing called Map/Reduce, which, as Apache explains it, divides an application into many small fragments of work, each of which may be executed on any node in the cluster. A key part of Hadoop is the Hadoop Distributed File System (HDFS), which reliably stores very large files across nodes in the cluster. Both Map/Reduce and HDFS are designed so that node failures are automatically handled by the framework. Hadoop nodes consist of a server with storage.
Hadoop moves computation to the data itself. Computation consists of a map phase, which produces a sorted key and value pairs, and a reduce phase. According to IBM, a distributor of Hadoop, data is initially processed by map functions, which run in parallel across the cluster. The reduce phase aggregates and reduces the map results and completes the job.
HDFS breaks stored data into large blocks and replicates it across the cluster, providing highly available parallel processing and redundancy for both the data and the jobs. Hadoop distributions provide a set of base class libraries for writing Map/Reduce jobs and interacting with HDFS.
The attraction of Hadoop is its ability to find and retrieve data fast from vast unstructured volumes and its resilience. Hadoop, or some variation of it, is critical for massive websites like Google, Facebook, Yahoo and others. It also is a component is IBM’s Watson. But where would Hadoop play in the enterprise?
Cloudera (www.cloudera.com) has staked out its position as a provider of Apache Hadoop for the enterprise. It primarily targets companies in financial services, Web, telecommunications, and government with Cloudera Enterprise. It includes the tools, platform, and services necessary to use Hadoop in an enterprise production environment, ideally within what amounts to a private cloud.
But there are other players plying the enterprise Hadoop waters. IBM offers its own Hadoop distribution. So does Yahoo. You also can get it directly from the Hadoop Apache community.
So what are those enterprise Hadoop applications likely to be. A few come immediately to mind:
- Large scale analytics
- Processing of massive amounts of sensor or surveillance data
- Private clouds running social media-like applications
- Fraud applications that must analyze massive amounts of dynamic data fast
Hadoop is like other new technologies that emerge. Did your organization know what it might do with the Web, rich media, solid state disk, or the cloud when they first appeared? Not likely, but it probably knows now. It will be the same with Hadoop.