Hadoop, Pig, and Hive.
Appliance Type
Partner
Description
Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data. This VM contains a single-node Hadoop+Pig+Hive installation with sample data for trying out the system. You can watch the associated free training videos at:
http://www.cloudera.com/hadoop-training
Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.
Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters. For more information about Hadoop, please see the Hadoop wiki:
http://wiki.apache.org/hadoop/
Features & Benefits
Here's what makes Hadoop especially useful:
* Scalable: Hadoop can reliably store and process petabytes.
* Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
* Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
* Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
* Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built in capabilities of the language.
* Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming, Optimization opportunities, and Extensibility.
Pricing
Free
Tags & Keywords
hadoop, clustered storage, map reduce, data warehousing

