Enabling Highly Available and Scalable Hadoop
The
Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters and nodes of computers, designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop is being used by enterprises across verticals for Big Data analytics to help make better business decisions based on large data sets.
VMware enables you to easily and efficiently deploy and use Hadoop on your existing virtual infrastructure. Leverage vSphere to deploy an Apache Hadoop cluster in minutes, while improving availability of the Hadoop cluster. VMware is making two open source contributions: Hadoop Virtual Extensions to make Hadoop virtualization-aware, and
Serengeti to enable deployment of a Highly Available Hadoop cluster in minutes.
- Achieve better Apache Hadoop performance in virtual environments through Hadoop Virtualization Extensions
- Simplify operations with rapid deployment of Apache Hadoop cluster, including distributions from multiple vendors with Serengeti
- Remove single point of failure through one click High Availability for Apache Hadoop NameNode and JobTracker, as well as Hadoop tools like Pig and Hive
Serengeti is an open source project initiated by VMware to automate deployment and management of Apache Hadoop clusters on virtualized environments such as vSphere. Serengeti code can run Hadoop distributions from multiple vendors.
- Deploy clusters with HDFS, MapReduce, Pig, Hive, and Hive server
- One command to deploy and use Hadoop clusters
Fully customizable configuration profile to meet your needs
- Dedicated machines or share with other work load
- Shared or local storage
- Static IP or DHCP network
- Fully control the placement of Hadoop nodes
Manage and use Hadoop easily
- Scale out a Hadoop cluster
- Tune Hadoop configuration
Speed up time to insight
- Upload/download data, run MapReduce job, pig and Hive scripts from PC
- Consume data in HDFS through Hive server SQL connection using existing tools
Boost up and dismiss compute capacity on demand
- Separate compute node from data without losing data locality
- Scale out and shutdown compute node on demand
- Spin up compute only cluster to analyze data in existing HDFS
Improved availability for Hadoop cluster
- One click to Highly Available NameNode and JobTracker to avoid single point of failure
- Fault-tolerance (FT) for NameNode and JobTracker
- One click HA for Hadoop tools Pig, Hive and Hbase
- VMware vMotion to reduce planned downtime
Support multiple Hadoop 1.0 based distributions
