Apache Hadoop is 100% open source, and
pioneered a fundamentally new way of storing and processing data. Instead of
relying on expensive, proprietary hardware and different systems to store and
process data, Hadoop enables distributed parallel processing of huge amounts of
data across inexpensive, industry-standard servers that both store and process
the data, and can scale without limits. With Hadoop, no data is too big. And in
today’s hyper-connected world where more and more data is being created every day,
Hadoop’s breakthrough advantages mean that businesses and organizations can now
find value in data that was recently considered useless.
Apache Hadoop is an
open source framework for distributed storage and processing of large sets of
data on commodity hardware. Hadoop enables businesses to quickly gain insight
from massive amounts of structured and unstructured data. It is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed
processing of very large data
sets (Big Data) on computer
clusters built from commodity hardware. All the modules in
Hadoop are designed with a fundamental assumption that hardware failures (of
individual machines, or racks of machines) are commonplace and thus should be
automatically handled in software by the framework.
Key Features
1)Scalable-
Hadoop is a highly
scalable storage platform, because it can store and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel. Unlike
traditional relational database systems (RDBMS) that can’t scale to process
large amounts of data, Hadoop enables businesses to run applications on
thousands of nodes involving thousands of terabytes of data.
2)Cost effective-
Hadoop is a highly
scalable storage platform, because it can store and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel. Unlike
traditional relational database systems (RDBMS) that can’t scale to process
large amounts of data, Hadoop enables businesses to run applications on
thousands of nodes involving thousands of terabytes of data.
2)Cost effective-
Hadoop also offers a
cost effective storage solution for businesses’ exploding data sets. The
problem with traditional relational database management systems is that it is
extremely cost prohibitive to scale to such a degree in order to process such
massive volumes of data. In an effort to reduce costs, many companies in the
past would have had to down-sample data and classify it based on certain
assumptions as to which data was the most valuable. The raw data would be
deleted, as it would be too cost-prohibitive to keep. While this approach may
have worked in the short term, this meant that when business priorities changed,
the complete raw data set was not available, as it was too expensive to store.
Hadoop, on the other hand, is designed as a scale-out architecture that can affordably store all of a
company’s data for later use. The cost savings are staggering:
instead of costing thousands to tens of thousands of pounds per terabyte,
Hadoop offers computing and storage capabilities for hundreds of pounds per
terabyte.
3)Flexible
Hadoop enables
businesses to easily access new data sources and tap into different types of data
(both structured and unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights from data
sources such as social media, email conversations or clickstream data. In
addition, Hadoop can be used for a wide variety of purposes, such as log
processing, recommendation systems, data warehousing, and market campaign
analysis and fraud detection.
4)Fast
Hadoop’s unique storage
method is based on a distributed file system that basically ‘maps’ data wherever
it is located on a cluster. The tools for data processing are often on the same
servers where the data is located, resulting in much faster data processing. If
you’re dealing with large volumes of unstructured data, Hadoop is able to
efficiently process terabytes of data in just minutes, and petabytes in hours.
5)Resilient to failure
A key advantage of
using Hadoop is its fault tolerance. When data is sent to an individual node,
that data is also replicated to other nodes in the cluster, which means that in
the event of failure, there is another copy available for use.
The MapR
distribution goes beyond that by eliminating the NameNode and replacing it
with a distributed No NameNode architecture that provides true high
availability. Our architecture provides protection from both single and
multiple failures.
When it comes to
handling large data sets in a safe and cost-effective manner, Hadoop has the
advantage over relational database management systems, and its value for any
size business will continue to increase as unstructured data continues to grow.
6)Distributed Metadata
The default Hadoop architecture
uses a single NameNode to store the metadata. This forces all data into a
bottleneck, and limits clusters to 50-200 million files. It also creates a single
point-of-failure (SPOF). If the NameNode were to fail, the entire cluster would
be useless.
Other distributions try
to sidestep the problem by using a secondary NameNode. Secondary NameNodes run
as a slave to the primary NameNode, and only replicate data from it on a
periodic basis. This means that those depending on a secondary NameNode cannot
trust its data integrity.
The only real solution
to the NameNode problem is to remove it. With the MapR Distribution no-NameNode
solution, there are no practical limits to the number of files that can be
stored on MapR. This foundational change in the Hadoop architecture distributes
the metadata amongst several nodes, which is illustrated below.
In addition to its
benefits for dependability, its database performance boost is also remarkable.
With only commodity hardware, you can gain 10-20 times the performance over all
other distributions that utilize the centralized metadata structure.
This feature is an
architectural improvement to Hadoop that MapR initiated in its infancy. The
power it adds to our offering’s dependability and performance makes it
untouchable by competitive offerings.
7)Low Latency
Your Hadoop
infrastructure needs to be fast. Equally as important, it needs to stay that
way. A dirty secret among many Hadoop distributions is the staggering
volatility in performance and latency. The MapR M7 disk strategy obviates
compactions and defragmentation that can affect performance. Because of this
ability, MapR M7 achieves 5x better performance, with low 95th and 99th
percentile latencies. The graph below compares the high performance and
consistent low latency of the MapR M7 Edition in comparison to other Hadoop
distributions.
8)High Availability
High availability (HA)
refers to the capability of a Hadoop system to continue functioning, regardless
of multiple system failures. For companies running mission-critical
applications, HA is a necessity.
The best way to ensure
that your distributed system is highly available is by using an architecture
that distributes the metadata. The MapR architecture increases performance and
removes the SPOF.
The MapR Distribution for Hadoop
provides high availability with self-healing and support for multiple failures.
This means that your Hadoop infrastructure will be accessible during system
failures, system upgrades and data recoveries.
Snapshots
Other distributions use
the HDFS snapshot system, which has several downsides when compared to the MapR
Distribution for Hadoop:
9)True Point-In-Time
HDFS snapshots only
capture data that is closed at the time the snapshot is taken. If you are using
snapshots as an automated recovery system, you will have no guarantees that the
data is complete. With MapR, you can perform point-in-time recovery of all
files and tables, whether they are open or not.
10)Supports All Applications
MapR Snapshots support
all Hadoop applications by default.
11)No Data Duplication
MapR snapshots never
duplicate your data and share the same storage with your live information. This
allows clients to capture snapshots of a 1 petabyte cluster in just seconds.
