maandag 26 januari 2015

The Hadoop ecosystem

Hadoop has evolved into a complex ecosystem, consisting of several open source products , frameworks and initiatives by the Apache organization.

This article aims to provide an overview of the eco-system in a structured way.

The diagram is organized in four tiers:
  • Management / Configuration: Central functions to manage resources and configurations
  • Storage: Functions for the (fysical) storage of data
  • Data Transfer: Functions involved with the transfer of data between nodes
  • Processing: Functions involved in the processing (extracting data/knowledge) from raw data

Of each Apache function the brief description will be provided below the diagram.
For more detailed information I would like to refer the apache.org website.

 

Management
ZooKeeper = Distributed configuration service, synchronization service, and naming registry for large distributed systems
 
Cluster Resource Management
YARN = Resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications
Mesos*)= Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Ambari = Software for provisioning, managing, and monitoring Apache Hadoop (YARN) clusters

Workflow
Oozie = WorkFlow for Hadoop Jobs
Cascading = Software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs

Storage
HDFS = Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
 
No SQL
HBase  = Non-relational, distributed database
Accumolo = Sorted, distributed key/value store based on the BigTable technology from Google
Cassandra*) = Distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure


Data Transfer
AVRO = Remote procedure call and data serialization framework developed within Apache's Hadoop project
Thrift *) = Interface definition language and binary communication protocol[1] that is used to define and create services for numerous languages.[2] It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development
Kafka *) = Unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.
Sqoop = Command-line interface application for transferring data between relational databases and Hadoop


Processing
Solr *) = Enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling
Drill *) = Software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
Tez = Application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN
MapReduce = Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
 
Stream Processing
S4 = Distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data
Flume = Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Storm  = Distributed real-time computation system
 
In Memory Processing
Flink = Fast and reliable large-scale data processing engine
Spark = Fast and general engine for large-scale data processing.
 
Functions on top of MR
R-Hadoop = Library to use MapReduce functions in R
Hive = Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
Crunch = Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce
Mahout = Scalable machine learning library
Pig = HighLevel platform to create MapReduce programs
Giraph = Iterative graph processing system built for high scalability
 
*): These Apache functions are not Hadoop functions
 

6 opmerkingen:

  1. Join now for the most comprehensive learning opportunities and create the most efficient set of modifications in Data Science with the aid of our AI Patasala Data Science Training in Hyderabad.
    Data Science Course with Placements in Hyderabad

    BeantwoordenVerwijderen
  2. Inventateq Advanced Digital Marketing Course is designed to provide Educate Students, Employees & Business Owners in the area of Online Advertisements For the Industry, By the Industry • Skilled Mentor with 14 Years of Experience
    digital marketing training online

    BeantwoordenVerwijderen
  3. DBA Training
    Following industry standards for optimizing SQL query performance understanding your back-end architecture
    You can build better applications if you have a better understanding of the technology. We also train our Database Engineers and SREs regularly so that we can stay on top of the latest technologies
    Database Training

    BeantwoordenVerwijderen