Posts tonen met het label Flink. Alle posts tonen
Posts tonen met het label Flink. Alle posts tonen

maandag 26 januari 2015

The Hadoop ecosystem

Hadoop has evolved into a complex ecosystem, consisting of several open source products , frameworks and initiatives by the Apache organization.

This article aims to provide an overview of the eco-system in a structured way.

The diagram is organized in four tiers:
  • Management / Configuration: Central functions to manage resources and configurations
  • Storage: Functions for the (fysical) storage of data
  • Data Transfer: Functions involved with the transfer of data between nodes
  • Processing: Functions involved in the processing (extracting data/knowledge) from raw data

Of each Apache function the brief description will be provided below the diagram.
For more detailed information I would like to refer the apache.org website.

 

Management
ZooKeeper = Distributed configuration service, synchronization service, and naming registry for large distributed systems
 
Cluster Resource Management
YARN = Resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications
Mesos*)= Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Ambari = Software for provisioning, managing, and monitoring Apache Hadoop (YARN) clusters

Workflow
Oozie = WorkFlow for Hadoop Jobs
Cascading = Software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs

Storage
HDFS = Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
 
No SQL
HBase  = Non-relational, distributed database
Accumolo = Sorted, distributed key/value store based on the BigTable technology from Google
Cassandra*) = Distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure


Data Transfer
AVRO = Remote procedure call and data serialization framework developed within Apache's Hadoop project
Thrift *) = Interface definition language and binary communication protocol[1] that is used to define and create services for numerous languages.[2] It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development
Kafka *) = Unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.
Sqoop = Command-line interface application for transferring data between relational databases and Hadoop


Processing
Solr *) = Enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling
Drill *) = Software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
Tez = Application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN
MapReduce = Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
 
Stream Processing
S4 = Distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data
Flume = Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Storm  = Distributed real-time computation system
 
In Memory Processing
Flink = Fast and reliable large-scale data processing engine
Spark = Fast and general engine for large-scale data processing.
 
Functions on top of MR
R-Hadoop = Library to use MapReduce functions in R
Hive = Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
Crunch = Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce
Mahout = Scalable machine learning library
Pig = HighLevel platform to create MapReduce programs
Giraph = Iterative graph processing system built for high scalability
 
*): These Apache functions are not Hadoop functions