Data Science, Analytics en Big Data: Flink

Hadoop has evolved into a complex ecosystem, consisting of several open source products , frameworks and initiatives by the Apache organization.

This article aims to provide an overview of the eco-system in a structured way.

The diagram is organized in four tiers:

Management / Configuration: Central functions to manage resources and configurations
Storage: Functions for the (fysical) storage of data
Data Transfer: Functions involved with the transfer of data between nodes
Processing: Functions involved in the processing (extracting data/knowledge) from raw data

Of each Apache function the brief description will be provided below the diagram.
For more detailed information I would like to refer the apache.org website.

Management

ZooKeeper = Distributed configuration service, synchronization service, and naming registry for large distributed systems

Cluster Resource Management

YARN = Resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications

Mesos*)= Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.

Ambari = Software for provisioning, managing, and monitoring Apache Hadoop (YARN) clusters

Workflow

Oozie = WorkFlow for Hadoop Jobs

Cascading = Software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs

Storage

HDFS = Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

No SQL

HBase = Non-relational, distributed database

Accumolo = Sorted, distributed key/value store based on the BigTable technology from Google

Cassandra*) = Distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

Data Transfer

AVRO = Remote procedure call and data serialization framework developed within Apache's Hadoop project

Thrift *) = Interface definition language and binary communication protocol[1] that is used to define and create services for numerous languages.[2] It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development

Kafka *) = Unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.

Sqoop = Command-line interface application for transferring data between relational databases and Hadoop

Processing

Solr *) = Enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling

Drill *) = Software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets

Tez = Application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN

MapReduce = Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster

Stream Processing

S4 = Distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data

Flume = Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

Storm = Distributed real-time computation system

In Memory Processing

Flink = Fast and reliable large-scale data processing engine

Spark = Fast and general engine for large-scale data processing.

Functions on top of MR

R-Hadoop = Library to use MapReduce functions in R

Hive = Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis

Crunch = Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce

Mahout = Scalable machine learning library

Pig = HighLevel platform to create MapReduce programs

Giraph = Iterative graph processing system built for high scalability

*): These Apache functions are not Hadoop functions

Data Science, Analytics en Big Data

maandag 26 januari 2015

The Hadoop ecosystem