This article aims to provide an overview of the eco-system in a structured way.
The diagram is organized in four tiers:
- Management / Configuration: Central functions to manage resources and configurations
- Storage: Functions for the (fysical) storage of data
- Data Transfer: Functions involved with the transfer of data between nodes
- Processing: Functions involved in the processing (extracting data/knowledge) from raw data
Of each Apache function the brief description will be provided below the diagram.
For more detailed information I would like to refer the apache.org website.
Management
ZooKeeper = Distributed configuration service, synchronization service, and naming registry for large distributed systems
Cluster Resource Management
YARN = Resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications
Mesos*)= Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Ambari = Software for provisioning, managing, and monitoring Apache Hadoop (YARN) clusters
Workflow
Oozie = WorkFlow for Hadoop Jobs
Cascading = Software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs
Storage
HDFS = Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
No SQL
HBase = Non-relational, distributed database
Accumolo = Sorted, distributed key/value store based on the BigTable technology from Google
Cassandra*) = Distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure
Data Transfer
AVRO = Remote procedure call and data serialization framework developed within Apache's Hadoop project
Thrift *) = Interface definition language and binary communication protocol[1] that is used to define and create services for numerous languages.[2] It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development
Kafka *) = Unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.
Sqoop = Command-line interface application for transferring data between relational databases and Hadoop
Processing
Solr *) = Enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling
Drill *) = Software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
Tez = Application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN
MapReduce = Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
Stream Processing
S4 = Distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data
Flume = Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Storm = Distributed real-time computation system
In Memory Processing
Flink = Fast and reliable large-scale data processing engine
Spark = Fast and general engine for large-scale data processing.
Functions on top of MR
R-Hadoop = Library to use MapReduce functions in R
Hive = Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
Crunch = Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce
Mahout = Scalable machine learning library
Pig = HighLevel platform to create MapReduce programs
Giraph = Iterative graph processing system built for high scalability
*): These Apache functions are not Hadoop functions
Thanks for this valuable information!
BeantwoordenVerwijderenData Science Online Training
Join now for the most comprehensive learning opportunities and create the most efficient set of modifications in Data Science with the aid of our AI Patasala Data Science Training in Hyderabad.
BeantwoordenVerwijderenData Science Course with Placements in Hyderabad
Inventateq Advanced Digital Marketing Course is designed to provide Educate Students, Employees & Business Owners in the area of Online Advertisements For the Industry, By the Industry • Skilled Mentor with 14 Years of Experience
BeantwoordenVerwijderendigital marketing training online
DBA Training
BeantwoordenVerwijderenFollowing industry standards for optimizing SQL query performance understanding your back-end architecture
You can build better applications if you have a better understanding of the technology. We also train our Database Engineers and SREs regularly so that we can stay on top of the latest technologies
Database Training
kartal alarko carrier klima servisi
BeantwoordenVerwijderenümraniye alarko carrier klima servisi
maltepe mitsubishi klima servisi
kartal vestel klima servisi
kartal bosch klima servisi
kartal arçelik klima servisi
ümraniye arçelik klima servisi
beykoz samsung klima servisi
üsküdar samsung klima servisi
nft nasıl alınır
BeantwoordenVerwijderenminecraft premium
yurtdışı kargo
özel ambulans
uc satın al
en son çıkan perde modelleri
lisans satın al
en son çıkan perde modelleri