maandag 26 januari 2015

The Hadoop ecosystem

Hadoop has evolved into a complex ecosystem, consisting of several open source products , frameworks and initiatives by the Apache organization.

This article aims to provide an overview of the eco-system in a structured way.

The diagram is organized in four tiers:
  • Management / Configuration: Central functions to manage resources and configurations
  • Storage: Functions for the (fysical) storage of data
  • Data Transfer: Functions involved with the transfer of data between nodes
  • Processing: Functions involved in the processing (extracting data/knowledge) from raw data

Of each Apache function the brief description will be provided below the diagram.
For more detailed information I would like to refer the apache.org website.

 

Management
ZooKeeper = Distributed configuration service, synchronization service, and naming registry for large distributed systems
 
Cluster Resource Management
YARN = Resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications
Mesos*)= Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
Ambari = Software for provisioning, managing, and monitoring Apache Hadoop (YARN) clusters

Workflow
Oozie = WorkFlow for Hadoop Jobs
Cascading = Software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs

Storage
HDFS = Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
 
No SQL
HBase  = Non-relational, distributed database
Accumolo = Sorted, distributed key/value store based on the BigTable technology from Google
Cassandra*) = Distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure


Data Transfer
AVRO = Remote procedure call and data serialization framework developed within Apache's Hadoop project
Thrift *) = Interface definition language and binary communication protocol[1] that is used to define and create services for numerous languages.[2] It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development
Kafka *) = Unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.
Sqoop = Command-line interface application for transferring data between relational databases and Hadoop


Processing
Solr *) = Enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling
Drill *) = Software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
Tez = Application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN
MapReduce = Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster
 
Stream Processing
S4 = Distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data
Flume = Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Storm  = Distributed real-time computation system
 
In Memory Processing
Flink = Fast and reliable large-scale data processing engine
Spark = Fast and general engine for large-scale data processing.
 
Functions on top of MR
R-Hadoop = Library to use MapReduce functions in R
Hive = Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
Crunch = Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce
Mahout = Scalable machine learning library
Pig = HighLevel platform to create MapReduce programs
Giraph = Iterative graph processing system built for high scalability
 
*): These Apache functions are not Hadoop functions
 

dinsdag 20 januari 2015

How to approach Analytics: A process overview


We are on the verge of the Data Science era and Big Data/Analytics is already a game changer.

Analytics involves a variety of concepts, technologies and competences. Think of Big Data, No SQL databases, In Memory processing, Data Mining and Statistics. To leverage this complex field of technology and concepts a sound Analytics Architecture and Analytics Process are required.
This article will provide a brief overview of both.

 Analytics Architecture

The Analytics Architecture should be an integral part of the Enterprise Architecture.

In TOGAF © ADM for instance the Preliminary and Architecture Vision phases would be extended to describe the global Analytics objectives, principles and required capabilities.
On Business Architecture level Analytics likely impacts the Business Process Model, Organization Model and even Business Functions through the implementation of any Analytics results. For input Analytics will heavily depend on the Business Data Model.
On Information Systems level Analytics may use and extend data structures and define processes for the collection of (raw) data.
The Technology Architecture level describes the infrastructure and platform functions required to extract, transport, store, analyze and visualize Analytics data. (Think of tools like Hadoop HDFS clusters, or no SQL databases like MongoDB for storage, Hadoop MapReduce or Spark for extraction, and R, SAS or IBM/SPSS for analysis).

After describing the current and target Analytics Architectures a roadmap and migration planning could be created to define the project(s) in order to put the Data Science capability in place.

Analytics Process

The proposed Analytics Process consists of four steps:
-        Collect (Raw) Data
-        Analyze
-        Visualize
-        Evaluate & Act
 
 

 
 1.      Raw Data Collection

All Data Science initiatives begin (logically) with raw data.
Obviously we have the structured data from our traditional business systems at our disposal. And usually we have some kind of Data Warehouse with historical (aggregated) data in a dimensional format. In addition to BI does Data Science also consider less structured data like documents, mails and log-files.

Just recently Big Data technologies (like Hadoop) have enabled us to collect, store and process quantities of data of any type that we were not able to capture with conventional means. This has opened to possibility to capture all kinds of additional data from sources like the web, social media, intelligent communicative devices like sensors and wearables (“Internet of Things”) and third parties (for example through open and commercial API-s).

In advance we may not know exactly what data to use and when to use it. Ideally we would like to store everything that has potential informational value. However even with Big Data tooling and affordable infrastructure a clear objective is required to not drown in the data lake we might try to create. This business objective will help us decide which particular data sources and data to select.

Data Collection requires access to all concerned data-sources (either through a data-pull -e.g. API-s- or by push, e.g. data streams), the infrastructure to transport and store the data and the configurations, processes and tools to extract and process the data.

This demands for clear business objectives, knowledge of the data sources, skills of the concerned technology and the ability to tie everything together.

 2.      Data Analysis

When raw data is available we may start extracting knowledge out of it (that is all that Data Science is about, isn’t it?).

The Analyze Data step is a process in itself, consisting of multiple sub-steps:

1.Define Objective

Analysis starts by formulating the business challenges that need to be tackled by the Data Analysis.
Do we need just business insights? Do we want to detect anomalies? Do we require to classify our customers so it is easier to decide on the policy to apply on a new client? Do we aim to analyze their sentiments? Or do we maybe even want predict their purchasing-behavior for the next month?
And if we have the challenges clear: which kinds of measures and KPI-s to we need to be able to tackle them: Incidents? Revenue? Transactions? Margins? Churn? 
This is a job typically to be performed by a business Analyst, together with stakeholders with specific business domain knowledge.

Based on the results a plan ought to be formulated. This could be anything between a simple pragmatic list of steps and a complete project plan, depending on strategic importance, expected effort and risk.

2. Gather Data

Once we know which data we globally need we may extract it from our stored raw data into datasets suitable to use with our analytics tool(s). For instance: We may extract data from Hadoop HDFS using MapReduce into a CSV-file that can be read by SAS. Besides internal data we even could extract data directly from external sources, like social media.
Usually this “gather” activity requires cooperation of the data store administrator(s) and the Data Scientist.

3. Explore Data

Next we need to explore the data using the Analytics tools. We need to understand the semantics of each field and the relevance of them for the analysis. Further we ought to determine other characteristics of each field, like the type, ranges, distributions and the quality of the data.
This exploration is usually done by the Data Scientist.

4. Prepare Data

Depending on our findings in the data exploration phase we may need to prepare the data in such a way it can be used for the actual modelling phase. We may have to clean the data, generate new data fields by calculations or aggregations, transform data into convenient formats and integrate and combine multiple separate datasets.
Once the data is ready to be used the analyst may take samples for “training” (modelling) purposes and for testing.
The data preparation is usually done by the Data Scientist. Usually it takes considerable time related to the other steps in the analysis (A percentage of 50% is not uncommon).

5. Create Model

The Analytics tools usually offer a set of modelling techniques that the Data Scientist Analysis may use. Examples are decision trees, clusters or regression models. The analyst picks one or few of them that seem suitable for the characteristics of the data and the business objectives. After applying and configuring them to the set of training data he/she selects the one that leads to the best results.
The deliverable is usually a pattern or formula that may be used for predictions or classifications in future cases.

6. Test Results

The findings in the Model step need to be evaluated with a different set of data than used to find the pattern to see if it is general applicable.
For that purpose usually one or more test samples are used that were created in the “Prepare Data” step. The findings can also be tested against new real-life cases.

3.      Visualize Results

The results from the Analysis phase should be communicated to decision makers and other stakeholders in a comprehensive way.
This may be through a simple set of charts in a presentation or document, or even through fancy graphics or animations (often delivered by specialized graphical designers) in case the idea needs more “persuasion-power”, or should be distributed over a broader audience.
The key is that the results should be presented in clear way.

4.      Evaluate and Act

The last step is to evaluate the results in real-life situations.

When the models and patterns are sufficiently reliable architects, process owners or other decision makers may decide to alter related business processes, rules or even strategy. This will be separate projects in the BPM profession.
 
The last step in the process is to evaluate the Analytics process itself, usually through a retrospective over the project. This may result in improvements in the Analytics Architecture or Process.

Conclusion

Big Data and Analytics are a complex field of technology and skills.
A sound Architecture and Process –both introduced above- will help to get the best out of it!

We wish you good mining!