tag:blogger.com,1999:blog-31132464599248305922024-03-16T11:52:22.559-07:00Data Science, Analytics en Big DataHarald van der Weelhttp://www.blogger.com/profile/13352602368388711191noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-3113246459924830592.post-69080327828698927152015-01-26T06:43:00.000-08:002015-01-26T06:46:31.967-08:00The Hadoop ecosystemHadoop has evolved into a complex ecosystem, consisting of several open source products , frameworks and initiatives by the Apache organization.<br />
<br />
This article aims to provide an overview of the eco-system in a structured way.<br />
<br />
The diagram is organized in four tiers:<br />
<ul>
<li><strong>Management / Configuration</strong>: Central functions to manage resources and configurations</li>
<li><strong>Storage</strong>: Functions for the (fysical) storage of data</li>
<li><strong>Data Transfer</strong>: Functions involved with the transfer of data between nodes</li>
<li><strong>Processing</strong>: Functions involved in the processing (extracting data/knowledge) from raw data</li>
</ul>
<br />
Of each Apache function the brief description will be provided below the diagram.<br />
For more detailed information I would like to refer the apache.org website.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-UavqT_WmrHo/VMZJD-8LZjI/AAAAAAAAAXs/dD6J2S-8I5s/s1600/Hadoop%2Beco%2Boverview%2BPNG.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-UavqT_WmrHo/VMZJD-8LZjI/AAAAAAAAAXs/dD6J2S-8I5s/s1600/Hadoop%2Beco%2Boverview%2BPNG.png" height="507" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<span style="font-size: large;"></span><br />
<span style="font-size: large;">Management</span><br />
<div>
<strong>ZooKeeper</strong> = Distributed configuration service, synchronization service, and naming registry for large distributed systems</div>
<div>
<strong><em></em></strong> </div>
<div>
<strong><em>Cluster Resource Management</em></strong></div>
<div>
<strong>YARN</strong> = Resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications</div>
<div>
<strong>Mesos*)</strong>= Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.</div>
<div>
<strong>Ambari</strong> = Software for provisioning, managing, and monitoring Apache Hadoop (YARN) clusters</div>
<div>
<strong><em></em></strong><br />
<strong><em>Workflow</em></strong> </div>
<div>
<strong>Oozie</strong> = WorkFlow for Hadoop Jobs</div>
<div>
<strong>Cascading</strong> = Software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs</div>
<br />
<span style="font-size: large;">Storage</span><br />
<div>
<strong>HDFS</strong> = Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster</div>
<div>
<em><strong></strong></em> </div>
<div>
<em><strong>No SQL</strong></em></div>
<div>
<strong>HBase</strong> = Non-relational, distributed database</div>
<div>
<strong>Accumolo</strong> = Sorted, distributed key/value store based on the BigTable technology from Google</div>
<div>
<strong>Cassandra*)</strong> = Distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure</div>
<br />
<br />
<span style="font-size: large;">Data Transfer</span><br />
<div>
<strong>AVRO</strong> = Remote procedure call and data serialization framework developed within Apache's Hadoop project</div>
<div>
<strong>Thrift *)</strong> = Interface definition language and binary communication protocol[1] that is used to define and create services for numerous languages.[2] It is used as a remote procedure call (RPC) framework and was developed at Facebook for "scalable cross-language services development</div>
<div>
<strong>Kafka</strong> *) = Unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.</div>
<div>
<strong>Sqoop</strong> = Command-line interface application for transferring data between relational databases and Hadoop</div>
<br />
<span style="font-size: large;"></span><br />
<span style="font-size: large;">Processing</span><br />
<div>
<strong>Solr</strong> *) = Enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling</div>
<div>
<strong>Drill</strong> *) = Software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets</div>
<div>
<strong>Tez</strong> = Application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN</div>
<div>
<strong>MapReduce</strong> = Programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster</div>
<div>
</div>
<div>
<em><strong>Stream Processing</strong></em></div>
<div>
<strong>S4</strong> = Distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data</div>
<div>
<strong>Flume</strong> = Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data</div>
<div>
Storm = Distributed real-time computation system</div>
<div>
</div>
<div>
<strong><em>In Memory Processing</em></strong></div>
<div>
<strong>Flink</strong> = Fast and reliable large-scale data processing engine</div>
<div>
<strong>Spark</strong> = Fast and general engine for large-scale data processing. </div>
<div>
</div>
<div>
<strong><em>Functions on top of MR</em></strong></div>
<div>
<strong>R-Hadoop</strong> = Library to use MapReduce functions in R</div>
<div>
<strong>Hive</strong> = Data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis</div>
<div>
<strong>Crunch</strong> = Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce</div>
<div>
<strong>Mahout</strong> = Scalable machine learning library</div>
<div>
<strong>Pig</strong> = HighLevel platform to create MapReduce programs</div>
<div>
<strong>Giraph</strong> = Iterative graph processing system built for high scalability</div>
<div>
</div>
<div>
<em>*): These Apache functions are not Hadoop functions</em></div>
<div>
</div>
Harald van der Weelhttp://www.blogger.com/profile/13352602368388711191noreply@blogger.com6tag:blogger.com,1999:blog-3113246459924830592.post-87459731115284888632015-01-20T05:18:00.000-08:002015-01-20T05:24:17.189-08:00How to approach Analytics: A process overview<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">We are on
the verge of the Data Science era and Big Data/Analytics is already a game
changer. <o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Analytics
involves a variety of concepts, technologies and competences. Think of Big
Data, No SQL databases, In Memory processing, Data Mining and Statistics. To
leverage this complex field of technology and concepts a sound Analytics
Architecture and Analytics Process are required.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">This
article will provide a brief overview of both.<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p><span style="font-family: Calibri;"> </span></o:p></span><b><span lang="EN-US" style="font-size: 14pt; mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Analytics Architecture<o:p></o:p></span></span></b></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The
Analytics Architecture should be an integral part of the Enterprise
Architecture.<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">In TOGAF ©
ADM for instance the Preliminary and Architecture Vision phases would be
extended to describe the global Analytics objectives, principles and required
capabilities.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">On Business
Architecture level Analytics likely impacts the Business Process Model,
Organization Model and even Business Functions through the implementation of
any Analytics results. For input Analytics will heavily depend on the Business
Data Model.<o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">On
Information Systems level Analytics may use and extend data structures and
define processes for the collection of (raw) data.<o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The
Technology Architecture level describes the infrastructure and platform
functions required to extract, transport, store, analyze and visualize
Analytics data. (Think of tools like Hadoop HDFS clusters, or no SQL databases
like MongoDB for storage, Hadoop MapReduce or Spark for extraction, and R, SAS
or IBM/SPSS for analysis).<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">After
describing the current and target Analytics Architectures a roadmap and
migration planning could be created to define the project(s) in order to put
the Data Science capability in place.</span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;"></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;"><div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<b><span lang="EN-US" style="font-size: 14pt; mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Analytics Process<o:p></o:p></span></span></b></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The proposed Analytics Process consists of four steps:<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">-</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span><!--[endif]--><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Collect (Raw) Data <o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;"> -</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span><!--[endif]--><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Analyze<o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">-</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span><!--[endif]--><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Visualize<o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">-</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span><!--[endif]--><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Evaluate & Act<o:p></o:p></span></span><br />
<o:p></o:p></span> </span><span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p><span style="font-family: Calibri;"></span></o:p></span> </div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-AjyZAKzcszg/VL5SgTy_SiI/AAAAAAAAAXQ/UpZqQdVEawk/s1600/Analytics%2BOverview%2B-%2B3.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-AjyZAKzcszg/VL5SgTy_SiI/AAAAAAAAAXQ/UpZqQdVEawk/s1600/Analytics%2BOverview%2B-%2B3.jpg" height="388" width="640" /></a></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p></o:p></span><br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
</div>
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p><span style="font-family: Calibri;"> </span></o:p></span><b><span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">1.</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span></b><!--[endif]--><b><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Raw Data Collection<o:p></o:p></span></span></b></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">All Data
Science initiatives begin (logically) with raw data.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Obviously
we have the structured data from our traditional business systems at our
disposal. And usually we have some kind of Data Warehouse with historical
(aggregated) data in a dimensional format. In addition to BI does Data Science
also consider less structured data like documents, mails and log-files. <o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Just
recently Big Data technologies (like Hadoop) have enabled us to collect, store
and process quantities of data of any type that we were not able to capture
with conventional means. This has opened to possibility to capture all kinds of
additional data from sources like the web, social media, intelligent
communicative devices like sensors and wearables (“Internet of Things”) and
third parties (for example through open and commercial API-s).<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">In advance
we may not know exactly what data to use and when to use it. Ideally we would
like to store everything that has potential informational value. However even
with Big Data tooling and affordable infrastructure a clear objective is
required to not drown in the data lake we might try to create. This business
objective will help us decide which particular data sources and data to select.<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Data
Collection requires access to all concerned data-sources (either through a
data-pull -e.g. API-s- or by push, e.g. data streams), the infrastructure to
transport and store the data and the configurations, processes and tools to
extract and process the data.<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">This
demands for clear business objectives, knowledge of the data sources, skills of
the concerned technology and the ability to tie everything together.<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p><span style="font-family: Calibri;"> </span></o:p></span><b><span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">2.</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span></b><!--[endif]--><b><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Data Analysis<o:p></o:p></span></span></b></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">When raw
data is available we may start extracting knowledge out of it (that is all that
Data Science is about, isn’t it?).<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The Analyze
Data step is a process in itself, consisting of multiple sub-steps:<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<i><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">1.Define
Objective<o:p></o:p></span></span></i></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Analysis
starts by formulating the business challenges that need to be tackled by the
Data Analysis. <o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Do we need
just business insights? Do we want to detect anomalies? Do we require to
classify our customers so it is easier to decide on the policy to apply on a
new client? Do we aim to analyze their sentiments? Or do we maybe even want
predict their purchasing-behavior for the next month? <o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">And if we
have the challenges clear: which kinds of measures and KPI-s to we need to be
able to tackle them: Incidents? Revenue? Transactions? Margins? Churn? <o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">This is a
job typically to be performed by a business Analyst, together with stakeholders
with specific business domain knowledge. <o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Based on
the results a plan ought to be formulated. This could be anything between a
simple pragmatic list of steps and a complete project plan, depending on
strategic importance, expected effort and risk.<o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<i><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">2. Gather
Data<o:p></o:p></span></span></i></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Once we
know which data we globally need we may extract it from our stored raw data
into datasets suitable to use with our analytics tool(s). For instance: We may
extract data from Hadoop HDFS using MapReduce into a CSV-file that can be read
by SAS. Besides internal data we even could extract data directly from external
sources, like social media.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Usually
this “gather” activity requires cooperation of the data store administrator(s)
and the Data Scientist.<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<i><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">3. Explore
Data<o:p></o:p></span></span></i></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Next we
need to explore the data using the Analytics tools. We need to understand the
semantics of each field and the relevance of them for the analysis. Further we
ought to determine other characteristics of each field, like the type, ranges,
distributions and the quality of the data. <o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">This exploration
is usually done by the Data Scientist.<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<i><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">4. Prepare
Data <o:p></o:p></span></span></i></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Depending
on our findings in the data exploration phase we may need to prepare the data
in such a way it can be used for the actual modelling phase. We may have to
clean the data, generate new data fields by calculations or aggregations,
transform data into convenient formats and integrate and combine multiple
separate datasets. <o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Once the
data is ready to be used the analyst may take samples for “training”
(modelling) purposes and for testing. <o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The data
preparation is usually done by the Data Scientist. Usually it takes
considerable time related to the other steps in the analysis (A percentage of
50% is not uncommon).<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">5. <em>Create
Model<o:p></o:p></em></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The
Analytics tools usually offer a set of modelling techniques that the Data
Scientist Analysis may use. Examples are decision trees, clusters or regression
models. The analyst picks one or few of them that seem suitable for the
characteristics of the data and the business objectives. After applying and
configuring them to the set of training data he/she selects the one that leads
to the best results.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The
deliverable is usually a pattern or formula that may be used for predictions or
classifications in future cases.<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p><span style="font-family: Calibri;">6. </span></o:p></span><i><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Test
Results<o:p></o:p></span></span></i></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The
findings in the Model step need to be evaluated with a different set of data
than used to find the pattern to see if it is general applicable.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">For that
purpose usually one or more test samples are used that were created in the
“Prepare Data” step. The findings can also be tested against new real-life
cases.<o:p></o:p></span></span><br />
<b><span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;"></span></span></span></b><br />
<b><span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">3.</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span></b><!--[endif]--><b><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Visualize Results<o:p></o:p></span></span></b><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The results
from the Analysis phase should be communicated to decision makers and other
stakeholders in a comprehensive way.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">This may be
through a simple set of charts in a presentation or document, or even through
fancy graphics or animations (often delivered by specialized graphical
designers) in case the idea needs more “persuasion-power”, or should be
distributed over a broader audience.<o:p></o:p></span></span><br />
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The key is
that the results should be presented in clear way.</span></span><br />
<b><span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;"></span></span></span></b><br />
<b><span lang="EN-US" style="mso-ansi-language: EN-US; mso-bidi-font-family: Calibri; mso-fareast-font-family: Calibri;"><span style="mso-list: Ignore;"><span style="font-family: Calibri;">4.</span><span style="font-size-adjust: none; font-stretch: normal; font: 7pt/normal "Times New Roman";"> </span></span></span></b><!--[endif]--><b><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Evaluate and Act<o:p></o:p></span></span></b><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The last
step is to evaluate the results in real-life situations. <o:p></o:p></span></span></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">When the
models and patterns are sufficiently reliable architects, process owners or
other decision makers may decide to alter related business processes, rules or
even strategy. This will be separate projects in the BPM profession.</span></span><br />
<span style="font-family: Calibri;"></span> </div>
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">The last
step in the process is to evaluate the Analytics process itself, usually
through a retrospective over the project. This may result in improvements in
the Analytics Architecture or Process.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p><span style="font-family: Calibri;"></span></o:p></span><br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<b><span lang="EN-US" style="font-size: 14pt; mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Conclusion<o:p></o:p></span></span></b></div>
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">Big Data
and Analytics are a complex field of technology and skills.<o:p></o:p></span></span></div>
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">A sound
Architecture and Process –both introduced above- will help to get the best out
of it!<o:p></o:p></span></span><br />
<br />
<div class="MsoNormal" style="margin: 0cm 0cm 0pt;">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: Calibri;">We wish you
good mining!<o:p></o:p></span></span></div>
Harald van der Weelhttp://www.blogger.com/profile/13352602368388711191noreply@blogger.com0tag:blogger.com,1999:blog-3113246459924830592.post-56748292323538992732014-12-06T06:33:00.006-08:002015-01-29T08:23:27.671-08:00Brainport Eindhoven - Data Science capital of the world?<span style="font-family: Arial, Helvetica, sans-serif;"><em>(Dutch Data Science Summit 2014 – Eindhoven, 4 Dec. 2014)</em></span><br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">It is already clear that Data Science, Big Data and Analytics are true game changers that will generate business worth of trillions of euros over the next decade. This paradigm is closely connected with other relative new concepts like Internet Of Things (IoT), Cloud, Social Media and Mobility. </span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">The applications of Data Science are beyond imagination. Some of the examples are Decision Management, Customer Intelligence, Preventive Maintenance, Risk Modelling, Operations Optimization, Security and Fraud Detection and even Psychological and Social research. </span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">Data Science is still young, though many elements of the practice are not completely new. Business Intelligence, Statistics and Visualizations have been around some time. Though there has not yet been defined a proper profile of the “Data Scientist”. </span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">It is clear that a Data Scientist should be able to move over three skill axes: Data, Business Context, and Tools & Methods in order to generate usable intelligence. As this role is new and maybe not even well defined it will be hard to find sufficient experts to meet the demand, and for the next few years a considerable gap between supply and demand of skilled data scientists is expected.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">Region Brainport - Eindhoven (Netherlands) has acknowledged this expected scarcity and has defined an ambitious plan: Data Science Center Eindhoven (DSC/e). Together with education, government and private companies like Philips and ASML it aims to bridge this gap the next few years, and to become the world leader in the Data Science field within 1000 days from now.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">The program consists of a number of strategic research programs, like Process Analytics, Customer Journey, Smart Maintenance, Quantified Self, Data Value & Privacy, Smart Cities and Smart Grids. <br />A second focus is set to high end education to cover the entire skill set of Data Science on all levels (Bachelor, Master and PhD). A collaboration between the University of Technology of Eindhoven and the University of Tilburg has resulted in the Brainport International School of Data Science, which will get chairs in Eindhoven (Elected Smartest City of the world, 2011), Tilburg and Den Bosch.</span><br />
<span style="font-family: Arial;"></span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">Besides the DSC/e the Fontys University of Applied Sciences in Eindhoven is setting up an ambitious curriculum for Big Data of their own. Though collaboration between DSC/e en Fontys might be is not unimaginable. If they will participate as well the odds are even better for Brainport to be the world leaders in Data Science, Big Data and Analytics soon!</span>Harald van der Weelhttp://www.blogger.com/profile/13352602368388711191noreply@blogger.com0