We are on
the verge of the Data Science era and Big Data/Analytics is already a game
changer.
Analytics
involves a variety of concepts, technologies and competences. Think of Big
Data, No SQL databases, In Memory processing, Data Mining and Statistics. To
leverage this complex field of technology and concepts a sound Analytics
Architecture and Analytics Process are required.
This
article will provide a brief overview of both.
The
Analytics Architecture should be an integral part of the Enterprise
Architecture.
In TOGAF ©
ADM for instance the Preliminary and Architecture Vision phases would be
extended to describe the global Analytics objectives, principles and required
capabilities.
On Business
Architecture level Analytics likely impacts the Business Process Model,
Organization Model and even Business Functions through the implementation of
any Analytics results. For input Analytics will heavily depend on the Business
Data Model.On Information Systems level Analytics may use and extend data structures and define processes for the collection of (raw) data.
The Technology Architecture level describes the infrastructure and platform functions required to extract, transport, store, analyze and visualize Analytics data. (Think of tools like Hadoop HDFS clusters, or no SQL databases like MongoDB for storage, Hadoop MapReduce or Spark for extraction, and R, SAS or IBM/SPSS for analysis).
After
describing the current and target Analytics Architectures a roadmap and
migration planning could be created to define the project(s) in order to put
the Data Science capability in place.
- Analyze
- Visualize
- Evaluate & Act
Analytics Process
The proposed Analytics Process consists of four steps:
- Collect (Raw) Data - Analyze
- Visualize
- Evaluate & Act
All Data
Science initiatives begin (logically) with raw data.
Obviously
we have the structured data from our traditional business systems at our
disposal. And usually we have some kind of Data Warehouse with historical
(aggregated) data in a dimensional format. In addition to BI does Data Science
also consider less structured data like documents, mails and log-files.
Just
recently Big Data technologies (like Hadoop) have enabled us to collect, store
and process quantities of data of any type that we were not able to capture
with conventional means. This has opened to possibility to capture all kinds of
additional data from sources like the web, social media, intelligent
communicative devices like sensors and wearables (“Internet of Things”) and
third parties (for example through open and commercial API-s).
In advance
we may not know exactly what data to use and when to use it. Ideally we would
like to store everything that has potential informational value. However even
with Big Data tooling and affordable infrastructure a clear objective is
required to not drown in the data lake we might try to create. This business
objective will help us decide which particular data sources and data to select.
Data
Collection requires access to all concerned data-sources (either through a
data-pull -e.g. API-s- or by push, e.g. data streams), the infrastructure to
transport and store the data and the configurations, processes and tools to
extract and process the data.
This
demands for clear business objectives, knowledge of the data sources, skills of
the concerned technology and the ability to tie everything together.
When raw
data is available we may start extracting knowledge out of it (that is all that
Data Science is about, isn’t it?).
The Analyze
Data step is a process in itself, consisting of multiple sub-steps:
1.Define
Objective
Analysis
starts by formulating the business challenges that need to be tackled by the
Data Analysis.
Do we need
just business insights? Do we want to detect anomalies? Do we require to
classify our customers so it is easier to decide on the policy to apply on a
new client? Do we aim to analyze their sentiments? Or do we maybe even want
predict their purchasing-behavior for the next month? And if we have the challenges clear: which kinds of measures and KPI-s to we need to be able to tackle them: Incidents? Revenue? Transactions? Margins? Churn?
This is a job typically to be performed by a business Analyst, together with stakeholders with specific business domain knowledge.
Based on
the results a plan ought to be formulated. This could be anything between a
simple pragmatic list of steps and a complete project plan, depending on
strategic importance, expected effort and risk.
2. Gather
Data
Once we
know which data we globally need we may extract it from our stored raw data
into datasets suitable to use with our analytics tool(s). For instance: We may
extract data from Hadoop HDFS using MapReduce into a CSV-file that can be read
by SAS. Besides internal data we even could extract data directly from external
sources, like social media.
Usually
this “gather” activity requires cooperation of the data store administrator(s)
and the Data Scientist.
3. Explore
Data
Next we
need to explore the data using the Analytics tools. We need to understand the
semantics of each field and the relevance of them for the analysis. Further we
ought to determine other characteristics of each field, like the type, ranges,
distributions and the quality of the data.
This exploration
is usually done by the Data Scientist.
4. Prepare
Data
Depending
on our findings in the data exploration phase we may need to prepare the data
in such a way it can be used for the actual modelling phase. We may have to
clean the data, generate new data fields by calculations or aggregations,
transform data into convenient formats and integrate and combine multiple
separate datasets.
Once the
data is ready to be used the analyst may take samples for “training”
(modelling) purposes and for testing. The data preparation is usually done by the Data Scientist. Usually it takes considerable time related to the other steps in the analysis (A percentage of 50% is not uncommon).
5. Create
Model
The
Analytics tools usually offer a set of modelling techniques that the Data
Scientist Analysis may use. Examples are decision trees, clusters or regression
models. The analyst picks one or few of them that seem suitable for the
characteristics of the data and the business objectives. After applying and
configuring them to the set of training data he/she selects the one that leads
to the best results.
The
deliverable is usually a pattern or formula that may be used for predictions or
classifications in future cases.
The
findings in the Model step need to be evaluated with a different set of data
than used to find the pattern to see if it is general applicable.
For that
purpose usually one or more test samples are used that were created in the
“Prepare Data” step. The findings can also be tested against new real-life
cases.3. Visualize Results
The results
from the Analysis phase should be communicated to decision makers and other
stakeholders in a comprehensive way.
This may be
through a simple set of charts in a presentation or document, or even through
fancy graphics or animations (often delivered by specialized graphical
designers) in case the idea needs more “persuasion-power”, or should be
distributed over a broader audience.The key is that the results should be presented in clear way.
4. Evaluate and Act
The last
step is to evaluate the results in real-life situations.
When the
models and patterns are sufficiently reliable architects, process owners or
other decision makers may decide to alter related business processes, rules or
even strategy. This will be separate projects in the BPM profession.
The last
step in the process is to evaluate the Analytics process itself, usually
through a retrospective over the project. This may result in improvements in
the Analytics Architecture or Process.
Conclusion
Big Data
and Analytics are a complex field of technology and skills.
A sound
Architecture and Process –both introduced above- will help to get the best out
of it!
We wish you
good mining!
Geen opmerkingen:
Een reactie posten