Data Science, Analytics en Big Data: How to approach Analytics: A process overview

We are on the verge of the Data Science era and Big Data/Analytics is already a game changer.

Analytics involves a variety of concepts, technologies and competences. Think of Big Data, No SQL databases, In Memory processing, Data Mining and Statistics. To leverage this complex field of technology and concepts a sound Analytics Architecture and Analytics Process are required.

This article will provide a brief overview of both.

Analytics Architecture

The Analytics Architecture should be an integral part of the Enterprise Architecture.

In TOGAF © ADM for instance the Preliminary and Architecture Vision phases would be extended to describe the global Analytics objectives, principles and required capabilities.

On Business Architecture level Analytics likely impacts the Business Process Model, Organization Model and even Business Functions through the implementation of any Analytics results. For input Analytics will heavily depend on the Business Data Model.
On Information Systems level Analytics may use and extend data structures and define processes for the collection of (raw) data.
The Technology Architecture level describes the infrastructure and platform functions required to extract, transport, store, analyze and visualize Analytics data. (Think of tools like Hadoop HDFS clusters, or no SQL databases like MongoDB for storage, Hadoop MapReduce or Spark for extraction, and R, SAS or IBM/SPSS for analysis).

After describing the current and target Analytics Architectures a roadmap and migration planning could be created to define the project(s) in order to put the Data Science capability in place.

Analytics Process

The proposed Analytics Process consists of four steps:

-        Collect (Raw) Data
-        Analyze
-        Visualize
-        Evaluate & Act

1. Raw Data Collection

All Data Science initiatives begin (logically) with raw data.

Obviously we have the structured data from our traditional business systems at our disposal. And usually we have some kind of Data Warehouse with historical (aggregated) data in a dimensional format. In addition to BI does Data Science also consider less structured data like documents, mails and log-files.

Just recently Big Data technologies (like Hadoop) have enabled us to collect, store and process quantities of data of any type that we were not able to capture with conventional means. This has opened to possibility to capture all kinds of additional data from sources like the web, social media, intelligent communicative devices like sensors and wearables (“Internet of Things”) and third parties (for example through open and commercial API-s).

In advance we may not know exactly what data to use and when to use it. Ideally we would like to store everything that has potential informational value. However even with Big Data tooling and affordable infrastructure a clear objective is required to not drown in the data lake we might try to create. This business objective will help us decide which particular data sources and data to select.

Data Collection requires access to all concerned data-sources (either through a data-pull -e.g. API-s- or by push, e.g. data streams), the infrastructure to transport and store the data and the configurations, processes and tools to extract and process the data.

This demands for clear business objectives, knowledge of the data sources, skills of the concerned technology and the ability to tie everything together.

2. Data Analysis

When raw data is available we may start extracting knowledge out of it (that is all that Data Science is about, isn’t it?).

The Analyze Data step is a process in itself, consisting of multiple sub-steps:

1.Define Objective

Analysis starts by formulating the business challenges that need to be tackled by the Data Analysis.

Do we need just business insights? Do we want to detect anomalies? Do we require to classify our customers so it is easier to decide on the policy to apply on a new client? Do we aim to analyze their sentiments? Or do we maybe even want predict their purchasing-behavior for the next month?
And if we have the challenges clear: which kinds of measures and KPI-s to we need to be able to tackle them: Incidents? Revenue? Transactions? Margins? Churn?
This is a job typically to be performed by a business Analyst, together with stakeholders with specific business domain knowledge.

Based on the results a plan ought to be formulated. This could be anything between a simple pragmatic list of steps and a complete project plan, depending on strategic importance, expected effort and risk.

2. Gather Data

Once we know which data we globally need we may extract it from our stored raw data into datasets suitable to use with our analytics tool(s). For instance: We may extract data from Hadoop HDFS using MapReduce into a CSV-file that can be read by SAS. Besides internal data we even could extract data directly from external sources, like social media.

Usually this “gather” activity requires cooperation of the data store administrator(s) and the Data Scientist.

3. Explore Data

Next we need to explore the data using the Analytics tools. We need to understand the semantics of each field and the relevance of them for the analysis. Further we ought to determine other characteristics of each field, like the type, ranges, distributions and the quality of the data.

This exploration is usually done by the Data Scientist.

4. Prepare Data

Depending on our findings in the data exploration phase we may need to prepare the data in such a way it can be used for the actual modelling phase. We may have to clean the data, generate new data fields by calculations or aggregations, transform data into convenient formats and integrate and combine multiple separate datasets.

Once the data is ready to be used the analyst may take samples for “training” (modelling) purposes and for testing.
The data preparation is usually done by the Data Scientist. Usually it takes considerable time related to the other steps in the analysis (A percentage of 50% is not uncommon).

5. Create Model

The Analytics tools usually offer a set of modelling techniques that the Data Scientist Analysis may use. Examples are decision trees, clusters or regression models. The analyst picks one or few of them that seem suitable for the characteristics of the data and the business objectives. After applying and configuring them to the set of training data he/she selects the one that leads to the best results.

The deliverable is usually a pattern or formula that may be used for predictions or classifications in future cases.

6. Test Results

The findings in the Model step need to be evaluated with a different set of data than used to find the pattern to see if it is general applicable.

For that purpose usually one or more test samples are used that were created in the “Prepare Data” step. The findings can also be tested against new real-life cases.

3. Visualize Results

The results from the Analysis phase should be communicated to decision makers and other stakeholders in a comprehensive way.

This may be through a simple set of charts in a presentation or document, or even through fancy graphics or animations (often delivered by specialized graphical designers) in case the idea needs more “persuasion-power”, or should be distributed over a broader audience.
The key is that the results should be presented in clear way.

4. Evaluate and Act

The last step is to evaluate the results in real-life situations.

When the models and patterns are sufficiently reliable architects, process owners or other decision makers may decide to alter related business processes, rules or even strategy. This will be separate projects in the BPM profession.

The last step in the process is to evaluate the Analytics process itself, usually through a retrospective over the project. This may result in improvements in the Analytics Architecture or Process.

Conclusion

Big Data and Analytics are a complex field of technology and skills.

A sound Architecture and Process –both introduced above- will help to get the best out of it!

We wish you good mining!

Data Science, Analytics en Big Data

dinsdag 20 januari 2015

How to approach Analytics: A process overview

Geen opmerkingen:

Een reactie posten