Getting started with Big Data and Advanced Analytics
What’s the hype about?
First of all, Big Data is not really new. Large companies have been analysing their data for centuries. Governments have done it for millenniums. What has changed are the tools at hand and the associated price tag. With the current technologies you don’t need a mainframe for computing or a specialized storage for keeping your data. It runs on standard commodity hardware which in quite a few cases is not only superior to existing systems, but also a lot cheaper.
The same is true for analytics. The underlying statistical mechanisms are not really new. Markow Chain Monte Carlo methods already existed in the 1940’s, but using an open source software for machine learning teaching a computer to recognize the age of a person by analysing a picture is new and requires the cheap computational power that only became available in the last decade.
What’s the difference between “Big Data” and “Advanced Analytics”?
“Big Data” and “Advanced Analytics” are not the same. They are not synonymous and they are not always related, although they often appear in the same context.
Big Data traditionally happens when your data exceeds the existing limits of your systems in terms of volume, velocity or variety (V+V+V).
Volume is the most prominent. E.g. you have last week’s log files or machine data or transactions, but older files are only stored for daily or weekly averages. Condensed data like this leads to condensed insights. If you have access to traffic information and want to know how the profile of the morning rush hour on your daily route has changed in the last 5 years, it doesn’t help to know the number of cars per minute for yesterday if you can only compare it to the number of cars on average per day of a specific date last year or per week of the last 5 years.
Velocity is an issue if you need the information on short term, but your traditional systems’ computational power is not sufficient to provide it instantly. A call center agent needs to have the details about a customer or product during a call, not half an hour later. Think of the recommendation you’re getting when buying on large web shops like Amazon’s.
Variety is information which is available but not usable combined as the systems are not connected or a clear common denominator to identify connected datasets is missing. It’s good to have a CRM system and it’s also good to analyse weblogs, but it’s much better if data from both systems can be combined.
There are more variants like dark data (data known to exist, but you can’t see or use it, in analogy to dark matter), veracity in terms of a single truth across the systems, etc. For the sake of simplicity lets stick to the 3 popular V’s mentioned above.
Advanced Analytics on the other hand leverages todays processing power. A computer as such does not bear any intelligence, but it’s far superior when it comes to calculations, and advanced analytics make use of that. This is applied when creating clusters (data points that are not 100% equal but belong to the same group). An example for this would be organising pictures by faces without names, where the human user has to assign names to groups, like iPhoto on the Mac. The opposite is rule extraction where data comes sorted into different bins and the machine tries to figure out the common denominator to describe the label.
How to conduct a Big Data/Analytics Project?
Even if all of the above is known and understood, putting it to good use is a different story.
Getting started properly with Big Data and Analytics is not easy and a lot of companies face different challenges while testing the waters:
· The topic is fuzzy and not clearly defined (-> lack of understanding)
· The traditional IT lacks the technical skills and systems and ramping those up will result in a long turn-around time (-> lack of resources)
· The complexity of the tool landscape makes the selection difficult and requires handling various solution providers at the same time (-> lack of overview)
· Without a defined topic and a calculated ROI it’s hard to get a budget (-> lack of use case)
· Big data is often seen as one bullet point on the Digital transformation agenda, which has not been shaped (-> lack of vision)
Where to start?
You don’t have to wait until you have drawn a detailed roadmap of your digital transformation agenda before getting started. In fact this will probably take so long that by the time the roadmap is finished, parts of it will be outdated, especially technical details.
The general credo is “Think big, start small, fail fast, scale quickly, move fast, break things”
Even the longest journey starts with one step, but you’ll be there much earlier if you start NOW with a good idea about the right direction.
Owner of such projects should be the business and not the IT, because it requires subject matter knowhow and it’s about solving problems, not about picking a specific technology.
So instead of choosing a technology or building a platform, pick a good use case.
What are the criteria for “good” use cases?
A good use case starts with a problem that needs to be solved, where the result can be quickly quantified in hard cash and which has enough potential to be further mined. If you have 7 billion dollars costs for maintenance on turbines and set savings as 1% as an achievable goal, that equals to 70 million dollars. So General Electric built a data lake for preventive maintenance. Much better than “let’s see if we can find something interesting in our machine data.”
There are several catalogues per industry to choose from if you don’t have a specific at hand which can work, but mileage might also vary with the market. A bank with 120.000 customers and an annual churn rate of 1% is losing 100 customers a month. Taking into account that not all are recoverable at all, the expected return on investment probably does not justify spending >100k for a proof of concept and the costs for a production platform. So the scope needs to be set properly and the feasibility needs to be checked.
Next on the list is on data. The required data should be available or at least the potential source identified and it needs to be well understood. Quite a few projects fail, because the data preparation (cleaning, combining with other sources, etc.) takes too many efforts, or the data itself is not understood well. A good way to start is to explore data manually by just visualizing it. If that works, the data can be assumed to be under control.
What skills are needed?
Skills clearly depend on what needs to be build. If complex analytics are involved, this calls for a data scientist. If the focus is on analysing large data sets, a system architect to design a performant system is more important, but none of this works without the specific know how from the business side. The most successful teams balance the technical and statistical know how with subject matter expertise.
Which technology building blocks are needed?
Overall there are 4 main pillars. First you need to somehow on-board your data: Extract-Transform-Load (ETL). That includes interfacing to source systems, filtering, cleaning and setting the proper format.
The second pillar is a platform for storing and computing data. This is the “engine”. It can be a “hadoop-ish” platform, but this is not a must.
Pillar three is the analytics tower. There is no magic involved and it doesn’t turn your Business Intelligence into superman with superpowers, but a technically well-armed batman is also a good choice. Toolwise this is the most complex part as the choice is huge. Think of it as a toolbox and every tool can be used better or less well for specific problems. The last pillar is visualization. This is fairly easy as it can be even done standalone with existing products. If you want to build a platform, start here as it will help for manual data exploration.
How to run the project from an operational point of view?
Agility is key. First run a proof of concept, preferably on a case where you are quite sure that there is some meaningful output. You need to prepare by carving out the use case(s) and preparing the data, but expect that your learning over time is a deep look into the rabbit’s hole and that your organization will evolve and targets might shift. Once you have started to harness the first valuable output it will lead to more questions.
Solve them one by one in order of ascending complexity. Allow the users to build their data experiments and prototypes supported by data scientists and don’t restrict the environment by dedicating the technology. After running the first successful proof of concept it’s still possible to rebuild the cases with production proof tools managed by the IT. If you have the luxury of choice between different cases, choose an easy one, that doesn’t involve personalized data. Your data protection officer will appreciate that.
Is there a methodology that can be used?
The Cross-industry standard process for data mining, commonly known by its acronym CRISP-DM,[1] is a data mining process model that describes commonly used approaches that data mining experts use to tackle problems. Polls conducted on one and the same website (KDNuggets) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.[2][3][4][5] The only other data mining approach named in these polls was SEMMA. However, the SAS Institute clearly states that SEMMA is not a data mining methodology, but rather a "logical organization of the functional tool set of SAS Enterprise Miner." A review and critique of data mining process models in 2009 called the CRISP-DM the "de facto standard for developing data mining and knowledge discovery projects."[6] Other reviews of CRISP-DM and data mining process models include Kurgan and Musilek's 2006 review,[7] and Azevedo and Santos' 2008 comparison of CRISP-DM and SEMMA.[8] Efforts to update the methodology started in 2006, but have as of June 30th, 2015[update] not led to a new version, and the "Special Interest Group" (SIG) responsible along with the website has long disappeared (see History of CRISP-DM).
How to align a small project with an overall strategy?
Make the environment and the learnings available to other departments. It should become embedded into daily routines and not be hidden in a specific silo.
One of the best implementations I’ve seen was in a large company which offered data science in a “cooking course” manner. Every department could “book” an accompanied course in the “data kitchen”, providing they would bring problems and data. The payment was not cash but sharing the results and making the own department’s data accessible to others in a data lake. That led to a growing data lake which could be further exploited with the big difference that each little part of this existing environment was well understood. If you choose the opposite and simply dump all existing data in one place without understanding it and without having worked on it, the risk of ending up with a data swamp instead, is quite high.
So in a nutshell: Start now, start from the business side, pick an easy use case which is relevant and has potential. Nobody learned to ride a bike by reading a book about it.