Self-service Big Data Preparation - Michele Stecca

Data Driven Innovation - Rome

Self-Service Data Preparation

Dr. Michele Stecca

24 Feb., 2017

• IoT systems generate massive amounts of data

SELF-SERVICE DATA PREPARATION

Who knows how to do

this?

So far, so good

BUT

• We store this huge amount of information in big data platforms

• Then we extract value from it

• A "citizen data scientist" is a person who creates or generatesmodels that leverage predictive or prescriptive analytics but whoseprimary job function is outside of the field of statistics andanalytics1

• The person is not typically a member of an analytics team. Citizendata scientists are typically in a line of business, outside of IT andoutside of a BI team


¹ Gartner 2015 Research – “Smart Data Discovery Will Enable a New Class of Citizen Data Scientist”

• Big data discovery will help expand the use of big data analytics because exploration of big data sources will occur more often, much faster and at a lower cost per analysis, delivered by a broader range of users with more rudimentary technical skills

• The global trend is to enable lesser skilled (i.e., citizen data scientists) users with the ability to solve more complex problems or access more insights using easier and quicker methods

• Through 2017, the number of citizen data scientists will grow five times faster than the number of highly skilled data scientists

• The blending in a single tool or tightly coupled portfolio, the ease of use, interactivity and agility of data discovery, with the richness of analysis and scale, diversity or immediacy of big data, will be the inception of big data discovery


• Gartner has developed the concept of smart big data discovery

• Preparing data, finding patterns in large, complex data and sharing findings with other users from data remains largely manual

• Smart self-service data preparation is a smart data discovery capability, where algorithms are used to find relationships in data and to profile and recommend to users the best approaches to minimize modeling time and improve quality


• doolytic simplifies access to big data with a modern BI user experience and functionality

• doolytic enables smart data discovery on both structured and unstructured data

• doolytic offers sophisticated advanced query capabilities required by power users/citizen data scientists

• doolytic leverages supervised and unsupervised machine learning features for further investigation



• Native Datalake Dictionary

• Join Recommender

• Not based on field name conventions like traditional BI tools

• Search links between fields and draw graphs with confidence from Datalake Dictionary


How can doolytic help to discover unknown

correlations?

• The algorithm suggests the user the potential correlations by associating a degree of confidence

• The user can accept/reject recommendations

• Graphical visualization for usability

• The algorithm is scalable



• The network planning department needs to optimize the bandwidth allocation by user and traffic type• Citizen data scientists are limited by the existing technology stack to high aggregation levels and small

fractions of data while performing statistical analyses • Citizen data scientists must manually correlate data coming from different data sources (including

network probes)

solution

challenge

benefits

• Business users keep track of frequently used queries with responsive interactive dashboards and visualizations

• Citizen data scientists drill data on the-the-fly at maximum granularity – at user, device and traffic package level - and discover new paths and rules for network optimization through the Relation-Action model

• Relationships among datasets are automatically recommended by a specific component

• More accurate and effective network optimizations algorithms are enabled with a wider and deeper set of inputs

• Citizen data scientist are free to do big data discovery on their own

• Lower TCO than legacy tools

• IT department redirected from support to custom data inquiries

• ROI realized through smaller required investment in optimized network equipment

• Moving from manual data preparation to smart data preparation is an important trend for IoT/big data applications

• This is particularly true when dealing with heterogeneous data such as sensor data, structured/unstructured data, etc.

• doolytic supports the citizen data scientist by providing advanced tools for data preparation on large datasets with the Join Recommender


@steccami


• Senior Big Data Analyst, doolytic

• Ph.D. Computer Engineering, Univ. of Genoa, Italy

• Visiting Researcher, ICSI - UC Berkeley, USA

• Principal Investigator, FP6 & FP7 projects co-funded by EU

• Author 30+ scientific papers in Computer Science

• Main interests: Big data (Hadoop, Spark, etc.), IoT

Data & Analytics

Self-service Big Data Preparation - Michele Stecca