21
Big Data Management Innovation potential analysis for the new technologies for managing and analyzing large amounts of data Executive Summary Commissioned by Prof. Volker Markl Prof. Thomas Hoeren Prof. Helmut Krcmar

Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Big Data Management Innovation potential analysis for the new technologies for

managing and analyzing large amounts of data

Executive Summary

Commissioned by

Prof. Volker Markl Prof. Thomas Hoeren Prof. Helmut Krcmar

Page 2: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Abstract

The increasing degree of networking in the Internet of Things and Services and the use ofadvanced sensor technology and simulation models in Industry 4.0, in service companies,in research and in the private sector are resulting in ever greater data availability. Theanalysis of this data will revolutionise commercial, scientific and social processes by pro-viding rapid and comprehensive data-driven support for decision-making. Companies inparticular can expect to gain significant competitive advantages. This trend is currentlydescribed by the terms “big data” or “data science”. Here, big data signifies that bothdata and analyses based on this data have attained a new quality and complexity inrecent years.If we are to ensure the lasting competitiveness of Germany as a business location, and

to continue to generate innovations in the modern information society, we will need acoherent interplay of four fields:

technology: provision of effective methods and tools to analyse large quantities of het-erogeneous data at a high data rate,

commercial exploitation: creation of applications to develop new markets or tostrengthen existing markets,

legal framework: ownership of data, data protection law, copyright and contractualand liability problems,

and the training of skilled workers.

The development of ICT for big data management is a logical priority of the FederalMinistry of Economics and Technology, and one which particularly merits support.This executive summary is an abridged version of a detailed study which evaluates

this interplay in order to move, via management and analysis, from big data to smartdata. The complete text of the study (in German) is available at http://www.dima.

tu-berlin.de/menue/research/big_data_management_report/.

Page 3: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

1Introduction: What are the New Big Data Qualities?

Big data is characterised by a new levelof complexity in terms of the data and theanalysis conducted on the data. This newtype of data complexity is characterised byrequirements in terms of volume, velocity,variety, and veracity, which cannot be metby conventional database systems. For ex-ample, the analysis of big data requires:

• the storage and processing of enormousquantities of data,

• whilst the window for decision-makingwithin which the results of the analysishave to be provided becomes smallerand smaller,

• a large number of different data sources(e.g., time series, tables, text, images,audio and video streams) to be in-cluded in the data analysis and

• due to the fuzziness of some datasources (e.g., sensors with a fixed pre-cision level) or of information extrac-tion procedures and integration proce-dures, systems and analysts need tohandle probability-based models andconfidences.

Also, new declarative languages areneeded for specifications and the automaticoptimisation and parallelisation of complexdata analysis programmes (including newstatistical and mathematical algorithms) inorder to cope with the volume, the process-ing speed, the different data formats andthe reliability of the data. Furthermore, bigdata involves a new complexity of analysis,as reflected in the fact that models are gen-erated from the data in order to supportdecision-making. This necessitates the useof advanced data analysis algorithms drawnfrom statistics, machine learning, linear al-gebra and optimisation, signal processing,data mining, text mining, graph mining,video mining, and visual analysis.These requirements will result in a

paradigm shift in data analysis languages,data analysis systems, and data analysis al-gorithms, making entirely new types of ap-plications possible.

1

Page 4: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

2Big Data Management: an Opportunity for Innovation

in Europe

The new challenges brought about bybig data represent a great opportunity forGerman and European companies, both interms of technology and in terms of appli-cations for this field. Existing products inthe commercial database market, which islargely dominated by U.S. firms, are basedon technologies, which cannot cope with bigdata because of a lack of scalability, lack oferror tolerance, or restricted programmingmodels. This means that there is a new in-ternational situation in the field of scalabledata analysis systems. Germany is well po-sitioned here. After the U.S., Germany hasthe second-strongest research community inthe field of scalable data management. Thiscommunity is already engaged in a largenumber of activities in basic research cen-tered on big data (e.g., the StratosphereSystem at Technische Universitat Berlin,Hyper at Technische Universitat Munchen,the research on Hadoop++ and HAIL atSaarbrucken University). In addition to anopen source movement, which is also activein Germany, many companies, and partic-ularly start-ups are challenging establishedproviders like IBM, Oracle and Microsoft.

Among the challengers pursuing opportu-nities and seeking market share in the emer-gent big data market there are a host of Ger-man technology companies as well as largefirms (e.g., SAP and their HANA product,Software AG and their Terracotta prod-uct) and many high-tech start-ups (e.g.,ParStream and Exasol). If the product andmarketing strategies of these companies areto be supported, it is important to createa climate of technology transfer and inno-vation, which enables German firms andparticularly SMEs, university spin-offs andstartups in the field of scalable data process-ing and data analysis to compete as equalswith the startups particularly found in Sil-icon Valley, in the UK, and in China.In this way, policy-makers can make an

important contribution towards the com-petitiveness of German firms in the futuredevelopment and commercialisation of keyenabling technologies for big data, and pavethe way for Germany to enter the billion-dollar big data market via scientific achieve-ments and innovations.

2

Page 5: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

3Training at Higher Education Institutions

In order to optimally prepare Germancommerce, science and society for thisglobal trend, there is a need for highly co-ordinated research, teaching and technologytransfer activities in the field of data analy-sis and scalable database management. Bigdata is no longer just a challenge for a spe-cific sector; rather, it affects all areas ofthe economy, all organisations and all usersof digital technologies. The novel job de-scription of the “data scientist” combinesknowledge of data analysis procedures (e.g.,statistics and machine learning, optimisa-tion, linear algebra, signal processing, lan-guage processing, data mining, text mining,video mining, and image processing) withtechnical skills in the field of scalable datamanagement (e.g., database systems, data-warehousing, information integration, dis-tributed systems, computer networks, andcomputer architectures) and practical sys-tems implementation skills.This sort of training should be backed

by practical application-based projects toteach skills in certain application areas.

3

Page 6: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

4Legal Aspects of Big Data

The technical developments in recentyears have substantially increased the avail-able quantities of data. These newly intro-duced technologies and their applicationsnot only enable data storage for a virtu-ally unlimited amount of time, but also toanalyse the data in greater detail, for ex-ample, in terms of user behavior patterns.However, the volume of data also raises nu-merous legal questions, mostly related tothe fields of data protection-, copyright- andcontract law. Furthermore, there is an on-going legal debate, which when resolved willhave serious consequences for the whole bigdata industry. The question whether thereis a right of data ownership right, and ifso who that would be.

Data Ownership

A property Data rights for data might seema dispensable theoretical construct unneces-sary given the already existing ownership,copyright (e.g., attributed to databases)and privacy protection. But data itself hasbecome an economic factor in its own right.It represents merchandise which has an in-dependent tangible value. With this back-drop it seems appropriate to develop an ab-solute protection regime.

Dogmatically, the concept of data own-ership creates challenges. Proprietary dataassumes that data can be clearly attributedto a legal subject and thus grants the sub-ject full property rights in return. A clas-

sification according to Section 90 ff. of theCivil Code (BGB) seems impossible. Thedatum itself is not a tangible asset accord-ing to Section 90 of the Civil Code (BGB)rather it is physically dependent. To in-clude a datum as major component of thedata medium according to Section 93 ofthe Civil Code (BGB) would conflict withthe view that in this case “ownership” andproperty rights are inseparable. But thisneeds to upheld to appropriately compen-sate for damages, for example, attributedto a data base.

This is exemplified in referring to a con-crete situation in which the economic dam-age did not occur to the owner of the datamedium, but rather to a third party usingthe data for economic benefits. While it waspossible 25 years ago to recourse to prop-erty rights when solving a legal case involv-ing data loss on a personal data medium,this construct is no longer applicable dueto modern technological advances in stor-ing data in the “cloud”. The formerly ar-gued “right to personal data stock” in thesecircumstances is “another right” accordingto Section 823 Civil Code (BGB) that ispotentially viable, but is met with dogmat-ically objections. To qualify as an exclu-sive right, it would need to offer an exclu-sion function similar to the rights defined insection 823 Abs 1 Civil Code (BGB). It isunclear who this exclusion right should beattributed to, thus this solution could onlyserve as an auxiliary construct that creates

4

Page 7: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

more problems than it solves.

Assigning data ownership rights to legalsubjects could also be achieved by meansof data protection rights. This only cre-ates a legal responsibility for data and shallnot be understood in such a way as datasubjects holding exclusive rights on indi-vidual data records in the legal sense of aproperty. The same holds true for a po-tential solution of using the exclusive sui-generis right on databases. This demon-strates an investment protection that pro-tects from economic exploitation by oth-ers, but does not extend to prevent linkingdata and people. The linkage problem how-ever can be solved by appealing to crim-inal law, as Section 303a Criminal Code(StGB) explicitly protects data. Linkingthe subject of protection to a legal entity isachieved through the scripture act (Skrip-turakt), thus by means of the technical cre-ation process itself. Transferring this con-cept to civil law enables a definite attribu-tion and thus establishes the possibility for“data ownership”.

It can be concluded that the derivation ofan ownership on data is dogmatically pos-sible. But it is unforeseeable whether theinstitution of data ownership will prevail asthe discussions about it are just starting.Nevertheless attention should be devoted tothis topic especially from the point of viewof big data companies. Depending on theoutcome of the debate the “data owner” as-sertions could affect the data record han-dling, independently of any data protec-tion or copyright claims. Furthermore, aclarification to the question of data owner-ship would also have dramatic impacts onbankruptcy law related issues as well.

Data Protection

Currently ongoing scientific discussions ve-hemently point out the conflicts that ex-ists between big data and data protection.Looking at the data protection principlessuch as appropriation, transparency, direct

survey, data reduction and data economy,and the principle of prohibition with thereservation of permission it becomes clearthat big data can collide with the conven-tion for the protection of individuals withregards to automatic processing of personaldata. To prevent data protection breachesbig data companies must therefore considera manifold of problem areas during decisionmaking.

One problem is already the question ofapplicability of the German privacy law.The generally valid principle of territorial-ity already causes difficulties, if data is dis-tributed worldwide onto different sites, thedata is ephemeral and can also very rapidlychange its determined location. Big datacompanies must therefore consider a multi-tude of different legal systems when dealingwith data.

Furthermore the permissibility of dealingwith data from a copyright point of view de-pends on an approval by the data subject orthe determination of legitimacy by statuteattributed to the processing agency. Thisprinciple of prohibition with the reserva-tion of permission is increasingly criticisedas being unsuitable since with the ubiquityof data processing in, for example, smart-phones almost everyone could potentiallybecome a data processor. The protection ofinformational self-determination is placedabove all at the beginning, to subsequentlydefine many in parts very extensive statu-tory permissions. It is suggested to rethinkthe concept of prohibition with the reserva-tion of permission in the scope of modern-izing the data protection law. In this waybig data companies could be relieved fromthe complex task to determine the “appro-priate” legitimate elements of a rule.

In this context the general question ofthe effectiveness of the institution of “con-sent as legitimate factor” arises. On theone hand there are existing concerns to-wards the optional nature of the consent,if it becomes the de facto good in returnfor a “free” service and thus develops into

5

Page 8: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

merchandise itself. On the other hand itneeds to be checked how consent can be ob-tained legally and whether its effective du-ration could be limited.

Altogether the representative scenariospresented and a multitude of other prob-lems show that data protection laws are anobstacle for big data companies. On the onehand this is the desired effect to successfullyprotect personal data. On the other hand itcan be seen that the principles of data pro-tection originated from an era that didn’tcater to the big data phenomena. For thatreason modernisation proposals are neededwhich balance the opposing interests in anadequate manner.

Copyright

Dealing with big data creates a lot of copy-right related questions as well. A univer-sally valid assessment of copyright relatedquestions which holds true for any big datasolution is impossible to achieve as concreteproblems always arise as consequences ofthe individual implementation of the pro-cesses. Protection of copyright can exerteffects along two different dimensions. Onthe one hand it is possible that the dataprocessing company itself could claim copy-right, in particular the sui-generis databaseprotection. On the other hand dealing withdata itself infringes upon copyrights or re-lated protected privileges of third parties.Especially the latter needs to be consideredby big data companies, when making con-crete data processing decisions.

This context, similar to the data protec-tion discussion, yields the question of appli-cability of the German statue. The princi-ple of territoriality in accordance with theGerman law is generally applicable to bigdata solutions that are used in Germany.Internet related cases which cannot be at-tributed to a concrete territory lead to con-flict of laws questions that have yet to be an-swered completely. Big data companies inturn are faced with an uncertain legal envi-

ronment since multiple different legal stat-ues could be applicable.

Copyright protection can only be guar-anteed if a sufficient level of originality ispresent in the work. Individual data recordsregularly do not present the necessary indi-vidual imprint, and as such copyright in-fringement are not to be expected. Butsomething contrary could hold if user gener-ated content, for example derived from a so-cial network is to be analyzed. Such recordscould be individually protected as photo-graphic images (Section 72 UrhG) photo-graphic works (Section 2 Abs. 1 Nr. 5UrhG), or literary works (Section 2 Abs. 1Nr. 1 UrhG). An analysis would thereforetouch upon the right of reproduction andthe right to make available to the public.

The permissibility of data handling fromthe point of view of copyright law depen-dends on whether existing statutory excep-tions apply in favor of the data processingcompany or the rights holder is acknowledg-ing the handling. Statutory exceptions willgenerally not authorize the handling of largeamounts of data. This finding serves as abasis for discussion as to whether the copy-right act (UrhG) should be expanded withfurther exceptions to accommodate the newdimensions of data flows. Without “suit-able” exceptions big data companies canonly resort to seeking approval from the re-spective rights holders. This however bearsconsiderable practical problems. New tech-nologies in the digital age lead to the factthat almost inevitably many foreign exclu-sive rights are touched upon so that a greatnumber of contractual agreements would berequired. To avoid this problem the figureof factual consent is introduced. Whethersuch a construct would in fact be viablefor authorizing the processing is still unde-cided. It can be concluded that there is aclearly visible trend that in the case of exer-cised copyrights concerning big data is onlypossible under unfavorable conditions.

This statement could be further empha-sized by the fact that another copyright pro-

6

Page 9: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

tection act could impede big data process-ing as well. If the data sets are sourcesfrom a third party database, the sui-generisdatabase protection according to Section87a ff. UrhG could conflict with the actionsof the big data company. Sui-generis pro-tection is independent of the level of orig-inality. It rather protects a database if asubstantial investment has gone into thecreation. When exactly this condition ismet is determined on a case by case ba-sis. This results in turn to uncertaintiesthat need to be addressed before managerialdecisions are undertaken. If the conditionis met this implies that without consent ofthe database owner only insignificant partsof the dataset can be used for the data pro-cessing. Databases that fall below the regu-lated threshold in turn can be used withoutexceptions.

Overall these sketched problems provethat copyright with all its unanswered le-gal issues as well as the regulations thatare partly in need of modernization couldpresent a barrier for the big data sector.In any event the focus on data privacy lawfrom the point of view of a business, shouldnot blur the view on corresponding copy-right related regulations.

Contractual and Liability Issues

Further uncertainties for big data compa-nies arise in regards to the contract termsas well as liability issues. Contract design iscomplicated due to the many technical as-pects involved as well as the existing touchpoints with data protection and copyrightlaw. It is hard to make generally applica-ble contract law statements. Neverthelessit can be asserted that the regulations interms and conditions, which exclude or re-strict liability for data loss or data damageare in danger of being ceased during a globaljuridical review of the terms and conditions.Furthermore there are general issues with li-ability of big data companies. This presup-poses misconduct on part of the respective

company. This could be owned to the cir-cumstances of transmitting faulty informa-tion. It is still not sufficiently defined whichcriteria need to be present to declare such acase.

Conclusion

The development of big data in Germany issignificantly influenced by the establishedlaw. Of great importance will be the ongo-ing debate about a potential “data owner-ship”. In addition there are some existingbarriers that impede especially data privacyas well as copyright regulations. Part of adebate on a modernisation of these fieldsof law should include whether these regula-tions are still up to date and where appro-priate alterations are possible which wouldappeal to economic interests of data pro-cessing companies without neglecting thelegitimate interests deserving protection ofdata subjects. As long as this process isongoing, big data companies must deal ona case by case basis with the fact as towhether their big data solution operateswithin the legal bounds as defined by thedata privacy as well as data copyright law.

7

Page 10: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

5The Innovative Potential of Big Data

Our study investigated the innovative po-tential of big data both in select verti-cal market sectors and on across horizon-tal markets. This chapter summarises ourfindings.

We only relied on big data studies whichcontain information about Germany, havequantitative data and contain comparablestatements, so that they can be linked uptogether. A total of nine studies were eval-uated. The sample size of the studies con-sidered here (i.e., BARC, BITKOM, Com-puting Research (two studies), ExpertonGroup, Fraunhofer IAIS, Interxion, TATAConsultancy Services, TNSinfratest) waslarge at n > 4492.

The core messages of the studies can besummarised as follows:

1. Big data has thus far chiefly been atopic for IT experts, but has been rel-atively unknown beyond that.

2. Big data will alter company manage-ment, business processes and informa-tion logistics.

3. The main need for specific action atpresent is perceived in the field of in-formation logistics.

4. Companies are only just starting to de-velop strategies for the use of big datareaching beyond the analysis of struc-tured data.

5. There is a lack of skilled workers, or-ganisational structures, and processesin companies to use big data. Thesewill be in even greater demand in fu-ture.

6. As companies take a greater interest,the big data market will grow consid-erably.

Results of the Empirical Study. Inaddition to the above-mentioned analysisoffered existing studies on the innovativepotential of big data, an independent empir-ical study (with 185 participants) was car-ried out. The aim of this study was to val-idate and supplement the existing findings.The investigation was carried out in the

2nd and 3rd Quarter of 2013. The totalamount of participants was 185.

Background of the Participants. Fig-ure 5.1 shows the professional backgroundof the participants of the study, 16% of theparticipants were decision makers. The ma-jority of the participants had a backgroundin computer science or related fields (Fig-ure 5.2).Most participants were between 20 and

54 years of age. (representing 164 responsesand 16 participants were older than 54) andattributed themselves mostly to the infor-mation technology sector.Out of the 90 participants who identi-

fied as part of the private sector, 40 were

8

Page 11: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Figure 5.1: The field of work of the partici-pants in the study

Figure 5.2: The education of the partici-pants in the study (Answers=183)

part of small and medium sized enterprises(SMEs). Exactly 47 participants came fromorganizations with more than 1000 employ-ees, 23 from organizations with 2500-1000employees and 39 from 250 or less employ-ees (109 responses in total). Precisely, 38companies had an annual turnover of morethan 250 million euros, 66 had a lower an-nual turnover, out of which 24 had an an-nual turnover between 5 and 50 million eu-ros. The next sections discuss the results ofthe study in detail.

Big Data Projects are Still in a VeryEarly Stage. About 40 % of the decision-makers state that they are still in an in-formation phase regarding big data tech-nologies (see Figure 5.3) Only 8 % ofdecision-makers have already consideredimplementing those technologies. A quar-ter of decision-makers stated, that they arealready planning and/or testing Strategies,Roadmaps and Measures, including cost-benefit analyses. On the side of practition-ers, things have already progressed further.15 % have already implemented a big datastrategy. Furthermore, more companies arein the phase of implementation, testing andplanning.

Figure 5.3: State of Big Data Projects

Big Data Will Be Important in 5Years. Decision-makers and practitionershave been asked to assess the importanceof big data in practice as part of this study.While 44 % of the practitioners (n=50) ratethe topic of big data as very important orrather important, only 29 % (n=24) of deci-sion makers did so. For the year 2014, 61 %of practitioners and 43 % of decision makershad this view. However the biggest impor-tance of the topic of big data was assertedfor the next five years (78 % of practitionersand 60 % of decision makers).

Big Data Will Have Established Itselfin Less Than 10 Years. A large ma-jority of the participants expects that bigdata will prevail everywhere in their indus-try in less than 10 years. This assessmentis shared by providers and decision makers,as is shown in Figure 5.4.

Big Data Has High Value Cre-ation Potential. About 50 % of decision-makers and 72 % of practitioners assess thevalue creation potential of big data as highor very high. Only 25 % of decision-makersand 8 % of practitioners assess the value

9

Page 12: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Figure 5.4: When is Big Data starting toprovide a competitive advantage?

creation potential of big data as low. It isnoteworthy that practitioners assess the po-tential of big data significantly more posi-tive than decision-makers.

Measurable Value Attributed to BigData? About 16 % of decision makersand 19 % of practitioners have stated thatthey already managed to create measurablevalue with big data. But 72 % of decisionmakers and 54 % of the practitioners statedthat they could not yet estimate the actualvalue of the value contribution. These re-sults indicate the discrepancy between thevalue creation potential and the actuallymeasurable contribution.

Besides Relational Data, TransactionData, Text Analyses and Web Analy-ses Dominate. The providers of big datatechnologies are highly diversified and canprovide analysis tools for many differentdata types. However predominantly toolsfor relational data and highly structureddata such as transaction data are requested.Video, image, and audio data are rated asrather unimportant (see Figure 5.5).

From a practitioners’ point of view, anal-ysis of transaction and web data are themajor concern regarding already carried outanalyses. It is particularly interesting, that

only few practitioners plan the future appli-cation but rather do not yet know whethersuch analyses should actually be carried outat all.

High Quality Data , Lack of SkilledPersonnel and Lack of Economic Fea-sibility are the Biggest Challenges.Decision makers identified the lack of incor-porated data analyses into actual decisionsas the largest obstacle to the adoption ofbig data. Decision makers also voiced ma-jor concerns regarding the legal frameworkof privacy. The shortage of skilled employ-ees for data analyses is also named as a con-cern. Practitioners also largely see privacyissues as the largest concern. They sharethe view that incorporating data analysesinto the decision making process is a ma-jor challenge. Providers on the other handlargely see privacy concerns as the majorchallenge. Providers share the challenge todemonstrate how data analyses can be in-corporated into the decision making pro-cess. Furthermore providers struggle withthe complexity of the structure of data inenterprises. In summary, the largest chal-lenges of big data lie in a value adding in-tegration of big data into decision makingprocesses, the lack of skilled data analystsand concerns regarding privacy.

The results of the survey can be sum-marised as follows:

1. Big data is being viewed in the lightof more effective corporate decisions.However, providers and users are fac-ing the challenge of developing convinc-ing combinations of data analysis, com-mercial decisions, and value. There isan opportunity here in in-house pro-cesses, since a large amount of data hasnot been used so far.

2. German providers of big data technol-ogy are capable of meeting the needsof German users. However, a key chal-lenge is the elucidation and explana-

10

Page 13: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Figure 5.5: Service provider perspective: Existence of Big Data tools for supportingvarious data types

tion of data protection rules and thedevelopment of viable commercial so-lutions.

3. Also, big data is believed to offer po-tential in the field of new businessmodels, products and services. Usercompanies are still unclear about whatdata analyses are of relevance and valuefor their respective business processes.Here, providers and users could workintensively together in order to developcorresponding business models, prod-ucts and services.

4. One important challenge lies in the re-sponsible use of personal data. All ofthe stakeholders can see major chal-lenges in this field. Here, it is necessarynot only to bring the law into line withthe state of the art, but in particular toeducate and to manage expectations.

5. Another key challenge lies in the avail-ability of appropriately trained staff.There is a need here for suitable train-ing courses in order to provide compa-nies with the necessary expertise.

6. Targeting the survey at small andmedium-sized firms reveals that thereare hardly any differences in the per-ception of the innovative potential.This can be interpreted as an opportu-nity for small and medium-sized firms,since there are clearly no additionalhurdles for this business sector. Fur-ther to this, the studies considered andthe authors’ own study discuss gen-eral horizontal potential as sustainablyclassified individual sectors.

Cross Sector and Sector Specific In-novation Potential. For the analyses ofstrengths, weaknesses, opportunities andthreats (SWOT), literature research wascarried out in order to assess the individ-ual factors. Different strengths, and weak-nesses as well as opportunities and threatscould be derived. The results are shown inFigure 5.6.Based on the reviewed studies as well as

the own study sectors with an outstand-ing innovation potential could be identified,these are:

11

Page 14: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

SWOT-AnalysisInternal Analysis

Strengths Weaknesses

Extern

alAnalysis

Opportunities

• Fast and accurate analysis

• Better strategic decisions

• Improved control for operationalprocesses

• Better analysis with Business In-telligence

• Cost reduction

• More flexible services, targetedmarketing campaigns

• Better information for internal riskmanagement

• Better understanding of the mar-ket

• Improved customer service

• Monitoring leads to an in-crease of internal control

• A reduction of the dataset led to unanalyzed datapoints

• Big Data creates more trust-worthy data

• Customer specific solutions(transparent customer)

Threats

• Distribution Data

• Fraud Detection

• Unstructured/informal communi-cation

• Lack of technical and domainknowledge

• Privacy

• Data integrity and security

• Social acceptance

• Ethics

• A lack of plausible and con-vincing application scenarios

• Technical problems

• Costs

Figure 5.6: SWOT Analysis for the Assessment of the Potential of Big Data

12

Page 15: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

• public sector,

• Industry 4.0,

• health sector/life sciences,

• market research, (social) media and en-tertainment,

• mobility services,

• the energy sector, plus

• risk management and the insurance in-dustry.

The studies and the selected sectors showthat the data volume is manageable via var-ious technologies, such as Hadoop, Strato-sphere, in some cases also via SAP HANA,Par-Stream, or other in-memory databases.Data protection and data security offer con-siderable potential for big data technologyin Germany. Due to the fact that Ger-many has a well-developed understanding oftrusted data, Germany can become a mar-ket leader in this field. To this end, there isfurther need for research in the field of dataprotection and data security, particularly interms of the integration of data protectionfunctionality into existing or emerging dataanalysis systems or algorithms. The protec-tion of data is of considerable significance,particularly in order to uphold commercialinterests.

13

Page 16: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

6Analysis of Big Data and Big Data Technologies

On the basis of the requirements of indus-try, as sketched out in the preceding chap-ter, one can derive four core requirementsfor the management of big data.

1. Handling of large, heterogeneous quan-tities of data

2. Complex data analysis algorithms

3. Interactive, in many cases visually sup-ported data analysis

4. Comprehensible data analysis.

Existing data management systems can-not cope with these challenges. In orderto master them, there is a need to developscalable, easy-to-use data analysis systemsand new algorithms or paradigms for dataanalysis which address the various aspectsand requirements at the same time.

Analysis of Big Data

A new level of data complexity re-flects itself in novel requirements regard-ing volume, velocity and veracity, which arenot covered by existing database solutions.The analysis of big data requires the storageand processing of massive data sets in theterra- or even petabyte range. At the sametime the timeframe for decisions, in whichanalysis results have to be generated, be-comes smaller and smaller. Data analysissystems have to provide analyses with low

latency even when facing high data rateswith which new data sources have to be in-tegrated into the database. Furthermore, alot of different data sources each containingdata of different structure (such as time se-ries, spreadsheets, text, images, audio andvideo streams) are tapped for analyses.

A new level of complexity regardingthe analyses of big data manifests itselfin the fact that models for decision supporthave to be generated from the data. Thisrequires the application of advanced dataanalysis algorithms, in particular methodsof statistical machine learning, linear alge-bra, optimization, signal processing, datamining, text mining, graph mining, videomining as well as visual analyses. There-fore the data analysis system has to pro-cess complex algorithms from linear alge-bra, statistics, or optimization theory ina timely manner. These algorithms arecharacterized by a combination of user de-fined functions (UDFs), iterative, status al-gorithms and the common operations of re-lational algebra. This combination of it-erative algorithms and user defined func-tions is not provided by either traditionalSQL-based database systems nor existingbig data solutions (e.g., Hadoop, Pig, Hive,Storm, Lambda Architecture, etc.).

14

Page 17: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Thus the development and commercialisa-tion of modern data analysis systems thatcombine relational data processing withalgorithms of statistical machine learningprovide a high innovation and market po-tential.

Technical Challenges of Big DataTechnologies

The interactive and iterative data analysisprocess necessitates the mastering of threetechnical challenges:

1. People must be able to describe the de-sired result of the inquires in a high-level language,

2. the technology must process iterativedata flows and

3. the technology must also be able toprocess unknown programmes of otherproviders, known as user-defined func-tions (UDFs), sufficiently quickly.

Programming Models for the Rankand File Analysts. Besides to Hadoopa multitude of interesting research work onmassively parallel processing of data driven,iterative algorithms exist. However, noneaddresses the development of a declarativespecification and automatic optimization ofiterative algorithms. Therefore, the analy-sis of big data requires skills in distributedsystems programming, knowledge in the do-main of analysis as well as a solid under-standing of machine learning methods. Peo-ple with such a combination skills are quiterare. Overcoming this shortage will be thecritical success factor not just for new bigdata technologies but also for a broad ap-plication and uptake of big data analytics.

Iterative Data Processing. Iterativedata analysis methods compute the resultof an analysis in a lot of individual steps.Each step generates an intermediary resultor intermediary state. Since, such compu-tations have to be carried out in paralleldue to the high data volume, the repeat-edly updated state has to be efficiently dis-tributed, stored and managed across manymachines. For efficiency reasons, the statehas to be held in main memory. Many algo-rithms also require a lot of iterations untilthey converge to the final solution. It is thusimperative to compute iterations with lowlatency to minimize the total query time.In some cases the computational effort de-creases significantly after the first few it-erations. Systems such as MapReduce andSpark carry out all computations in each it-eration, even if part of the result has alreadyconverged and no longer needs to be up-dated. Contrary to this approach, real iter-ative data flow systems like Stratosphere orspecialized graph processing systems suchas GraphLab or Google’s Pregel can exploitthis property and reduce the computationaleffort for each iteration.

Processing of User Defined Functionswith Low Latency

User defined functions are not inherentlypart of the system, they are a long knownand supported concept in relational database systems. However interfaces are oftentoo restrictive to allow the implementationof complex algorithms. Google’s MapRe-duce, SCOPE, Stratosphere and Spark arethe only systems that offer more expressiveuser defined functions that can be executedin parallel. The degree of parallelism is usu-ally determined by the semantics of the pro-gramming interface (e.g., second order func-tions like map and reduce in Hadoop or fur-ther functions like match, cogroup or crossin Stratosphere).

15

Page 18: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

User defined functions do not belong thefunctional scope of the system but are of-ten executed externally. For that reason,their usage generally leads to higher exe-cution costs. These costs can be lowered,when the UDFs are deeply embedded insidethe system. Another challenge is the treat-ment of UDFs during the optimization ofdata analysis programs. The semantics of aUDF generally are not known to the system,data analysis programs with UDFs can onlybe optimized if the optimizer has additionalinformation. First approaches attempt toobtain this information by manual annota-tions or static code analysis.

State of the Art

The new challenges posed by big dataare not covered by established systems.Thus, there is a chance to break the quasimonopoly of US based database vendorswith innovative technologies.

Big Data as an Opportunity for Ger-man Technology Providers. A multi-tude of German vendors, research institu-tions and universities are well positioned. Inaddition to SAP’s HANA in-memory tech-nology and Software AG’s Terra Cotta sys-tem, small and medium sized enterprisessuch as Exasol or ParSteam as well asinnovative technologies from the univer-sity research community such as Strato-sphere, Hyper, HAIL/Hadoop++ whichcould be commercialized have to be men-tioned. These companies that stem fromuniversity cooperations, have already wonvarious international awards and beat theUS-American competitors in crucial bench-marks like the TPV Benchmark.

Innovative Research Prototypes atGerman Universities. German univer-sities additional innovative systems andprototypes such as Stratosphere (TUBerlin, HU Berlin, Hasso-Plattner Insti-tute), Hyper (TU Munich) Hadapt/HAIL(University Saarland) with new, disruptivetechnologies in the field of efficient speci-fication and scalable execution of machinelearning methods and mixes workloads havebeen created.

Data Marketplaces

The high technical, organizational and per-sonnel efforts necessary for the preparationand execution of reliable and comprehen-sible big data analyses are still an obsta-cle for many companies. Data marketplacesprovide extracted and integrated data in acentralized manner and can thus lower thecost for individual companies, which can ob-tain the cleansed and integrated data, sig-nificantly. The central entry point of a datamarket place eases the access to such ser-vices and data. Furthermore the data mar-ketplace serves as a data integration plat-form across customers and merchants, inparticular for the collective storage, analy-ses and re-use of data. Information mar-ketplaces enable in particular small andmedium sized enterprises to analyse suchdata and to apply the results to even outtheir competitive disadvantage

16

Page 19: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

7Recommendations for Decision-makers

Big data is a general concept embrac-ing technologies to collect, process andpresent large, heterogeneous quantities ofdata which accrue in very short periods andcan be used for very rapid decisions. Bigdata can thus foster disruptive changes onmarkets and in companies. These disrup-tive changes can result in substantial op-portunities and competitive advantages forbusinesses in Germany. At the same time,big data entails risks. In a globalised andclosely integrated economy, it is necessaryto shape a policy environment which per-mits German firms to utilise the opportu-nities of big data while effectively control-ling the related risks. The following sectionpresents three premises and six recommen-dations for action for the development ofan effective environment for big data. Thepremises stress central aspects of promotingbig data in Germany which serve as a basisfor all the recommended actions. The rec-ommended actions represent important ori-entations for developments which make itpossible to utilise the opportunities of bigdata and to control the related risks.

Premise 1: Education and Manage-ment of Expectations. The com-petitive advantages offered by big datanecessitate an objective discussion ofthe opportunities and risks of big datatechnologies and their use, with broadparticipation from commerce, govern-ment, society and academia.

Premise 2: Responsible Use of Data.The competitive advantages offered bybig data require clear rules precondi-tions and limitations to the responsibleuse of personal data.

Premise 3: Small and Medium SizedFirms are an Important TargetGroup. The competitive advantagesof big data require the targeted supportof small and medium-sized providersand users of big data technology.

Recommended Action 1: Make Useof Currently Unused Data toOptimise Operative BusinessProcesses. The compilation andpreparation of data is resource-intensive and cost-intensive. Compa-nies are therefore reluctant to investin big data. It is therefore necessaryto support pilot projects which helpcompanies to better assess the costand benefit of big data. Data archivesfrom manufacturing, development oroperations are particularly suited tothis. These do not generally containmuch personal data, so that the legalbarriers are lower. Furthermore, theycan be allocated to specific operativeprocess optimisations, and this makesit easier to understand the benefitsof using the technology. If sensiblecommercial use can be made of thepotential of big data, it will be possibleto derive robust arguments in favourof using big data.

17

Page 20: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Recommended Action 2: Buildingand Strengthening Ecosystemsfor Data Services. Big data estab-lishes the technical framework for dataservices, i.e. data and data analysesbecome commercial assets. Supportshould therefore be given to measureswhich provide large quantities of datafor analysis by third parties as well.In particular, companies should beenabled to trade, exchange or disclosedata and data analyses. These mea-sures should make it possible to makerobust statements about the structure,development and sale of such dataservices. In view of the novel natureand volatility of such a market, theemergence of complementary dataproviders for Germany’s core sectorslike industry, healthcare and mobilityshould be supported for importantpublic data.

Recommended Action 3:Strengthening German Tech-nology Providers for Big Data(Technology Push). In the fieldof big data technologies, Germany isvery well positioned thanks to researchat universities and developments incompanies. For this reason, measuresshould be supported which aim tocommercialise these technologies inGermany and internationally. Thiscan particularly take place via closeco-operation between technologyproviders and potential users. Thisgives potential users the possibility tointervene in their own interest in the fi-nal phases of technology development;the provider can orient its productor service better to the specific needsof the user. The aim is to boost theprospects of success of the German bigdata technology providers entering themarket (technology push).

Recommended Action 4:Establishment and Strength-ening of Sector Specific and CrossSector Innovation Networks forBig Data (Market Pull). Onemajor challenge for the broad use ofthe potential of big data is to identifyor establish the demand amongstpotential users (market pull). Supportshould be given to measures whichenable potential users and technologyproviders of big data to join forceswith innovation networks and todevelop data-driven innovations. Here,the form of the innovation networksensures the sustainability of the inno-vation beyond the individual companyand creates the possibility for newforms of co-operation on the use ofdata. The main emphasis here is oncommercial aspects and the realisationof big data potential in new products,services and business models. If itproves possible to establish specificand robust requirements for big datatechnology here, there will be op-portunities for companies to developcorresponding products and services.

Recommended Action 5: Boostingthe Legal Certainty in the Useof Big Data and OvercomingExisting Barriers. It is alreadypossible to use big data in compliancewith the law. Despite this, the cur-rent legal situation makes companiesreluctant to make effective use ofthe full commercial potential of bigdata applications. Adapting the legalframework to the current state of theart, particularly in the field of dataprotection law and copyright law,could make a decisive contributiontowards overcoming barriers andincreasing legal certainty.

18

Page 21: Big Data Management - TU Berlin · 2014-04-02 · Big Data Management ... tered on big data (e.g., the Stratosphere System at Technische Universita¨t Berlin, ... duced technologies

Recommended Action 6: Expansionof Training Courses for DataScience as a Key Skill. There is anurgent need for training courses for thequantitative and qualitative analysisof large heterogeneous data quantitieswith low latency. Here, support shouldbe given to products and serviceswhich integrate the systems view ofbig data with the analytical view andwith a responsible, legally secure useof big data. Similarly the economicaspects of big data must be taken intoaccount.

Colophon

EditorsProf. Volker Markl, TU-BerlinProf. Thomas Hoeren, WWU-MunsterProf. Helmut Krcmar, TU-Munchen

AuthorsThis executive summary is a trans-lated excerpt of a detailed report com-missioned by the Federal Ministry forEconomic Affairs and Energy. TheGerman title of the detailed reportis ”Innovationspotenzialanalyse fur dieneuen Technologien fur das Verwaltenund Analysieren von großen Datenmen-gen (Big Data Management)” The au-thors of the detailed report are: VolkerMarkl, Alexander Loser, Thomas Ho-eren, Helmut Krcmar, Holmer Hemsen,Michael Schermann, Matthias Gottlieb,Christoph Buchmuller, Philip Uecker,Till Bitter

Photos/IllustrationsPhoto (S. 3) Holmer Hemsen/TU-BerlinIllustrations (S. 9-11) Own work basedon the outcome of a questionnaire basedstudy made for the big data managementreport

Editoral departmentTU-Berlin/FG DIMA

TranslationChristoph Boden, Johannes Kirschnick,Juan Soto (TU-Berlin), kindly supportedby the translation service of the FederalMinistry for Economic Affairs and En-ergy

Publication dateMarch, 2014; Version 1.0

19