Big Data and Its Analytics for CA

ICAI's Book on Big data for Chartered Accountants

  • Big Data and its Analytics A Challenge or Boon for Governance

    Ravikumar Ramachandran

    COBIT 5 (F), ISO 27001:2013 Lead Auditor

    More than 22 years Industry experience

    Last 12 years as CRO, CISO

    Research and Review Committee ISACA

    e-journal editor of Mumbai Chapter & CGEIT Coordinator

    Presently in Hewlett-Packard

  • Disclaimer & Authors Note

    The views expressed belongs to the author and not that of the employer or any of the Professional Associations

    This Presentation is meant for the members of the Institute of Chartered Accountants of India

    The Author is sharing his own independent views and whenever references have been made to other works, due credit is given to the respective authors

  • Seizing the future..

    As for the future, your task is not to foresee it, but to enable it -French aviator and author Antoine de Saint-Exupery

  • What is Big Data

    Extremely large data sets

    Unmanageable by database software tools

    Relative and not an absolute figure

    Increase with technology advances

    Varies with Sector

    Every two days now we create as much information as we did from the dawn of civilization up until 2003. Thats something like five exabytes of data-Former Google CEO Erik Schmidt

    1000 Bits = 1 Kilobyte

    1000 Kilobytes = 1 Megabyte

    1000 Megabytes = 1 Gigabyte

    1000 Gigabytes = 1 Terabyte

    1000 Terabytes = 1 Petabyte

    1000 Petabytes = 1 Exabyte

    1000 Exabytes = 1 Zettabyte


  • Human Brain (Scientific American)

    Storage Capacity -2.5 Petabytes ( or 1 million gigabyte)

    Capacity to hold 3 million hours of TV shows

    TV to run for more than 300 years!!

  • Internet-Worlds largest library

    Estimated at Yottabytes as on date

    11 trillion years using the fastest internet connectivity

    Estimated at 5 lakh TB in 2003

    In 10 years. Expanded 20 lakh times!!

  • Internet-Worlds largest library

    The Internet emphasizes the depth of our ignorance because our knowledge can only be finite, while our ignorance must necessarily be infinite-Sir Karl Popper, Conjectures and Refutation: The Growth of Scientific knowledge (2002)

  • IDCs Digital Universe Study

    Between 2009 and 2020, digital data will grow 44-fold to 35 zettabytes per year

  • IDC s Prediction

    Volume of Digital Content:

    2012 -2.7 billion terrabytes ( 48% more than 2011)

    2015 -8 billion terrabytes

    Digital content doubles every 18 months

  • Economist

    Humans created 150 exabytes of information in the year 2005

    In 2011-more than 1200 exabytes!!

  • Gartner s prediction

    More than 90% of universal data have been created in the last two years

    About 80% of enterprise data will be in the form of unstructured data

  • The arrival of Analytics

    Big Data-Big Opportunity

    NASA, National Oceanic and Atmospheric Administration

    Pharmaceutical companies, energy companies

    Big Data & Todays business

  • Dimensions of Big Data

    Volume : Whole and sample size

    Variety : Structured and unstructured

    Structured : Any data capable of being entered in a data field.

    Unstructured : Audio, Video, image, geospatial, click streams and log files

  • Dimensions of Big Data

    Velocity : The speed at which the data is created, accumulated, ingested and processed

    Real-time decision making

  • Big Data Synergies

    Traditional Business Intelligence

    Data Mining

    Statistical applications

    Predictive analysis

    Data Modeling

  • Getting the Big of Big Data

    Transformation Capabilities

    Big Data is too big an opportunity

    Best Integration

    Storage Technologies

  • Open Source

    Hadoop-its suitability

    Limitations-Pre-requisites, hardware requirements

  • Business Takeaway

    Business cannot wait to take decision for the completed and structured data

    It needs to take decision on unstructured data

    However not all unstructured data is useful

    Business Houses ignoring unstructured data are doomed

  • Factors enabling Big Data

    Internet and digitization of opinions & behaviour

    Mobile computing

    Social Networking

    Moores Law & Cloud

  • Key factors driving Big Data-1

    Increasing data volumes being captured and stored

    2011 IDC Digital Universe Study- In 2011, the amount of information created and replicated will surpass 1.8 zettabytesgrowing by a factor of 9 in just 5 years

    The scale of this growth surpasses traditional technologies and configuration setups

  • Key factors driving Big Data-2

    Rapid acceleration of data growth

    2012 IDC Digital Universe study, From 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40000 exabytes

    From now, until 2020, the digital universe will double about every two years

  • Key factors driving Big Data-3

    Increased data volumes pushed into the network

    According to CISCOs annual Visual Networking Index Forecast, By 2016, annual global IP traffic is forecasted to be 1.3 zettabytes

    Due to increasing number of smartphones, tablets and other internet devices

    Increased bandwidth and proliferation of Wi-fi availability

  • Key factors driving Big Data-4

    Growing variation in types of data assets for analysis

    Data scientists take advantage of unstructured datasets as against structured datasets

    Acquired from a wide variety of sources

    Format can be that of text, images, audio and video content

    Existing structured data management needs to enhanced to accommodate the above

  • Key factors driving Big Data-5

    Alternate and unsynchronized methods for facilitating data delivery

    Structured environment gives clear methods of data delivery and exchange

    File transfers through tape and disk storage systems

    Unstructured data coming from twitter, Government websites

    Pressure for rapid acquisition, absorption and analysis

  • Key factors driving Big Data-6

    Rising demand for real-time integration of analytical results

    Increasing number of consumers for analytical results

    Business required real-time results of consumer behaviour

  • Data Explosion

    Data doubles itself in every two years

  • Malthusian Theory of Population

    Author of book Essay on the Principles of Population (1798)

    Food production increases in A.P (25 years)

    Population growth increases in G.P (25 years)

    Restraint on reproduction

  • Malthusian Theory of Data Explosion (Imaginary)

    Population growth increases in G.P (25 years)

    Data explodes every 2 years ( 1024 times app)

    Do not use mobile devices

    Restraint on internet

    Do not go to social sites

    Reproduction is allowed

    But no DATA Reproduction!!

    All economists to become Data Scientists

  • Evolution of Big Data

    Farnam Jahanian-Assistant Director for computer and information science and engineering for National Science foundation(NSF) defines data a transformative new currency for science, engineering, education and commerce

  • Evolution of Big Data

    Big Data is characterized not only by the enormous volume of data but also by the diversity and heterogeneity of the data and the velocity of its generation

  • Implications of Big Data-Farnam

    Creation of new products and services

    Accelerate the pace of discovery in every science and engineering discipline

    Solve the nations challenges-medicine to cyber security

  • Data Explosion & Knowledge Management

    Data multiplies every two years

    Proprietary knowledge gets diluted

  • Going Forward

    Chief Innovation officer (CIO)!!

    Chief Discovery officer ( CDO)!!

  • Balance Sheet

    Financial Management

    Management Accounting

    Strategic Financial Management

    Financial Risk on

    exciting new disciplines follows.

    Open source software framework for processing huge datasets on a distributed system

    Development was inspired by Googles Map Reduce and Google File system

    Allows you to question on structured and unstructured data

  • Hadoop

    Store any kind of data in its native format

    Stores petabytes of data inexpensively

    Assurance of availability

    Runs on a cluster of servers each having its own CPU and disk storage

  • Components of Hadoop

    Hadoop Distributed File System (HDFS) Storage system for Hadoop cluster

    HDFS breaks the data into pieces

    Distributes among the servers in the cluster

    Each server stores a small segment of the data set

    Each piece of data is replicated on more than one server

  • Components of Hadoop

    Map Reduce

    Each server does its part of analytical job

    Reports the results for collation into a comprehensive answer

    Map Reduce is the agent that distributes the work and collects the results

  • Hadoop

    HDFS continually monitors the data stored in the cluster

    In case of hardware or software failure, it takes the data from the known good replica

    Map Reduce monitors the progress of each server

    In case of server slowing down or failing to return an answer.

  • Hadoop

    MapReduce automatically starts another instance of the task in the server having copy

    HDFS & MapReduce joins to do a super fast & reliable job

  • Hadoop Users

    As of early 2013, Facebook was recognized as having the largest Hadoop cluster in the world

    Other prominent users




  • New Approach of Data processing

    Data needs to be stored in a system in which hardware is infinitely scalable

    Storage and network cannot be a bottleneck

    Data must be processed into BI where it is

    Move the code to the data and not other way

    Data sits in one place and never move it around

  • Challenges in Protection of Big Data

    Big Data Risk of permanent loss

    Data from monitoring devices

    Surveillance cameras

    In frequency and in real time

    Uniqueness- No deduplication

    Large files- Huge CPU processing power

    No good Back up solution available

  • Challenges in Protection of Big Data

    Not handled well by RDBMS

    Nosql new DBMS evolution

    HIPAA & PCI compliance challenge

    Very risky in medical industry


    Predefined Scheme

    Standard Definition and Interface language

    Tight consistency

    NoSQL Database

    No predefined scheme

    Per-product definition and interface language

    Getting an answer quickly is more important than getting an correct answer

  • Challenges in Protection of Big Data

    CIA Triad- Focus on Access Control

    Balance with performance

    High levels of encryption

    Complex security technology

  • Way forward.

    Destroy data if not legally required (logs)

    Classify data

  • Protection measures

    Control access on Need to Know

    Secure the Data at rest

    Keep the cryptographic keys on a separate hardened server

    Ensure that security does not impede performance

    Pick the right encryption scheme

    Flexible security solution with changing requirements

  • Big Data & IP

    Inventions, literary and artistic works

    Symbols, images designs

    What to protect

    Prioritize protection

    Labeling and locking

    Security awareness

    Holistic approach

  • Governance Measures

    Strategic Alignment

    Identify Business priorities

    Define problems to be solved

    Time frame

    Measurable and achievable outcomes

  • Strategic alignment

    Demonstration of Value: Whether these technologies add value to real business problems

    Operationalization : How to migrate the big data projects into the production environment in a controlled and managed way

  • Governance Measures

    Management Sponsorship

    Management support for fact-based decision making

    Identify champions for consumption of analytics

    Ensure benefits realization from various reports and statistical models

  • Integration of Big Data Analytics

    Standard processes for soliciting input from business users

    Clear evaluation criteria for acceptability and adoption

    Massive data scalability

    Data reuse

    Oversight and Governance

    Mainstreaming accepted technologies

  • Governance Measures

    Analytical Human Capital

    Mobilize resources for analytics

    Hire the right talent and retain them

    Increasing demand for analysts skilled in mathematics, business and technology

  • Key Governance Role

    Ensure business effectively uses analytics to make better decisions

    Ensure investment is made in right type of analytics

    Ensure investment happens in right type of people, process & technology

  • Data Governance

    Alert : Identify data issues that might have negative business impact

    Triage : Prioritize those issues in relation to corresponding business value drivers

    Remediate : Data owners to take proper actions when alerted to the existence of those issues

  • McKinsey study

    Approximately 1,40,000 to 1,80,000 unfilled positions of data analytic experts in U.S by 2018

    Shortage of 1.5 million managers and analysts who have the ability to understand and make decisions using Big Data

  • Rise of Data Scientist

    New designation

    The Data Scientist

  • Yesterdays skills

    Business + Mathematics = Consulting profession

    Usage of heuristics and persuasive arguments in the board roon

  • Yesterdays skills

    Business + Technology = IT Profession

    Automate algorithmic Tasks improving productivity and efficiency

  • Yesterdays skills

    Mathematics + Technology = Software Development

    Address a wide range of business problems

  • Tomorrow's Skills

    Business + Mathematics + Technology +Behavioral Science = Decision Science

  • Tomorrows Skills (Big Data, Big Analytics Michael Minelli et al)

  • Privacy Landscape-Businesses

    Increased need to leverage privacy information for competitive advantage

    Huge investment in data sources and data analytics

  • Privacy Landscape-Criminals

    Rise in Identity theft

    Sophisticated technology to exploit data security vulnerabilities

  • Privacy Landscape-Consumers

    Increased awareness and concern about



    Disclosure of personal information

  • Privacy Landscape-Legislators

    Responding to consumer concern by restricting use of PI

    Significant impact and restriction for business

  • Seven Global Privacy Principles

    Notice : Inform individuals the purpose for which information is collected

    Choice : Offer individuals the opportunity to choose or opt-out

    Consent : Only disclose information to third parties consistent with the above principles

    Security : Take responsibility for CIA of PI

  • Seven Global Privacy Principles-Contd

    Data Integrity : Assure the reliability of PI

    Access : Provide access to individuals to PI about them

    Accountability : A firm must be accountable for following principles-compliance mechanism

  • Other Regulations




  • Different approach

    Privacy may be wrong focus

    Data privacy is the thing you do to keep from getting sued, data ethics is the thing you do to make your relationship with your customers positive-James Stogdill, OReilly Radar

  • James Powell, CTO, Thomson Reuters, 2011, OReilly Strata Data Conference

  • Conclusion

    Availability of Big Data

    Low Cost Hardware

    New Information Management and Analytic software

    Enormous opportunity

    Efficiency, productivity, profitability

  • Concluding Remarks

    There are known knowns, there are known unknowns, but there are also unknown unknowns-Former U.S. Secretary of Defense, Donald Rumsfeld

  • Concluding Remarks..

    I love that quoteWhen I think about these three things in our daily life, they fall into these three outcomes for me.. The known unknowns more fall into the category of analysis throwingthe thing I love is the last part, if you could figure this thing out, we could have saved Afghanistan from big problems Googles Avinash Kaushik in his presentation at Strata 2012, A Big Data Imperative, Driving Big Action

  • Thanks for your precious time!
