Week 12 Student(1)

Embed Size (px)

Citation preview

  • 8/17/2019 Week 12 Student(1)

    1/111

    WEEK 12

    Dr. A. Brennan

  • 8/17/2019 Week 12 Student(1)

    2/111

  • 8/17/2019 Week 12 Student(1)

    3/111

    NOTE:

    Efficient Physical DB design produces technical

    specifications to be used during the DB

    implementation phase

  • 8/17/2019 Week 12 Student(1)

    4/111

    For efficient physical DB design, certain

    info. needs to be gathered:

    Normalised relations with estimates of table volume (numberof rows in each table)

    Attribute (field) definitions and possible max. length

    Descriptions of data usage (when and where data areentered, deleted, retrieved, updated etc.)

    Response time expectations

    Data security needs

    Backup/recovery needs

    Integrity expectations

    What DBMS technology will be used to implement thedatabase

    What DB architecture to use

  • 8/17/2019 Week 12 Student(1)

    5/111

    Once this info. is gathered, the designer

    has to decide on a range of issues:

    Suitable storage format (i.e. data types) for eachattribute (in order to minimise storage space andmaximise data integrity)

    Grouping attributes from the logical model into physicalrecords (denormalisation)

    File organisation (arranging similarly structured recordsin secondary memory for the purpose of storage, fastand efficient retrieval and update, protection of dataand its recovery after errors are found)

    Query optimisation

  • 8/17/2019 Week 12 Student(1)

    6/111

    What is it?

    translate the logical description of data into technicalspecifications for storing and retrieving of data

    Why?

    Good performance , database integrity, security and

    recoverability.

    Physical Design

  • 8/17/2019 Week 12 Student(1)

    7/111

    Input and Output for Physical Design

    • Normalised relations with

    estimates of table volume (number

    of rows in each table)• Attribute definitions (and possible

    maximum length)

    • Descriptions of data usage (when

    and where data are entered,

    retrieved, deleted, updated)

    • Response time expectations

    • Data security needs

    • Backup/recovery needs

    • Integrity expectations

    • Description of DBMS technology

    •Suitable storage format (data

    type) for each attribute in the

    logical data model in order tominimise storage space and

    maximise data integrity

    • Grouping attributes from the

    logical model into physical records

    • File organisation

    • Selection of indexes and database

    architectures for storing and

    connecting files to efficiently

    retrieve related data

    • Query optimisation

  • 8/17/2019 Week 12 Student(1)

    8/111

    Data Types

    • CHAR – fixed-length character

    • VARCHAR2 – variable-length character (memo)

    • LONG – large number

    • NUMBER – positive/negative number

    • DATE – actual date

    • BLOB – binary large object (good for graphics,

    sound clips, etc.)

  • 8/17/2019 Week 12 Student(1)

    9/111

    Goals

    Data type

    Goals = minimise storage space, represent all possible

    values, improve data integrity, support all data

    manipulations Data integrity controls

    Default value, range control (constraints/validation

    rules), null value control (i.e. PK cannot be null)

    Referential integrity (FK cannot be null)

  • 8/17/2019 Week 12 Student(1)

    10/111

    Default value - assumed value if no explicit value is entered for aninstance of the field (this reduces data entry time and helps prevententry errors for the most common value);

    Range control – imposes allowable value limitations (constraints orvalidation rules). This may be a numeric lower to upper bound, or a setof specific values. This approach should be used with caution, since thelimits may change over time

    Null value control – allows or prohibits empty fields (e.g. each primarykey must have an integrity control that prohibits a null value);

    Referential integrity – a form of a range control (and null valueallowances) for foreign-key to primary-key match-ups. It guaranteesthat only some existing cross-referencing value is used

    Integrity Controls

  • 8/17/2019 Week 12 Student(1)

    11/111

    Physical Records

    A Physical Record is a group of fields that are stored in adjacent secondary memory

    locations and are retrieved and written together as a unit by particular DBMS

    Scope:

    Efficient use of secondary storage (influenced by both the size ofthe record and the structure of the secondary storage)

    • Data processing speed.

    Computer operating systems read data from secondary memory in units called pages.

    A page is the amount of data read or written by an operating system in one operation.

    Blocking Factor is the number of physical records per page.

  • 8/17/2019 Week 12 Student(1)

    12/111

    Normalization

    Normalization is a logical database design that is

    structurally consistent and has minimal redundancy.

    Normalization forces us to understand completely

    each attribute that has to be represented in the

    database. This may be the most important factor

    that contributes to the overall success of the system.

  • 8/17/2019 Week 12 Student(1)

    13/111

    What is Denormalization?

    Denormalization a process of transforming normalised

    relations into unnormalised physical record

    specifications

     Denormalization can also be referred to a process

    in which we combine two relations into one new

    relation, and the new relation is still normalized butcontains more nulls than the original relations

  • 8/17/2019 Week 12 Student(1)

    14/111

    Denormalization

    In addition, the following factors have to be

    considered:

    Application specific;

    Denormalization may speed up retrievals but it

    slows down updates

    Size of tables

    Coding

  • 8/17/2019 Week 12 Student(1)

    15/111

    Answer15

    Efficient data processing (second goal of physical record

    design after efficient use of storage space) in most cases,

    dominates the design process.

    The speed of data processing depends on how close

    together the related data are.

  • 8/17/2019 Week 12 Student(1)

    16/111

    Benefits and Possible Problems16

    Benefits:

    • Can improve performance (speed)

    Due to data duplication

    Problems:

    • Wasted storage space

    • Data integrity/consistency threats

  • 8/17/2019 Week 12 Student(1)

    17/111

    Denormalisation – How?

    • Option one: Combine attributes from several logical

    relations together into one physical record in order to avoid

    doing joins (one to one, many to many, one to many)• Option two: Partition a logical relation into several

    physical records (multiple tables);

    • Option three: Data replication; or a combination of the

    two options above.

  • 8/17/2019 Week 12 Student(1)

    18/111

    Denormalisation – Option 1

    1. Two entities with a one-to-one relationship

  • 8/17/2019 Week 12 Student(1)

    19/111

    Mapping

    Logical Model: Normalised Relations

  • 8/17/2019 Week 12 Student(1)

    20/111

    Select * from EMPLOYEE, PARKING

    WHERE

    EMPLOYEE.Employee-ID = PARKING.Employee-ID

  • 8/17/2019 Week 12 Student(1)

    21/111

    Denormalisation – Option 1

    1. Two entities with a one-to-one relationship

  • 8/17/2019 Week 12 Student(1)

    22/111

    Try this!

    EMPLOYEE MANAGER

    Employee

    PPSName

    Address

    Manager

    ID

    Expertise

    Manages

  • 8/17/2019 Week 12 Student(1)

    23/111

    EMPLOYEE(EmployeePPS, Name, Address,

    ManagerID)

    MANAGER(ManagerID, Expertise)

    Select * from EMPLOYEE, MANAGER

    WHEREEMPLOYEE.ManagerID = MANAGER.ManagerID

    ManagerID Expertise EmployeePPS Name Address

  • 8/17/2019 Week 12 Student(1)

    24/111

    Denormalisation – Option 1

    2. Many-to-many relationship (associative entity)with non-key attributes

  • 8/17/2019 Week 12 Student(1)

    25/111

  • 8/17/2019 Week 12 Student(1)

    26/111

    Denormalisation – Option 1

    Physical Model: Denormalised Relation

  • 8/17/2019 Week 12 Student(1)

    27/111

    Denormalisation – Option 1

    3 One to many relationship

  • 8/17/2019 Week 12 Student(1)

    28/111

    Logical Model: Normalised Relations Resulting from One-to-Many (1:M) Relationship

  • 8/17/2019 Week 12 Student(1)

    29/111

    Physical Model: Denormalised Relation

  • 8/17/2019 Week 12 Student(1)

    30/111

    Denormalisation – Option 2

    Horizontal partitioning - places different rows of a table into several physical files, based

    on common column values.

    Vertical partitioning – distributing the columns of a table into several separate files,

    repeating the primary key in each one of them

    Option 2 : Partitioning of logical relation into multiple tables

  • 8/17/2019 Week 12 Student(1)

    31/111

    31

    CUSTOMER

    CustID

    FirstName

    MiddleName

    LastName

    Address1

    Address2

    CityCounty

    Country

    Phone

    CreditLimit

    SalesTaxRate

    Fax

    Email

    CUSTOMERA

    CustID

    FirstName

    MiddleName

    LastName

    Address1

    Address2City

    County

    Country

    Phone

    Fax

    Email

    CUSOMTERB

    CustID

    CreditLimit

    SalesTaxRate

    Vertical partitioning

  • 8/17/2019 Week 12 Student(1)

    32/111

    32

    CUSTOMER

    CustID

    FirstName

    MiddleName

    LastName

    Address1

    Address2

    CityCounty

    Country

    Phone

    CreditLimit

    SalesTaxRate

    Fax

    Email

    CUSTOMERA-M

    CustID

    FirstName

    MiddleName

    LastName

    Address1

    Address2

    CityCounty

    Country

    Phone

    CreditLimit

    SalesTaxRate

    Fax

    Email

    CUSTOMERN-Z

    CustID

    FirstName

    MiddleName

    LastName

    Address1

    Address2

    CityCounty

    Country

    Phone

    CreditLimit

    SalesTaxRate

    Fax

    Email

    Horizontal partitioning

  • 8/17/2019 Week 12 Student(1)

    33/111

    Advantages and disadvantages of

    partitioning33

    Efficiency

    Local optimisation

    Recovery

    Slow retrieval

    Complexity

    Extra space and time for updates

  • 8/17/2019 Week 12 Student(1)

    34/111

    Denormalisation – Option 334

    Data Replication - the same data is purposely stored in multiple locationsof the database.

    Data replication improves performance by allowing multiple users to

    access the same data at the same time with minimum contention.

    Option 3 : Data replication; or a combination of the other twooptions

  • 8/17/2019 Week 12 Student(1)

    35/111

    Denormalisation Disadvantages

    The potential for loss of integrity is considerable.

    Additional time that is required to maintain consistency

    automatically every time a record is inserted, updated,

    or deleted

    Increase in storage space resulting from the duplication

  • 8/17/2019 Week 12 Student(1)

    36/111

    Whose responsibility?36

    DBMS

    Database Designer

  • 8/17/2019 Week 12 Student(1)

    37/111

    File Organisation37

    1. Sequential File Organisation

    2. Indexed File Organisation

    3. Hashed File Organisation

  • 8/17/2019 Week 12 Student(1)

    38/111

    38

  • 8/17/2019 Week 12 Student(1)

    39/111

    Sequential File Organisation

    The records are stored in sequence according to aprimary key value. To locate a particular record, a

    program must scan the file from its beginning until

    the desired record is located

  • 8/17/2019 Week 12 Student(1)

    40/111

    40

    https://www.youtube.com/watch?v=zDzu6vka0rQ

    https://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQ

  • 8/17/2019 Week 12 Student(1)

    41/111

    Th f I d t f l

  • 8/17/2019 Week 12 Student(1)

    42/111

    Therefore Indexes are most useful

    for…

    Larger tables

    Attributes which are referenced in ORDER BY or

    GROUP BY clauses

  • 8/17/2019 Week 12 Student(1)

    43/111

    43

    https://www.youtube.com/watch?v=h2d9b_nEzoA

    https://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoA

  • 8/17/2019 Week 12 Student(1)

    44/111

    Hashed File Organisation

    The address of each file is determined using a

    hashing algorithm

    A hashing algorithm is a routine that converts a PK

    value into a record address A hash index table uses hashing to map a key into a

    location in an index, where there is a pointer to the

    data record matching the hash key

  • 8/17/2019 Week 12 Student(1)

    45/111

    DB Architecture

  • 8/17/2019 Week 12 Student(1)

    46/111

    Note46

    De-normalisation should only take place after a

    satisfactory level of normalisation has taken place

  • 8/17/2019 Week 12 Student(1)

    47/111

    Goal of Physical DB Design

    The goal of physical DB design is to create technical

    specifications from the logical descriptions of data

    that will provide adequate data storage and

    performance and will ensure database integrity,security and recoverability

  • 8/17/2019 Week 12 Student(1)

    48/111

    DATA AND DATABASE

    ADMINISTRATION

  • 8/17/2019 Week 12 Student(1)

    49/111

    Data within the organisation

    Data are a resource to be translated into

    information

    Data is constantly being produced and analysed to

    create evenmore data

  • 8/17/2019 Week 12 Student(1)

    50/111

    Database use in the organisation

    Top management

    strategic decision making, planning and policy

    Middle management

    tactical decisions and planning

    Operational management

    support company operations

    MIS

    DSS

    TPS

    Database

  • 8/17/2019 Week 12 Student(1)

    51/111

    Management data

    Two recognised roles

  • 8/17/2019 Week 12 Student(1)

    52/111

    Data/database administration

    Data administration is responsible for: planning and analysis function responsible for setting data

    policy and standards

    promoting company’s data as a competitive resource

    providing liaison support to systems analysts duringapplication development

    Database administration

    operationally oriented

    responsible for day-to-day monitoring and management ofactive database

    liaison and support during application development

  • 8/17/2019 Week 12 Student(1)

    53/111

    Data administrator

    Data coordination keep track of update, responsibilities and interchange

    Data standards

    e.g naming standards

    Liaison with systems analysts and programmers,including design

    Training managers, users, developers

    Arbitration of disputes and usage authorization Documentation and internal publicity

    Promotion of data’s competitive advantage

  • 8/17/2019 Week 12 Student(1)

    54/111

    Database administrator

    Responsible for the day-to-day administration ofthe database

    Monitors performance to maximize efficiency

    Provides central point for troubleshooting Monitors security and usage (audit log)

    Responsible for operational aspects of datadictionary

    Carries out data and software maintenance

    Involved in database design

  • 8/17/2019 Week 12 Student(1)

    55/111

    Database Administrator in DB Design

    Define conceptual schema what data to be held; what entities; what attributes

    Define internal schema

    decide physical database design

    Liaise with users

    ensure the data they need is available

    Define security needs

    Define backup and recovery Monitor performance

    respond to changing requirements

  • 8/17/2019 Week 12 Student(1)

    56/111

    A Summary of DBA Activities

    data distribution and usedelivering

    data backup and recoverymonitoring

    data security, privacy and integritytesting

    policies, procedures and standardsorganising

    end-user supportplanning

    db service

    of

    db activity

  • 8/17/2019 Week 12 Student(1)

    57/111

    Tools for Database Administration

    Information is kept about all corporate resources,

    including data

    This “data about data” is termed metadata

    The database which holds this metadata is the datadictionary

    Two types data dictionary

    stand-alone or passive integrated or active

  • 8/17/2019 Week 12 Student(1)

    58/111

    Metadata in Access

  • 8/17/2019 Week 12 Student(1)

    59/111

    Data Dictionary

    Passive data dictionary

    self-contained database

    all data about entities are entered into the dictionary

    requests for metadata information are run as reportsand queries as necessary

    Active data dictionary

  • 8/17/2019 Week 12 Student(1)

    60/111

    Data dictionary: relationships

    Table construction which attributes appear in which tables

    Security which people have access to which databases or tables

    Impact of change which programs might be affected by changes to which tables

    Physical residence which tables or files are on which disks

    Program data requirements which programs use which tables or files

    Responsibilitywho is responsible for updating which databases or tables

    I d i D b

  • 8/17/2019 Week 12 Student(1)

    61/111

    Introducing a Database: Considerations

    Three important aspects

    technological:

    managerial:

    cultural:

    DBMS software and hardware

    Administrative functions

    Corporate resistance to change

  • 8/17/2019 Week 12 Student(1)

    62/111

    Social impact databases

    Data collection is extensive

    both voluntary and involuntary

    Data is a commodity

  • 8/17/2019 Week 12 Student(1)

    63/111

    DATABASE SECURITY

  • 8/17/2019 Week 12 Student(1)

    64/111

    Security - types threat

    Loss or corruption to data due to sabotage

    external

    internal

    • Loss or corruption to data due to error

    • Disclosure of sensitive data

    • Fraudulent manipulation of data

  • 8/17/2019 Week 12 Student(1)

    65/111

    Threats to data security

  • 8/17/2019 Week 12 Student(1)

    66/111

    Controlling unauthorised access

    Physical access to building

    Access to hardware

    Monitor any unusual activity

    C

  • 8/17/2019 Week 12 Student(1)

    67/111

    Controlling unauthorised access

    Developing user profiles

    care over decisions on what data and resources can be

    accessed (and type access) for each end user

    user training and education Firewalls

    Encryption

    Plugging known security holes using patches available for known problems

    l f l

  • 8/17/2019 Week 12 Student(1)

    68/111

    Developing user profiles

    Every user is given an identifier for authentication Users are given privileges to access data

    dependent on what is essential for their work

    insert update

    delete

    Most DBMS provide an approach called

    Discretionary Access Control (DAC) SQL standard supports DAC through the GRANT

    and REVOKE commands

    DAC d MAC

  • 8/17/2019 Week 12 Student(1)

    69/111

    DAC and MAC

    DAC has certain weaknesses in that an unauthorizeduser can trick an authorized user into disclosing sensitivedata

    An additional approach is required called Mandatory

    Access Control (MAC) MAC based on system-wide policies that cannot be

    changed by individual users each database object is assigned a security class

    each user is assigned a clearance for a security class

    rules are imposed on reading and writing of databaseobjects by users

    SQL standard does not include support for MAC

    secret

    Fi ll

  • 8/17/2019 Week 12 Student(1)

    70/111

    Firewalls

    Firewall controls network

    traffic

    E i

  • 8/17/2019 Week 12 Student(1)

    71/111

    Encryption

    Encryption: decoding or scrambling data to make it unintelligible

    to those without the key

    encryption

    C lli l DP f ili i

    http://localhost/var/www/apps/conversion/tmp/scratch_5/Encryption.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_5/Encryption.pptx

  • 8/17/2019 Week 12 Student(1)

    72/111

    Controlling loss DP facilities

    Redundancy

    Virus protection

    Disaster protection

    Minimise error

    Alert network managers to problems

    Minor disruptions require on-going monitoring

    P i

  • 8/17/2019 Week 12 Student(1)

    73/111

    Protect against error

    Educate all employees Reminders to save

    Should you overwrite existing files?

    incorporate safety nets on deletion Include integrity checks on data

    validation

    cross checking

    range checks

    hash totals

    check digits

    batch totals

  • 8/17/2019 Week 12 Student(1)

    74/111

    Software Invasion

    Cruise virus

    • attacks for profit

    • Exploits the network’s weakest link -you

    • attacks through the public domain

    • waits to reach its target

    • reports successful penetration

    • delivers payload

    Stealth viruses

    • encrypt and hides tracks

    Worm• makes copies of itself

    • transmits copies to other machines

    • difficult to access to disable

    Trojan horse

    • looks like something else

    • once launched, too late!

    Trapdoor

    • simulates regular entry

    •  or bypasses normalsecurity procedures

    • difficult to detect that ithas been run

    Logic bomb

    • event driven

    P t ti i t i tt k

  • 8/17/2019 Week 12 Student(1)

    75/111

    Protecting against virus attacks

    Prepare a company policy on viruses Educate on the destructive power of viruses

    Control the source of software purchasing

    Ensure new or upgraded software is installed by system

    administrator on quarantined machine Control use of bulletin boards

    Install anti-virus software where necessary

    Make regular back-ups data and programs separately store back-up copies off-site once software opened

    Be aware of software holes in systems software

    H it b i d

  • 8/17/2019 Week 12 Student(1)

    76/111

    How security can be compromised

    Poor security management

    Poor connections to the outside world

    Shoddy system control

    Human folly

    Lack of security ethic

    And the answer is: 

    Education

  • 8/17/2019 Week 12 Student(1)

    77/111

    DISTRIBUTED DATABASE MANAGEMENT SYSTEMS

    Di t ib t d d t b

  • 8/17/2019 Week 12 Student(1)

    78/111

    Distributed databases

    Distributed database a logically interrelated collection of shared data (and

    description of this data) physically distributed over acomputer network

    Distributed DBMSs (DDBMS) the software systems that permits the management of the

    distributed database and makes the distribution apparentto users

    must perform all the functions of a centralized DBMS must handle all necessary functions imposed by the

    distribution of data and processing

    Di t ib t d i /d t b

  • 8/17/2019 Week 12 Student(1)

    79/111

    Distributed processing/database

    Distributed Processing

    Shares data processing chores over

    sites using communications network

    Database resides at one site only

    Distributed Database

    Each site has a data fragment 

    which might be replicated at

    other sites

    Requires distributed processing

    DDBMS

  • 8/17/2019 Week 12 Student(1)

    80/111

    DDBMS

    Advantages Reflects organisational

    structure

    Faster data access andprocessing

    Improved communications inorg.

    Reduced operating costs

    Improved share-ability andlocal autonomy

    Less danger of single-pointfailure

    Modular growth easier

    Disadvantages Complexity management

    and control

    Security

    Integrity control moredifficult

    Lack of standard comms.protocols for dbs

    Increased training costs Database design more

    complex

    Ch t i ti DDBMS

  • 8/17/2019 Week 12 Student(1)

    81/111

    Characteristics DDBMS

    A collection of logically related shared data The data is spilt into a number of fragments

    Fragments may be replicated

    Fragments/replicas are allocated to sites Sites linked by a communications network

    Data at each site is under control of a DBMS

    DBMS at each site can handle local applications

    autonomously Each DBMS participates in at least one global

    application

    DDBMS feat res

  • 8/17/2019 Week 12 Student(1)

    82/111

    DDBMS features

    Application interface to interact with end user orapplication programs and with other DBMs

    Validation to analyse data results

    Transformation: to determine which data requestsare distributed and which local

    Query optimization to find best access strategy

    Mapping to determine location fragments

    I/O interface Formatting to prepare data for presentation

    Distributed database design

  • 8/17/2019 Week 12 Student(1)

    83/111

    Distributed database design

    Data fragmentation (divide) need to decide how to split into fragments

    OR

    Data replication (copy)

    a copy of a fragment (or all) may be held at several sites

    THEN

    Data allocation:

    need to decide where to locate those fragments andreplicas : each fragment is stored at the site with “optimaldistribution”

    Data fragmentation

  • 8/17/2019 Week 12 Student(1)

    84/111

    Data fragmentation

    Users work with views, so appropriate to work withsubsets data

    Cheaper to store data closest to where it is used

    May give reduced performance for globalapplications

    Integrity control may be difficult if data andfunctional dependencies are at different sites

    Data fragmentation must be done carefully

    Data fragmentation

  • 8/17/2019 Week 12 Student(1)

    85/111

    Data fragmentation

    Breaks single object into two or more segments orfragments

    Each fragment can be stored at any site over a

    computer network Information about data fragmentation is stored in

    the distributed data catalog (DDC), from which it is

    accessed by the transaction processor

    Strategies for fragmentation

  • 8/17/2019 Week 12 Student(1)

    86/111

    Strategies for fragmentation

    For successful fragmentation, must ensure: completeness: each data item must appear in at least

    one fragment

    reconstruction: should be able to define a relationaloperation that will reconstruct relation from fragments

    disjointness: a data item appearing in one fragment

    should not appear in another

    Strategies for data fragmentation

  • 8/17/2019 Week 12 Student(1)

    87/111

    Strategies for data fragmentation

    Horizontal fragmentation division of a relation into subsets (fragments) based on

    tuples (rows)

    Vertical fragmentation

    division of a relation into attribute (column) subsets

    Mixed fragmentation

    combination

    Data replication

  • 8/17/2019 Week 12 Student(1)

    88/111

    Data replication

    Storage of data copies at multiple sites served by a computer

    network

    Fragment copies can be stored at several sites to serve specific

    information requirements

    can enhance data availability and response time

    can help to reduce communication and total query costs

    Data replication

  • 8/17/2019 Week 12 Student(1)

    89/111

    Data replication

    Fully replicated database: stores multiple copies of each database fragment at

    multiple sites

    can be impractical due to amount of overhead

    Partially replicated database: stores multiple copies of some database fragments at

    multiple sites

    most DDBMSs are able to handle the partially replicateddatabase well

    Unreplicated database: stores each database fragment at a single site

    no duplicate database fragments

    Data allocation

  • 8/17/2019 Week 12 Student(1)

    90/111

    Data allocation

    Data allocation is closely related to the way thedatabase if fragmented : leads to decisions on whichdata is stored where

    Centralized

    entire database is stored at one site

    Partitioned/ fragmented

    database divided into several fragments and stored atseveral sites

    Replicated copies of one or more database fragments (selective

    replication) are stored at several sites

    Strategies for data allocation

  • 8/17/2019 Week 12 Student(1)

    91/111

    Strategies for data allocation

  • 8/17/2019 Week 12 Student(1)

    92/111

    BIG DATA, SMALL DATA

    Big Data

  • 8/17/2019 Week 12 Student(1)

    93/111

    Big Data

    Big data is the term for data sets so large and complex that itbecomes difficult to process them using on-hand database

    management tools or traditional data processing applications

    We are collecting more data than ever

    electronics enables us to do so (RFID)

    storage is cheap

    We have streamlined our processes through normal channels

    computing has enabled us to improve what we do and …

    … businesses are looking for new ways to have a competitive edge

    By looking at patterns in this data we can find out useful things

    From a McKinsey report …

  • 8/17/2019 Week 12 Student(1)

    94/111

    $600 to buy a disk which can store all of ethworld’s music

    Internet of Things

  • 8/17/2019 Week 12 Student(1)

    95/111

    g

    Ubiquitous Broadband

    Reduction in connectivitycosts

    RFID enables uniqueaddressability

    Increasingly, we are including sensors in everyday objects

    These often have communicative capacity and link to source

    through the internet

  • 8/17/2019 Week 12 Student(1)

    96/111

    Use of Big Data

  • 8/17/2019 Week 12 Student(1)

    97/111

    Use of Big Data

    We can gain additional information derivable from analysis ofa single large set of related data (rather than a large number

    of small sets)

    Correlations can be found which "spot business trends,

    determine quality of research, prevent diseases, link legalcitations, combat crime, and determine real-time roadway

    traffic conditions"

    The business case (Mc Kinsey)

  • 8/17/2019 Week 12 Student(1)

    98/111

    The business case (Mc Kinsey)

    1. Big data can unlock significant value by making information

    transparent and usable at much higher frequency.

    2. Organizations can collect more accurate and detailed

    performance information on everything from product

    inventories to sick days, and therefore expose variability and

    boost performance. Leading companies are using data

    collection and analysis to conduct controlled experiments to

    make better management decisions; others are using data for

    basic low-frequency forecasting to high-frequency nowcastingto adjust their business levers just in time.

    The business case (Mc Kinsey)

  • 8/17/2019 Week 12 Student(1)

    99/111

    The business case (Mc Kinsey)

    3. Big data allows ever-narrower segmentation of customers and

    therefore much more precisely tailored products or services.

    4. Sophisticated analytics can substantially improve decision-making.

    5. Big data can be used to improve the development of the next

    generation of products and services. For instance, manufacturersare using data obtained from sensors embedded in products to

    create innovative after-sales service offerings such as proactive

    maintenance (preventive measures that take place before a failure

    occurs or is even noticed).

    http://www.mckinsey.com/insights/business_technology/big_data_the_next_ 

    frontier_for_innovation

    Big Data in Health

  • 8/17/2019 Week 12 Student(1)

    100/111

    Big Data in Health

    Big data is enabling a new understanding of the molecular biologyof cancer. The focus has changed over the last 20 years from the

    location of the tumor in the body (e.g., breast, colon or blood), to the

    effect of the individual’s genetics, especially the genetics of that

    individual’s cancer cells, on her response to treatment and sensitivity

    to side effects. For example, researchers have to date identifiedfour distinct cell genotypes of breast cancer; identifying the cancer

    genotype allows the oncologist to prescribe the most effective

    available drug first.

    http://strata.oreilly.com/2013/08/cancer-and-clinical-trials-the-role-

    of-big-data-in-personalizing-the-health-experience.html

    Big Data in banking

  • 8/17/2019 Week 12 Student(1)

    101/111

    Big Data in banking

    IBM’s Watson can do analysis with “unstructured data” such as thosefound in e-mails, news reports, books and websites. Citigroup has

    hired Watson to help it decide what new products and services to

    offer its customers and to try to cut down on fraud and look for

    signs of customers becoming less creditworthy. In most financial

    institutions the immediate use of big data is in containing fraud and

    complying with rules on money-laundering and sanctions.

    Big credit card companies are getting better at recognising patterns

    Solutions are getting cheaper – even for smaller banks

    Banks also use the data to sell products (eg insurance) by looking at

    the type of transactions customers make

    http://www.economist.com/node/21554743

    Some geospatial uses

  • 8/17/2019 Week 12 Student(1)

    102/111

    Some geospatial uses

     The Climate Corporation, an insurance company, combines modernBig Data techniques, climatology and agronomics to analyse the

    weather’s complex and multi-layered behaviour to help the world’s

    farmers adapt to climate change.

    McLaren’s Formula One racing team uses Big Data to identify issues

    with its racing cars using predictive analytics and takes corrective

    actions pro-actively. They spend 5% of their budget on telemetry. An

    F1 car is fitted with about 130 sensors. In addition to the engine

    sensors, video and GPS is used to work out the best line to take

    through each bend. The sensor data is helping in traffic smoothing,energy-optimising analysis and driver’s direction determination.

    e.g.new Pirelli tyres this year meant teams had to watch for tyre

    wear, grip, temperature under different weather conditions and

    tracks, relating all that to driver acceleration, braking and steering.

    Some geospatial uses

  • 8/17/2019 Week 12 Student(1)

    103/111

    Some geospatial uses

     Vestas Wind Systems is implementing a big data solution that issignificantly reducing data processing time and helping faster and

    more accurate predict weather patterns prediction at potential sites

    to increase turbine energy production. They currently store 2.8

    petabytes in a wind library covering over 178 parameters, such astemperature, barometric pressure, humidity, precipitation, wind

    direction and wind velocity from the ground level up to 300 feet.

     Nokia need a technology solution to support the collection, storage

    and analysis of virtually unlimited data types and volumes. They

    leverage data processing and complex analyses in order to build

    maps with predictive traffic and layered elevation models, to source

    information about points of interest around the world, to understand

    the quality of phones and more. www.geospatialworld.net

    More geospatial uses

  • 8/17/2019 Week 12 Student(1)

    104/111

    More geospatial uses

     US Xpress, transportation solutions, collects about a thousanddata elements ranging from fuel usage to tyre condition to truck

    engine operations to GPS information, and uses this for optimal

    fleet management and to drive productivity, saving millions of

    dollars in operating costs. When an order is dispatched, it is

    tracked using an in-cab system installed on a DriverTech tablet

    with speech recognition capability. US Xpress constantly connects

    to the devices to monitor progress of the lorry. The video camera

    on the device could be used to check if the driver is nodding off.

    All the data collected is analysed in real time using geospatialdata, integrated with driver data and truck telematics. They can

    minimise delays and ensure trucks are not left waiting when they

    arrive at a depot for maintenance. www.geospatialworld.net

    Big data in the university

  • 8/17/2019 Week 12 Student(1)

    105/111

    Big data in the university

    Huddersfield University linked library data to identify learning styles. Now including

    lecture attendance records

    Purdue University, Indiana

    when student logs into a course website, they see a traffic

    light signal (and advice how to move to green) University of Derby

    VLE use, sports, car parking

    Loughborough University

    analyses staff-student interaction

    www.theguardian.com/education/2013/aug/05

    Role of Cloud Computing

  • 8/17/2019 Week 12 Student(1)

    106/111

    Role of Cloud Computing

    Enables easier gathering, storage and processing of BigData

    Cloud computing provides accessibility any time, anyplace

    Large scale data gathering is possible from multiplelocations

    Sharing of data easier

    Large scale storage

    Processing power also available with virtual machinesprovision to analyse data Can be utilised on an ad-hoc basis

    Analysing Big Data

  • 8/17/2019 Week 12 Student(1)

    107/111

    Analysing Big Data

    Data mining blend of applied statistics and artificial intelligence

    neural networks, cluster analysis, genetic algorithms, decision trees,

    support vector machines

    Analytics Machine learning

    Visualisation

    interactive rather than static graphs help to understand patterns

    Shift of skills to digital analysis and visualisation techniques

    Data mining

    Who interprets?

    http://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Warehousing%20Vs%20Data%20Mining.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Warehousing%20Vs%20Data%20Mining.pptx

  • 8/17/2019 Week 12 Student(1)

    108/111

    Who interprets?

    A new set of tools make it easier to do a variety of dataanalysis tasks. Some require no programming, while other toolsmake it easier to combine code, visuals, and text in the sameworkflow. They enable users who aren’t statisticians or datageeks, to do data analysis. While most of the focus is on

    enabling the application of analytics to data sets, some toolsalso help users with the often tricky task of interpreting results.In the process users are able to discern patterns and evaluatethe value of data sources by themselves, and only call uponexpert data analysts when faced with non-routine problems

    http://strata.oreilly.com/2013/08/data-analysis-tools-target-non-experts.html

    Issues

  • 8/17/2019 Week 12 Student(1)

    109/111

    Problems with algorithms can magnify misbehaviour (e.g. selection bias)

    Privacy and security

    anonymity: profiling individuals Over-reliance on technology

    Need for skilled workers with “deep analytics” skills

    www.internetofthings.eu

    House Keeping

  • 8/17/2019 Week 12 Student(1)

    110/111

    p g

    Groups and group names Project distribution

    Weighting (65% exam : 35% CA)

    35% CA = 28% project, 7% SQL CAs(approx.)

    110

  • 8/17/2019 Week 12 Student(1)

    111/111

    FINI