Upload
kevin-oconnell
View
215
Download
0
Embed Size (px)
Citation preview
8/17/2019 Week 12 Student(1)
1/111
WEEK 12
Dr. A. Brennan
8/17/2019 Week 12 Student(1)
2/111
8/17/2019 Week 12 Student(1)
3/111
NOTE:
Efficient Physical DB design produces technical
specifications to be used during the DB
implementation phase
8/17/2019 Week 12 Student(1)
4/111
For efficient physical DB design, certain
info. needs to be gathered:
Normalised relations with estimates of table volume (numberof rows in each table)
Attribute (field) definitions and possible max. length
Descriptions of data usage (when and where data areentered, deleted, retrieved, updated etc.)
Response time expectations
Data security needs
Backup/recovery needs
Integrity expectations
What DBMS technology will be used to implement thedatabase
What DB architecture to use
8/17/2019 Week 12 Student(1)
5/111
Once this info. is gathered, the designer
has to decide on a range of issues:
Suitable storage format (i.e. data types) for eachattribute (in order to minimise storage space andmaximise data integrity)
Grouping attributes from the logical model into physicalrecords (denormalisation)
File organisation (arranging similarly structured recordsin secondary memory for the purpose of storage, fastand efficient retrieval and update, protection of dataand its recovery after errors are found)
Query optimisation
8/17/2019 Week 12 Student(1)
6/111
What is it?
translate the logical description of data into technicalspecifications for storing and retrieving of data
Why?
Good performance , database integrity, security and
recoverability.
Physical Design
8/17/2019 Week 12 Student(1)
7/111
Input and Output for Physical Design
• Normalised relations with
estimates of table volume (number
of rows in each table)• Attribute definitions (and possible
maximum length)
• Descriptions of data usage (when
and where data are entered,
retrieved, deleted, updated)
• Response time expectations
• Data security needs
• Backup/recovery needs
• Integrity expectations
• Description of DBMS technology
•Suitable storage format (data
type) for each attribute in the
logical data model in order tominimise storage space and
maximise data integrity
• Grouping attributes from the
logical model into physical records
• File organisation
• Selection of indexes and database
architectures for storing and
connecting files to efficiently
retrieve related data
• Query optimisation
8/17/2019 Week 12 Student(1)
8/111
Data Types
• CHAR – fixed-length character
• VARCHAR2 – variable-length character (memo)
• LONG – large number
• NUMBER – positive/negative number
• DATE – actual date
• BLOB – binary large object (good for graphics,
sound clips, etc.)
8/17/2019 Week 12 Student(1)
9/111
Goals
Data type
Goals = minimise storage space, represent all possible
values, improve data integrity, support all data
manipulations Data integrity controls
Default value, range control (constraints/validation
rules), null value control (i.e. PK cannot be null)
Referential integrity (FK cannot be null)
8/17/2019 Week 12 Student(1)
10/111
Default value - assumed value if no explicit value is entered for aninstance of the field (this reduces data entry time and helps prevententry errors for the most common value);
Range control – imposes allowable value limitations (constraints orvalidation rules). This may be a numeric lower to upper bound, or a setof specific values. This approach should be used with caution, since thelimits may change over time
Null value control – allows or prohibits empty fields (e.g. each primarykey must have an integrity control that prohibits a null value);
Referential integrity – a form of a range control (and null valueallowances) for foreign-key to primary-key match-ups. It guaranteesthat only some existing cross-referencing value is used
Integrity Controls
8/17/2019 Week 12 Student(1)
11/111
Physical Records
A Physical Record is a group of fields that are stored in adjacent secondary memory
locations and are retrieved and written together as a unit by particular DBMS
Scope:
•
Efficient use of secondary storage (influenced by both the size ofthe record and the structure of the secondary storage)
• Data processing speed.
Computer operating systems read data from secondary memory in units called pages.
A page is the amount of data read or written by an operating system in one operation.
Blocking Factor is the number of physical records per page.
8/17/2019 Week 12 Student(1)
12/111
Normalization
Normalization is a logical database design that is
structurally consistent and has minimal redundancy.
Normalization forces us to understand completely
each attribute that has to be represented in the
database. This may be the most important factor
that contributes to the overall success of the system.
8/17/2019 Week 12 Student(1)
13/111
What is Denormalization?
Denormalization a process of transforming normalised
relations into unnormalised physical record
specifications
Denormalization can also be referred to a process
in which we combine two relations into one new
relation, and the new relation is still normalized butcontains more nulls than the original relations
8/17/2019 Week 12 Student(1)
14/111
Denormalization
In addition, the following factors have to be
considered:
Application specific;
Denormalization may speed up retrievals but it
slows down updates
Size of tables
Coding
8/17/2019 Week 12 Student(1)
15/111
Answer15
Efficient data processing (second goal of physical record
design after efficient use of storage space) in most cases,
dominates the design process.
The speed of data processing depends on how close
together the related data are.
8/17/2019 Week 12 Student(1)
16/111
Benefits and Possible Problems16
Benefits:
• Can improve performance (speed)
Due to data duplication
Problems:
• Wasted storage space
• Data integrity/consistency threats
8/17/2019 Week 12 Student(1)
17/111
Denormalisation – How?
• Option one: Combine attributes from several logical
relations together into one physical record in order to avoid
doing joins (one to one, many to many, one to many)• Option two: Partition a logical relation into several
physical records (multiple tables);
• Option three: Data replication; or a combination of the
two options above.
8/17/2019 Week 12 Student(1)
18/111
Denormalisation – Option 1
1. Two entities with a one-to-one relationship
8/17/2019 Week 12 Student(1)
19/111
Mapping
Logical Model: Normalised Relations
8/17/2019 Week 12 Student(1)
20/111
Select * from EMPLOYEE, PARKING
WHERE
EMPLOYEE.Employee-ID = PARKING.Employee-ID
8/17/2019 Week 12 Student(1)
21/111
Denormalisation – Option 1
1. Two entities with a one-to-one relationship
8/17/2019 Week 12 Student(1)
22/111
Try this!
EMPLOYEE MANAGER
Employee
PPSName
Address
Manager
ID
Expertise
Manages
8/17/2019 Week 12 Student(1)
23/111
EMPLOYEE(EmployeePPS, Name, Address,
ManagerID)
MANAGER(ManagerID, Expertise)
Select * from EMPLOYEE, MANAGER
WHEREEMPLOYEE.ManagerID = MANAGER.ManagerID
ManagerID Expertise EmployeePPS Name Address
8/17/2019 Week 12 Student(1)
24/111
Denormalisation – Option 1
2. Many-to-many relationship (associative entity)with non-key attributes
8/17/2019 Week 12 Student(1)
25/111
8/17/2019 Week 12 Student(1)
26/111
Denormalisation – Option 1
Physical Model: Denormalised Relation
8/17/2019 Week 12 Student(1)
27/111
Denormalisation – Option 1
3 One to many relationship
8/17/2019 Week 12 Student(1)
28/111
Logical Model: Normalised Relations Resulting from One-to-Many (1:M) Relationship
8/17/2019 Week 12 Student(1)
29/111
Physical Model: Denormalised Relation
8/17/2019 Week 12 Student(1)
30/111
Denormalisation – Option 2
Horizontal partitioning - places different rows of a table into several physical files, based
on common column values.
Vertical partitioning – distributing the columns of a table into several separate files,
repeating the primary key in each one of them
Option 2 : Partitioning of logical relation into multiple tables
8/17/2019 Week 12 Student(1)
31/111
31
CUSTOMER
CustID
FirstName
MiddleName
LastName
Address1
Address2
CityCounty
Country
Phone
CreditLimit
SalesTaxRate
Fax
CUSTOMERA
CustID
FirstName
MiddleName
LastName
Address1
Address2City
County
Country
Phone
Fax
CUSOMTERB
CustID
CreditLimit
SalesTaxRate
Vertical partitioning
8/17/2019 Week 12 Student(1)
32/111
32
CUSTOMER
CustID
FirstName
MiddleName
LastName
Address1
Address2
CityCounty
Country
Phone
CreditLimit
SalesTaxRate
Fax
CUSTOMERA-M
CustID
FirstName
MiddleName
LastName
Address1
Address2
CityCounty
Country
Phone
CreditLimit
SalesTaxRate
Fax
CUSTOMERN-Z
CustID
FirstName
MiddleName
LastName
Address1
Address2
CityCounty
Country
Phone
CreditLimit
SalesTaxRate
Fax
Horizontal partitioning
8/17/2019 Week 12 Student(1)
33/111
Advantages and disadvantages of
partitioning33
Efficiency
Local optimisation
Recovery
Slow retrieval
Complexity
Extra space and time for updates
8/17/2019 Week 12 Student(1)
34/111
Denormalisation – Option 334
Data Replication - the same data is purposely stored in multiple locationsof the database.
Data replication improves performance by allowing multiple users to
access the same data at the same time with minimum contention.
Option 3 : Data replication; or a combination of the other twooptions
8/17/2019 Week 12 Student(1)
35/111
Denormalisation Disadvantages
The potential for loss of integrity is considerable.
Additional time that is required to maintain consistency
automatically every time a record is inserted, updated,
or deleted
Increase in storage space resulting from the duplication
8/17/2019 Week 12 Student(1)
36/111
Whose responsibility?36
DBMS
Database Designer
8/17/2019 Week 12 Student(1)
37/111
File Organisation37
1. Sequential File Organisation
2. Indexed File Organisation
3. Hashed File Organisation
8/17/2019 Week 12 Student(1)
38/111
38
8/17/2019 Week 12 Student(1)
39/111
Sequential File Organisation
The records are stored in sequence according to aprimary key value. To locate a particular record, a
program must scan the file from its beginning until
the desired record is located
8/17/2019 Week 12 Student(1)
40/111
40
https://www.youtube.com/watch?v=zDzu6vka0rQ
https://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQhttps://www.youtube.com/watch?v=zDzu6vka0rQ
8/17/2019 Week 12 Student(1)
41/111
Th f I d t f l
8/17/2019 Week 12 Student(1)
42/111
Therefore Indexes are most useful
for…
Larger tables
Attributes which are referenced in ORDER BY or
GROUP BY clauses
8/17/2019 Week 12 Student(1)
43/111
43
https://www.youtube.com/watch?v=h2d9b_nEzoA
https://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoAhttps://www.youtube.com/watch?v=h2d9b_nEzoA
8/17/2019 Week 12 Student(1)
44/111
Hashed File Organisation
The address of each file is determined using a
hashing algorithm
A hashing algorithm is a routine that converts a PK
value into a record address A hash index table uses hashing to map a key into a
location in an index, where there is a pointer to the
data record matching the hash key
8/17/2019 Week 12 Student(1)
45/111
DB Architecture
8/17/2019 Week 12 Student(1)
46/111
Note46
De-normalisation should only take place after a
satisfactory level of normalisation has taken place
8/17/2019 Week 12 Student(1)
47/111
Goal of Physical DB Design
The goal of physical DB design is to create technical
specifications from the logical descriptions of data
that will provide adequate data storage and
performance and will ensure database integrity,security and recoverability
8/17/2019 Week 12 Student(1)
48/111
DATA AND DATABASE
ADMINISTRATION
8/17/2019 Week 12 Student(1)
49/111
Data within the organisation
Data are a resource to be translated into
information
Data is constantly being produced and analysed to
create evenmore data
8/17/2019 Week 12 Student(1)
50/111
Database use in the organisation
Top management
strategic decision making, planning and policy
Middle management
tactical decisions and planning
Operational management
support company operations
MIS
DSS
TPS
Database
8/17/2019 Week 12 Student(1)
51/111
Management data
Two recognised roles
8/17/2019 Week 12 Student(1)
52/111
Data/database administration
Data administration is responsible for: planning and analysis function responsible for setting data
policy and standards
promoting company’s data as a competitive resource
providing liaison support to systems analysts duringapplication development
Database administration
operationally oriented
responsible for day-to-day monitoring and management ofactive database
liaison and support during application development
8/17/2019 Week 12 Student(1)
53/111
Data administrator
Data coordination keep track of update, responsibilities and interchange
Data standards
e.g naming standards
Liaison with systems analysts and programmers,including design
Training managers, users, developers
Arbitration of disputes and usage authorization Documentation and internal publicity
Promotion of data’s competitive advantage
8/17/2019 Week 12 Student(1)
54/111
Database administrator
Responsible for the day-to-day administration ofthe database
Monitors performance to maximize efficiency
Provides central point for troubleshooting Monitors security and usage (audit log)
Responsible for operational aspects of datadictionary
Carries out data and software maintenance
Involved in database design
8/17/2019 Week 12 Student(1)
55/111
Database Administrator in DB Design
Define conceptual schema what data to be held; what entities; what attributes
Define internal schema
decide physical database design
Liaise with users
ensure the data they need is available
Define security needs
Define backup and recovery Monitor performance
respond to changing requirements
8/17/2019 Week 12 Student(1)
56/111
A Summary of DBA Activities
data distribution and usedelivering
data backup and recoverymonitoring
data security, privacy and integritytesting
policies, procedures and standardsorganising
end-user supportplanning
db service
of
db activity
8/17/2019 Week 12 Student(1)
57/111
Tools for Database Administration
Information is kept about all corporate resources,
including data
This “data about data” is termed metadata
The database which holds this metadata is the datadictionary
Two types data dictionary
stand-alone or passive integrated or active
8/17/2019 Week 12 Student(1)
58/111
Metadata in Access
8/17/2019 Week 12 Student(1)
59/111
Data Dictionary
Passive data dictionary
self-contained database
all data about entities are entered into the dictionary
requests for metadata information are run as reportsand queries as necessary
Active data dictionary
8/17/2019 Week 12 Student(1)
60/111
Data dictionary: relationships
Table construction which attributes appear in which tables
Security which people have access to which databases or tables
Impact of change which programs might be affected by changes to which tables
Physical residence which tables or files are on which disks
Program data requirements which programs use which tables or files
Responsibilitywho is responsible for updating which databases or tables
I d i D b
8/17/2019 Week 12 Student(1)
61/111
Introducing a Database: Considerations
Three important aspects
technological:
managerial:
cultural:
DBMS software and hardware
Administrative functions
Corporate resistance to change
8/17/2019 Week 12 Student(1)
62/111
Social impact databases
Data collection is extensive
both voluntary and involuntary
Data is a commodity
8/17/2019 Week 12 Student(1)
63/111
DATABASE SECURITY
8/17/2019 Week 12 Student(1)
64/111
Security - types threat
Loss or corruption to data due to sabotage
external
internal
• Loss or corruption to data due to error
• Disclosure of sensitive data
• Fraudulent manipulation of data
8/17/2019 Week 12 Student(1)
65/111
Threats to data security
8/17/2019 Week 12 Student(1)
66/111
Controlling unauthorised access
Physical access to building
Access to hardware
Monitor any unusual activity
C
8/17/2019 Week 12 Student(1)
67/111
Controlling unauthorised access
Developing user profiles
care over decisions on what data and resources can be
accessed (and type access) for each end user
user training and education Firewalls
Encryption
Plugging known security holes using patches available for known problems
l f l
8/17/2019 Week 12 Student(1)
68/111
Developing user profiles
Every user is given an identifier for authentication Users are given privileges to access data
dependent on what is essential for their work
insert update
delete
Most DBMS provide an approach called
Discretionary Access Control (DAC) SQL standard supports DAC through the GRANT
and REVOKE commands
DAC d MAC
8/17/2019 Week 12 Student(1)
69/111
DAC and MAC
DAC has certain weaknesses in that an unauthorizeduser can trick an authorized user into disclosing sensitivedata
An additional approach is required called Mandatory
Access Control (MAC) MAC based on system-wide policies that cannot be
changed by individual users each database object is assigned a security class
each user is assigned a clearance for a security class
rules are imposed on reading and writing of databaseobjects by users
SQL standard does not include support for MAC
secret
Fi ll
8/17/2019 Week 12 Student(1)
70/111
Firewalls
Firewall controls network
traffic
E i
8/17/2019 Week 12 Student(1)
71/111
Encryption
Encryption: decoding or scrambling data to make it unintelligible
to those without the key
encryption
C lli l DP f ili i
http://localhost/var/www/apps/conversion/tmp/scratch_5/Encryption.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_5/Encryption.pptx
8/17/2019 Week 12 Student(1)
72/111
Controlling loss DP facilities
Redundancy
Virus protection
Disaster protection
Minimise error
Alert network managers to problems
Minor disruptions require on-going monitoring
P i
8/17/2019 Week 12 Student(1)
73/111
Protect against error
Educate all employees Reminders to save
Should you overwrite existing files?
incorporate safety nets on deletion Include integrity checks on data
validation
cross checking
range checks
hash totals
check digits
batch totals
8/17/2019 Week 12 Student(1)
74/111
Software Invasion
Cruise virus
• attacks for profit
• Exploits the network’s weakest link -you
• attacks through the public domain
• waits to reach its target
• reports successful penetration
• delivers payload
Stealth viruses
• encrypt and hides tracks
Worm• makes copies of itself
• transmits copies to other machines
• difficult to access to disable
Trojan horse
• looks like something else
• once launched, too late!
Trapdoor
• simulates regular entry
• or bypasses normalsecurity procedures
• difficult to detect that ithas been run
Logic bomb
• event driven
P t ti i t i tt k
8/17/2019 Week 12 Student(1)
75/111
Protecting against virus attacks
Prepare a company policy on viruses Educate on the destructive power of viruses
Control the source of software purchasing
Ensure new or upgraded software is installed by system
administrator on quarantined machine Control use of bulletin boards
Install anti-virus software where necessary
Make regular back-ups data and programs separately store back-up copies off-site once software opened
Be aware of software holes in systems software
H it b i d
8/17/2019 Week 12 Student(1)
76/111
How security can be compromised
Poor security management
Poor connections to the outside world
Shoddy system control
Human folly
Lack of security ethic
And the answer is:
Education
8/17/2019 Week 12 Student(1)
77/111
DISTRIBUTED DATABASE MANAGEMENT SYSTEMS
Di t ib t d d t b
8/17/2019 Week 12 Student(1)
78/111
Distributed databases
Distributed database a logically interrelated collection of shared data (and
description of this data) physically distributed over acomputer network
Distributed DBMSs (DDBMS) the software systems that permits the management of the
distributed database and makes the distribution apparentto users
must perform all the functions of a centralized DBMS must handle all necessary functions imposed by the
distribution of data and processing
Di t ib t d i /d t b
8/17/2019 Week 12 Student(1)
79/111
Distributed processing/database
Distributed Processing
Shares data processing chores over
sites using communications network
Database resides at one site only
Distributed Database
Each site has a data fragment
which might be replicated at
other sites
Requires distributed processing
DDBMS
8/17/2019 Week 12 Student(1)
80/111
DDBMS
Advantages Reflects organisational
structure
Faster data access andprocessing
Improved communications inorg.
Reduced operating costs
Improved share-ability andlocal autonomy
Less danger of single-pointfailure
Modular growth easier
Disadvantages Complexity management
and control
Security
Integrity control moredifficult
Lack of standard comms.protocols for dbs
Increased training costs Database design more
complex
Ch t i ti DDBMS
8/17/2019 Week 12 Student(1)
81/111
Characteristics DDBMS
A collection of logically related shared data The data is spilt into a number of fragments
Fragments may be replicated
Fragments/replicas are allocated to sites Sites linked by a communications network
Data at each site is under control of a DBMS
DBMS at each site can handle local applications
autonomously Each DBMS participates in at least one global
application
DDBMS feat res
8/17/2019 Week 12 Student(1)
82/111
DDBMS features
Application interface to interact with end user orapplication programs and with other DBMs
Validation to analyse data results
Transformation: to determine which data requestsare distributed and which local
Query optimization to find best access strategy
Mapping to determine location fragments
I/O interface Formatting to prepare data for presentation
Distributed database design
8/17/2019 Week 12 Student(1)
83/111
Distributed database design
Data fragmentation (divide) need to decide how to split into fragments
OR
Data replication (copy)
a copy of a fragment (or all) may be held at several sites
THEN
Data allocation:
need to decide where to locate those fragments andreplicas : each fragment is stored at the site with “optimaldistribution”
Data fragmentation
8/17/2019 Week 12 Student(1)
84/111
Data fragmentation
Users work with views, so appropriate to work withsubsets data
Cheaper to store data closest to where it is used
May give reduced performance for globalapplications
Integrity control may be difficult if data andfunctional dependencies are at different sites
Data fragmentation must be done carefully
Data fragmentation
8/17/2019 Week 12 Student(1)
85/111
Data fragmentation
Breaks single object into two or more segments orfragments
Each fragment can be stored at any site over a
computer network Information about data fragmentation is stored in
the distributed data catalog (DDC), from which it is
accessed by the transaction processor
Strategies for fragmentation
8/17/2019 Week 12 Student(1)
86/111
Strategies for fragmentation
For successful fragmentation, must ensure: completeness: each data item must appear in at least
one fragment
reconstruction: should be able to define a relationaloperation that will reconstruct relation from fragments
disjointness: a data item appearing in one fragment
should not appear in another
Strategies for data fragmentation
8/17/2019 Week 12 Student(1)
87/111
Strategies for data fragmentation
Horizontal fragmentation division of a relation into subsets (fragments) based on
tuples (rows)
Vertical fragmentation
division of a relation into attribute (column) subsets
Mixed fragmentation
combination
Data replication
8/17/2019 Week 12 Student(1)
88/111
Data replication
Storage of data copies at multiple sites served by a computer
network
Fragment copies can be stored at several sites to serve specific
information requirements
can enhance data availability and response time
can help to reduce communication and total query costs
Data replication
8/17/2019 Week 12 Student(1)
89/111
Data replication
Fully replicated database: stores multiple copies of each database fragment at
multiple sites
can be impractical due to amount of overhead
Partially replicated database: stores multiple copies of some database fragments at
multiple sites
most DDBMSs are able to handle the partially replicateddatabase well
Unreplicated database: stores each database fragment at a single site
no duplicate database fragments
Data allocation
8/17/2019 Week 12 Student(1)
90/111
Data allocation
Data allocation is closely related to the way thedatabase if fragmented : leads to decisions on whichdata is stored where
Centralized
entire database is stored at one site
Partitioned/ fragmented
database divided into several fragments and stored atseveral sites
Replicated copies of one or more database fragments (selective
replication) are stored at several sites
Strategies for data allocation
8/17/2019 Week 12 Student(1)
91/111
Strategies for data allocation
8/17/2019 Week 12 Student(1)
92/111
BIG DATA, SMALL DATA
Big Data
8/17/2019 Week 12 Student(1)
93/111
Big Data
Big data is the term for data sets so large and complex that itbecomes difficult to process them using on-hand database
management tools or traditional data processing applications
We are collecting more data than ever
electronics enables us to do so (RFID)
storage is cheap
We have streamlined our processes through normal channels
computing has enabled us to improve what we do and …
… businesses are looking for new ways to have a competitive edge
By looking at patterns in this data we can find out useful things
From a McKinsey report …
8/17/2019 Week 12 Student(1)
94/111
$600 to buy a disk which can store all of ethworld’s music
Internet of Things
8/17/2019 Week 12 Student(1)
95/111
g
Ubiquitous Broadband
Reduction in connectivitycosts
RFID enables uniqueaddressability
Increasingly, we are including sensors in everyday objects
These often have communicative capacity and link to source
through the internet
8/17/2019 Week 12 Student(1)
96/111
Use of Big Data
8/17/2019 Week 12 Student(1)
97/111
Use of Big Data
We can gain additional information derivable from analysis ofa single large set of related data (rather than a large number
of small sets)
Correlations can be found which "spot business trends,
determine quality of research, prevent diseases, link legalcitations, combat crime, and determine real-time roadway
traffic conditions"
The business case (Mc Kinsey)
8/17/2019 Week 12 Student(1)
98/111
The business case (Mc Kinsey)
1. Big data can unlock significant value by making information
transparent and usable at much higher frequency.
2. Organizations can collect more accurate and detailed
performance information on everything from product
inventories to sick days, and therefore expose variability and
boost performance. Leading companies are using data
collection and analysis to conduct controlled experiments to
make better management decisions; others are using data for
basic low-frequency forecasting to high-frequency nowcastingto adjust their business levers just in time.
The business case (Mc Kinsey)
8/17/2019 Week 12 Student(1)
99/111
The business case (Mc Kinsey)
3. Big data allows ever-narrower segmentation of customers and
therefore much more precisely tailored products or services.
4. Sophisticated analytics can substantially improve decision-making.
5. Big data can be used to improve the development of the next
generation of products and services. For instance, manufacturersare using data obtained from sensors embedded in products to
create innovative after-sales service offerings such as proactive
maintenance (preventive measures that take place before a failure
occurs or is even noticed).
http://www.mckinsey.com/insights/business_technology/big_data_the_next_
frontier_for_innovation
Big Data in Health
8/17/2019 Week 12 Student(1)
100/111
Big Data in Health
Big data is enabling a new understanding of the molecular biologyof cancer. The focus has changed over the last 20 years from the
location of the tumor in the body (e.g., breast, colon or blood), to the
effect of the individual’s genetics, especially the genetics of that
individual’s cancer cells, on her response to treatment and sensitivity
to side effects. For example, researchers have to date identifiedfour distinct cell genotypes of breast cancer; identifying the cancer
genotype allows the oncologist to prescribe the most effective
available drug first.
http://strata.oreilly.com/2013/08/cancer-and-clinical-trials-the-role-
of-big-data-in-personalizing-the-health-experience.html
Big Data in banking
8/17/2019 Week 12 Student(1)
101/111
Big Data in banking
IBM’s Watson can do analysis with “unstructured data” such as thosefound in e-mails, news reports, books and websites. Citigroup has
hired Watson to help it decide what new products and services to
offer its customers and to try to cut down on fraud and look for
signs of customers becoming less creditworthy. In most financial
institutions the immediate use of big data is in containing fraud and
complying with rules on money-laundering and sanctions.
Big credit card companies are getting better at recognising patterns
Solutions are getting cheaper – even for smaller banks
Banks also use the data to sell products (eg insurance) by looking at
the type of transactions customers make
http://www.economist.com/node/21554743
Some geospatial uses
8/17/2019 Week 12 Student(1)
102/111
Some geospatial uses
The Climate Corporation, an insurance company, combines modernBig Data techniques, climatology and agronomics to analyse the
weather’s complex and multi-layered behaviour to help the world’s
farmers adapt to climate change.
McLaren’s Formula One racing team uses Big Data to identify issues
with its racing cars using predictive analytics and takes corrective
actions pro-actively. They spend 5% of their budget on telemetry. An
F1 car is fitted with about 130 sensors. In addition to the engine
sensors, video and GPS is used to work out the best line to take
through each bend. The sensor data is helping in traffic smoothing,energy-optimising analysis and driver’s direction determination.
e.g.new Pirelli tyres this year meant teams had to watch for tyre
wear, grip, temperature under different weather conditions and
tracks, relating all that to driver acceleration, braking and steering.
Some geospatial uses
8/17/2019 Week 12 Student(1)
103/111
Some geospatial uses
Vestas Wind Systems is implementing a big data solution that issignificantly reducing data processing time and helping faster and
more accurate predict weather patterns prediction at potential sites
to increase turbine energy production. They currently store 2.8
petabytes in a wind library covering over 178 parameters, such astemperature, barometric pressure, humidity, precipitation, wind
direction and wind velocity from the ground level up to 300 feet.
Nokia need a technology solution to support the collection, storage
and analysis of virtually unlimited data types and volumes. They
leverage data processing and complex analyses in order to build
maps with predictive traffic and layered elevation models, to source
information about points of interest around the world, to understand
the quality of phones and more. www.geospatialworld.net
More geospatial uses
8/17/2019 Week 12 Student(1)
104/111
More geospatial uses
US Xpress, transportation solutions, collects about a thousanddata elements ranging from fuel usage to tyre condition to truck
engine operations to GPS information, and uses this for optimal
fleet management and to drive productivity, saving millions of
dollars in operating costs. When an order is dispatched, it is
tracked using an in-cab system installed on a DriverTech tablet
with speech recognition capability. US Xpress constantly connects
to the devices to monitor progress of the lorry. The video camera
on the device could be used to check if the driver is nodding off.
All the data collected is analysed in real time using geospatialdata, integrated with driver data and truck telematics. They can
minimise delays and ensure trucks are not left waiting when they
arrive at a depot for maintenance. www.geospatialworld.net
Big data in the university
8/17/2019 Week 12 Student(1)
105/111
Big data in the university
Huddersfield University linked library data to identify learning styles. Now including
lecture attendance records
Purdue University, Indiana
when student logs into a course website, they see a traffic
light signal (and advice how to move to green) University of Derby
VLE use, sports, car parking
Loughborough University
analyses staff-student interaction
www.theguardian.com/education/2013/aug/05
Role of Cloud Computing
8/17/2019 Week 12 Student(1)
106/111
Role of Cloud Computing
Enables easier gathering, storage and processing of BigData
Cloud computing provides accessibility any time, anyplace
Large scale data gathering is possible from multiplelocations
Sharing of data easier
Large scale storage
Processing power also available with virtual machinesprovision to analyse data Can be utilised on an ad-hoc basis
Analysing Big Data
8/17/2019 Week 12 Student(1)
107/111
Analysing Big Data
Data mining blend of applied statistics and artificial intelligence
neural networks, cluster analysis, genetic algorithms, decision trees,
support vector machines
Analytics Machine learning
Visualisation
interactive rather than static graphs help to understand patterns
Shift of skills to digital analysis and visualisation techniques
Data mining
Who interprets?
http://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Warehousing%20Vs%20Data%20Mining.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_5/Data%20Warehousing%20Vs%20Data%20Mining.pptx
8/17/2019 Week 12 Student(1)
108/111
Who interprets?
A new set of tools make it easier to do a variety of dataanalysis tasks. Some require no programming, while other toolsmake it easier to combine code, visuals, and text in the sameworkflow. They enable users who aren’t statisticians or datageeks, to do data analysis. While most of the focus is on
enabling the application of analytics to data sets, some toolsalso help users with the often tricky task of interpreting results.In the process users are able to discern patterns and evaluatethe value of data sources by themselves, and only call uponexpert data analysts when faced with non-routine problems
http://strata.oreilly.com/2013/08/data-analysis-tools-target-non-experts.html
Issues
8/17/2019 Week 12 Student(1)
109/111
Problems with algorithms can magnify misbehaviour (e.g. selection bias)
Privacy and security
anonymity: profiling individuals Over-reliance on technology
Need for skilled workers with “deep analytics” skills
www.internetofthings.eu
House Keeping
8/17/2019 Week 12 Student(1)
110/111
p g
Groups and group names Project distribution
Weighting (65% exam : 35% CA)
35% CA = 28% project, 7% SQL CAs(approx.)
110
8/17/2019 Week 12 Student(1)
111/111
FINI