Upload
mohammed-aldrees
View
433
Download
1
Tags:
Embed Size (px)
Citation preview
04/12/23 Data Warehousing 1
Data Warehousing
Data WarehousingLectures based on material from
Phil Trinder (HW)
Monica Farrowemail : [email protected]
04/12/2306/30/08 Data Warehousing 1.2
Data Warehouse
Two definitions: “A data warehouse is a copy of transaction data
specifically structured for querying and reporting.”
Data Warehousing Information Center http://www.dwinfocenter.org/defined.html
A data warehouse is a specialised database to support strategic decision making
Decision making involves: Analysing the problem, e.g.
Why are my sales not meeting my targets? What products are not meeting their targets? What are the trends for the failing products?
Generating alternative solutions, evaluating them, and choosing the best
04/12/23 Data Warehousing 3
Decision Support Systems
These are used by management to make strategic or policy decisions
They have existed for a long time Characteristics
Aimed at loosely specified problems Combine models and analytical approaches
with data retrieval Good usability for non-specialist use Flexible: to support multiple decision-making
approaches
04/12/23 Data Warehousing 4
A wine club example
100,000 members, 2000 wines, 150 suppliers, 750,000 orders per year
Systems : storage technology Member administration : indexed sequential
files Stock control: relational database Order processing: relational database Despatch: proprietary database
04/12/23 Data Warehousing 5
Wine Club Operational Schema
Member
MemberOrder
OrderItemStock
WineSupplier
places
On
supplies
in
Is for
04/12/23 Data Warehousing 6
Wine Club Questions
Competitors have moved in. Is our market share falling?
What products are increasing/decreasing in popularity?
Which products are seasonal? Which members place regular orders? Are some products more popular in certain
parts of the country? Which members concentrate on particular
products?
04/12/23 Data Warehousing 7
Strategic vs Operational Issues
Strategic*: planning and policy making, long term and broad brush, higher levels of management, e.g. When to launch a new product? What would be the effect of closing the
Edinburgh branch Operational: day-to-day running of business.
Details and immediate, lower levels of management Which items are out of stock? What is the status of order 34522?
*Here, ‘strategic’ is in the management context, not executive
04/12/23 Data Warehousing 8
Motivation for data warehousing
Operational data is not suitable to guide strategic decisions Some of the data is not relevant Data may be archived regularly once it is not
regularly required Need to examine trends
What is happening over time? Queries over time may significantly affect the
speed of operational processing Solution: record sales on a regular basis,
separate from the operational system, and analyse them
This is the start of a warehouse
04/12/23 Data Warehousing 9
Data warehouse characteristics
Subject-oriented e.g. sales Non-volatile – no alteration to records once
they are added Whereas in operational processing, records
will frequently be updated (e.g. alteration to prices, quantity etc)
Integrated, data from multiple (operational) sources are accumulated in an integrated format E.g. wine club has >1 operational db
Time variant: data is recorded against time to allow trend analysis
04/12/23 Data Warehousing 10
Data warehouse characteristics continued
Records are extracted to make future querying easy. Therefore There is likely to be some data duplication,
including storage of derived data (data obtained from calculations and aggregations)
There will be less joins and more indexes than in a well-designed operational database.
The data warehouse will be larger than the corresponding operational database
Data in operational databases will be archived periodically, whereas a data warehouse keeps data for years to allow trend analysis.
04/12/23 Data Warehousing 11
Warehouse construction
Extraction
Integration
DBMS
AggregateNavigators
Presentation
Source1
Source n
Now we have a look at each stage in warehouse construction:
04/12/23 Data Warehousing 12
Extraction
Retrieve data from all data sources: files, databases etc
The process to extract data will be an add-on to the existing operational system. For example, Day-end extraction run When a sale is recorded, this triggers
extraction of the sale data
04/12/23 Data Warehousing 13
Integration
When data is extracted from different sources, integration may be required: Format Integration, similar to type mismatch
Examples: gender ‘male’, ‘female’ ‘M’, ‘F’ 0 and 1
Semantic integration: does a word have the same meaning in all the data being integrated?
Example – a ‘sale’ means: order processing: order received stock control: extracted items from physical
warehouse despatch: goods shipped
04/12/23 Data Warehousing 14
Data Warehouse design: dimensional analysis
Dimensional analysis is used to identify the requirements of the warehouse
What are the aspects of the data that are strategically important? e.g. Member Product - wine Time always
We don’t know in advance exactly what the queries will be!
04/12/23 Data Warehousing 15
3 dimensions example
Macon Chablis Merlot Chardonnay PRODUCT
TIMEQ3 2007
Q4 2007
Q1 2008
MEMBER
Smith
Jones
Bloggs
04/12/23 Data Warehousing 16
Star Schema
A star schema is one of the simplest designs for a data warehouse. A central fact table, containing all the main
information, is the centre of the star Smaller dimension tables, containing look-up
information for attributes in the fact table, at the points.
Wine
Member Time
SALES
Centralfacttable
04/12/23 Data Warehousing 17
Star Schema Design for DB
SALES
Centralfacttable
winecode,membercode,timecode,quantity,cost
Wine
winecode,winename,vintage,description,price
Member
membercode, membername,memberAddress
Time
timecode, date, periodno, quarterno, year
04/12/23 Data Warehousing 18
Warehouse Database
Centre of star schema becomes a relation: the fact table – numeric facts and foreign keys
Sales(membercode, winecode, timecode, qty, itemcost) Each dimension becomes a relation: a dimension
table Member(membercode, membername, memberaddress) Wine(winecode, name, vintage, description, price)
There is ALWAYS a time dimension table This includes period and quarter details, since they
are frequently used in queries Time(timecode, date, periodno, quarterno, year)
04/12/23 Data Warehousing 19
Using the Warehouse
The strategic questions can now be investigated using data extracted by SQL queries
For example, to discover which wines have increasing and decreasing sales, we can retrieve a table giving the total sales for each wine against time: SELECT w.winename, t.period_number,SUM(s.qty)
FROM sales s, wine w, time tWHERE s.winecode = w.winecode AND s.timecode = t.timecodeGROUP BY w.winename, t.periodnoORDER BY w.winename, t.periodno
04/12/23 Data Warehousing 20
Indexes
Usually a lot of indexes will be created, to make queries more efficient An index helps speed up retrieval. A column that is frequently referred to in the
WHERE clause is a potential candidate for indexing.
Diagrams of the 2 most commonly used indexes in data warehousing are shown on the next slides: Indexes may be based on the B-Tree Also bitmap indexes are widely used
04/12/23 Data Warehousing 21
B-tree index
04/12/23 Data Warehousing 22
Bitmap indexes
Bitmap indexes An example on the next slide For each value of a domain, there is a bitmap
identifying the row Ids of satisfying tuples 1 if a match, 0 otherwise
Usually applied to attributes with a sparse domain
In Oracle, <100 distinct values E.g. bitmaps for all tuples with sex = male and
for sex=female Updating a bitmap takes a lot of time, so use
for tables with hardly any updates, inserts, deletes
Ideal for data warehousing
04/12/23 Data Warehousing 23
Bitmap indexes example
The first table is a table about Sailors The second table shows a bitmap index for the
rating attribute, assuming values are only from 1-3 There is a row in the bitmap index for each row in
the Sailor table Column headings in the index are the values in the
rating column
Bitmap index
1 2 3
1 0 0
0 1 0
0 0 1
1 0 0
SAILORS
Id Rating etc
22 1 Other data
23 2 Other data
31 3 Other data
35 1 Other data
04/12/23 Data Warehousing 24
Materialised views and Aggregation
Data warehouses grow continuously, and may become very large indeed
Problems: the time to compute a query and the size of the result can be very large indeed
Solution: materialised views and aggregation
A materialised view is a stored pre-computed table, used to prevent frequent use of time-consuming joins and calculations
04/12/23 Data Warehousing 25
Aggregates
Basic idea: sacrifice detail to reduce the size of the data
Store precomputed tables at a useful level of detail, consisting of commonly used sums, counts etc.
Must be carefully selected, e.g. Sales to each member of each wine summer for each
quarter Sales of each wine summed for each quarter for each
month Levels of aggregation
None(i.e. detail) Light (e.g. monthly) Highly (e.g. quarterly)
04/12/23 Data Warehousing 26
Aggregate navigator
An aggregate navigator uses information about available aggregates to automatically rewrite queries to use them
It also records aggregates usage, so that unused aggregates can be removed
It can suggest useful new aggregates E.g. a frequent query is based on the number
of wines sold per month in a range of price bands. This is suggested as a new aggregate
04/12/23 Data Warehousing 27
Presentation requirements
Must be easy to use Visualise the results of queries in many ways
e.g. charts, graphs, scatter diagrams etc Make good use of colour and dimensions 2D,
2.5D, 3D, animationExample of 2.5D graph
Have analysis tools: statistical and curve fitting
For example the product sales trend table would be plotted as a graph
04/12/23 Data Warehousing 28
OLAP
OnLine Analytic Processing uses multidimensional analysis of the data
Allows users to get summaries and find answers to known questions What is the average profit month by month? If we increased sales by 10%, what would the
effect be?
04/12/23 Data Warehousing 29
Data mining
Data mining is the extraction of hidden predictive information from large databases E.g. what’s likely to happen to sales next
March and why? The actual techniques for data mining are
not covered in this course. Data mining is usually based on the data in
a data warehouse, and ideally data mining tools are integrated with the data warehouse.
Data Mining provides the Enterprise with intelligence and Data Warehousing provides the Enterprise with a memory.
04/12/23 Data Warehousing 30
Summary
A data warehouse is a specialised database to enable efficient and straightforward production of reports to support strategic decision making.
It contains a copy of the operational data, often integrated from >1 source. Records, once added, are not altered. The central fact table in a star schema design will be very large.
04/12/23 Data Warehousing 31
Discussion/Exercise
A company sells garden trees from several stores located around the country. People visit the store, and buy trees. The names of the customers are always recorded, and many customers place repeat orders.
The company would like to set up a data warehouse so that they can analyse details such as
Frequency of sales per customer Which store has the best sales, ranked by month Top selling tree by month Etc etc
Create a suitable star schema, inventing appropriate attributes