64
1 Topic for Thursday?

1111 Topic for Thursday?. Miscellaneous Topics in Databases

  • View
    224

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

1111

Topic for Thursday?

Page 2: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

Miscellaneous Topicsin Databases

Page 3: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

PARALLEL DBMS

Page 4: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

4444

WHY PARALLEL ACCESS TO DATA?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute to scan.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

Page 5: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

5555

PARALLEL DBMS: INTRO

Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing

one step in a multi-step process. Partition parallelism: many machines doing the

same thing to different pieces of data. Both are natural in DBMS!

Pipeline

Partition

Any Sequential Program

Any Sequential Program

SequentialSequential SequentialSequentialAny

Sequential Program

Any Sequential Program

outputs split N ways, inputs merge

Page 6: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

6666

SOME || TERMINOLOGY Speed-Up

More resources means proportionally less time for given amount of data.

Scale-Up If resources increased

in proportion to increase in data size, time is constant.

Why Realistic <> Ideal?

degree of ||-ism

Xac

t/se

c.(t

hrou

ghpu

t)

Ideal

degree of ||-ism

sec.

/Xac

t(r

espo

nse

tim

e)Ideal

Realistic

Realistic

Page 7: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

7777

INTRODUCTION Parallel machines are becoming quite

common and affordable Prices of microprocessors, memory and disks

have dropped sharply Recent desktop computers feature multiple

processors and this trend is projected to accelerate

Databases are growing increasingly large large volumes of transaction data are collected

and stored for later analysis. multimedia objects like images are increasingly

stored in databases Large-scale parallel database systems

increasingly used for: storing large volumes of data processing time-consuming decision-support

queries providing high throughput for transaction

processing

Page 8: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

8888

Google data centers around the world, as of 2008

Page 9: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

9999

PARALLELISM IN DATABASES Data can be partitioned across multiple disks

for parallel I/O. Individual relational operations (e.g., sort,

join, aggregation) can be executed in parallel data can be partitioned and each processor can

work independently on its own partition Results merged when done

Different queries can be run in parallel with each other. Concurrency control takes care of conflicts.

Thus, databases naturally lend themselves to parallelism.

Page 10: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

10101010

PARTITIONING Horizontal partitioning (shard)

involves putting different rows into different tables

Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest

Vertical partitioning involves creating tables with fewer columns and

using additional tables to store the remaining columns

partitions columns even when already normalized called "row splitting" (the row is split by its

columns) Ex: split (slow to find) dynamic data from (fast to

find) static data in a table where the dynamic data is not used as often as the static

Page 11: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

11111111

COMPARISON OF PARTITIONING TECHNIQUES

Evaluate how well partitioning techniques support the following types of data access:

1.Scanning the entire relation. 2.Locating a tuple associatively – point

queries. E.g., r.A = 25.

3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries. E.g., 10 r.A < 25.

Page 12: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

12121212

HANDLING SKEW USING HISTOGRAMS

Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram

Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation

Page 13: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

13131313

INTERQUERY PARALLELISM Queries/transactions execute in parallel with

one another concurrent processing

Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.

Easiest form of parallelism to support

Page 14: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

14141414

INTRAQUERY PARALLELISM Execution of a single query in parallel on

multiple processors/disks; important for speeding up long-running queries

Two complementary forms of intraquery parallelism : Intraoperation Parallelism – parallelize the

execution of each individual operation in the query

(each CPU runs on a subset of tuples) Interoperation Parallelism – execute the

different operations in a query expression in parallel.

(each CPU runs a subset of operations on the data)

Page 15: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

15151515

PARALLEL JOIN The join operation requires pairs of tuples to

be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.

Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally.

In a final step, the results from each processor can be collected together to produce the final result.

Page 16: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

16161616

QUERY OPTIMIZATION Query optimization in parallel databases is more

complex than in sequential databases Cost models are more complicated, since we must take

into account partitioning costs and issues such as skew and resource contention

When scheduling execution tree in parallel system, must decide: How to parallelize each operation how many processors to use for it What operations to pipeline what operations to execute independently in parallel what operations to execute sequentially

Determining the amount of resources to allocate for each operation is a problem E.g., allocating more processors than optimal can

result in high communication overhead

Page 17: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

DEDUCTIVE DATABASES

Page 18: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

18181818

OVERVIEW OF DEDUCTIVE DATABASES

Declarative Language Language to specify rules

Inference Engine (Deduction Machine) Can deduce new facts by interpreting the rules Related to logic programming

Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate

Top-down application of the rules

Consists of: Facts

Similar to relation specification without the necessity of including attribute names

Rules Similar to relational views (virtual relations that are not

stored)

Page 19: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

19191919

PROLOG/DATALOG NOTATION

Facts are provided as predicates Predicate has

a name a fixed number of arguments

Convention: Constants are numeric or character strings

Variables start with upper case letters E.g., SUPERVISE(Supervisor, Supervisee)

States that Supervisor SUPERVISE(s) Supervisee

Page 20: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

20202020

PROLOG/DATALOG NOTATION

Rule Is of the form head :- body

where :- is read as if and only iff E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

Page 21: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

21212121

PROLOG/DATALOG NOTATION

Query Involves a predicate symbol followed by some

variable arguments to answer the question where :- is read as if and only iff

E.g., SUPERIOR(james,Y)? E.g., SUBORDINATE(james,X)?

Page 22: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

22222222

Supervisory treeProlog notation

Page 23: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

23232323

PROVING A NEW FACT

Page 24: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

24242424

Page 25: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

DATA MINING

Page 26: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

26262626

DEFINITION

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%

Page 27: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

27272727

DEFINITION (CONT.)

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Valid: The patterns hold in general.Novel: We did not know the pattern

beforehand.Useful: We can devise actions from the

patterns.Understandable: We can interpret and

comprehend the patterns.

Page 28: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

28282828

WHY USE DATA MINING TODAY?

Human analysis skills are inadequate:Volume and dimensionality of the dataHigh data growth rate

Availability of:DataStorageComputational powerOff-the-shelf softwareExpertise

Page 29: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

29292929

THE KNOWLEDGE DISCOVERY PROCESS

Steps: Identify business problem Data mining Action Evaluation and measurement Deployment and integration into

businesses processes

Page 30: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

30303030

PREPROCESSING AND MINING

Original Data

TargetData

PreprocessedData

PatternsKnowledge

DataIntegrationand Selection

Preprocessing

ModelConstruction

Interpretation

Page 31: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

31313131

DATA MINING TECHNIQUES

Supervised learning Classification and regression

Unsupervised learning Clustering

Dependency modeling Associations, summarization, causality

Outlier and deviation detection Trend analysis and change detection

Page 32: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

32323232

EXAMPLE APPLICATION: SKY SURVEY

Input data: 3 TB of image data with 2 billion sky objects, took more than six years to complete

Goal: Generate a catalog with all objects and their type

Method: Use decision trees as data mining model

Results:94% accuracy in predicting sky object

classes Increased number of faint objects

classified by 300%Helped team of astronomers to discover 16

new high red-shift quasars in one order of magnitude less observation time

Page 33: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

33333333

CLASSIFICATION EXAMPLE

Example training databaseTwo predictor attributes:

Age and Car-type (Sport, Minivan and Truck)

Age is ordered, Car-type iscategorical attribute

Class label indicateswhether person boughtproduct

Dependent attribute is categorical

Age Car Class

20 M Yes30 M Yes25 T No30 S Yes40 S Yes20 T No30 M Yes25 M Yes40 M Yes20 S No

Page 34: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

34343434

GOALS AND REQUIREMENTS

Goals: To produce an accurate classifier/regression

function To understand the structure of the problem

Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training

databases

Page 35: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

35353535

WHAT ARE DECISION TREES?

Minivan

Age

Car Type

YES NO

YES

<30 >=30

Sports, Truck

0 30 60 Age

YES

YES

NO

Minivan

Sports,Truck

Page 36: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

36363636

DENSITY-BASED CLUSTERING A cluster is defined as a connected dense

component. Density is defined in terms of number of

neighbors of a point. We can find clusters of arbitrary shape

Page 37: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

37373737

MARKET BASKET ANALYSIS

Consider shopping cart filled with several items

Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items?

Page 38: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

38383838

MARKET BASKET ANALYSIS (CONTD.)

Coocurrences 80% of all customers purchase items X, Y and Z

together. Association rules

60% of all customers who purchase X and Y also buy Z.

Sequential patterns 60% of customers who first buy X also purchase Y

within three weeks.

Page 39: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

SPATIAL DATA

Page 40: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

40404040

Page 41: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

41414141

WHAT IS A SPATIAL DATABASE?

Database that: Stores spatial objects Manipulates spatial objects just like other objects

in the database

Page 42: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

42424242

WHAT IS SPATIAL DATA?

Data which describes either location or shape

e.g.House or Fire Hydrant locationRoads, Rivers, Pipelines, Power linesForests, Parks, Municipalities, Lakes

In the abstract, reductionist view of the

computer, these entities are represented as Points, Lines, and Polygons.

Page 43: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

43434343

Roads are represented as Lines Mail Boxes are represented as Points

Page 44: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

44444444

TOPIC THREE

Land Use Classifications arerepresented as Polygons

Page 45: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

45454545

TOPIC THREE

Combination of all the previous data

Page 46: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

46464646

SPATIAL RELATIONSHIPS

Not just interested in location, also interested in “Relationships” between objects that are very hard to model outside the spatial domain.

The most common relationships are Proximity : distance Adjacency : “touching” and “connectivity” Containment : inside/overlapping

Page 47: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

47474747

SPATIAL RELATIONSHIPS

Distance between a toxic waste dump and a piece of property you were considering buying.

Page 48: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

48484848

SPATIAL RELATIONSHIPS

Distance to various pubs

Page 49: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

49494949

SPATIAL RELATIONSHIPS

Adjacency: All the lots which share an edge

Page 50: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

50505050

Connectivity: Tributary relationships in river networks

Page 51: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

51515151

MOST ORGANIZATIONS HAVE SPATIAL DATA

Geocodable addresses Customer location Store locations Transportation tracking Statistical/Demographic Cartography Epidemiology Crime patterns

Weather Information Land holdings Natural resources City Planning Environmental planning Information Visualization Hazard detection

Page 52: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

52525252

ADVANTAGES OF SPATIAL DATABASES

Able to treat your spatial data like anything else in the DB transactions backups integrity checks less data redundancy fundamental organization and operations

handled by the DB multi-user support security/access control locking

Page 53: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

53535353

ADVANTAGES OF SPATIAL DATABASES

Offset complicated tasks to the DB server organization and indexing done for you do not have to re-implement operators do not have to re-implement functions

Significantly lowers the development time of client applications

Page 54: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

54545454

ADVANTAGES OF SPATIAL DATABASES

Spatial querying using SQL use simple SQL expressions to determine spatial

relationships distance adjacency containment

use simple SQL expressions to perform spatial operations area length intersection union buffer

Page 55: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

55555555

Original Polygons

Union Intersection

Page 56: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

56565656

Original river network

Buffered rivers

Page 57: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

57575757

ADVANTAGES OF SPATIAL DATABASES

… WHERE distance(<me>,pub_loc) < 1000SELECT distance(<me>,pub_loc)*$0.01 +

beer_cost …... WHERE touches(pub_loc, street)

… WHERE inside(pub_loc,city_area) and city_name = ...

Page 58: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

58585858

ADVANTAGES OF SPATIAL DATABASES

Simple value of the proposed lot

Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre>- distance(<my lot>, <power lines>) * <cost of power line laying>

Page 59: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

59595959

New Electoral Districts

• Changes in areas between 1996 and 2001 election.

• Want to predict voting in 2001 by looking at voting in 1996.

• Intersect the 2001 district polygon with the voting areas polygons.• Outside will have zero area• Inside will have 100% area• On the border will have partial area

• Multiply the % area by 1996 actual voting and sum

• Result is a simple prediction of 2001 voting

More advanced: also use demographic data.

Page 60: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

60606060

DISADVANTAGES OF SPATIAL DATABASES

Cost to implement can be high Some inflexibility Incompatibilities with some GIS software Slower than local, specialized data structures User/managerial inexperience and caution

Page 61: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

61616161

PICTOGRAMS - SHAPES

Types: Basic Shapes, Multi-Shapes, Derived Shapes, Alternate Shapes, Any possible Shape, User-Defined Shapes

Basic Shapes Alternate Shapes

Multi-Shapes Any Possible Shape

Derived Shapes User Defined Shape

N 0, N

*

!

Page 62: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

62626262

SPATIAL DATA ENTITY CREATION

Form an entity to hold county names, states, populations, and geographiesCREATE TABLE County(

Name varchar(30),State varchar(30),Pop Integer,Shape Polygon);

Form an entity to hold river names, sources, lengths, and geographiesCREATE TABLE River(

Name varchar(30),Source varchar(30),Distance Integer,Shape LineString);

Page 63: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

63636363

EXTENDING THE ER DIAGRAM

Standard ER Diagram

Spatial ER Diagram

Page 64: 1111 Topic for Thursday?. Miscellaneous Topics in Databases

64646464