17
1 Introduction to Geographic Information Science Database Management Geography 4103 / 5103 Updates Last Lecture We tried to explore the term spatial modelby looking at definitions, taxonomies and examples An understanding of the methods we use (analysis tools), appropriate data models and of the problem we face (modeling) are central Deriving meaningful representations of events, occurrences or processes by making use of the power of spatial analysis Modelbuilder: How do you like it?

Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

Embed Size (px)

Citation preview

Page 1: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

1

Introduction to Geographic Information Science

Database Management

Geography 4103 / 5103

Updates

Last Lecture

• We tried to explore the term “spatial model” by looking at definitions, taxonomies and examples

• An understanding of the methods we use (analysis tools), appropriate data models and of the problem we face (modeling) are central

• Deriving meaningful representations of events, occurrences or processes by making use of the power of spatial analysis

• Modelbuilder: How do you like it?

Page 2: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

2

Today‘s Outline

• We will look into Database Management Systems (DBMS)

• Exploring what Databases and their elementsare and what DBMS means

• Types of DBMS• How attribute data & feature info is managed

and stored• Operations on relational DBMS (Relational

Operators) to manipulate and query/selectdata

• Spatial data

Learning Objectives

• Database Management Systems (DBMS) • Databases and their elements• Types of DBMS• Relational Operators to manipulate and

query/select data• Spatial data

Databases and DBMS

It’s all about our data (attribute information & feature information…)

A database is a collection of data files that is structured (organized).

A database management system (DBMS) is a specialized computer program used to organize & manipulate (manage) the database (data storage, editing, and retrieval).

Oracle, Access, Postgres

Page 3: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

3

- Often huge tables- Require maintenance (change, add, delete)

to store data properties and their relationships

- Must serve different people/applications for queries

- Protection from corrupting/deleting & access restrictions

- Geodatabases as a more complex type

DBMS & GIS

Logical vs. Physical Structures• Logical structure = database design (schema)

– “Logical specification of attributes and relationships”

– Conceptual model of items, mappings, cardinality

– Entity- relation diagram / notation (UML)

• Physical structure = database implementation

– Many possible implementations of any schema

– Depends on intended db use requirements

• Speed access, frequent update

• Flexible relationships

• Protect data security

Logical Structure (Schema)

• Bolstad’s (2005) forest trails database– Entity sets hold attributes*

– Relationships hold mappings• Use these for joining tables

– Cardinality (1-N, M-N) defines nature and direction of the relationship

• Review joins-and-relates

• Entity sets hold features, too…

Recreation

Activity

Features

M

Page 4: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

4

cardinality

cardinality

cardinality

Attention: 1:M and M:N

Use relate instead of join

If join: only first element in shp filesgdb: relationships for all mappings

Spatial Joins…M:1 Landuse features (M)& descriptions (1)

Physical Structures/ Database Models

• … particular way of conceptually organizing multiple data files in a database (implementation)

• Flat File: text files• Hierarchical: parent-child• Network: nodes & links• Relational: tables related via keys• … Hybrid/ Object-oriented

Hierarchical and network database models have generally been replaced by the relational data model.

Data in a “text”formatted file (row/column format).“Initial stage format”

Advantages:Transparent, easily transportable

Disadvantages:Little structure, few error safeguards, no ability to cross-reference or link among entries

Flat File

Page 5: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

5

Hierarchical DBMS Root entity & tree (e.g. ArcCatalog, Windows Explorer) and parent-children relationships

Simple, hard to capture complex relationships, slow searches

Redundancies exist (updates!) - duplicates

forests

trails

featuresredundant

Network DBMS

forests

trailsfeatures

activity

Eliminate redundancy - permit multiple parents for each child

Disadvantages: Advantages:difficult to implement fast searchdifficult to update flexible relationshipsdifficult to validate no duplicates

Hierarchic and Network DBMS In Practice

Redundant itemsNot an error (no way to avoid)

No redundant nodes, buterrors in relations

point 4 not part of edge fpoint 5 should be part

Page 6: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

6

Physical Structures/ Database Models

• … particular way of conceptually organizing multiple data files in a database

• Flat File: text files• Hierarchical: parent-child• Network: nodes & links• Relational: tables related via keys• … Hybrid/ Object-oriented

Hierarchical and network database models have generally been replaced by the relational data model.

Relational DBMS

• Introduced by E.F. Codd (1968)– Mathematician at IBM, same time as Mandelbrot

• Most frequently encountered DBMS in GIS– Flexible– Wide range of data types

• Simple to implement, modify and understand– Bernhardsen: simple table structure permitted

development of SQL

• Sometimes retrieval is slow (so optimize tables)– Use fewer columns, fewer joins– Use relationship classes instead of joins

• Table: Data organized in rows and columns• Record (rows/tuples): a set of tuples represents logical entities (e.g. road, lake, land use polygon)• Field (column/item): The attribute (property) of the logical entity• Index/key: Attribute(s) used to identify, organize, or order records in a database (needed for relational algebra or joins; see below)

Terminology

ID AREA Perim Class Code27 39.2 55.4 a 11z14 192.4 77.3 a 119f

integerdomain

realdomain(float/double)

alpha-numericdomain(a string)

Record (or tuple)

Field (or attribute/item)

Page 7: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

7

Original “Flat Files”

Minimal structure

Field types/domainsspecified (possible values)

Advantages:Minimum structure, easy programming, flexible

Disadvantages:Can be slow due to lack of structure

Relational Database

Relational “Keys”

• Any unique field can be a key.• Keys can span multiple columns

– What is unique?• Forests: forest-id or forest_name• Recreational features: feature or description

Page 8: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

8

Primary and Foreign “Keys”• Primary key – index to a table

• Foreign key – index contained in the table that is possibly non-unique, but which serves as primary key in another table (can be used for Joins)

Primary Key

Foreign Key

Every table must have a primary key (what is primary key for Trails table?)

Primary Key

Why join these tables? What could it tell you?

Relational Database – Joining Tables

• Rule: each row holds a unique combination of values, before and after join

• Ideally, isolate key in as few fields as possible– db is simpler, smaller disk volume, faster

query– This is partly why ArcGIS uses a # field –

forces primary key into a single column, in absence of other keys

Relational Database –Result of the Join

What is the primary key in the Joined Table? What foreign keys do you see in Joined Table?

Page 9: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

9

Which fields can be primary keys?

Sometimes we need multiple fields to form a keye.g. Parcel-ID and Own-ID

What is the primary key for this table?

Relational Join (1:n)

Resulting Joined table:

Page 10: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

10

Sorting – ordering by attribute values

Name AREA class TypeEmily, Lake 52,222.6 1 Limnetic zoneEmily, Lake 58,662.2 1 Limnetic zone

60,826.6 2 Shallow lakes64,588.5 2 Shallow lakes70,590.3 2 Shallow lakes

Long Lake 88,259.5 1 Limnetic zone143,285.3 2 Littoral zone

Sleepy Eye Lake 170,797.1 2 Littoral zone Mud Lake 193,318.5 2 Shallow lakesGoldsmith Lake 201,127.1 2 Littoral zone Emily, Lake 336,343.2 2 Littoral zone

349,528.7 1 Limnetic zone384,160.1 2 Littoral zone

Emily, Lake 420,798.4 1 Limnetic zoneSavidge Lake 479,709.7 2 Littoral zone Emily, Lake 545,381.8 1 Limnetic zoneDog Lake 635,537.0 2 Littoral zoneDuck Lake 1,126,331.9 1 Limnetic zoneWita Lake 1,354,583.2 2 Littoral zone

1,418,133.3 1 Limnetic zoneBallantyne Lake 1,428,331.5 1 Limnetic zoneWashington, Lake 1,914,835.3 1 Limnetic zone

1 937 698 6 1 Limnetic zone

Name AREA class Type4,040,675.7 1 Limnetic zone1,937,698.6 1 Limnetic zone

Washington, Lake 1,914,835.3 1 Limnetic zoneBallantyne Lake 1,428,331.5 1 Limnetic zone

1,418,133.3 1 Limnetic zoneDuck Lake 1,126,331.9 1 Limnetic zoneEmily, Lake 545,381.8 1 Limnetic zoneEmily, Lake 420,798.4 1 Limnetic zone

349,528.7 1 Limnetic zoneLong Lake 88,259.5 1 Limnetic zoneEmily, Lake 58,662.2 1 Limnetic zoneEmily, Lake 52,222.6 1 Limnetic zoneDog Lake 635,537.0 2 Littoral zoneWita Lake 1,354,583.2 2 Littoral zone Savidge Lake 479,709.7 2 Littoral zone

384,160.1 2 Littoral zone Emily, Lake 336,343.2 2 Littoral zone Goldsmith Lake 201,127.1 2 Littoral zone Sleepy Eye Lake 170,797.1 2 Littoral zone

143,285.3 2 Littoral zone Mud Lake 193,318.5 2 Shallow lakes

70,590.3 2 Shallow lakes64 588 5 2 Shallow lakes

Simple sort – ascending AREACompound sort – ascending Type, then descending AREA within Type

Constraints on relational implementation

• Rules for implementing tables appear to be fast and loose.

• In fact, two kinds of constraints allow flexibility yet preserve logical consistency.– Constraint 1 – limit the number of legal

operations on relational tables (Relational Algebra)

– Constraint 2 – Balance the amount of redundancy (Normal Forms)

Constraint 1 – limit # operations

• Codd’s relational algebra (only 8 operations)– Combine or split tables

– Select rows or columns

– Expand tables (add rows or columns)

• Everything you do in Arc that geoprocesses tables is accomplished by these 8 operations.

Page 11: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

11

The Eight Operators (after Bolstad, 2005)

Select (rows) by attribute

Select size >= big

Simple or compound restricts (using logical operators AND, OR, NOT)

Select specific column(s)Recall comment about fewer columns makes a simpler db; better speed, smaller disk volume(vertical subsetting)

Show SQL

The Eight Operators (after Bolstad, 2005)

Combine all possible unique rows in two tables

Often used in queries with “All” based on a condition

Combines all unique rows of one table with all unique rows of another table (cross-tabulating)

“Find all types (m,n,r) associated with size = 1 and size =2” … in 3rd table

Returns list of types

Find structure types that are located in 2 different Hazard zones (1 and 2) out of a table that summarizesall relationships

The Eight Operators (after Bolstad, 2005)

Combine tables to return records found in one or both

Combine tables to return records found in both

No duplicates; same set of attributes in input tables

same set of attributes in input tables

Page 12: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

12

The Eight Operators (after Bolstad, 2005)

Return rows in first but not second table – order matters!

Match candidate keys to expand attributes

Sequential joins are possible

Similar to Erase

What is the join field above?

Summary

• We tried to explore Database Management Systems (DBMS)

• Databases for organizing and manipulating data• Relational databases are most common• Attribute data is managed and stored and tables can

be linked based on keys• Operations on relational DBMS (Relational

Operators) are very important concepts for DBMS• Spatial data are special cases and often special

structures are required to manage them

Constraints on relational implementation

• Rules for implementing tables appear to be fast and loose.

• In fact, two kinds of constraints allow flexibility yet preserve logical consistency.– Constraint 1 – limit the number of legal

operations on relational tables (Relational Algebra)

– Constraint 2 – Balance the amount of redundancy (Normal Forms)

Page 13: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

13

Relational tables have many advantages, butif improperly structured, they may suffer from:

- Poor performance- Inconsistency- Redundancy- Difficult maintenance

This occurs when concepts of Normal Forms in relational tables are violated.

Pitfalls of Relational Tables

Constraint 2 – limit redundancy(do this with indexing keys & dependencies)

• Dependencies needed to make relational DBMS work. Dependency means that one column pre-determines another.

• Dependency Redundancy

(they complement and balance each other)– Too much bulky database, slower performance

– Not enough can’t find all the info in the table easily, and difficult to join when added information needed

Simple (Functional) Dependency

• Dfn: knowing one field in a row determines what the value in another field would be.

Example: Student DatabaseKnowing a Buff One number determines student nameKnowing name determines major (even Undeclared) Knowing major determines College (A&S, ENG)

• Functional dependencies are good (they’re simple)• Transitive dependencies are bad

• Transitive: sequence of simple dependencies in one table.• Bad because too much redundancy creates complex primary and foreign indexing keys (again, bulky, slow, and possibly contradictory)

Page 14: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

14

How to resolve Constraint #2?• Normalization insures indexing keys provide just

the right amount of dependency in single table. – “Just the right amount” means that edits can be made in just one

table and propagated through the rest of the DBMS using table relationships (and joins).

– And database edits cannot easily corrupt the data (goal is to free the database of modification anomalies).

How to resolve Constraint #2?

• Normalize in stages, called “Normal Forms”– Each form inserts or eliminates dependencies

– Codd proposed six in a sequence – three added later

– First three needed for GIS

– When all three are in place, relational database contains only simple dependences

– A normalized database is suitable for general purpose queries, meaning special cases in the database should not require different query formulation than general cases.

Normal Forms

• 1st normal form: Atomic columns and cell values– every cell contains only one attribute value, and

– no repeat columns appear in any single table

• 2nd normal form: establish simple dependencies– attributes that do not make up the primary key are

functionally dependent only on the primary key

– Split tables to remove duplicate rows

• 3rd normal form: eliminate transitive dependencies– Split tables to remove dependent rows and columns

Six additional normal forms can be established, but GIS uses only these three…

Page 15: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

15

Establish 1st Normal Form

BuffOne Student Name Major Dept College

1234..789 Sally Jones Interface Dsn CSI ENG

1357..246 Bob WillisPolicy,

Human GeogENVS, GEOG A&S

9876..321 Kathy DunnPhys Geog,

PolicyGEOG, ENVS A&S

5432..567 Hal Smith GIS GEOG A&S

5798..123 Carl TomlinAnalysis,

GISENVS, GEOG A&S

Problem? Cells can have only one value (thus queries need to recognize and isolate one major from list of possibly multiple majors – queries become more difficult than they need to be)

Establish 1st Normal Form

BuffOneStudent Name Major Dept College Major Dept College

1234..789 Sally JonesInterface

Dsn CSI ENG -- -- --

1357..246 Bob Willis Policy ENVS A&SHuman Geog GEOG A&S

9876..321 Kathy DunnPhys Geog GEOG A&S Policy ENVS A&S

5432..567 Hal Smith GIS GEOG A&S -- -- --

5798..123 Carl Tomlin Analysis ENVS A&S GIS GEOG A&S

1st Normal Form? Not yet!No cell contains more than one value, and now primary key is BuffOne or Student Name; BUT multiple columns persist

Establish 1st Normal FormDone!

Every cell has only one value, and no duplicate columns.

BuffOne Student Name Major Dept College

1234..789 Sally Jones Interface Dsn CSI ENG

1357..246 Bob Willis Policy ENVS A&S

9876..321 Kathy Dunn Phys Geog GEOG A&S

5432..567 Hal Smith GIS GEOG A&S

5798..123 Carl Tomlin Analysis ENVS A&S

1357..246 Bob Willis Human Geog GEOG A&S

9876..321 Kathy Dunn Policy ENVS A&S

5798..123 Carl Tomlin GIS GEOG A&S

But what is primary key now? (note Bob Willis, Kathy Dunn Carl Tomlin have double majors)

Page 16: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

16

Establish 2nd Normal Form

BuffOne Student Name Major1 ID Major 2 ID

1234..789 Sally Jones 134.05 Null

1357..246 Bob Willis 378.01 260.02

9876..321 Kathy Dunn 260.01 378.01

5432..567 Hal Smith 260.04 Null

5798..123 Carl Tomlin 378.02 260.04

2nd Normal Form – Remove duplicate rows to establish primary keys for simple dependencies; every non-key field depends only on the primary key.

Students

MajorsMajor ID Major Dept ID Dept College

134.05 Interface Design 134 CSI ENG

378.01 Policy 378 ENVS A&S

378.02 Analysis 378 ENVS A&S

260.01 Phys Geog 260 GEOG A&S

260.02 Human Geog 260 GEOG A&S

260.04 GIS 260 GEOG A&SDone!

Establish 3rd Normal Form3rd Normal Form -- Eliminate transitive dependencies

Major ID Dept ID and Dept ID College

BuffOne Student Name Major1 ID Major 2 ID

1234..789 Sally Jones 134.05 Null

1357..246 Bob Willis 378.01 260.02

9876..321 Kathy Dunn 260.01 378.01

5432..567 Hal Smith 260.04 Null

5798..123 Carl Tomlin 378.02 260.04

Students

Major ID Major Dept ID

134.05 Interface Design 134

378.01 Policy 378

378.02 Analysis 378

260.01 Phys Geog 260

260.02 Human Geog 260

260.04 GIS 260

Majors

Dept ID Dept College

134 CSI ENG

378 ENVS A&S

260 GEOG A&S

Departments

Done!

In 3rd Normal Form

• Tables have distinct sets of rows and columns

• Primary keys are unique identifiers within tables, and in each table, they span as few columns as possible.

• Items appearing in multiple tables keep same ID throughout

• All dependencies are simple; no transitive dependencies exist in any single table

Page 17: Introduction to Geographic Information Science - colorado.edu · Introduction to Geographic Information Science Database Management ... • Exploring what Databasesand their elements

17

• In GIS often hybrid database designs are used

• Coordinate data in specialized structures (fast retrieval)

*object-related grouping, *indexing, listing as well as *pointers to link between geographic

features and attributes

• Topology explicitly stored in an indexing table (using lists and pointers to keep information about adjacency,…)

• Attribute data in relational databases

Hybrid DBMS

Integrated DBMS• Attributes and features

stored in same type DBMS– Relational

– Arc Geodatabase

– Item b) is the Management Base; c) is Data Base

– Where is the Analytic Base?

Geodatabases stored in Integrated DBMS

Object-Oriented DBMS

• Items are objects, “encapsulate” data in “frames”

• Classes have “properties” = “behaviors” = “methods”

• Subclasses “inherit” properties

• Objects pass “messages”

• E.g. GIS SmallWorld,

ArcObjects (learn a bit

about this in GIS 2)

• Has evolved into agent-based

modeling, dynamic modeling

and mobility tracking)