View
228
Download
0
Category
Preview:
Citation preview
1
Introduction to Geographic Information Science
Database Management
Geography 4103 / 5103
Updates
Last Lecture
• We tried to explore the term “spatial model” by looking at definitions, taxonomies and examples
• An understanding of the methods we use (analysis tools), appropriate data models and of the problem we face (modeling) are central
• Deriving meaningful representations of events, occurrences or processes by making use of the power of spatial analysis
• Modelbuilder: How do you like it?
2
Today‘s Outline
• We will look into Database Management Systems (DBMS)
• Exploring what Databases and their elementsare and what DBMS means
• Types of DBMS• How attribute data & feature info is managed
and stored• Operations on relational DBMS (Relational
Operators) to manipulate and query/selectdata
• Spatial data
Learning Objectives
• Database Management Systems (DBMS) • Databases and their elements• Types of DBMS• Relational Operators to manipulate and
query/select data• Spatial data
Databases and DBMS
It’s all about our data (attribute information & feature information…)
A database is a collection of data files that is structured (organized).
A database management system (DBMS) is a specialized computer program used to organize & manipulate (manage) the database (data storage, editing, and retrieval).
Oracle, Access, Postgres
3
- Often huge tables- Require maintenance (change, add, delete)
to store data properties and their relationships
- Must serve different people/applications for queries
- Protection from corrupting/deleting & access restrictions
- Geodatabases as a more complex type
DBMS & GIS
Logical vs. Physical Structures• Logical structure = database design (schema)
– “Logical specification of attributes and relationships”
– Conceptual model of items, mappings, cardinality
– Entity- relation diagram / notation (UML)
• Physical structure = database implementation
– Many possible implementations of any schema
– Depends on intended db use requirements
• Speed access, frequent update
• Flexible relationships
• Protect data security
Logical Structure (Schema)
• Bolstad’s (2005) forest trails database– Entity sets hold attributes*
– Relationships hold mappings• Use these for joining tables
– Cardinality (1-N, M-N) defines nature and direction of the relationship
• Review joins-and-relates
• Entity sets hold features, too…
Recreation
Activity
Features
M
4
cardinality
cardinality
cardinality
Attention: 1:M and M:N
Use relate instead of join
If join: only first element in shp filesgdb: relationships for all mappings
Spatial Joins…M:1 Landuse features (M)& descriptions (1)
Physical Structures/ Database Models
• … particular way of conceptually organizing multiple data files in a database (implementation)
• Flat File: text files• Hierarchical: parent-child• Network: nodes & links• Relational: tables related via keys• … Hybrid/ Object-oriented
Hierarchical and network database models have generally been replaced by the relational data model.
Data in a “text”formatted file (row/column format).“Initial stage format”
Advantages:Transparent, easily transportable
Disadvantages:Little structure, few error safeguards, no ability to cross-reference or link among entries
Flat File
5
Hierarchical DBMS Root entity & tree (e.g. ArcCatalog, Windows Explorer) and parent-children relationships
Simple, hard to capture complex relationships, slow searches
Redundancies exist (updates!) - duplicates
forests
trails
featuresredundant
Network DBMS
forests
trailsfeatures
activity
Eliminate redundancy - permit multiple parents for each child
Disadvantages: Advantages:difficult to implement fast searchdifficult to update flexible relationshipsdifficult to validate no duplicates
Hierarchic and Network DBMS In Practice
Redundant itemsNot an error (no way to avoid)
No redundant nodes, buterrors in relations
point 4 not part of edge fpoint 5 should be part
6
Physical Structures/ Database Models
• … particular way of conceptually organizing multiple data files in a database
• Flat File: text files• Hierarchical: parent-child• Network: nodes & links• Relational: tables related via keys• … Hybrid/ Object-oriented
Hierarchical and network database models have generally been replaced by the relational data model.
Relational DBMS
• Introduced by E.F. Codd (1968)– Mathematician at IBM, same time as Mandelbrot
• Most frequently encountered DBMS in GIS– Flexible– Wide range of data types
• Simple to implement, modify and understand– Bernhardsen: simple table structure permitted
development of SQL
• Sometimes retrieval is slow (so optimize tables)– Use fewer columns, fewer joins– Use relationship classes instead of joins
• Table: Data organized in rows and columns• Record (rows/tuples): a set of tuples represents logical entities (e.g. road, lake, land use polygon)• Field (column/item): The attribute (property) of the logical entity• Index/key: Attribute(s) used to identify, organize, or order records in a database (needed for relational algebra or joins; see below)
Terminology
ID AREA Perim Class Code27 39.2 55.4 a 11z14 192.4 77.3 a 119f
integerdomain
realdomain(float/double)
alpha-numericdomain(a string)
Record (or tuple)
Field (or attribute/item)
7
Original “Flat Files”
Minimal structure
Field types/domainsspecified (possible values)
Advantages:Minimum structure, easy programming, flexible
Disadvantages:Can be slow due to lack of structure
Relational Database
Relational “Keys”
• Any unique field can be a key.• Keys can span multiple columns
– What is unique?• Forests: forest-id or forest_name• Recreational features: feature or description
8
Primary and Foreign “Keys”• Primary key – index to a table
• Foreign key – index contained in the table that is possibly non-unique, but which serves as primary key in another table (can be used for Joins)
Primary Key
Foreign Key
Every table must have a primary key (what is primary key for Trails table?)
Primary Key
Why join these tables? What could it tell you?
Relational Database – Joining Tables
• Rule: each row holds a unique combination of values, before and after join
• Ideally, isolate key in as few fields as possible– db is simpler, smaller disk volume, faster
query– This is partly why ArcGIS uses a # field –
forces primary key into a single column, in absence of other keys
Relational Database –Result of the Join
What is the primary key in the Joined Table? What foreign keys do you see in Joined Table?
9
Which fields can be primary keys?
Sometimes we need multiple fields to form a keye.g. Parcel-ID and Own-ID
What is the primary key for this table?
Relational Join (1:n)
Resulting Joined table:
10
Sorting – ordering by attribute values
Name AREA class TypeEmily, Lake 52,222.6 1 Limnetic zoneEmily, Lake 58,662.2 1 Limnetic zone
60,826.6 2 Shallow lakes64,588.5 2 Shallow lakes70,590.3 2 Shallow lakes
Long Lake 88,259.5 1 Limnetic zone143,285.3 2 Littoral zone
Sleepy Eye Lake 170,797.1 2 Littoral zone Mud Lake 193,318.5 2 Shallow lakesGoldsmith Lake 201,127.1 2 Littoral zone Emily, Lake 336,343.2 2 Littoral zone
349,528.7 1 Limnetic zone384,160.1 2 Littoral zone
Emily, Lake 420,798.4 1 Limnetic zoneSavidge Lake 479,709.7 2 Littoral zone Emily, Lake 545,381.8 1 Limnetic zoneDog Lake 635,537.0 2 Littoral zoneDuck Lake 1,126,331.9 1 Limnetic zoneWita Lake 1,354,583.2 2 Littoral zone
1,418,133.3 1 Limnetic zoneBallantyne Lake 1,428,331.5 1 Limnetic zoneWashington, Lake 1,914,835.3 1 Limnetic zone
1 937 698 6 1 Limnetic zone
Name AREA class Type4,040,675.7 1 Limnetic zone1,937,698.6 1 Limnetic zone
Washington, Lake 1,914,835.3 1 Limnetic zoneBallantyne Lake 1,428,331.5 1 Limnetic zone
1,418,133.3 1 Limnetic zoneDuck Lake 1,126,331.9 1 Limnetic zoneEmily, Lake 545,381.8 1 Limnetic zoneEmily, Lake 420,798.4 1 Limnetic zone
349,528.7 1 Limnetic zoneLong Lake 88,259.5 1 Limnetic zoneEmily, Lake 58,662.2 1 Limnetic zoneEmily, Lake 52,222.6 1 Limnetic zoneDog Lake 635,537.0 2 Littoral zoneWita Lake 1,354,583.2 2 Littoral zone Savidge Lake 479,709.7 2 Littoral zone
384,160.1 2 Littoral zone Emily, Lake 336,343.2 2 Littoral zone Goldsmith Lake 201,127.1 2 Littoral zone Sleepy Eye Lake 170,797.1 2 Littoral zone
143,285.3 2 Littoral zone Mud Lake 193,318.5 2 Shallow lakes
70,590.3 2 Shallow lakes64 588 5 2 Shallow lakes
Simple sort – ascending AREACompound sort – ascending Type, then descending AREA within Type
Constraints on relational implementation
• Rules for implementing tables appear to be fast and loose.
• In fact, two kinds of constraints allow flexibility yet preserve logical consistency.– Constraint 1 – limit the number of legal
operations on relational tables (Relational Algebra)
– Constraint 2 – Balance the amount of redundancy (Normal Forms)
Constraint 1 – limit # operations
• Codd’s relational algebra (only 8 operations)– Combine or split tables
– Select rows or columns
– Expand tables (add rows or columns)
• Everything you do in Arc that geoprocesses tables is accomplished by these 8 operations.
11
The Eight Operators (after Bolstad, 2005)
Select (rows) by attribute
Select size >= big
Simple or compound restricts (using logical operators AND, OR, NOT)
Select specific column(s)Recall comment about fewer columns makes a simpler db; better speed, smaller disk volume(vertical subsetting)
Show SQL
The Eight Operators (after Bolstad, 2005)
Combine all possible unique rows in two tables
Often used in queries with “All” based on a condition
Combines all unique rows of one table with all unique rows of another table (cross-tabulating)
“Find all types (m,n,r) associated with size = 1 and size =2” … in 3rd table
Returns list of types
Find structure types that are located in 2 different Hazard zones (1 and 2) out of a table that summarizesall relationships
The Eight Operators (after Bolstad, 2005)
Combine tables to return records found in one or both
Combine tables to return records found in both
No duplicates; same set of attributes in input tables
same set of attributes in input tables
12
The Eight Operators (after Bolstad, 2005)
Return rows in first but not second table – order matters!
Match candidate keys to expand attributes
Sequential joins are possible
Similar to Erase
What is the join field above?
Summary
• We tried to explore Database Management Systems (DBMS)
• Databases for organizing and manipulating data• Relational databases are most common• Attribute data is managed and stored and tables can
be linked based on keys• Operations on relational DBMS (Relational
Operators) are very important concepts for DBMS• Spatial data are special cases and often special
structures are required to manage them
Constraints on relational implementation
• Rules for implementing tables appear to be fast and loose.
• In fact, two kinds of constraints allow flexibility yet preserve logical consistency.– Constraint 1 – limit the number of legal
operations on relational tables (Relational Algebra)
– Constraint 2 – Balance the amount of redundancy (Normal Forms)
13
Relational tables have many advantages, butif improperly structured, they may suffer from:
- Poor performance- Inconsistency- Redundancy- Difficult maintenance
This occurs when concepts of Normal Forms in relational tables are violated.
Pitfalls of Relational Tables
Constraint 2 – limit redundancy(do this with indexing keys & dependencies)
• Dependencies needed to make relational DBMS work. Dependency means that one column pre-determines another.
• Dependency Redundancy
(they complement and balance each other)– Too much bulky database, slower performance
– Not enough can’t find all the info in the table easily, and difficult to join when added information needed
Simple (Functional) Dependency
• Dfn: knowing one field in a row determines what the value in another field would be.
Example: Student DatabaseKnowing a Buff One number determines student nameKnowing name determines major (even Undeclared) Knowing major determines College (A&S, ENG)
• Functional dependencies are good (they’re simple)• Transitive dependencies are bad
• Transitive: sequence of simple dependencies in one table.• Bad because too much redundancy creates complex primary and foreign indexing keys (again, bulky, slow, and possibly contradictory)
14
How to resolve Constraint #2?• Normalization insures indexing keys provide just
the right amount of dependency in single table. – “Just the right amount” means that edits can be made in just one
table and propagated through the rest of the DBMS using table relationships (and joins).
– And database edits cannot easily corrupt the data (goal is to free the database of modification anomalies).
How to resolve Constraint #2?
• Normalize in stages, called “Normal Forms”– Each form inserts or eliminates dependencies
– Codd proposed six in a sequence – three added later
– First three needed for GIS
– When all three are in place, relational database contains only simple dependences
– A normalized database is suitable for general purpose queries, meaning special cases in the database should not require different query formulation than general cases.
Normal Forms
• 1st normal form: Atomic columns and cell values– every cell contains only one attribute value, and
– no repeat columns appear in any single table
• 2nd normal form: establish simple dependencies– attributes that do not make up the primary key are
functionally dependent only on the primary key
– Split tables to remove duplicate rows
• 3rd normal form: eliminate transitive dependencies– Split tables to remove dependent rows and columns
Six additional normal forms can be established, but GIS uses only these three…
15
Establish 1st Normal Form
BuffOne Student Name Major Dept College
1234..789 Sally Jones Interface Dsn CSI ENG
1357..246 Bob WillisPolicy,
Human GeogENVS, GEOG A&S
9876..321 Kathy DunnPhys Geog,
PolicyGEOG, ENVS A&S
5432..567 Hal Smith GIS GEOG A&S
5798..123 Carl TomlinAnalysis,
GISENVS, GEOG A&S
Problem? Cells can have only one value (thus queries need to recognize and isolate one major from list of possibly multiple majors – queries become more difficult than they need to be)
Establish 1st Normal Form
BuffOneStudent Name Major Dept College Major Dept College
1234..789 Sally JonesInterface
Dsn CSI ENG -- -- --
1357..246 Bob Willis Policy ENVS A&SHuman Geog GEOG A&S
9876..321 Kathy DunnPhys Geog GEOG A&S Policy ENVS A&S
5432..567 Hal Smith GIS GEOG A&S -- -- --
5798..123 Carl Tomlin Analysis ENVS A&S GIS GEOG A&S
1st Normal Form? Not yet!No cell contains more than one value, and now primary key is BuffOne or Student Name; BUT multiple columns persist
Establish 1st Normal FormDone!
Every cell has only one value, and no duplicate columns.
BuffOne Student Name Major Dept College
1234..789 Sally Jones Interface Dsn CSI ENG
1357..246 Bob Willis Policy ENVS A&S
9876..321 Kathy Dunn Phys Geog GEOG A&S
5432..567 Hal Smith GIS GEOG A&S
5798..123 Carl Tomlin Analysis ENVS A&S
1357..246 Bob Willis Human Geog GEOG A&S
9876..321 Kathy Dunn Policy ENVS A&S
5798..123 Carl Tomlin GIS GEOG A&S
But what is primary key now? (note Bob Willis, Kathy Dunn Carl Tomlin have double majors)
16
Establish 2nd Normal Form
BuffOne Student Name Major1 ID Major 2 ID
1234..789 Sally Jones 134.05 Null
1357..246 Bob Willis 378.01 260.02
9876..321 Kathy Dunn 260.01 378.01
5432..567 Hal Smith 260.04 Null
5798..123 Carl Tomlin 378.02 260.04
2nd Normal Form – Remove duplicate rows to establish primary keys for simple dependencies; every non-key field depends only on the primary key.
Students
MajorsMajor ID Major Dept ID Dept College
134.05 Interface Design 134 CSI ENG
378.01 Policy 378 ENVS A&S
378.02 Analysis 378 ENVS A&S
260.01 Phys Geog 260 GEOG A&S
260.02 Human Geog 260 GEOG A&S
260.04 GIS 260 GEOG A&SDone!
Establish 3rd Normal Form3rd Normal Form -- Eliminate transitive dependencies
Major ID Dept ID and Dept ID College
BuffOne Student Name Major1 ID Major 2 ID
1234..789 Sally Jones 134.05 Null
1357..246 Bob Willis 378.01 260.02
9876..321 Kathy Dunn 260.01 378.01
5432..567 Hal Smith 260.04 Null
5798..123 Carl Tomlin 378.02 260.04
Students
Major ID Major Dept ID
134.05 Interface Design 134
378.01 Policy 378
378.02 Analysis 378
260.01 Phys Geog 260
260.02 Human Geog 260
260.04 GIS 260
Majors
Dept ID Dept College
134 CSI ENG
378 ENVS A&S
260 GEOG A&S
Departments
Done!
In 3rd Normal Form
• Tables have distinct sets of rows and columns
• Primary keys are unique identifiers within tables, and in each table, they span as few columns as possible.
• Items appearing in multiple tables keep same ID throughout
• All dependencies are simple; no transitive dependencies exist in any single table
17
• In GIS often hybrid database designs are used
• Coordinate data in specialized structures (fast retrieval)
*object-related grouping, *indexing, listing as well as *pointers to link between geographic
features and attributes
• Topology explicitly stored in an indexing table (using lists and pointers to keep information about adjacency,…)
• Attribute data in relational databases
Hybrid DBMS
Integrated DBMS• Attributes and features
stored in same type DBMS– Relational
– Arc Geodatabase
– Item b) is the Management Base; c) is Data Base
– Where is the Analytic Base?
Geodatabases stored in Integrated DBMS
Object-Oriented DBMS
• Items are objects, “encapsulate” data in “frames”
• Classes have “properties” = “behaviors” = “methods”
• Subclasses “inherit” properties
• Objects pass “messages”
• E.g. GIS SmallWorld,
ArcObjects (learn a bit
about this in GIS 2)
• Has evolved into agent-based
modeling, dynamic modeling
and mobility tracking)
Recommended