20
Data Quality Issues- Data Quality Issues- Chapter 10 Chapter 10 GiGo: garbage in, garbage out GiGo: garbage in, garbage out Quality Issues Quality Issues Terminology Terminology Sources, propagation, and management Sources, propagation, and management What is Data Quality? What is Data Quality? Overall fitness or suitability of data Overall fitness or suitability of data for a specific purpose for a specific purpose

Data Quality Issues-Chapter 10

Embed Size (px)

DESCRIPTION

Data Quality Issues-Chapter 10. GiGo: garbage in, garbage out Quality Issues Terminology Sources, propagation, and management What is Data Quality? Overall fitness or suitability of data for a specific purpose. Errors, Accuracy, Precision, & Bias. Errors - PowerPoint PPT Presentation

Citation preview

Page 1: Data Quality Issues-Chapter 10

Data Quality Issues-Chapter 10Data Quality Issues-Chapter 10

GiGo: garbage in, garbage outGiGo: garbage in, garbage out Quality IssuesQuality Issues

– TerminologyTerminology– Sources, propagation, and managementSources, propagation, and management

What is Data Quality?What is Data Quality?– Overall fitness or suitability of data for a Overall fitness or suitability of data for a

specific purposespecific purpose

Page 2: Data Quality Issues-Chapter 10

Errors, Accuracy, Precision, & Bias Errors, Accuracy, Precision, & Bias

ErrorsErrors– Difference between real world and GISDifference between real world and GIS– Could be one error or the whole thing is offCould be one error or the whole thing is off

AccuracyAccuracy– Extent in which an estimated value approaches Extent in which an estimated value approaches

a true valuea true value– Can never get 100% accurateCan never get 100% accurate

PrecisionPrecision– Recorded level of detailRecorded level of detail

Page 3: Data Quality Issues-Chapter 10

Errors, Accuracy, Precision, & BiasErrors, Accuracy, Precision, & Bias

BiasBias– Consistent Consistent

error error throughout throughout data setdata set

– Human, Human, equipmentequipment

– Difficult to Difficult to spotspot

Page 4: Data Quality Issues-Chapter 10

ResolutionResolution

Smallest feature or data that can be Smallest feature or data that can be displayeddisplayed

RasterRasterCell sizeCell size Vector-point size, line widthsVector-point size, line widths

Page 5: Data Quality Issues-Chapter 10

GeneralizationGeneralization

Process of simplifying Process of simplifying

Page 6: Data Quality Issues-Chapter 10

Completeness & Consistency Completeness & Consistency

CompletenessCompleteness– Are all instances of a feature the GIS/map claims to include, Are all instances of a feature the GIS/map claims to include,

in fact, there?in fact, there?– Simply put, how much data is missing?Simply put, how much data is missing?

Logical ConsistencyLogical Consistency– The presence of contradictory relationships in the databaseThe presence of contradictory relationships in the database

Some crimes recorded at place of occurrence, others at Some crimes recorded at place of occurrence, others at place where report takenplace where report taken

Data for one country is for 2000, for another its for 2001 Data for one country is for 2000, for another its for 2001 Annual data series not taken on same day/month etc. Annual data series not taken on same day/month etc.

(sometimes called lineage error)(sometimes called lineage error) Data uses different source or estimation technique for Data uses different source or estimation technique for

different years (again, lineage)different years (again, lineage)

Page 7: Data Quality Issues-Chapter 10

CompatibilityCompatibility CompatibilityCompatibility

– Overlay maps different scalesOverlay maps different scales Can not be combinedCan not be combined

– Combining nominal and ratio Combining nominal and ratio Nominal scales Nominal scales

distinguish one item from distinguish one item from another, but they do not another, but they do not rank or quantify data. rank or quantify data.

– Soil Name, City Name, Soil Name, City Name, Polygon Identification Polygon Identification Number Number

Ordinal scales identify the Ordinal scales identify the relative magnitudes, but relative magnitudes, but they do not quantify they do not quantify exact differences exact differences between values. between values.

– Income = ( low , medium Income = ( low , medium , or high), or high)Slope = ( A , B ); where Slope = ( A , B ); where A = 0-4%, and B = 5-9% A = 0-4%, and B = 5-9%

Slope

Crop

Page 8: Data Quality Issues-Chapter 10

ApplicabilityApplicability

ApplicabilityApplicability– Suitability of data for commands, operations or Suitability of data for commands, operations or

analysisanalysis– Using your GIS data collected points for a Using your GIS data collected points for a

parcel fabricparcel fabric

Page 9: Data Quality Issues-Chapter 10

Sources of Error in GISSources of Error in GIS

Survey DataSurvey Data– surveyor or instrument errorsurveyor or instrument error– choice of spheroid and datumchoice of spheroid and datum– Data encoding and entryData encoding and entry

E.g. keying or digitizing errorsE.g. keying or digitizing errors

Remotely Sensed Data or Aerial Remotely Sensed Data or Aerial PhotographyPhotography– Mistakes in classificationMistakes in classification– Change in timeChange in time

Page 10: Data Quality Issues-Chapter 10

ManualManualDigitizing ErrorsDigitizing Errors

Cleaning and Cleaning and editing always editing always requiredrequired

Page 11: Data Quality Issues-Chapter 10

Vector to Raster or Vector to Raster or Raster to VectorRaster to Vector

Page 12: Data Quality Issues-Chapter 10

Errors in Data Processing and Errors in Data Processing and AnalysisAnalysis

is this data suitable for analysis?is this data suitable for analysis? Is in a suitable format?Is in a suitable format?

– Different datum's?Different datum's?

Are the data sets compatible?Are the data sets compatible?– Incompatible units?Incompatible units?– Widely different scales?Widely different scales?

Will the output mean anything?Will the output mean anything?

Page 13: Data Quality Issues-Chapter 10

Classification Classification ErrorsErrors

Page 14: Data Quality Issues-Chapter 10

EVALUATING CURRENT DATAEVALUATING CURRENT DATA

Most of the information captured in a Most of the information captured in a GIS generally exists somewhere in GIS generally exists somewhere in the office that requires the the office that requires the application. Some additional data application. Some additional data may be purchased or obtained by may be purchased or obtained by data sharing with other agencies.data sharing with other agencies.

The source, accuracy, reliability, The source, accuracy, reliability, condition and scale for each condition and scale for each document or record must be document or record must be evaluated.evaluated.

Page 15: Data Quality Issues-Chapter 10

SOURCESOURCE

The data may be in paper or map The data may be in paper or map form, or it may exist in computer files form, or it may exist in computer files on another system.on another system.– Where did that information come from?Where did that information come from?– What is the source of the source?What is the source of the source?– Do you know how the map was compiled?Do you know how the map was compiled?– Do you know who compiled the map or record?Do you know who compiled the map or record?– Have you spoken with the author to learn as Have you spoken with the author to learn as

much as possible about the data?much as possible about the data?– What are the strong & weak points about the What are the strong & weak points about the

data?data?

Page 16: Data Quality Issues-Chapter 10

Data Accuracy & ReliabilityData Accuracy & Reliability There are different types of accuracy.There are different types of accuracy.

– Absolute positional Absolute positional accuracy refers to the measurement accuracy refers to the measurement of map location as it relates to a real world location (For of map location as it relates to a real world location (For example; a GPS coordinate point).example; a GPS coordinate point).

– Relative positional Relative positional accuracy is a measure of the accuracy is a measure of the relationships between the different features on the map. relationships between the different features on the map. Relative accuracy compares the scaled distance between Relative accuracy compares the scaled distance between features measured from the map data with distances features measured from the map data with distances measured between the same features on the ground. measured between the same features on the ground.

The other type of accuracy deals with the content of the The other type of accuracy deals with the content of the information in the GIS database. Are there errors or missing information in the GIS database. Are there errors or missing data? A road may have positional accuracy but have the data? A road may have positional accuracy but have the wrong road name associated to the feature. We think of this wrong road name associated to the feature. We think of this as Reliability.as Reliability.

Another very important aspect of reliability is how current Another very important aspect of reliability is how current the data sources are.the data sources are. If the map or record has not been If the map or record has not been properlyproperlymaintained some method of bringing the document up to maintained some method of bringing the document up to date must be instituted.date must be instituted.

Page 17: Data Quality Issues-Chapter 10

Data Accuracy & ReliabilityData Accuracy & Reliability

Page 18: Data Quality Issues-Chapter 10

MAINTENANCE OF DATAMAINTENANCE OF DATA

Many of the answers needed to insure Many of the answers needed to insure proper data maintenance are flushed proper data maintenance are flushed out in a preliminary needs and data out in a preliminary needs and data analysis.analysis.– Specifically, maintaining data involves knowingSpecifically, maintaining data involves knowing– Frequency of changeFrequency of change– Quantity of changeQuantity of change– Sources of changeSources of change

It must be re-iterated: If data is not It must be re-iterated: If data is not going to be maintained DO NOT PUT IT going to be maintained DO NOT PUT IT IN YOUR GIS.IN YOUR GIS.

Page 19: Data Quality Issues-Chapter 10

ConditionCondition

The condition of the source The condition of the source documents, especially maps, will documents, especially maps, will determine how difficult the determine how difficult the conversion will be.conversion will be.

Clear mylar and ink drawings will be Clear mylar and ink drawings will be easier to digitize (no matter what the easier to digitize (no matter what the method) than maps of poor legibility.method) than maps of poor legibility.

Page 20: Data Quality Issues-Chapter 10