Upload
silvia-reed
View
235
Download
2
Tags:
Embed Size (px)
Citation preview
DataAcquisition
Best Practices for Creating Data Data Entry Options Data Manipulation Options Gathering Existing Data
Module Topics
DataAcquisition
After completing this lesson, the participant will be able to: ◦ Describe the characteristics of an easily understood and
manipulated dataset◦ Identify data entry tools and validation measures that can be
done as data is entered◦ Define what a relational database is and why it is useful◦ Describe documentation associated with using existing data
Learning Objectives
DataAcquisition
Collect
Assure
Describe
Deposit
Preserve
Discover
Integrate
Analyze
The Data Life Cycle
DataAcquisition
Whether recording data on paper forms or entering data digitally, accuracy and usability are important
Have a plan for storing your data before you collect it
The quality of your data records should be defensible
Following proper procedures when creating data results in digital datasets that are:◦ Valid◦Organized◦ Easy to understand◦ Easy to subset
Goals of Data Entry and Recording
DataAcquisition
Example: Poor Data Organization
• Inconsistency between data collection events (& more than one activity type)
– Date in title or as a data element– Column names vary– Order of columns differs
• Other Issues– Inconsistent Date formats– Different Site spellings– Typos in Site spellings– Poor column names (ex: Acult)– Inappropriate column use (‘got
away’, J, N in Acult column)– Mean1 Label is in Species column– Text and numbers in same column
DataAcquisition
Same Data, Using Best Practices
• Descriptive File Name• Column data types are consistent:
– only numbers, dates, or text• Consistent Names, Codes, Formats (date) used in each column• Similar data (separate samples) are all stored in one table with a single format –
computing is simpler using a single table than multiple, unrelated small tables that require a lot of individual human manipulation and synchronization
DataAcquisition
Create simple, consistently formatted file names. Consider using a date suffix in the name to help with version control. MyFieldData_20110910a.xls
Avoid using -,+,*,^, /, and other higher ASCII characters in column names. Software may interpret them as operators or reserved characters.
Create readable, descriptive column names without spaces or special characters◦ Soil T30 Soil_Temp_30cm◦ Species-Code Species_Code
Best Practice: Use Naming Conventions
DataAcquisition
Date Time NO3_N_Conc NO3_N_Conc_Flag20081011 1300 0.01320081011 1330 0.01620081011 1400 M
20081011 1430 0.018
20081011 1500 0.001 Est
Best Practice: Handling Missing Data◦Whenever possible, leave the value empty (NULL = no value)◦Use a Data_Flag column to qualify missing values (and other
issues) Ex: NA, Est, Null (no qualification)◦ For numeric fields, you might need to provide a distinct unlikely
value such as 9999 to indicate a missing value◦ For text fields, “NA” may be appropriate ( for “Not Applicable”
or “Not Available”)
M = missing; no sample collected
Est = estimated from grab sample
DataAcquisition
Enter complete lines of data, don’t assume a person or program will know to ‘fill in below’ for cells in a spreadsheet
Best Practice: Create Complete Records
Sorting can remove the best of orderly intentions!
Also avoid creating columns that are rarely used
DataAcquisition
Document the (authoritative) data source Obtain and keep all metadata about the format, content,
quality, and limitations of the data Document any data-sharing agreements or data-use
constraints Before use, evaluate the data for its fitness to your
needs Track the provenance of the data in derived data
products Consider informing the data source of the use of their
data
When Using Existing Data…
DataAcquisition
Excel: Pick-list validation
Control data entry by using standardized pick lists
Also look into using Excel’s automatic ‘data form’ feature
DataAcquisition
Excel: Range Validation
20
Avoid entry errors by trapping values - by range, type, etc.
Let the spreadsheet do the work of warning you!
DataAcquisition
Great for Charts, Graphs, Calculations
Poor choice for large or complex datasets
Difficult to subset Flexible about cell content
type (no type enforcement) Lack record integrity (can
independently sort columns relative to other columns)
Easy to use – but harder to maintain as complexity and size of data grows (lots of repeats in records)
Spreadsheet vs. Database Not a math or graphics tool,
but can provide data to those tools
Work well with lots of data Easy to query and subset data Data fields are typed – only
integers in integer fields Columns cannot be sorted
independently of each other (a good thing!!)
Slight learning curve compared to a spreadsheet
Normalization reduces complexity and entry effort
DataAcquisition
Why should you use a relational database?
*siteIDsite_name
latitudelongitude
description
Site
• Each data value is stored in only one place• Tables contain attributes describing a single thing (Site)• Primary Keys (marked with *) establish row identities, and are shared
between tables to link them• Based on set theory, the query language provides a concise way of
grouping, sorting, filtering, & summarizing
*speciesIDspecies_name
common_namefamilyorder
Species
*sampleIDsiteID
sample_datespeciesID
heightflowering
flagcomments
SpeciesSample
• sets of well-defined tables
• formal relationships• a query language for
manipulating data
DataAcquisition
Database Features: Explicit control over data types
Date Site Species Height Flowering<dates only> <text only> <from code
list>< real numbers only>
< ‘y’ and ‘n’ only>
Advantages• quality control• performance• organization
DataAcquisition
Date Site Species Flowering?
2/13/2010 A BOGR2 y
2/13/2010 B HODR y
4/15/2010 B BOER4 y
4/15/2010 C PLJA n
Database Features: Relationships are defined between tables
Site Latitude Longitude
A 34.1 -109.3
B 35.2 -108.6
C 32.6 -107.5
Date Site Species Flowering? Latitude Longitude2/13/2010 A BOGR2 y 34.1 -109.3
2/13/2010 B HODR y 35.2 -108.6
4/15/2010 B BOER4 y 35.2 -108.6
4/15/2010 C PLJA n 32.6 -107.5
Join tables and their data using queries
DataAcquisition
Date Plot Treatment SensorDepth SoilTemperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
2010-02-02 A N 0 15.1
Database Features: Powerful Query Language
OR Select * from SoilTemp where Treatment = ‘N’ and SensorDepth = ‘0’
OR Select Max(Soil_Temperature) from SoilTemp
Select Date, Plot, Treatment, SensorDepth, Soil Temperature from SoilTemp where Date > ‘2010-01-01’
DataAcquisition
SAS, R, SPSS Good for calculations, data analysis, subsetting data Also can be used for quality assurance
Statistical Software
DataAcquisition
Be aware of Best Practices when designing data file structures
Make sure you adequately document data that you obtain from others, and track the use of those data in your derived data products
Choose a data entry method that allows some validation of data as it is entered
Consider investing time in learning how to use a database if datasets are large or complex
Conclusion
DataAcquisition
Best Practices for Preparing Environmental Data Sets to Share and Archive. September 2010. Les A. Hook, Suresh K. Santhana Vannan, Tammy W. Beaty, Robert B. Cook, and Bruce E. Wilson. http://daac.ornl.gov/PI/BestPractices-2010.pdf
References
DataAcquisition
1. Which of the following datasets is entered according to best practices?
date site sp wght1/2/2010 1 1 0.22/5/2010 1 2 lost3/2/2010 1 3 0.5
date site species weight_gm1/2/2010 1 1 0.22/5/2010 1 2 Missing – eaten?3/2/2010 1 3 0.5
date site species weight_gm flag comment1/2/2010 1 Acacia constricta 0.2 2/5/2010 1 Acacia greggii Missing from
plotEaten?
3/2/2010 1 Acacia redolens 0.5
date site sp wt flag comment1/2/2010 1 Ac 0.2 2/5/2010 1 Ag Missing from plot Eaten?3/2/2010 1 Ar 0.5
DataAcquisition
consistent
2. It is considered best practice if you indicate any _______ data.
missing
different
special characters in the
DataAcquisition
“value missing”
3. In numeric fields use a value such as_______ for missing data.
*
-
9999
DataAcquisition
valid or void
4. It is considered best practice to insert into text fields with missing data __________or_____________ .
null or NA
missing or void
missing or NA
DataAcquisition
periods
5. In a separate column use_____ to indicate empty cells.
data flags
dashes
brackets
DataAcquisition
numbers or special characters
6. A best practice in data entry is to create descriptive column names without _____ or ________.
spaces or special characters
dashes or lengthy descriptions
errors or spaces
DataAcquisition
Bird population monitoring in a river valley that will go on for several years
7. In which case should you use a database rather than spreadsheets for your data?
Greenhouse plant growth study with a 3x3 factorial design and a one time destructive sampling of biomass
Nitrogen concentrations in leaves from three different tree species in spring of 2010
DataAcquisition
We want to hear from you! CLICK the arrow to take our short survey.
Before you go . . .