SCD Research Data For UCAR Data Management Working Group
January 10, 2001 Steven Worley Scientific Computing Division Data
Support Section
Slide 2
Four Categories of Data Service User Profile Data Content Data
Access
Slide 3
Four Categories of Data Service Archives directly from the MSS
Accessible to all with NCAR computing accounts Web accessible
online data server Information interface for all data Individual
requests Customized on per request basis Data preparation for large
projects E.g. Reanalyses at ECMWF and NCEP
Slide 4
User Profile, MSS User Groups
Slide 5
User profile, online data server Users based on network address
domain, data for 1995-1998 ~ 20K unique addresses per year Domain%
of total.com24.edu18.gov4 International17.mil1.net16 No domain
(IP)24.org1
Slide 6
User profile, individual request Requests excluding CD-ROMS
Based on 1998-1999 data 28% U.S. Univ. (179 of 638) 11% Foreign
Univ. (69) 27% Foreign Non-Univ. (171) 34% U.S. Gov. and Commercial
(219) (remarkably, some foreign and government sources find it
desirable to acquire their own data from SCD/DSS)
Slide 7
User profile, all users All users by year, excluding online
category
Slide 8
User profile, finding the data Peer and colleague
recommendations Acknowledgements in publications WWW searches and
perusing
Slide 9
Quick look at DSS Information Interface Website,
dss.ucar.edudss.ucar.edu Top level information and dataset
groupings Oceanographic datasets by CategoryOceanographic datasets
by Category
Slide 10
Important improvements for the Information Interface More top
level documents to guide users to the best datasets For improved
searches Carefully worded.html.. Pages with introductory text that
clearly defines the dataset with keywords that promote discovery.
.html.., note, not all search engines boost ranking based on
these.
Slide 11
User profile, compliments Fast service, requests receive prompt
action. Staff with scientific knowledge to offer assistance and
guidance. Flexible system can adapt to meet users
requirements.
Slide 12
What makes this system work The data records and files remain
in simple structures This way the archive should always be
accessible to programs written with low level languages The data
can survive evolutions in OS systems and software, 50-years is not
too much. Programs can be written that allow fast and efficient
manipulation of large collections. Internal checksum keys can be
strategically placed to insure data integrity at any level.
Slide 13
User profile, complaints All the data is not online even though
this quite impractical 12+ TB All the data is not in their favorite
format, IDL, HDF, netCDF, GrIB, ASCII, GIS, Binary,.xls, Matlab,
etc. Can I just get the piece I need? Do you mean I need to know
some FORTRAN or C Language?
Slide 14
User Profile, skill set Best skill set for our users includes
knowing some FORTRAN and/or C. Trend; more and more people are
requesting data in application environment specific formats Will
the next generation scientist know a basic computing language?
Slide 15
Data Content, size and characteristic Veritable smorgasbord of
data. Overall size, 12+ TB 500+ distinct datasets Many historical
observations from the atmosphere, and ocean Many operational
analyses and reanalyses Dataset sizes, < 1 MB to several TB Many
original formats. GrIB is dominate in our analyses and reanalyses
datasets
Slide 16
Data Content, metadata management Primarily, metadata is
managed on our online information server. Each dataset has a WWW
page. All dataset WWW pages are automatically formed. Corrections,
addition, and changes are made to text files manipulated under a
Unix change and control system. Advantage: history of all changes
and data files associated with the dataset, and the WWW pages are
always current.
Slide 17
Data Content, metadata management Have considerable amounts of
hard copy references and metadata. - We are making scanned images
of these now.
Slide 18
Data Content, long term archive and security Small datasets and
irreplaceable observations and analyses have two copies on the MSS
Although we cannot guarantee they reside on separate cartridges
Files are write password protected prevents accidental overwrites.
We have been fortunate to have a very reliable MSS and our success
will continue to rely on it in the future.
Slide 19
Data Content, long term archive and security Areas of concern
We dont have adequate offsite backups At least critical
observations should be protected from catastrophe at the Mesa Lab
In the event of loss of single copy large datasets we rely on other
centers for replacement This needs to be discussed more nationally
Redistribution may have restrictions or be costly
Slide 20
Data Content, long term archive and security Areas of concern,
continued Must always remain on guard so important data are not
lost due to short sighted policy decisions. Must participate in
national and international projects so that the archive content is
continually refreshed with the most scientifically important data,
at low cost.
Slide 21
Data Access, annual summary
Slide 22
Data Access, aids to access Maintain FORTRAN code to read all
data files Sometimes for many platforms (Unix, PC) The MSS file
location is defined for all datasets, and is available online.
Staff specialist are assigned and identified for each dataset
Slide 23
Data Access, most frequent NCEP/NCAR Global Atmospheric
Reanalysis, 2.6 TB How? MSS WWW (monthly means) CDROMS FTP Various
Tape Media (large capacity)
Slide 24
Data Access, largest barrier Discovering what is available
Gaining access to the MSS collection (when they dont have a
computing account) Not having experience with low level languages,
e.g. FORTRAN and C/C++
Slide 25
Data Access, product development Yes we do, and we feel it is
very important! Why? Can QC the data and identify problems early
Can reorganize into logical collection, or create popular subsets.
Reduce the volume of large collections to manageable size for users
Saves many users extra work
Slide 26
Data Access, improvements for scientific advancement Minimize
the barriers that inhibit discovery metadata problem. Supply the
data in the users favorite format or provide tools that can convert
the data where it is practical and efficient. Place more data, and
valuable higher level data products on line