Upload
jamar
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Automated Metadata Creation: Possibilities and Pitfalls. Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html. - PowerPoint PPT Presentation
Citation preview
Automated Metadata Creation: Possibilities and Pitfalls
Presented by Wilhelmina Randtke
June 10, 2012
Nashville, Tennessee
At the annual meeting of the North American Serials Interest Group.
Materials posted at www.randtke.com/presentations/NASIG.html
Background: What is “metadata”?
Metadata = any indexing information
Examples:
MARC records
color, size, etc. to allow clothes shopping on a website
writing on the spine of a book
food labels
What we'll cover Automated indexing:
Human vs machine indexing Range of tools for automated metadata creation:
Techy and less techy. Sample projects
A little background on relational databases Database design for a looseleaf (a resource that
changes state over time). Sample project: The Florida Administrative
Code 1970-1983
Automated Indexing: What’s easy for computers?
Computers like black and white decisions.
Computers are bad with discretion.
Word search vs. Subject headings
One Trillion
1,000,000,000,000
webpages indexed in Google
… 4 years ago …
Nevertheless…
… Human indexing is alive and well
How to fund indexing?
http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress
How to fund indexing?
How to fund indexing?
Who made the metadata:Human or Machine?
How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html
Not automated indexing, but a related concept….
Always try to think about
how to reuse existing metadata.
High Tech automated metadata creation
The high end: Assigning subject headings with computer code
Some technologies:
• UIMA (Unstructured Information Management Architecture)
• GATE (General Architecture for Text Engineering)
• KEA (Keyphrase Extraction Algorithm)
Computer Program for Automated Indexing
OntologyThesaurus
Person’s role:Select an appropriate
ontology.Configure the
program so that it’s looking at outside sources.
Review the results and make sure the assigned subject headings are good.
Program’s role:Take ontology or
thesaurus and apply it to each item to give subject headings.
Subject Headings
Item
http://www.nzdl.org/Kea/examples1.html
The lower end: Deterministic fields
There’s an app for that
Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot
Batch OCR
Many tools exist to extract text from PDFS to Excel
Walkthrough – examining the extracted spreadsheets
http://fsulawrc.com/excelVBAfiles/index.html
How to plan the program• Look for patterns
• Write step-by-step instructions about how to process the Excel file
• Remember, NO DISCRETION, computers do not take well to discretion.
• Good steps:• Go to the last line of the worksheet
• Look for the letter a or A
• Copy starting from the first number in the cell, up to and including the last number in the cell.
• Bad steps:• Find the author’s name (this step needs to be broken into small
“stupid” steps)
Writing the program• Identify appropriate advisors.
• Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills.
• If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects.
• Try to find entry level material on coding.
• (Sadly, most computer programming instructions already assume you know some programming.)
• If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.
Finding Advisors: Most campus IT is about carrying heavy objects
Finding Advisors: Most campus IT is about carrying heavy objects
Perfection?
How close to perfection can you get?
Let’s run some code:
A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls
Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx
The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com
How much metadata was missing?
Field Number of empty fields(27,992 fields total, after preliminary removal of blank pages)
Percent of Field filled
Chapt. No before dash 183 99.3%
Chapt no after dash 2179 92.2%
Page no. 1766 93.6%
Supp no (ie. Date page went into the looseleaf)
3242 88.4%
Replacing supplement (ie. Data page was removed from the looseleaf)
All (however, 105 fields were entered manually in order to demonstrate the interface and get funding for manual metadata creation)
0%
Cheap and fastand incomplete
This is a search engine build on an index for the automated metadata only:
http://fsulawrc.com/automatedindex.php
It’s better than a shuffled pile of 30,000 pages.
It’s not very good.
If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.
Filling in the gapsCode helps speed workflow, but still time consuming.
http://fsulawrc.com/phptest/chaptbeforedashfill.php
This is editing a copy of the automated metadata database. You can enter as much as you like, and not break anything.
Last step: Auditing for missing pages, by comparing instruction sheets that went out with
supplements
www.fsulawrc.com/supplementinstructionsheets.pdf
Task Hours spent Category of work
Inspecting looseleaf and planning a database
20 (high skill, high training) Database work
Digitization with sheetfed scanner
35 (low skill, low training) Digitization
Planning the code for automated indexing
20 hours (high skill, high training) Database work
Coding for the automated indexing
35 hours (would be faster for someone with a programming background)
Automated metadata
Running script, and cleaning up metadata
35 hours (skilled staff) Automated metadata
Loading database and metadata on a server
10 hours (would be about twice as fast for someone with more database design experience)
Database work
Coding online forms to speed data entry
15 hours (skilled staff) Manual metadata
Training on documents and database design
15 hours (unskilled staff, but done before the student assistant got setup with computer forms and permissions)
Manual metadata
Metadata entry for fields the computer didn’t get
98.25 hours (unskilled staff) Manual metadata
Auditing the database against instruction sheets which went out with supplements
342.75 hours (skilled staff; includes training time for student assistant)
Auditing
Where did the time go?
Tasks and Hours
Database Work
Digitization
Auditing
Manual Metadata Creation
Automated Metadata Creation
Error ratesAutomated metadata for Supplement Number: 2.4%
Human metadata for Supplement Number: 0.8%
Automated metadata for Page Number
with systematic error: 1.0%
with the systematic error removed: 0.3%
Human metadata for Page Number: 3.1%
Error rates for the thesis indexer on GitHub: 5% - 6%
Do error rates matter?
For computer rates, might be measuring OCR.
Most metadata will be words, not numbers.
• Words are easier for a computer to pull out. Misspellings are obvious when reviewing output.
• Words are easier for a person to pull out. Less fatigue.
Recommendations
• For practitioners:
• Consider automating a process. Is it possible to index this without human involvement?
• Understand what IT support is available. Support can be someone who picks the appropriate tool, then you apply it.
• For administrators
• Allow work time for this type of experimentation.
Good resources to get started• A-PDF to Excel Extractor
• A program that takes text from PDFs and puts it in Excel.
• www.a-pdf.com/to-excel/download.htm
• This is an easy start to get source material into a format you can work with.
• Excel Visual Basic (VBA) Tutorials by Pan Pantziarka
• Almost all training material on coding assumes you already know how to code. These tutorials are good, because they assume you do not already know something.
• www.techbookreport.com/tutorials/excel_vba1.html
• For more advanced instructions, use a search engine to read message boards.
• An eHow instructions telling you how to turn on the Developer Ribbon in Excel 2007
• http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html
(use these same instructions for Excel 2010; older versions of Excel have the developer ribbon turned on by default)
• How to get to the tab where you can do simple coding.
• How to Build a Search Engine
• http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012
• Takes you through how webcrawlers work, using the programming language Python. (A website is a string of text only, nothing more, so these concepts are similar to metadata extraction.)
• This was good, because it doesn’t assume that you know how to code already.
Good resources to get started
• Wikipedia section on string processing algrithms.
• http://en.wikipedia.org/wiki/String_%28computer_science%29#String_processing_algorithms
• These six links go to lists of all the things you can do to strings. (Remember, a string is a string of letters – it’s what you will be working with.)
• Use the terminology from here to know what term of art to put into a search engine so that you can find instructions on how to do that in whatever code you choose.
• Wikipedia page on relational databases
• http://en.wikipedia.org/wiki/Relational_database
• It will be useful for you to understand primary keys, foreign keys, and tables referencing each other.
Good resources to get started
Automated Metadata Creation: Possibilities and Pitfalls
Presented by Wilhelmina Randtke
June 10, 2012
Nashville, Tennessee
At the annual meeting of the North American Serials Interest Group.
Materials posted at www.randtke.com/presentations/NASIG.html
Special thanks to:
Jason Cronk
Anna Annino
Automated Metadata Creation: Possibilities and Pitfalls
Presented by Wilhelmina Randtke
June 10, 2012
Nashville, Tennessee
At the annual meeting of the North American Serials Interest Group.
Materials posted at www.randtke.com/presentations/NASIG.html