16
NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004

NATIONAL LIBRARY OF MEDICINE PubMed Central Martha Fishel National Library of Medicine CENDI Meeting September 15, 2004

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

NATIONAL LIBRARY OF MEDICINE

PubMed Central

Martha Fishel

National Library of Medicine

CENDI Meeting

September 15, 2004

NATIONAL LIBRARY OF MEDICINE

What is PubMed Central?

• Digital archive of life sciences journals• includes health policy, bioinformatics and other fields

• Participation is open to journals:• covered by a major abstracting/indexing service

• or, that have 3 editorial board members with current grants from major non-profit funding agencies

• Free access to full-text articles and supporting data

• Integrated with PubMed and other bibliographic and factual databases in NCBI’s Entrez network

NATIONAL LIBRARY OF MEDICINE

PMC Basic Policy

• Journal deposits an authoritative electronic copy that meets PMC data quality standards• full-text XML• original high-resolution graphics• PDF• supplementary data

• Journal may delay free access (up to 2 years)• research articles usually free in one year or less

• Copyright is retained by publisher or author

• Deposits – and free access permissions – are permanent• journal may stop depositing new material but may not withdraw

material already deposited

NATIONAL LIBRARY OF MEDICINE

Back Issue Digitization

• Objective: Create a complete digital archive of PMC journals back to volume 1

• Cover-to-cover digital copy of everything up to where journal began producing electronic copy

• (includes articles, covers, TOCs, advertisements and administrative matter)

• Publisher gets free, unencumbered copy

NATIONAL LIBRARY OF MEDICINE

Back Issue Digitization

• 1st set of scanned journals covered 62 titles (and title variations) for approximately 2.5 million pages

• As of September 2004, 193,436 scanned articles are included in PMC

• Starting September 2004, a new cooperative agreement with the Wellcome Trust and JISC in UK to cover an additional 1.7 million pages or more

NATIONAL LIBRARY OF MEDICINE

Titles Scanned Back to Volume 1

Antimicrobial Agents and Chemotherapy v.1,1972BMLA v. 1,1911Clinical Microbiology Reviews v. 1, 1988J of Bacteriology v.1, 1916J Clinical Investigation v.1, 1924J Clinical Microbiology v.1, 1975J Virology v. 1, 1967Molecular and Cellular Biology v.1, 1981Nucleic Acids Research v.1, 1974Texas Heart Institute Journal v.1, 1974

NATIONAL LIBRARY OF MEDICINE

Scanning Specifications

• 1-bit B&W 600 dpi G4 TIFFs

• 8-bit 300 dpi grayscale TIFF

• 24-bit 300 dpi color TIFF for illustrations

• Unedited prime OCR (5-pass engine)

• PDF with hidden text (searchable OCR “hidden behind” pg. images)

NATIONAL LIBRARY OF MEDICINE

Digitized Samples

NATIONAL LIBRARY OF MEDICINE

• Secure permission to digitize• Acquire disposable content

• Donor sources include publishers, associations, individuals

• Create issue-level inventory • Prepare content for digitization – create journal

style sheets• Pack and ship materials

Back Issue Scanning Tasks

NATIONAL LIBRARY OF MEDICINE

QA Tasks

• Receive deliverables (DVDs) at NLM (NCBI)

• Run automated QA programs

• Mark random issues from each title for manual QA

• Perform QA (NLM contractor compares digitized image to original volumes pulled from NLM shelves)

• Accept or Reject a batch based on rigid criteria

NATIONAL LIBRARY OF MEDICINE

QA Criteria

• XML Character and Tag accuracy – 99.95%• Inventory Error – 100%• Image quality – 100%

• Distortion• Color• Visible pixilation

• OCR quality – 100% for completeness and zoning (unedited)

• PDF sequence and source accuracy – 100%

NATIONAL LIBRARY OF MEDICINE

Sample Batch Status ReportXML # of samples

# of samples failed

% failed Status

Tag content accuracy (99.95%)

5135 tags 12 tags 0.23% Failed

PDF # of samples# of samples failed

% failed Status

Visible skew (99%)

675 pages 1 page 0.15% OK

OCR file # of samples# of samples failed

% failed Status

Full Page Image

# of samples# of samples failed

% failed Status

Illustration Image

# of samples# of samples failed

% failed Status

Other # of samples# of samples failed

% failed Status

Recommended action

      Reject

NATIONAL LIBRARY OF MEDICINE

Final data Preparation

• Update Inventory database indicating issues returned• Format TIFFs, PDFs, Organize journal parts• Load to Preview Site for publisher review• Load to Live PMC site• Retain indefinitely!

NATIONAL LIBRARY OF MEDICINE

Progress to Date

25,000: Issues received

1.8 million: pages scanned

156,000: XML Citations created

NATIONAL LIBRARY OF MEDICINE

Challenges To Date

• Locating old, rare copies in good condition• Scanning and delivering fill-in pages at NLM• Feeding the pipeline • Maintaining even workflow at NLM• Quality Assurance (understanding requirements)

NATIONAL LIBRARY OF MEDICINE

Find Out More

PubMed Central homehttp://www.pubmedcentral.gov/

NLM Journal XML DTDshttp://dtd.nlm.nih.gov/