Upload
alannah-richard
View
216
Download
0
Embed Size (px)
Citation preview
Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format
Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler,
Umadevi Thanneeru
Portico & JSTOR: Committed to Preserving the Scholarly Record
JATS-CON 2010
I T H A K A
Ithaka helps the academic community
use digital technologies to preserve the
scholarly record and to advance research and teaching in sustainable
ways
Digitization for Preservation & AccessDigital Preservation
“Dark Archive” “Light Archive”
Portico Archive
• Portico’s objective is to help libraries make a secure and reliable transition from print to a reliance on e-content.
• Maintains archiving agreement with publishers to collect and preserve content.
• Receives content directly from publishers.
• Preserves:– Current journals (born digital)– Back file journals (reborn
digital)– E-books– Digitized historical collections
JATS-CON 2010
An “Insurance Policy” for e-Content
• Provide libraries with access to archived content when it becomes lost, orphaned or abandoned (regardless of libraries past or current subscription):
1.Publisher ceases operation
2.Publisher discontinues title
3.Publisher drops back file
JATS-CON 2010
•Provide libraries with post-cancellation access – if publisher specifically names Portico
•About 90% of titles in Archive are covered by Portico post-cancellation access rights.
•Libraries asked to pay annual Archive support payment to defray cost of preservation, e.g. “insurance premium”
Portico Archive as of July 19, 2010
Category Files %
Images 84,215,731 47.93%
Publisher Supplied Text 47,393,731 26.98%
Portico Created Archival Text
43,689,083 24.87%
Application Specific Files 232,732 0.13%
Multi-file Packages 140,333 0.08%
Videos 20,604 0.01%
Audio 570 <0.00%
Executable 6 <0.00%
Total 175,692,826 100%
• 114 publisher participants• 11,788 committed journal titles• 43,253 committed e-books• 13 committed digitized collections
• >14 million articles ingested
• 688 library participants– (48% outside US)
• 4 Trigger events• 15 Post-cancellation Access Claims
JATS-CON 2010
Portico Preservation Infrastructure
JATS-CON 2010
• Publisher supplies XML Source file (including the text, images) and PDF page rendition. • Best approach for preserving the intellectual content of the article or book.
• Authenticate: verify that preserved content is what it purports to be.
• Verify format: ensure the file meets syntactic and semantic rules of format specification. • Repair
• Normalize (XML)
• Create preservation metadata
• Assess archival robustness of file format.
• Migrate files to ensure future usability of content.
• Replicate objects and metadata to protect against bit rot and media deterioration
• Render articles to meet viewing requirements of delivery platform.
Key Challenges for an Archival DTD
Dec 2001, Inera’s “E-Journal Archive DTD Feasibility Study” highlighted these Key Challenges for an Archival DTD:
• Use of generated and boilerplate text, especially in – Label text for figure captions
– Citation text
– Author name and affiliation
– Dates
• Expression of links between author and affiliation• Reference elements• Expression of non-article and other content• Abbreviations and definitions
JATS-CON 2010
Key Challenges for an Archival DTD
• Keywords• Sections, including handling of sections without headers• Placement of floating objects, such as figures, tables, graphs• Tables, including cell formatting issues (cells with figures,
content alignment, etc.)• Math• Intra-, inter- and extra-article linking• Publisher-specific elements
When reviewing the minutes of the Working Group and the evolution of the DTD, we can confirm that these areas have
been the main focus of discussion.
JATS-CON 2010
Some Design Constraints
• IMPLIED, not REQUIRED attributes
• CDATA instead of controlled list
• Optional Elements, or relaxed order of elements
• Surprising location of Elements
• No Domain Specific Elements
JATS-CON 2010
Publisher/Domain Specific Elements
• Custom-Meta– Business Data– Allowed in journal-meta, article-meta, front-stub– Name/Value pair (may contain 38 different
Elements)
• Named-Content– Semantic Significance– Allowed in 112 Elements– May contain 59 different Elements
JATS-CON 2010
Challenges posed by source DTDs
Extended Semantics for Named-Content
• Price in Citation– Becomes <named-content content-type=“price”>
<citation reference="1" id="R1" type="serial"> <author order="1"> <name><first>S. P.</first><last>Morgan</last></name> </author> <journal> <sertitle>J. Appl. Phys.</sertitle> <URI type="ISSN">0030-3941</URI> <price>$01.00</price> <volume>29</volume> <pages><first>1358</first><last>1368</last></pages> <pubdate>1958</pubdate> </journal> <title>General solution of the Luneburg lens problem</title></citation>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Affiliation in Footnotes/P– Becomes <named-content content-type=“aff” id=“AFF2”>
<FOOTNOTE ID="N101" TYPE="AFF"><P ALPHABET="LATIN" TYPE="INDENT"> <AFF ID="AFF2“><IT>Corresponding author address:</IT> Nicholas M. J. Hall, Dept. of Atmospheric and Oceanic Sciences, McGill University, 805 Sherbrooke St. W., Montreal PQ H3A 2K6, Canada.</AFF>
</P></FOOTNOTE>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Funding in Acknowledgments/P– Becomes <named-content content-type=“funding”>
<ack><sectitle>ACKNOWLEDGMENTS</sectitle><p>Q.W.’s research is partially supported by AFOSR Grant No. <funding source="USAFOSR"><contract>F49550-05-1-0025</contract></funding> and NSF Grants No. <funding source="NSF"><contract>DMS-0204243</contract></funding>, No. <funding source="NSF"><contract>DMS-0605029</contract></funding>, and No. <funding source="NSF"><contract>DMS-0626180</contract></funding>. P.Z. is partially supported by the special funds for major State Research Projects <funding source="UNSPECIFIED"><contract>2005CB321704</contract></funding> and National Science Foundation of China for Distinguished Young Scholars <funding source="NSFC"><contract>10225103</contract></funding>. H.Z.’s work is supported in part by the Naval Postgraduate School Research Initiation Program.</p></ack>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Organization Division in Affiliation– Becomes <named-content content-type=“division”>
<Affiliation ID="Aff12"> <OrgDivision>Optisches Institut</OrgDivision> <OrgName>Technische Universität Berlin</OrgName> <OrgAddress> <City>Berlin</City> <Country>Germany</Country> </OrgAddress> </Affiliation>
JATS-CON 2010
Challenges posed by source DTDs
More Extended Semantics for Named-Content
• Generic Element (addinfo)– Becomes <named-content content-type=“addinfo”>
<ref-conf id="CIT0045"><ref-conf-text><author-ref-text><surname>Bishop</surname> <givenname>CJ</givenname></author-ref-text>, <author-ref-text><surname>Aanenses</surname> <givenname>DM</givenname></author-ref-text>, <author-ref-text><surname>Jordan</surname> <givenname>GE</givenname></author-ref-text>, <author-ref-text><surname>Kilian</surname> <givenname>M</givenname></author-ref-text>, <author-ref-text><surname>Hanage</surname> <givenname>WP</givenname></author-ref-text>, <author-ref-text><surname>Spratt</surname> <givenname>BG.</givenname></author-ref-text> <presentationtitle>Electronic taxonomy: assigning strains to bacterial species via the internet</presentationtitle>. <collectworktitle>BMC Biology</collectworktitle> <publicationfield-text><year>2009</year>; <year>7</year></publicationfield-text>: <firstpage>3</firstpage>. <addinfo>doi:10.1186/1741-7007-7-3</addinfo>.</ref-conf-text> </ref-conf>
JATS-CON 2010
Challenges posed by source DTDs
Target DTD Structural Constraints that force the use of Named-Content
• Table in Table– TD contains named-content, which contains a table
<td><named-content content-type=“table”><table-wrap>
• Figure in Table– TD contains named-content, which contains a fig
<td><named-content content-type=“figure”><fig>
• Display-Formula in Title– Title contains named-content, which contains a display-formula
<title><named-content content-type=“display-formula”><display-formula>
JATS-CON 2010
Challenges posed by source DTDs
• Question/Answer– Generic and Structural
– Is saying <list list-content=“question”> enough?
<Question-Answer> <Q><P><L>1</L>. The major advantage of amniotic membrane transplantation in pterygium surgery is</P></Q> <A><P><L>A</L>. reduction in surgical time</P></A> <A><P><L>B</L>. preservation of conjunctiva</P></A> <A><P><L>C</L>. better cosmetic outcomes compared with conjunctival autografting</P></A> <A><P><L>D</L>. lowest recurrence rate among the surgical techniques</P></A></Question-Answer>
JATS-CON 2010
Challenges posed by source DTDs
• Synonymy– Domain and Semantic
– Is saying <list list-content=“synonymy”> enough?
– Or <named-content content-type=“synonymy”> because of the semantic meaning?
<SYNONYMY>
<HEAD>ECHINOSTELIALES</HEAD>
<ITEM><P><GENSP>Clastoderma debaryanum</GENSP> A. Blytt</P></ITEM>
<ITEM><P><GENSP>Echinostelium apitectum</GENSP> K.D. Whitney, MC</P></ITEM>
<ITEM><P><GENSP>Echinostelium coelocephalum</GENSP> T.E. Brooks & H.W. Keller, MC</P></ITEM>
<ITEM><P><GENSP>Echinostelium minutum</GENSP> de Bary, MC</P></ITEM>
</SYNONYMY>
Synonyms are different scientific names that pertain to the same taxon
JATS-CON 2010
Challenges posed by source DTDs
• Decision Tree (Taxonomic Key)– Domain, Semantic, Structural, and Presentation
<KEY> <COUPLET><DESCR><NO>1.</NO>Hypostomal setae (Hy) shorter than half the width of labrum</DESCR> <RESP><GENSP>Sycophila mellea</GENSP> (Curtis, 1831), <GENSP>Tetramesa </GENSP>Walker, 1848</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Hypostomal setae longer or about as long as half the width of labrum</DESCR> <RESP>2</RESP></COUPLET> <COUPLET><DESCR><NO>2.</NO>More than two dorsal setae (D) present on abdominal segments A6-8</DESCR> <RESP>3</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--At least one of abdominal segments A6-8 with only two dorsal setae</DESCR> <RESP>4</RESP></COUPLET> <COUPLET><DESCR><NO>3.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>E. (Ahtola) atra</GENSP> (Walker, 1832)</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>E. nodularis</GENSP> Boheman</RESP></COUPLET> <COUPLET><DESCR><NO>4.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>Eurytoma appendigaster</GENSP> group</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>Eurytoma heriadi</GENSP> Zerova</RESP></COUPLET></KEY>
tree-like model of decisions and their possible outcomes
JATS-CON 2010
Concluding Question
How to support Publisher/Domain Specific constructs in the Archival DTD?
• Continue use of Named-Content
• New Miscellaneous Element
• Support for adding namespaced elements
• Other
JATS-CON 2010
Questions/Answers?
Thank you
John Meyer
Director of Data Technologies
100 Campus Drive, Suite 100
Princeton, NJ 08540
609 986-2220
JATS-CON 2010