Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position

Preview:

DESCRIPTION

Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position. Ahmed AlSum PhD Candidate Old Dominion University. Outline. Engineer What I did Web Archive What I know What I did What I can do for SUL. CCSP Project. - PowerPoint PPT Presentation

Citation preview

Web Archiving Challenges and Opportunities

Presentation for Web archiving Engineering position

Ahmed AlSumPhD Candidate

Old Dominion University

Outline

• Engineer– What I did

• Web Archive– What I know– What I did– What I can do for SUL

CCSP Project

• It is an internal IBM support portal that provides client-facing audiences a by-client, holistic view of client situations.

• Technologies: The project depends on IBM technologies, WebSphere Portal, DB2, and deployed on zLinux machines

• Responsibilities:– Software Engineer.– Administrator on production and staging.– Customer support team lead.– Software engineer team leader.

• Developing Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and the support for backend tasks based on EJB.

• Lotus Sametime developer for both Plugins and Bot development.• Development front-end components based on Web 2.0 technologies (AJAX based

on dojo 1.0, and Java Script).• Developing and deploying Portal solutions on WebSphere Portal.• WebSphere Portal Administration on for standalone and clustered environment.• Administration on Linux and Windows OS.• DB2 server’s administration for single instance and multiple instances with HADR

support.• Leading the customer support activities.• Support in some project quality activities.• Code review and static analysis activities.

• Certifications:• IBM Certified System Administrator, IBM WebSphere Portal V6.0. (May. 2008) • IBM Certified Solution Developer, XML and Related Technologies. (Since Mar.

2008) • IBM Certified Solution Developer, IBM WebSphere Portal V6.0. (Since Feb.

2008)• Sun Certified Web Component Developer for the Java 2 Platform, Enterprise

Edition 1.4 (Since Jan. 2008).• Sun Certified Programmer for the Java 2 Platform, SE 5.0, (Since March 2007).• IBM Rational Software Certified, RAD 6.0 Associate Developer (Since Apr.

2006)• Microsoft Certified Professional in Designing and Implementing Desktop

Applications with Microsoft® Visual C++® 6.0. (Since Sep. 2002)

Memento

• Memento is an extension for the (HTTP) to allow the user to browse the past web as the current web.

I. Jacobs and N. Walsh. Architecture of the world wide web. Technical report, W3C, 2004. http://www.w3.org/TR/webarch/.

Now

T1

T2

T3

Memento

• Memento Aggregator– Developer and Adminstartor

Memento

• Memento Client– MementoFox: Firefox addon– mcurl: command line in Perl

• Both of them have been implemented based on Memento internet draft 8.0.

WAT Extraction

• Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls.

• Technologies: Hadoop, PigLatin, JAVA.

WEB ARCHIVINGChallenges and Opportunities

Web Archive Life Cycle

Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.

Selection

• Decide what to capture• We studied what is already captured.

How Much of the Web is archived?

• Tell me what is your URI source!!

Including SE cache

Excluding SE Cache

90% 79%

97% 68%

35% 16%

88% 19%

S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada. 2011.

Where is it archived?

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

IA Internet Archive

LoC Library of Congress

IC Icelandic Web Archive

CAN Library and Archives Canada

BL British Library

UK UK National Library

PO Portuguese Web Archive

CAT Web Archive of Catalonia

CR Croatian Web Archive

CZ Archive of the Czech Web

TW National Taiwan University

AIT Archive It

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

What is missing?

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Selection

• Curator• TwitterCrowdsource:– UK Web archive: Twittervana.– Internet Memory: Collect URIs from twitter APIs.– VA Tech: CTRNET project.

Web Archive Life Cycle

Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.

Harvesting

• Services– Archive-It– WAS @ CDLib

• Dedicated server– Heritrix

Harvesting

• Challenges– Ajax and Web 2.0/3.0– Streaming Media– URI challenges (i.e. twitter hash-bang)– Mobile

Harvesting

• SiteStory - Transaction Archive

Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013.

Web Archive Life Cycle

Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.

Storage

• Flat files:– WARC files (ISO standard)

• No-SQL db:– Internet memory

Storage

• Wrong solution could be a disater

Access

Recommended