28
Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position Ahmed AlSum PhD Candidate Old Dominion University

Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position

  • Upload
    alaura

  • View
    49

  • Download
    1

Embed Size (px)

DESCRIPTION

Web Archiving Challenges and Opportunities Presentation for Web archiving Engineering position. Ahmed AlSum PhD Candidate Old Dominion University. Outline. Engineer What I did Web Archive What I know What I did What I can do for SUL. CCSP Project. - PowerPoint PPT Presentation

Citation preview

Page 1: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Web Archiving Challenges and Opportunities

Presentation for Web archiving Engineering position

Ahmed AlSumPhD Candidate

Old Dominion University

Page 2: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Outline

• Engineer– What I did

• Web Archive– What I know– What I did– What I can do for SUL

Page 3: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

CCSP Project

• It is an internal IBM support portal that provides client-facing audiences a by-client, holistic view of client situations.

• Technologies: The project depends on IBM technologies, WebSphere Portal, DB2, and deployed on zLinux machines

Page 4: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

• Responsibilities:– Software Engineer.– Administrator on production and staging.– Customer support team lead.– Software engineer team leader.

Page 5: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

• Developing Enterprise Applications with J2EE platform technologies for frontend (Servlets, JSP, Portlet APIs), and the support for backend tasks based on EJB.

• Lotus Sametime developer for both Plugins and Bot development.• Development front-end components based on Web 2.0 technologies (AJAX based

on dojo 1.0, and Java Script).• Developing and deploying Portal solutions on WebSphere Portal.• WebSphere Portal Administration on for standalone and clustered environment.• Administration on Linux and Windows OS.• DB2 server’s administration for single instance and multiple instances with HADR

support.• Leading the customer support activities.• Support in some project quality activities.• Code review and static analysis activities.

Page 6: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

• Certifications:• IBM Certified System Administrator, IBM WebSphere Portal V6.0. (May. 2008) • IBM Certified Solution Developer, XML and Related Technologies. (Since Mar.

2008) • IBM Certified Solution Developer, IBM WebSphere Portal V6.0. (Since Feb.

2008)• Sun Certified Web Component Developer for the Java 2 Platform, Enterprise

Edition 1.4 (Since Jan. 2008).• Sun Certified Programmer for the Java 2 Platform, SE 5.0, (Since March 2007).• IBM Rational Software Certified, RAD 6.0 Associate Developer (Since Apr.

2006)• Microsoft Certified Professional in Designing and Implementing Desktop

Applications with Microsoft® Visual C++® 6.0. (Since Sep. 2002)

Page 7: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Memento

• Memento is an extension for the (HTTP) to allow the user to browse the past web as the current web.

I. Jacobs and N. Walsh. Architecture of the world wide web. Technical report, W3C, 2004. http://www.w3.org/TR/webarch/.

Now

T1

T2

T3

Page 8: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Memento

• Memento Aggregator– Developer and Adminstartor

Page 9: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Memento

• Memento Client– MementoFox: Firefox addon– mcurl: command line in Perl

• Both of them have been implemented based on Memento internet draft 8.0.

Page 10: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

WAT Extraction

• Web Archive Transformation (WAT) is a specification for structuring metadata generated by Web crawls.

• Technologies: Hadoop, PigLatin, JAVA.

Page 11: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

WEB ARCHIVINGChallenges and Opportunities

Page 12: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Web Archive Life Cycle

Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.

Page 13: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Selection

• Decide what to capture• We studied what is already captured.

Page 14: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

How Much of the Web is archived?

• Tell me what is your URI source!!

Including SE cache

Excluding SE Cache

90% 79%

97% 68%

35% 16%

88% 19%

S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, Ottawa, Canada. 2011.

Page 15: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Where is it archived?

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 16: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

IA Internet Archive

LoC Library of Congress

IC Icelandic Web Archive

CAN Library and Archives Canada

BL British Library

UK UK National Library

PO Portuguese Web Archive

CAT Web Archive of Catalonia

CR Croatian Web Archive

CZ Archive of the Czech Web

TW National Taiwan University

AIT Archive It

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 17: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 18: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

What is missing?

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 19: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web

LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University

IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It

Page 20: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Selection

• Curator• TwitterCrowdsource:– UK Web archive: Twittervana.– Internet Memory: Collect URIs from twitter APIs.– VA Tech: CTRNET project.

Page 21: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Web Archive Life Cycle

Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.

Page 22: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Harvesting

• Services– Archive-It– WAS @ CDLib

• Dedicated server– Heritrix

Page 23: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Harvesting

• Challenges– Ajax and Web 2.0/3.0– Streaming Media– URI challenges (i.e. twitter hash-bang)– Mobile

Page 24: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Harvesting

• SiteStory - Transaction Archive

Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Proceedings of TPDL 2013.

Page 25: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Web Archive Life Cycle

Hockx-Yu, H., 2011. The Past Issue of the Web. In Proceedings of 3rd International Conference on Web Science. pp. 1–8.

Page 26: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Storage

• Flat files:– WARC files (ISO standard)

• No-SQL db:– Internet memory

Page 27: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Storage

• Wrong solution could be a disater

Page 28: Web Archiving  Challenges  and Opportunities Presentation for Web archiving Engineering position

Access