XMLandKM XML and KM Powering Information and Retrieval for the Semantic Web Frank Cervone Assistant...

Preview:

Citation preview

XX

MM

LL

andand

KK

MM

XML and KM Powering Information and Retrieval

for the Semantic Web

Frank CervoneAssistant University Librarian for Information Technology,

Northwestern University

f-cervone@northwestern.edu

Darlene FichterData Library Coordinator,

University of Saskatchewan Library

darlene.fichter@usask.ca

XX

MM

LL

andand

KK

MM

Introductions

• Who are you?

• Where do you work?

• What is your experience with KM?

• What is your interest in XML?

XX

MM

LL

andand

KK

MM

Outline

• Semantic Web and KM• What is XML?• SGML & HTML - where do they fit?• XML - Structure and Elements• XML Applications

– Integration of disparate content• News

– Expertise profiling– Enterprise solutions

XX

MM

LL

andand

KK

MM

Semantic Web

“The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.”

Tim Berners-Lee and others

XX

MM

LL

andand

KK

MM

One Goal

Support elaborate precise

searchesby integrating

and utilizing all relevant sources of information / relationships.

Illustration from Scientific American May 1, 2001

XX

MM

LL

andand

KK

MM

Is XML a magical fix?

• Not likely.

• It does not magically integrate redundant data versions

• We’re unlikely to replace systems with single, common shared version of integrated just for this reason

• But, if used correctly, XML can help

XX

MM

LL

andand

KK

MM

Harness the Power of Semantics

• If we wish to harness this power, then we need to– To understand and resolve the different words

and meanings we use to refer to the same things– Consider ways and means of defining standard

terminology & establishing agreed upon meaning usually through standard metadata

– Be able to use XML messaging between applications and transformations

XX

MM

LL

andand

KK

MM

Pieces

XX

MM

LL

andand

KK

MM

XML – Codification of Knowledge

Knowledge Representation

In order for the “idea” to become a reality computers must have access to structured collections of information and sets of inference rules that they can use to conduct automated reasoning.

XX

MM

LL

andand

KK

MM

Why talk about the semantic web?

• Many of the “information intensive” processes of KM are facing the same challenge– Capture – formalize existing knowledge– Select and assess relevance, value ..– Store – in repository with schema– Share – distribute based on interest and work– Apply – retrieve, use in daily work– Create new knowledge

Beckman, T. Eight stage process of KM

XX

MM

LL

andand

KK

MM

XML & KM – What’s the connection?

• Many KM activities that have nothing to do with technology

• Some KM activities have technology is a key enabler or component– in these cases XML is often under the

hood– Knowing about XML means we can exploit

the opportunities and see the limitations

XX

MM

LL

andand

KK

MM

XML Overview

• Structured data interchange– A common syntax for expressing structure in

data

• Designed to account for “unstructured” data– documents

• Inherently conveys meaning/structure• Content and display separate from

structure• Delivered via standard text files

XX

MM

LL

andand

KK

MM

XML in 7 bullets

• New, but not that new• Structured data in a text file via markup• Self-describing information• Looks like HTML but isn't• Verbose text, isn't meant to be read• License-free, platform-independent and well-

supported• A family of technologies

(parts adapted from Bert Bos, http://www.si.uniovi.es/mirror/www.w3.org/XML/1999/XML-in-10-points)

XX

MM

LL

andand

KK

MM

Driving Forces for XML Adoption

• Internationalized media-independent electronic publishing

• Definition of platform-independent protocols for the exchange of data– electronic commerce– knowledge harvesting

• Information delivery to user agents – automatic processing after receipt

XX

MM

LL

andand

KK

MM

Benefits of Adoption

• Easier to develop software – handle specialized information distributed

over the Web

• Processing information using lighter-weight software

• Allows greater end-user control of information display– style sheets

• Metadata for resource discovery

XX

MM

LL

andand

KK

MM

The *ML family

• SGML

• HTML

• XML

From World Wide Web Consortium note W3C Data Formats, by Tim Berners-Lee.

XX

MM

LL

andand

KK

MM

SGML

• Designed for documents

• Very powerful

• Very complicated

• “Well defined” = strict rules

• Rigid - not very extensible

• Inappropriate for wide-spread use

XX

MM

LL

andand

KK

MM

HTML

• Simple, general-purpose document markup language

• Simple hyperlinking

• Designed for collaborative authoring

• Combined authoring and viewing roles

XX

MM

LL

andand

KK

MM

HTML Evolution

• Started with simple document description– Few tags designed for structuring

documents

• Quickly evolved– forms– images– tables– frames– fonts

XX

MM

LL

andand

KK

MM

HTML shortcomings

• Not easily extensible– HTML standards change too slowly– Browser-specific tags ("extensions")– Totally geared toward document display

• Limited data formatting– mathematics

• Can't markup data in any structurally meaningful way

XX

MM

LL

andand

KK

MM

Why can’t HTML be used for information exchange?

• HTML markup provides no inherent method of knowing what the information is about

• Browser paradigm is too constraining • Metadata schemes are deficient

– Search engines return far too many hits

• Can't related information items (pages) to one another

• One-way linking is somewhat limited

XX

MM

LL

andand

KK

MM

How HTML confuses content and presentation

• <h1>…<h6>

• <br>

• <p></p>

• <center>

• <table>

XX

MM

LL

andand

KK

MM

Example - content and presentation mixture in HTML

<HTML>

<BODY BGCOLOR=#FFFFFF>

<H1>005.72 M849et2001</H1>

<I>Enterprise application integration with XML and Java

</I>

<BR>

Upper Saddle River, NJ : Prentice Hall PTR, 2001

</BODY>

</HTML>

XX

MM

LL

andand

KK

MM

But what does it mean?

XX

MM

LL

andand

KK

MM

XML represents structure, not presentation

<marc>

<field=“245” indicator_1=“1” indicator_2=“0”>

<subfield=“a”>Enterprise application integration with XML and Java</subfield>

<subfield=“c”>J.P. Morganthal, with Bill la Forge</subfield>

</field>

<field=“260”>

<subfield=“a”>Upper Saddle River, NJ</subfield>

<subfield=“b”>Prentice Hall PTR</subfield>

<subfield=“c”>2001</subfield>

</field>

</marc>

XX

MM

LL

andand

KK

MM

XML is hierarchical

aEnterprise Application I ntegration w ith XML and J ava

cJ .P. Morganthal, w ith Bill la Forge

245title

aUpper Saddle R iver, NJ

bPrentice Hall PTR

c2001

260publisher

MARC

XX

MM

LL

andand

KK

MM

Nesting

<bigdoll>

<mediumdoll>

<littledoll>

rosette theme <littlestdoll/>

</littledoll>

<mediumdoll>

</bigdoll>

XX

MM

LL

andand

KK

MM

Elements, Attributes, and Content

<field=“245” indicator_1=“1” indicator_2=“0”>

<subfield=“a”>Enterprise application integration with XML and Java</subfield>

<subfield=“c”>J.P. Morganthal, with Bill la Forge</subfield>

</field>

XX

MM

LL

andand

KK

MM

DOM – Document Object Model

• DOM – a platform- and language-neutral interface that allow programs and scripts to dynamically access and update the content, structure and style of documents

• Built into web browsers and servers– Used by web browser for dynamic display

capabilities

XX

MM

LL

andand

KK

MM

Document Type Definition (DTD)

• A set of syntax rules for creating tags

• Defines – What tags can be used– The order they should appear in– Which tags can be nested– Which tags have attributes

• Can be part of an XML document– Typically defined externally

XX

MM

LL

andand

KK

MM

DTD and Elements

<!DOCTYPE BOOK[<!ELEMENT BOOK(AUTHOR?, TITLE,

PUBLISHER+,SUBJECT*)<!ELEMENT AUTHOR (#PCDATA)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT PUBLISHER (#PCDATA)>

<!ELEMENT SUBJECT (#PCDATA)>

]>

XX

MM

LL

andand

KK

MM

Attributes

<!ELEMENT PERSON EMPTY> <!ATTLIST PERSON person_id ID #REQUIRED> <!ATTLIST PERSON sex (M | F) #IMPLIED> <!ATTLIST PERSON status (employee | trainee) “employee”> <!ATTLIST PERSON company CDATA #FIXED “XYZ”>

XX

MM

LL

andand

KK

MM

Schemas

• Introduces a mechanism for strong typing– Allows a schema to be directly imported

into a database to create a table

• Standardized NULL representation

• Key representation

XX

MM

LL

andand

KK

MM

Well-formed and valid

• Well-formed– Conforms to the general rules of XML

syntax, which are very rigorous– Example – a tag must always be ended

• <title>Discourse Analysis</title>• <subtitle/>

• Valid– Documents that conform to the specific

DTD in use

XX

MM

LL

andand

KK

MM

XML-Link and XML Pointer

• Open set of linking elements• Non-directional

– arbitrary– non-hierarchical

• XML Pointer– Enables addressing any part of a text

• A more powerful HTML “anchor” tag

• XML-Link– Enables attaching a behavior to a link– Extended links, similar to a web ring

XX

MM

LL

andand

KK

MM

XML-Link Example

<related-URL-group>search

<related-URL HREF=“altavista.xml”/>

<related-URL HREF=“webbrain.xml”/>

<related-URL HREF=“yahoo.xml”/>

</related-URL-group>

<!ELEMENT related-URL-group (#PCDATA | related-URL)*>

<!ATTLIST related-URL-group

XML-Link CDATA #FIXED “EXTENDED”

INLINE CDATA #FIXED “TRUE”

CONTENT ROLE CDATA #FIXED “RT”

>

XX

MM

LL

andand

KK

MM

Displaying XML information in the browser

• XML parser built in– Relates data stream to DTD and style sheet

• Style Sheets– Only method for formatting XML data for display

• Similar to HTML CSS– More powerful

• XSLT– Processing language that allows for

transformation of data presentation

XX

MM

LL

andand

KK

MM

XHTML

• “Next generation” HTML

• HTML that conforms to XML standards

• Will eventually support integration with other XML applications

• Device independent web-access

XX

MM

LL

andand

KK

MM

XHTML Example

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"><head>

<title>Bare bones example</title></head><body> <p>

<a href="http://validator.w3.org/check/referer"> validate </a>

</p></body></html>

XX

MM

LL

andand

KK

MM

HTML 4 - XHTML Major Differences

• All related to “well formedness”– Tag/attributes must be in lower-case– Elements must nest, no overlap– All non-empty elements must be closed– All empty elements must be terminated– Attribute values must be quoted– Attributes cannot be minimized– Scripts should be downloaded from server

XX

MM

LL

andand

KK

MM

XML Life Cycle

• Authoring

• Presentation

• Search and Retrieval

• Integration

XX

MM

LL

andand

KK

MM

The Big Picture

XX

MM

LL

andand

KK

MM

Just “Add Water & Stir”

XML (document or database)XSLT style sheet

XSLT Processor(XML Parser)

Browser(XML Parser)

XX

MM

LL

andand

KK

MM

Authoring Tools

• Editors (getting the content in)– XML and XSLT Editors

• XML Spy• XML Notepad• XMetal• Xeena

– Word processors• WordPerfect

– Content Management Systems

XX

MM

LL

andand

KK

MM

XML Spy

• Structured/document editor – XML– DTD– schemas (DCD, XDR, BizTalk, XSD)– XSLT

• Views for: – Structured editing (grid view, table view)– Document editing (WYSIWYG)

• Full Unicode support– MSXML3 is used by default, but can be changed

XX

MM

LL

andand

KK

MM

XML Notepad

• Quick and dirty editor for Windows

• Doesn't use DTD to guide editing– if present, however, validates it on

document loading

XX

MM

LL

andand

KK

MM

XMetal

• Professional, full-featured XML/SGML editing tool– word processor-like view– source view– tag view

• SGML or XML DTD's– context-sensitive lists of allowed elements and

attributes– supports CALS tables, DOM, CSS, and HTML

• Integrated browser preview for XML documents.

XX

MM

LL

andand

KK

MM

Xeena

• Loads DTD and provides tree-view syntax directed editing

• Aware of the DTD grammar– Makes only authorized elements icons

sensitive– Ensures that all documents generated are

valid according to the given DTD

XX

MM

LL

andand

KK

MM

WordPerfect

• Word processor with advanced support for authoring XML and SGML documents in a WYSIWYG environment

• Includes – Wizards– Automatic element insertion– Automatic generation of documents.

• The DTD, layout information, and mapping files are incorporated into a single WordPerfect template.

XX

MM

LL

andand

KK

MM

Content Management Systems

• Many CM systems repositories use XML under the hood for tagging and storing information

• Or can “speak” XML – export as XML to allow integration with other applications

• Open any trade magazine and see the standard vendor names proclaim their support for XML

• To the document creator, XML is “invisible”

XX

MM

LL

andand

KK

MM

XML Conversion Tools

Examples:

• Logictran RTF Converter

• HTML Tidy– Free Windows program– Converts HTML to XHTML or XML

XX

MM

LL

andand

KK

MM

Logictran RTF Converter

• Converts Word and RTF documents to HTML, XML, SGML

• The converter allows you to create output for any DTD.

• You can generate HTML, XHTML, OEB and Docbook.

XX

MM

LL

andand

KK

MM

XSLT Processors

• Means of converting files between XML dialects and other formats – MSXML built into Internet Explorer

• http://msdn.microsoft.com/xml

– Xalan • http://xml.apache.org/xalan-j/index.html

XX

MM

LL

andand

KK

MM

XML Parsers

Examples • Expat

– Written in C (ported to other languages), used by LIBWWW, Apache, …

• XML4J – from alphaWorks, in Java, based on

Apache Xerces, supports DOM and SAX

• Many other parsers

XX

MM

LL

andand

KK

MM

Servers

• Apache XML– xml.apache.org

built in Xerces XML parser, Xalan XSLT processor

XX

MM

LL

andand

KK

MM

Browsers

• Internet Explorer 6– XML support is fairly extensive– Namespaces are supported– Supports Style sheets in CSS as well as XSLT 1.0

Parser is still an issue

• Netscape 6.1– supports HTML 4.0, XML, CSS, DOM,

namespaces, simple Xlink – Does NOT support XSLT

• Opera – supports XML

XX

MM

LL

andand

KK

MM

XML Standards & Applications

• Many activities where XML has a role

• OASIS has an extensive list of applications – RSS (news headlines)– MathML– SMIL– DocBook

XX

MM

LL

andand

KK

MM

XML Standards – Multiplying Like Rabbits

• Software applications (transactions, interchange)

• Publishing

XX

MM

LL

andand

KK

MM

Software Applications

• Office tools and groupware

• Decision support systems

• Functional/transactional systems for HR, CRM ..

• Intelligent systems (ES, IPSS)

• User support

XX

MM

LL

andand

KK

MM

Publishing

• Digital rights (EBX,…)

• DocBook, e-book, TEI

• News (RSS, ICE, nift, NewsML)

• Special subject area formats (MathML, ChemML, CellML, GeneXML)

XX

MM

LL

andand

KK

MM

Publishing: News

• Web site news• Syndicated news• Headlines• Full text

KM applications• Integrating internal, external news, creating

auto-categorization of news, adding items to the news based on new additions to the repository, user profiling

ICE

RSS

NewsML

nift

XX

MM

LL

andand

KK

MM

RSS (Rich Site Summary)

CRM News www.moreover.com

• Web news format• Simple application• Take a look at the

bits and peices

XX

MM

LL

andand

KK

MM

RSS – Why?

• The Need– Quick, easy, and consistent

announcements pushed out to other sites– Incorporate news and other information

feeds on a site

XX

MM

LL

andand

KK

MM

How it works

XX

MM

LL

andand

KK

MM

Before RSS

• No standard

• Every one put up what was new and described it differently

• Special one off programs to create parsers and screen scrapers

XX

MM

LL

andand

KK

MM

The Result

• > 1700 sites sharing news

• Many sites re-posting the headlines

• Examples:• myuserland.com• www.moreover.com• xmlTree - directory of content

XX

MM

LL

andand

KK

MM

RSS Syntax

• RSS file has two major placeholders for data: channel and items.

XX

MM

LL

andand

KK

MM

Channel Element

• The channel element must contain the following:

• title or name of the channel, • short description of the channel, • link to the web site of the channel, and • the language that is encoding the web site.• Also, numerous optional elements can be

included with the channel, such as copyright, webmaster, publication date and so on.

XX

MM

LL

andand

KK

MM

Item Element

• RSS file can have up to 15 item elements. Item elements are used to store the headlines and are the meat of the document. Item elements have the following elements:

• title• link• description

XX

MM

LL

andand

KK

MM

RSS Code

• First line contains an XML declaration:

<?xml version="1.0"?> • The next item is the DTD identifier <!

DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/ formats/rss-0.91.dtd">

XX

MM

LL

andand

KK

MM

RSS Statement

• Next, the rss element – must specify the version attribute. – may contain an encoding attribute

• the default is UTF-8

<rss version="0.91" encoding= "ISO_8859-1">

XX

MM

LL

andand

KK

MM

Channel Definition

• Contains a single channel element.– Title, description, link to channel’s web site,

language, one or more item elements, lots of optional elements

<channel> <title>moreover... US politics news</title> <link>http://www.moreover.com</link> <description>US politics news - news headlines from around

the web, refreshed every 15 minutes</description> <language>en-us</language>

XX

MM

LL

andand

KK

MM

Item Elements

• Up to 15 item elements <item> <title>'Author Unknown' by Don Foster

</title> <link>http://www.salon.com/books/feature/2000/10/30/pbacks/index.html

</link> <description>Salon Nov 2 2000 6:51AM </description>

</item>

XX

MM

LL

andand

KK

MM

From Simple Documents to Complex

• Hierarchical

• Many objects and elements

• Many “namespaces”

XX

MM

LL

andand

KK

MM

Namespaces

• A single XML document may contain elements and attributes that are defined for and used by two or more XML-based languages without conflict or ambiguity

XX

MM

LL

andand

KK

MM

Example

<xmlns:book="http://www.oasis-open.org/docbook/

xml/4.1.2/docbookx.dtd">

<xmlns:dc="http://purl.org/dc/elements/1.1/">

<dc:title>Working Knowledge</dc:title>

<dc:description>Overview and case studies of knowledge management</dc:description>

<book:chapter>5. Knowledge Transfer … </book:chapter>

XX

MM

LL

andand

KK

MM

OEB - Open E-Book

• In September 1999, the group published the Open E-Book 1.0 Publication Structure

• The Open E-book standard is essentially XHTML—that is, a clean version of HTML 4.0 along with support for CSS.

• www.openebook.org

XX

MM

LL

andand

KK

MM

RDF - Resource Description Framework

• Framework for metadata

• Interoperability of information exchange between applications

• Applications:– Resource discovery

– Knowledge sharing and exchange

– Content rating

– Intellectual property rights

XX

MM

LL

andand

KK

MM

RDF Example

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/dc/elements/1.0/"> <rdf:Description rdf:about="http://your.url" dc:creator=”Frank Cervone" dc:title="My RDF document" dc:description=”Exciting RDF Stuff." dc:date=”2000-11-10" /></rdf:RDF>

XX

MM

LL

andand

KK

MM

Emerging Standards For KM

• XTM

• OPML

• RFML

• FLBC

• ebXML

XX

MM

LL

andand

KK

MM

XTM: Topic Maps

• Used to organize information into knowledge bases

• Topic maps are a new ISO standard for describing knowledge structures and associating them with information resources

• “GPS” for information• http://www.topicmaps.org/xtm/

index.html“A book without an index is like a country without a map”

XX

MM

LL

andand

KK

MM

OPML

• Outline Processor Markup Language– Outline-structured information

• Used for data the is easily browsed and editable– Specifications– Legal briefs– Product plans– Presentations– Screenplays– Directories

XX

MM

LL

andand

KK

MM

RFML

• Relational-functional markup language

• Used to define relationship and functions among data elements– Tables within relational databases– Relational views

XX

MM

LL

andand

KK

MM

FLBC

• Formal Language for Business Communication– Automated communication – Conversation management– Dialog management– Based on speech act theory

• Formally defined message types• Broad range of message types• Defined in terms of intentions• Clear delineation between message type and content

XX

MM

LL

andand

KK

MM

XML in Use

• Portals

• Content management & syndication

• Content management: industry sector

• Integration

• Analytical/decision making

• Search and retrieval

• Visualization

XX

MM

LL

andand

KK

MM

Applications: Portals

• Portal are an obvious place for XML to be used. Most are integrating diverse data sources.

• Examples:– Hummingbird’s Enterprise Portal Suite

• allows XML-based third party application integration for variety of scripting languages

• Basically “write with your own tools/platform” exchange data with XML

– DataChannel, Sybase Enterprise Portal, Citrix XPS,

XX

MM

LL

andand

KK

MM

Content Production & Syndication

• Interwoven– Intranet/extranet content management and

authoring based on intelligent business rules, profiling etc.

– Newest component of Interwoven’s suite of tools focuses on content distribution and uses XML.

– OpenSyndicate uses a XML repository which allows content to be stored as objects and reused for multiple projects.

XX

MM

LL

andand

KK

MM

Open Syndicate

XX

MM

LL

andand

KK

MM

Content: Industry Specific Solutions

• Ringtail Solutions– Suite of litigation support and KM modules for

legal practitioner

XX

MM

LL

andand

KK

MM

Integration

• InfoShark– Used to integrate data from host of services and

programs, from 100’s to 1000’s of transactions each day

– Automates data exchange between Oracle, IBM DBW and Microsoft SQL for use over Internet, intranets, and extranets

– Being used by Montgomery county for eGov services of all types

XX

MM

LL

andand

KK

MM

Analytical/Decision Making

• Spotfire– DecisionSite 6.2 is powered by XML-based

application manager to tools, guides, resources for Genomics, Chemistry And Manufacturing

XX

MM

LL

andand

KK

MM

Visualization

• Antarcti.ca– visual mapping technology provides enterprises

with data search and discovery,

XX

MM

LL

andand

KK

MM

Not a Silver Bullet

“XML is not the answer to all the world’s problems—it creates new problems, that are awfully damn interesting to solve.”

Simon St. Laurent,

author of XML: A Primer,

on the xml-dev mailing list

XX

MM

LL

andand

KK

MM

Thank you!

• Frank CervoneAssistant University Librarian for Information

Technology, Northwestern University

f-cervone@northwestern.edu

• Darlene Fichter Data Library Coordinator, University of

Saskatchewan

Darlene.Fichter@usask.ca

Recommended