Solving Information Complexity - White paper by Hank Alles (Infolution

INFOLUTIONSOLVING INFORMATION COMPLEXITY


Logo 01

Solving information complexity

whitepaper version 2

By Henk Alles, CEO Infolution

Copyright and license information

Infolution software binaries are subject to listening. Delivery will be electronically over the Internet or via mail on CD Rom. Every copy of the software is accompanied by a license agreement. Opening the physical CD casting means acceptance of the license. A written copy of the shrink-wrap license agreement is available in every CD case. Elextronic delivery is per-formed as a secure (encrypted) archive. Opening the archive requires the license to read the shrink wrap license and click the accept button. The license grants the licensee a 28 (twenty eight) day activation period. After installation an activation token is generated based on a hardware inventory of the licensee´s computer. This token is to be transmitted to Infolu-tion. Infolution will generate an activation key based on: Customer and user name, license type (personal, professional etc) and user´s hardware ID. The activation key will be sent to the licensee within 24 hours of receipt of the information above.

Copyright(s) ©2001–2003 INFOLUTION B.V. All rights reserved. No part of these documents may be reproduced. trans-mitted, transcribed, stored in a retrieval system or translated into any language, in any form or by any means without the prior written permission of INFOLUTION B.V., Schouwschuit 37, 1613 CJ, Grootebroek, Netherlands. Infolution™, Tarchon™ and InfoRaptor™ are registered trademarks

or trademarks of INFOLUTION B.V. in the Netherlands and/or other countries. Other company and product names men-tioned herein may be the trademarks of their respective owners. Disclaimer(s); the information contained in this document represents the current hereof without the obligation of INFOLUTION B.V. to notify any person(s) or organizations of such revisions or changes. All information contained herein is proprietary to IN-FOLUTION. All information contained in this document is provided „as is“ without warranty of any kind. Because of the pos-sibility of human and mechanical errors as well as other factors, INFOLUTION is not responsible for any errors or omissions in the information. INFOLUTION makes no representations and disclaims all express, implied and statutory warranties of any kind to the user and/or any third party, including any warranties of accuracy, time-liness, completeness, merchantability and fitness for a particular purpose. Neither INFOLUTION nor its affiliates shall be liable to user or to any other entity or individual for any loss or profits, revenues, trades, data or for any direct, indirect, special, punitive, consequential or incidental loss or damage of any nature arising from any cause whatsoever, even if INFOLUTION has been advised of the possibility of such damage. INFOLUTION and its affiliates shall have no liability in tort, contract or otherwise to user and/or any third party.

2

Table of content

Introduction

Search economics

Humanizing find

Making the system understand you

Core engine: InfoRaptor

Harvesting

Analysis

Processing

Language indentification

Scrubbing

Lexical analysis

Part of sentence tagging

Vaut analysis

Phrase analysis

Full text indexing

Context analyses

Clustered content tagging

Stock knowledge

Document classification

Semantic cluster analysis

Taxonomy builder

Semantic analysis: from stock knowledge to flow knowledge

Semantic storage

Document synthesizer

InfoRaptor components

Web services inferface

Application Programmers Interface

Sytax SQL statements

Query samples

Database schema tree

Spider

Visualization

System Requirements

4

5

6

7

8

8

9

9

10

11

11

12

12

14

15

16

17

17

18

19

20

21

22

23

25

26

26

27

28

29

29

30

33

3

Having the right information at the right time is absolutely essential to one‘s business. „Information is power!“ or so the saying goes. But while information is a valuable resource, information overload can pose a serious problem.

Current studies show that presently available knowledge management tech-nology only find correct answers about three times out of ten. Worse still, they retrieve countless irrelevant documents, costing time and increasing user frustration. Because search technology has not progressed at the same rate as data has grown in volume, and because those search technologies are keyword or concept based, the current generation of knowledge management applications has failed to satisfy expectation.

In addition, the largest source of infor-mation that is unstructured sources, so far lackes technology to search, retrieve and convert information into knowledge. The current generation of knowledge management applications will never create a useful user experience that adds value to business. Infolution addresses the fundamental issue: retrieval based on semantics (context and relationships) using techniques involving uncertainty and ranking of effective retrieval instead of Boolean or Bayesian logic. This new methodology is a significant change from the standard techniques of indexing and formal querying.

Textual information comes in many diffe-rent forms such as memos, reports, telexes, faxes, e-mail, books, manuals, and newsletters. The advent of computers has greatly accelerated the production of such information, more than any human can handle. To plough through the masses of text for information can be time-consuming and mind-boggling. Where do we even begin to collate, let alone analyze, all these text? Infolution has come up with new technology that allows us to sieve through text extremely fast and get to meaningful information of exactly the right size (length) that is needed in decision-making.

Introduction

4

Catalogues or classification schemes fail because of sheer volume dynamics involved in the document domain. New ways and new technology is needed to provide economically viable ways to allow access to all this data. Since the chances of a document ever to be retrieved are becoming smaller and smaller over time we can no longer justify putting human effort in pre structuring the data before storing it. Infolution is set out to provide the tools that can structure and classify data autonomic and in near real time during retrieval.

Infolution products and tools support the new users that are more and more beco-ming on-line and computer literate every day. And there are millions of these new users coming on-line every year. These new users do not have the same back-ground as the more traditional IT aware users. The old metaphor of educating and training new users is no longer economical viable.

The log files of an on-line search service for example show that retrieval professio-nals, like librarians, query the system with an average of fifteen concepts as apposed to new users with 1.3 concepts on ave-rage. Therefore last generation software presents these users with an overwhelming number of documents resulting in low user satisfaction. The precision of the system‘ answer is very low.

But even the old hands at information retrieval are not satisfied with their systems. Due to the preciseness of their queries they get a limited set of documents that describe exactly what they asked for but which is mostly incomplete and certainly does not comprise what was intended. The recall of the system‘s answer is low.

Data needs structure to be information. Information needs context to provide knowledge and insight. The Encyclopedia Britannica provides data. A medical text-book provides information. A doctor doing an interactive diagnosis provides knowledge.

Search economics

5

Infolution is an intelligent database and text visualization system that simplifies, personalises and enhances the search for knowledge. The technology is based on research concerning how human beings think and the context of their understanding. So far the best interface between a person and information has been another person, someone who has already studied the topic at hand. Infolution has been designed on the basic assumptions that successful use of an Internet requires an intermediary who provides context and structure. If you don‘t have a context from which to construct questions, how can you ask the right questions? Infolution provides the user with context as he seeks knowledge. It provides responses that are relevant for what the user wants to discover. It understands and provides the context in a user‘s quest for knowledge.

In order to gain insight and knowledge it is necessary to blur the distinction between searching and browsing. These functions have no meaning to the user. Users do not want to browse through data but want to find information. Information seeking is a complex human activity, not easily expressed in algorithms or calcula-tions. It is both cumulative and iterative. It needs to take into account the user‘s background and previous experiences. Infolution therefore closes the loop by providing users with the means to provide instant feedback during the process. Like a human intermediary interaction is the basis for a successful find.

Taking context into account on the written word helps us, and machines, to under-stand text. When we see a single word, for example ‚duke‘, what does our mind think? It is trying to assign meaning to this word, however we are not able to do so, since ‚duke‘ can mean different things. When we are presented with a sentence like ‚the last movie the duke starred‘ we can eliminate most of the options and use the knowledge we already have gained over the years to disambiguate the word ‚duke‘. By analyzing the context we start hypothesizing of what is meant by ‚duke‘. In this case we quickly see that ‚duke‘ has something to do with a movie, and we know from past experience that both Duke Ellington featured in some movies and the actor John Wayne was known as the duke. Further analysis takes the word ‚starred‘ into account, which hints towards John Wayne since Duke Ellington never had a lead role in a movie. When we read on in the document we try to find proof that John Wayne is meant. What we actually do is read beyond the words.

Humanising find

6

Understanding is the connection between the document domain on one side and the user on the other side. For true understanding knowledge is required: knowledge about the available content in the domain, knowledge about the user‘s questions, requirements and intent.

Where does such knowledge come from? Traditionally it has been a product of true human craftsmanship, required for produ-cing things such as encyclopedias, dictio-naries, libraries etc. Furthermore people learn concepts in relation to other concepts. Infolution builds an understanding of a digital document domain in a similar way. In the digital age it is possible to engage machines in this knowledge harvesting process. The machine‘s involvement serves two main purposes:

• In the first place to automatically extract in high volume documented knowledge from machine readable texts produced by humans.

• Second, to monitor individual human behavior when they interact with machines and extract knowledge about a particular user and her/his behavior.

The availability of knowledge technology is imperative in being able to enhance the user‘s interaction with a machine: it allows the machine to understand the content of all available documents and the user‘s request. Subsequently it enables the machine to appropriately match the most relevant content with the users‘ request, and order results using the knowledge obtained about the user – her/his requests, expectations, and priorities.

Making the system understand you

7

Core engine: InfoRaptor

It‘s quick, agile, ferocious, and smart, operates stealthy, flies, works in teams and will eat just about everything that has text in it.

Infolution‘s automated knowledge extraction is based on linguistic technologies that enable the machine to ‚understand‘ written text and store knowledge in the form of semantic networks. This is combined with more classical technologies, generally based on statistical data about word forms and their occurrence; the hybrid solution offers a revolutionary improvement. The core engine, affectionately called InfoRaptor by its developers, consists of four independent modules that harvest, analyze, store/retrieve, synthesize and visualize.

Harvesting

In order to create a single point of access to all information, a requirement on top of the wish list of 98% users, is that the harvesting process must be wide, over all sources and deep into each of these sources. The harvesting module has two means of getting to the corporate data: walking and crawling.

• File and system walkers are able to connect to most common sources found in today‘s enterprises like local hard drives, network file servers, mail systems, databases and other information systems like Lotus Notes. Data is gathered on a per document basis, which can reside on Microsoft Windows, Apple Macintosh and UNIX platforms. • Crawlers actively spiders data from web based services both inside the corporation and on the Internet. Infolution crawlers can be context sensitive whereby the crawler sees links (URL) prioritized based on semantic analysis of the text surrounding the link. 8

Each harvested document is placed in a processing queue where meta data is added at each stage. The first stage for all non-XML data is conversion to XML. Common formats like Microsoft Office Word, Excel, PowerPoint and Access plus Adobe Acrobat, html, text and various email formats are supported. The format converters always add the XML represen-tation to the harvested file, never replacing it. The system manager can determine what parts of the XML are actually held in cache and stored. This allows organization to tune the system for speed and storage size.

In the next step the XML is further annotated with meta information from date time stamps, possible present meta tags like author and general statistics like number of words, sentences, paragraphs and text size.

The harvesting part of the system then transfers the XML file to the analyzer. The core engine allows for many harvesting modules to work in parallel or be located at separate locations within an enterprise. Harvesting data therefore occurs at the source and lowers bandwidth requirements on the network.

Analysis

Processing

The Infolution analyzer is the first technology to automate the extraction and synthesis of knowledge from both structured and unstructured sources. It creates knowledge and allows it to be embedded within net-works, thereby unlocking it from within applications and making it a ubiquitous asset within an organization.

The Infolution Knowledge Network departs from conventional methodology by em-ploying the first ever scalable and com-mercially viable semantic processor. This engine can progressively comprehend, relate and retrieve concepts in the same way, as does the human brain. Most impor-tantly it learns as it goes. It remembers what it has seen and becomes more efficient every time it retrieves knowledge. In other words, it knows the meaning or context of a user query and correlates that to a document whose context is also understood. With this, Infolution represents a step change in technology. The greater the volume of information available, the more useful it will be. Infolution enables the user to find the information they are really looking for quickly and easily, without having to sift through irrelevant search results.

This data has a steering and prioritizing effect on the following stages as it is used to set the appropriate parameters for follow up parts in the analysis.

As in the harvester the core of the proces-sing is steered by a XML based document pipeline. At each stage the XML data is amended with information.

9



Logo 01

Language indentification

As some of the analyzers are language dependant and the user should be able to select or limit retrieval to a certain language the first step is to accurately identify the language the document is written in. The granularity of this process can be set any-where from the whole document to parts or paragraphs.

The identification process uses a statistical approach on word parts (n-grams) that are matched against a predefined dictionary that is part of the engine. The language identifier is able to discriminate between 77 languages. The supported languages are listed in the table below.

10

Srubbing

Lexical analysis

Many documents combine unwanted noise with content. Particularly html pages contain large portions of text that are not relevant to the user during retrieval. For instance menu‘s and other click regions along the top and left side of the page contain many words that could lead to false hits later. The scrubber filters out these unwanted pieces of text. Depending on the source and/or type of document the systems manager is able to set parameters for the scrubbing or cleansing process. The scrubbed items are tagged as ‚do not analyze‘ but are not removed from the file.

To enhance retrieval, words need to be in their root form. Depending on the language lexical and algorithmic so-called stemmers are used to bring both verbs and nouns back to the root (stemmed) form. For example the verb walking is reduced to walk, cars to car etc. The inflected word is added to the XML document in the pipeline and used for full text indexing. The semantic and context analyzer use the non-inflected words for processing, the annotated words are however used to map unknown inflections to known words. As the meaning of more words is known the contextual and semantic analysis performance is increased. Inflection helps reduce the number of non-hits in the dictionaries and therefore leads to better understanding of the text.

11

Part of sentence tagging

Vaut analysis

For clustered text synthesis the structure of the sentence needs to be tagged in advance in order to present the most relevant parts of a sentence first. The sentence ‚The man, who just had his bir-thday, went to work early today‘. The part ‚who just …‘ is tagged with a lower relevance as the rest of the sentence. Further contextual analysis is used for the ranking of sentences and paragraphs. The rated parts are stored as annotations by the part of sentence tagging. These tags are used by both the semantic analyzer and during retrieval by the document constructor. While constructing virtual documents these tags help the document synthesizer to discriminate between important and non-important paragraphs, sentences and part of sentence text. The tagged items properties are amended later in the process during the semantic analysis.

Many documents that are written by humans contain subtle spelling errors; documents resulting from optical character recognition are usually riddled with ‚strange‘ characters that are not easy to spot. The difference between an O and a 0; the first being a capital letter o the second the number zero - have no effect on humans trying to understand a text. Machines traditionally cannot over-come this barrier and retrieval recall for OCR‘ed domains is usually poor. Some systems try to overcome this by using pattern matching the query against a list of all known words in a domain. This is however a very CPU intensive process that is not scalable for large domains. Infolution has devised and patented an algorithm that analyses these errors beforehand and stores the results as links between words in the full text index. The analysis uses a statistical method that performs a two dimensional windowed analysis of all available text. The number of alternative spellings of words does therefore not affect the retrieval performance. The list below shows a few examples of the output of the vaut analyzer.

12

The vaut module is not only used during analysis of new documents but also provides alternative spelling suggestions on user query input. Mistakes while entering questions are therefore automatically corrected by the system. The user will just get back the right answer! During retrieval alternatives for entered words are typically shown to the users as ‚Do you mean: [list of alternatives]‘.

13

Phrase analysis

A more sophisticated technique to dis-cover word-patterns, a combination of words (not necessarily in sequence), is required for increasing the interactivity of the system. The phrase analysis mo-dule uses basic linguistic techniques combined with belief networks to extract word combinations. A user asking for Amsterdam will be presented with a response ‚Do you mean: University of Amsterdam, City of Amsterdam or Tourism in Amsterdam‘. This provides a helpful tool for further interaction and allows the user to express his intent. Another use for the phrase analyzer is to identify and extract names of people that are sometimes referred to with both first and last name but in other instances within the same document with only their first name or surname.

14

The full text-indexing engine creates an inverted file of all word occurrences in the text pipeline. The engine has a number of features that allow it to have a 100% up time. Even during indexing the engine allows for simultaneous retrieval by using single write and multiple read techniques. Real time filtering applications that provides dynamic content processing of crawler results and business processes are imple-mented by Infolution partners at several locations.

The real time aspects of the engine lets users search simultaneously for existing data and most up to date information in 24/7 environments.

It has been specifically designed for large Internet and information centric enterprise environments with high throughput feeds. The incremental features of the engine cater for applications that depend on dynamic domains whereby documents are added and deleted constantly. No re-indexing of the entire domain after document removal is necessary. Scalability up to millions of documents is guaranteed. During retrieval its advanced relevance ranking capabilities allow users to enter queries without worrying about formal AND/OR type of syntax since the engine will automatically execute quorum searches whereby the most relevant document are returnedfirst.

Full text indexing

The ranking behavior curves of the engine can be set at retrieval time that allows for fine-tuning the systems towards specific dynamic domains. The ranking engine can be tuned dynamically depending on the size and quality of the number of results.

15

Context analysis

Automatic knowledge creation requires a number of steps whereby each step aggregates further concepts. The outcome of analysis is stored as the relationship between words and a concept. Relations created by previous phases get more and more specific as more semantically orientated analyzers go over the source text.

The first step in this process is context analysis whereby graphs are created on a per document basis that contain pairs of related words (first order analysis) followed by second and third order analysis to link up the individual graphs. The resulting graphs contain undirected graphs of the type ‚is related‘. For each document a graph is created and stored that can be individually manipulated later if the source document is changed or purged. The patented algorithm does not require access to the domain if individual documents are changed. The process is therefore incremental and not CPU intensive.

16

Clustered content tagging

Users mention retrieval performance in their top three requirements; therefore the engine uses a number of asymmetrical processing stages to shift the CPU burden towards analysis time. The clustered content tagging uses a statistical technique to de-termine the most relevant topics for each document in the pipeline. Topic lists are stored on a separate plane in the graph database to enhance retrieval-clustering performance. The topic list for each document contains about two-dozen topics depending on document size. These tags make the semantic clustering process at retrieval time almost linear and size independent.

Stock knowledge

Available (corporate) taxonomies can enhance the analysis and retrieval performance of the core engine. Usually taxonomies encompass a narrow domain in great detail. During import the hierarchical structure of taxonomies are automatically converted into graphs and are amended to the machine-generated graphs. Knowledge in taxonomies usually represents either ‚is_a‘ type of relationships combined with ‚synonym‘, ‚narrow term‘ and ‚broader term‘ relationships. An example of such taxonomy is shown in the figure below.

These types of knowledge bases can be imported into the semantic network engine from a number of physical formats. The RDF format is the preferred format. The import process can translate CVS, ASCII and some database formats stored in SQL addressable systems to RDF before importing. During import the structure of the file is also translated to a graph as shown below.

17

The resulting structure complies with the native structure of the semantic database and is placed on a separate plane. The database engine will thereafter automati-cally connect all concepts of the taxonomy with matching concepts in other planes. The analyzer uses a logical view through all planes during processing and all know-ledge from the taxonomy is treated just as any other knowledge item.

Included with the engine is a network editor to aid in importing and managing the knowledge network.

Document classification

Classification of documents used to be a tedious and labor intensive task for specialised librarians. With the Infolution classifier the librarian‘ expertise can be leveraged and extended dynamically. The classifier trainer applications allows specialist to narrowly define categories with trai-ning data from lists and sample docu-ments. The InfoRaptor engine will use the training data to classify every document at index time.

The resulting categories are however dynamically added to the semantic data-base. Category changes therefore require no additional effort on reclassification. Category mutations are propagated into document categories automatically after the semantic database is updated.

18

Semantic cluster analysis

The resulting graphs from the previous analysis phases are placed in a second processing pipeline which merges all graphs into one graph which is thereafter periodically added to the master know-ledge graph. Adding can be continues for near real time applications or scheduled to times when demand on server CPU resources is relaxed. During the addition phase all resulting concepts are analyzed for promiscuity to determine centroids in the network. During this first order analysis the relative weights for the concepts and relations are set against the domain and document size. The results aid the retrieval clustering process by steering the user towards common themes. Cluster analysis for a domain for instance tags the word ‚Java‘ as a central theme. From this con-cept there are three major other themes: coffee, island and programming. During retrieval these connected concepts are presented to the user as alternatives.

Marker passing techniques are used to create dynamic clusters at retrieval time which are created and sized dynamically to provide query feedback on both high and low yield sub domains.

19

Taxonomy builder

An automatic taxonomy builder periodi-cally analyzes the network that was build during the previous phases. An advanced statistical process determines statistical variances between the number of forward and backward occurrences of concepts.

The process tags relations beyond the ‚synonym‘ type. As more documents are added to the domain the knowledge network will also be rearranged whereby more abstract concepts trickle towards the top and more specific terms towards the bottom.

The analyzer can for instance tag the concept ‚mortgage‘ as something more general than ‚ABN‘, ‚HSBC‘ and ‚Bank of Scotland‘ since statistically the chances

of a bank mentioning the term ‚mortgage‘ is higher than finding a document that relates the term ‚mortgage‘ to all three banks.

20

Semantic analysis: from stock knowledge to flow knowledge

People learn concepts in relation to other concepts. There is no inherent concept hierarchy in this way. A human-like ontology should therefore contain no such hierarchy. Dictionary entries are not forbidden to contain circular definitions.

The semantic analysis uses a number of basic language rules that have been pre trained by Infolution to enhance the pre-cision of the relations between concepts, whereby the statistically analysis compo-nents deliver relations of the type ‚broader term‘, ‚narrower term‘ and ‚related term‘. The semantic analysis phase amends the relations with things like ‚walked‘, ‚sells‘, ‚bought‘, ‚did‘ etc.

21

Semantic storage

The Nebula component represents a SQL compatible database that handles graph data structures as its native data type. It can operate in multi user and distributed en-vironments for high performance or large domain applications. Semantic storage uses an atomized model for easy processing in the digital domain but due to it flexibility and expandability adapts itself easy to real world problems. A graph database consists of three native storage types:

• Concepts • Links • Descriptions (properties of concepts and relations)

The interface supports filtered queries and the full SQL command set is expanded with operators specific for manipulating graphs and planes.

The native graph data type can be queried traditionally and with marker passers. Marker passing enables powerful natural associative queries that go beyond tradi-tional binary decisions. For example the question: “Does Olaf know anybody in New York?” is very hard to resolve using traditional techniques if the exact answer is not available in one document. Marker passing overcomes this by chaining concepts through link traveling.

22

Document synthesizer

Imagine asking a question to an expert and the answer would be ‚Oh yes I know that‘. Here‘s a list with the answers: have a look at the article ‚xyz‘ from issue 45 published in the year 2000 page 45 and if you read that study chapter 23 in book ‚abc‘ and pay particular attention to paragraph 15 and so on. Now you would surely think that this ‚expert‘ is in need of physiotherapy! It is however exactly this kind of behavior that humans had to accept form machines thus far.

One of the most spectacular elements of the Infolution technology is the Infor-mation Canopy document synthesizer. This component can take any number of documents and present a summary as a single virtual document. The source docu-ments need not reside within a structured database and the summary can easily be shortened or lengthened without loss of coherence. The document can even be rewritten, placing emphasis upon different topics as required by the user or his/her profile.

Traditional hit list output

23

The document synthesizer module is able to provide the user with a targeted and coherent answer about any topic of any length; the same kind of action you‘ve come to expect from humans. The domain the answer originates from can be part of a document; a whole document or a group (cluster) of documents.

Synthesized document output

The document synthesizer is the last step, it is the retrieval process, and for the user the most visible part of the system. Since all of the necessary information has been made available at analysis (index) time query, processing happens in near real time. During query processing the results are:

• Semantically clustered into groups of similar documents.

• Concepts are sorted on relevance to the user query.

• The best candidates for each cluster are extracted.

• The corresponding raw texts of the paragraphs in which these concepts occur are fetched from the database.

• Part of sentence tagging information is evaluated and sorted according to relevance, sentence construction whereby sentences that are more factually written, get boosted in relevance and user preference for length of the answer.

• By working down this list the answer constructed by extracting snippets from the most relevant parts of source text(s) until the length specified by the user is reached.

• Presented to the user.

24

The document synthesizer is by no means an output only module. Words that occur in the query are clearly labeled in the output. Words and concepts that suggest alternate but related answers are presented as hyperlinks. A click on a linked (hot) word leads to boosting of concept relevance and recompiling of a new virtual document that puts more emphasis on the selected topic.

The output windows of the document synthesizer also have two buttons to lengthen and shorten the virtual document interactively and instantly.

InfoRaptor components

A fine-grained architecture with an internal programming language and pipelined processing allow the engine to be deployed in a wide variety of situations. Customers have used the engine for very large infor-mation domains whereby the storage requirements have been lowered in favor of execution speed to real time environ-ments with 24/7 uptime requirements.

The engine has dual interfaces; the first being a traditional interface allowing for optimal high bandwidth connections typically used in product integration projects. The second interface exposes the InfoRaptor engine as a web service that makes inte-gration almost seamless on a wide variety of platforms.

25

Web services interface

The web service interface makes integration in existing infrastructures extremely easy. OEM type of embedding into web service enabled environments takes the hassle out of cumbersome API learning as the web service is added to the development environment as a native type.

The web service exposes a limited number of objects that allow the developer complete control over aspects of the engine. With a single call the developer can query the semantic database, request a document or cluster abstract, generate a document preview or control the analyzing aspects of the engine.

Application Programmers Interface

A development kit with sample code, documentation and a development version of the InfoRaptor semantic database is available for developers. The remainder of this section provides a synopsis of the exposed interfaces and internal data model. The semantic database API is modeled to Microsoft’s ADO API. The web service allows developers to work with all aspects of the database with just six methods and functions.

The InfoRaptor semantic database supports the following SQL syntax.

• Select statement SELECT select list FROM source [ WHERE search_condition ] [ ORDER BY order_expression] [ limit [offset,] rows]

• Create statement CREATE DATABASE database_name [ ON [ filename ] ]

• Update statement [ maxtime time(ms) ] Update Database

26

Syntax SQL statements

The InfoRaptor engine supports the following SQL clauses: Select, From, Where, Order by, Create, Update, Insert, Limit and Maxtime.

Select

• Select *

From

• from scope(„/data/“) • from scope(„/data/folder/“) • from scope(„/tarchon/net/cooc/“) • from scope(„/tarchon/net/phrases/“) • from scope(„/tarchon/net/typos/“) • from scope(„/tarchon/net/stem/“) • from scope(„/tarchon/fuzzy/“) • from scope(„/tarchon/fuzzyclassic/“) • from Scope(„/tarchon/net/geo“) • from Scope(„/tarchon/net/meta“) • from Scope(„/tarchon/index/dictionary“)

Where

• where FreeText(„test“) • where FreeText(„+test +testa“) • where FreeText(„test and testa“) • where FreeText(„(test and testa) or testc“) • where Contains(„test“) • where FreeText(contents, „test“) • where FreeText(title, „test“) • where FreeText(summary, „test“) • where Contains(contents, „test“) • where Contains(contents, „‘New York‘“) {phrase search} • where Pipe(„test“) {default pipe} • where Pipe(def,“test“) {default pipe} • where Pipe(none, „test“) {no pipe} • where source=‘Amsterdam“ • where destination=“Amsterdam“ • where concept=“Amsterdam“ • where FreeText(„test“) and filter=“biotech“ • where FreeText(„test“) and modified=“2000“ • where FreeText(„test“) and modified<“2000“

Order by

• order by <field> i.e „title“

Create

• Create database test on „c:test“ • Create scope(„/tarchon/net/myplane“) • Create scope(„/data/myfolder“) • Create scope(„/tarchon/filter/myfilter“)

Update

• Update database (recreate indexes and state of database) • update data (incremental update indexes) • update scope(„/tarchon/net/meta/ data“) set language = „/tarchon/ language/dutch“ • update scope(„/data“) set language = „/tarchon/language/dutch“ • update scope(„/data/myfolder“) set gp_use_goldpan = „/tarchon/ filter“ • update scope(„/tarchon/filter/myfilter“) set filterexample=URIEncode(„data mydir24.html“)

Insert

• insert into scope(„/tarchon/net/ phrases/“) (source,relation,destination) values („pietje bel“,“contains“, “pietje“) {adding links} • insert into scope(„/tarchon/ filter/myfilter“) (filterexample) values (URIEncode(„datamydir24. html“))

Limit

• select * from scope(‚/data/‘) limit 100 • select * from scope(‚/data/‘) limit 90,10

Maxtime

• update database maxtime 5000 (5000 ms)

27

Query samples

select * from scope(„/tarchon/fuzzy/“)

select * from scope(„/tarchon/net/cooc/

select * from scope(„/tarchon/net/typos/

select * from scope(„/tarchon/net/stem/

select * from scope(„/data/“)

select * from scope(„/data/news/“)

select * from scope(„/data/“)where freetext(default,“hallo“);

select * from scope(„/data/“)

where freetext(/tarchon/pipe/default, „hallo“);

select * from scope(„/data/“) where freetext(/user/pipe/mypipe, „hallo“);

select * from scope(„/tarchon/index/“)

select * from scope(„/tarchon/index/system/“)

select * from scope(„/tarchon/index/dictionary/“)

select * from /tarchon/language/dutch/stopwords

select * from /tarchon/language/dutch/dictionary

create scope(„/tarchon/net/myplane“)

insert into scope(„/tarchon/net/phrases/“) (source,relation,destination) values („pietje bel“,“contains“,“pietje“)

select * from scope(„/tarchon/net/phrases/“) where concept =“pietje bel“

update scope(„/tarchon/net/meta/data“) set language = „/tarchon/language/dutch“

select * from scope(„/tarchon/net/meta/concepts“) where name=“/data“

select * from scope(„/tarchon/net/phrases/concepts/

select * from scope(„/tarchon/net/phrases“)

28

Database schema tree

The Tarchon Database contains the following folder in database

• /data/

• /data/filter

• /data/filter/myfilter/example

• /meta/

• /tarchon/

• /tarchon/net/

• /tarchon/index/

• /tarchon/index/dictionary

• /tarchon/language/

• /tarchon/language/dutch/

• /tarchon/language/english/

• /tarchon/language/german/

• /tarchon/fuzzy/

• /tarchon/pipe/

• /tarchon/analyzer/

• /tarchon/analyzer/cooc

• /user/

Spider

The default spider is activated for url‘s in /data/ area.

The default spider settings are:

• spider_depth = 0

• spider_width = 0

• spider_fetchers = 25

• spider_useragent = „Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)“

• spider_accept = „text/html“‘;

To change these settings, create a new spider with create scope („/tarchon/spider/myspider“) and set properties of new spider with

update scope(„/tarchon/spider/myspider“) set spider_depth = 1, spider_width = 0, spider_fetchers = 25, spider_useragent = „Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)“, spider_accept = „text/html“‘

Connect new spider with web_spider relation to /data/

insert into scope(„/data/“) („web_spider“) values („/tarchon/spider/myspider“)

29

Visualization

The dynamic visualization tools of the Infolution user interface are truly innovative. If a user searches for „economics“, the engine finds the conceptual context of this search and gives the user a number of intelligent alternatives and thus relates queries.

The context of the search results is shown in an intuitive three-dimensional concept space, in which more important concepts appear to be closer to the user than less important ones. Using visual cues like mouse-over effects and animation, the system makes clear exactly how and why certain concepts appear. Since users prefer to click rather than type to refine search criteria this visualization method appeals to many types of users, both experts and non-experts.

One of the great, unsolved business problems of today‘s wired world is the inability to gain access to information in an easy, quick and interactive manner. Infolution has been rethinking the tra-ditional methods we‘ve used to explore information, using viewing techniques that allows both breadth and depth of insight.

Infolution provides a hands-on way to visualize networks of interrelated infor-mation. Networks are rendered as inter-active graphs, which lend themselves to a variety of transformations. By engaging their visual image, a user is able to navigate through large networks, and to explore different ways of arranging the network‘s components on screen. Every word on the screen is click-able to place the user in full control to steer answers in any direction or change the focus (story-line) of the answer.

The Infolution engine delivers contextual information in a special XML standard known as topic maps or XTM. The portal based solution uses XSLT style sheet transformations to present a com-prehensive information landscape to the user. The engine is compatible with all topic map based visualization systems. Infolution provides three user selectable visualization methods with its solutions.

30

Text based; topic map concepts are dis-played in one or more named relationship lines. Concepts provide hyperlinks to the query engine.

Graphical visualization is provided with two client based visualization systems. These are provided as plug-in‘s, ActiveX controls, Macromedia Flash or Java applets. Both interactive graphical viewers allow high bandwidth user interaction.

The associative nature of a network makes remembering its structure surprisingly easy, but it is the experience of seeing a series of recurring stable visual images that really gives a boost to the user‘s memory. The ability to create and navigate these stable images is what makes both the Information Canopy and TouchGraph special, and it is also the key to empowering both the designer and the user.

Visually navigating through a network is inherently a dynamic process, and steps are taken to keep the user feeling oriented and in control. The Information Canopy and TouchGraph achieve this by keeping the graph looking as static as possible, and more importantly, by making sure that dynamic changes are predictable, repeatable, and un-doable.

The Information Canopy and TouchGraph are information visualization systems for large information spaces using computer graphics and 3D interactive animation. They stimulate the recognition of patterns and structure by exploiting the human perceptional system. This dynamic visualization tools show appropriate contexts for the results and the synthesized documents.

31

The Information Canopy applet was deve-loped by Q42 to provide a lightweight solution suitable for narrow band com-munication lines. It presents concepts in a free flowing space. The Information Canopy uses biased random placement algorithms and non-distorting fish-eye view methods to make browsing information domains intuitive.

TouchGraph provides a hands-on way to visualize networks of interrelated infor-mation. Touchgraph is based on the original spring applet by Sun and uses a more structured visualization algorithm with line drawing between concepts. TouchGraph is being developed in an open source environment in order to expedite the maturity of this technology, and to allow it to be used as broadly as possible.

Visualization provides the user of Infolution software with considerable advantages:

• Presenting results in an intuitive way, combining visual concept clustering with synthesized abstracts of those concepts.

• Keeping an eye on the task; increased interaction bandwidth with information visualization narrowing the gap between searching and browsing diminishes the attention span loss of (knowledge) workers.

• Embedding searching in a larger task; extraction and structuring of knowledge fluidly.

Features:

• Visual display of term relationships improves browsing and searching of full-text information.

• Exploitation of the human perceptual system by showing context, relations and relevance. Information visualization increases band-width of access to the brain and leverages humans.

32

System Requirements

• Personal Computer with a least a Pentium 233 MHz. A Pentium III or higher is recommended.

• Microsoft Windows 2000 with Service Pack 3 installed or Windows XP Pro.

• 64 MB RAM (256 MB RAM is recommended) for the operating system and an additional 16 MB RAM for each application that runs simultaneously

• 100 MB hard disk space plus 25 MB on the boot disk

• Client components require 1 MB for a typical installation and 7 MB for a maximum installation.

• Server components require 18 MB for a typical installation and 100 MB for a maximum installation.

• The Internet Explorer (version 5.5 or higher) requires an additional 43 MB hard disk space in case of a typical installation and 59 MB in case of a maximum installation.

• Internet Information Services (IIS) installed. These services are part of the Windows 2000 and Windows XP Professional package.

• CD-ROM drive

• Super VGA (800x600) or higher resolution monitor with at least 256 colors.

• Microsoft mouse or compatible pointing device.

33



Logo 01

Infolution bvSchouwschuit 37

1613 CJ GrootebroekThe Netherlands

Phone: +31 (0) 228 52 40 70 Fax: +31 (0) 228 52 40 74

www.infolution.com

Technology

Solving Information Complexity - White paper by Hank Alles (Infolution