Enterprise Search Myths and Reality - Attivio White Paper

Enterprise Search Myths and Realities

2

Myths in Enterprise Search Many people think that since enterprise search has been around for years, all its issues should have

been solved and its complexity dramatically reduced. In other words, it ought to be a commodity by

now. Although search technologies continue to advance, the origin of the complexity is the nature of

language itself – full of ambiguity, contradiction, multiple meanings, and many contexts. The digital

search for what you want to know involves a balance between precision (the measure of the usefulness

or a result) and recall (the measure of the completeness of the result). Increasing precision without

sacrificing recall is a complicated balance to strike, and the commoditization of this balance is one of the

myths not currently in line with reality.

COMMON SEARCH MYTHS

In this document we uncover and correct some of the other myths, assumptions, and misconceptions

about enterprise search that prevail in the market today.

Myth

Web search and Enterprise search are more or less the same.

Reality

This is the assumption behind the statement, “Why can’t I get a Google interface for my company?” The

quick answer is, “because you don’t actually want it.” The more appropriate question is, “Why can I find

what I’m looking for on Google more easily than I can on my own corporate website?” That is indeed a

problem, and it happens, ironically, when people try to solve it from a Web search perspective.

To Google’s credit, they have been successful in forcing the enterprise search market to realize that ease

of use for the end user (and ease of use in general) should be given the serious attention it deserves.

Agreed. But the demands of the enterprise user are different and the complexities of source content are

greater than a public search of the Web.

Users expect a better answer to their question because they are more intimate with their content. They

are not looking for the most popular answer – the general logic behind ranking for Web search – they

want the right answer. Consequently, the ranking model for the enterprise is much more complex than

for the Web. It must consider several parameters in its balance of demand between precision and recall

(term frequency, source freshness, spatial proximity, authority, etc.), a balance that changes from

application to application.

Now add to this greater complexity. The reasons for searching are much more varied. Data types are

much more varied. Data freshness (how soon does a new document appear in my search?) is more

important. Security is an issue. A typical enterprise supports more content types, formats, and security

layers than the entire Web.

3

Finally, search in the enterprise is often in context of a specific solution, for example uncovering legal

risk, assessing product campaigns, or buying goods online. While it is popular to start a request for

information with a search query, unlike the Web, there is an expectation that further investigation will

involve refinement techniques that involve navigating through supplemental information such as facets

and concepts. This more exploratory model visualized through navigators, tag clouds, heat maps, and

the like, is commonplace in enterprise search but not on the Web.

Myth

The Web has more content than any enterprise, so Web search companies are the real experts on

search.

Reality

As of summer 2008, Google’s index was just under 21 billion web pages and growing. Yahoo’s was

actually higher at around 55 billion1. If you take an average page size of 200 bytes2, then Google’s index

was then about 4.2TB and Yahoo’s about 11TB. While this is large, and perhaps larger than many

enterprise search implementations, there are many, many enterprise search implementations in the

tens and hundreds of terabytes, and a few now in the petabyte range. The assumption might be that the

corpus is larger because the average document size is also. True for some implementations, but others

contain just email.

In any case, when it comes to scale, it is intuitive who has the most incentive for efficiency. The Web

search companies purchase and host their own hardware. Enterprise search vendors must convince

their customers to buy the hardware themselves3.

Myth

My relevancy model is better than your relevancy model.

Reality

Search technology, in some form or another, has been around since Lexis-Nexis first commercialized it in

the 1970s. In the 1990s, Google became the catalyst for carving out the web search business as a

separate entity, introducing ranking algorithms based on website popularity. It also legitimized the

importance of search as a strategic asset in the enterprise.

Until then there was little distinction between Web or enterprise search. All the models were more or

less the same, based on matching words typed into the search box against words in documents residing

in the index. But now that there was real money to be had, enterprise search vendors turned up the

volume on competitive differentiation by touting the superiority of their relevancy models.

In hindsight, this was a surprising tactic because the math behind relevancy is quite complex and hardly

the stuff of debate for the search customer: keyword search, conceptual search, semantic search, scope

search, TF/IDF, Bayesian probability, Boolean filtering, query expansion, NLP, text mining, and so on.

One vendor proclaims that enterprise search is not rocket science (or brain surgery – it’s

4

interchangeable); enterprise search is harder. Another vendor simply tells its customers it’s too

complex, so “leave it to the experts who know how to do this.”

Yet, today, as in Time Before Google, the majority of customers are still not satisfied with the results

they get from their search engines. In their minds the quality has not really changed. You still hear, “Why

can’t I find information in my company as easily as I can find it on Google?”

The suggestion that one model is better than another is not so much wrong, but a moot point. Every

vendor’s model is good for some content, just not for all content. What makes a good enterprise search

vendor is their ability to adapt to the context and character of the content and application requirements

by deploying an optimal combination of all these different approaches.

Finally, perhaps we’re all arguing about the wrong thing. Relevancy is important, but what of the user’s

experience? Is the search technology touching the complete information landscape or just part of it?

How is navigation and exploration accomplished? Does the technology act on the results, i.e., connect in

to business operations and trigger action? There is more to information access than search, and there is

more to search than relevancy.

Myth

Manual facet management is easy and quick and offers a good user interface.

Reality

Facets, or dimensions, of results of a search can be used to help navigate to related information. The

conventional approach to facet management requires defining the facets before indexing as part of the

search platform’s configuration. Some vendors provide a well-designed user interface to make the

process as easy as possible. A typical example might be the organization of facets on an electronic retail

outlet’s ecommerce website. For productType = 'computer', I can declare my facets in this

order: price, make, CPUs, memory, storage, slots, monitor. For laptops, I would add a piece of logic that

says if portable='yes' then display the weight facet.

The manual approach is not a problem if your objects have a fairly uniform structure (e.g. books). You

can have millions of them, but the key is they are all described the same way. But imagine a national

retail outlet whose product catalog contains several hundred thousand different products. And further,

imagine the catalog changes fairly constantly. The manual process is now a half-year project and the

change management a major ongoing commitment.

A system that recommends the facets to you automatically and on the fly for each query would remove

all this work. The logic behind the ranking is not unlike the ranking for search results, involving a number

of calculations to arrive at a composite score (e.g. sparse matrix analysis, clustering, facet distribution,

etc.). The algorithms should be smart enough to avoid situations where no facets appear because none

are relevant enough to display (e.g. sparse-matrix analysis alone). This can happen with content that has

minimal facet intersection.

5

Myth

A simple database query is all a search engine needs to extract content from a database. Only the

relational database can truly support ad hoc structured querying

Reality

We challenge this assumption by comparing both relational and search engine technologies, with the

goal of proposing a hybrid solution that reflects the advantages of both. Let’s take a look at the

relational model first.

The relational model, you may recall, was originally designed for managing the transactional integrity of

inputting information and for its efficient retrieval through predefined, repetitive reporting. It works

because the database schema is designed specifically for the structure of the data and the shape of the

reports.

But the market began demanding a more ad hoc approach to querying their data for what-if analysis and

general exploration. In this situation, the query is no longer repetitive or known in advance, and

therefore cannot be planned for in the database engine.

The ad hoc query does not sit well with the basic relational model. Any relationship created a priori (all

relationships in a database schema) will bias for some queries and against others. Since you do not know

the query being asked, you do not know which side it will fall on. It is quite possible to create a “killer

query” that brings the database engine to a screeching halt.

Attempts to solve this problem have resulted in a continuous evolution of the relational model, twisting

it in various ways to provide better performance and greater flexibility (more ad hoc). Technologies

include data marts and star schemas, software and hardware data warehouses (e.g., Teradata, Netezza),

cubes, and vertical indexing technologies (e.g., Vertica, Sybase IQ).

The underlying problem is still there, however. All these technologies still view the problem from a

traditional table-column-relationship point of view, and this is inherently limiting. It does not mean we

abandon SQL or the need for the relational model to manage transactional data entry and fixed

reporting, but it does suggest we should rethink how the basic engine works for the optimization of

rapid, high volume, ad hoc information retrieval.

What might this new engine look like? Search indexes provide an interesting approach. They are

certainly designed for this type of problem. Google, for example, responds to millions of queries a day,

searching through billions of documents, each query taking less than a few seconds to respond. No

database technology comes close to this type of performance.

But then Google does not have to deal with cardinal relationships. It does not have to support the SQL

JOIN statement. The JOIN statement is the cornerstone of both reporting and ad hoc querying. For

example, we may have a hundred invoices for a customer. In a relational database, that amounts to a

101 tables: one for the customer and a hundred for the invoices. The customer data is stored once but

6

referenced a hundred times. If we want to return all the invoices for a particular customer, or all the

customers with invoices greater than a certain amount, the JOIN statement is used to exploit the

relationship between the customer and invoice tables.

The approach conventional search vendors use to extract content from a relational database is to

execute a SQL query against the database, returning a result set of uniform shape that is then indexed. If

different data or a different result set is requested, a new query is defined, the search index is

reconfigured, and the index is re-indexed. It works this way because search technologies simply do not

understand the relational concept.

There are many problems with this model. First, the data is “flattened”, meaning all cardinality is

removed by repeating content in each result set row. In our example, a flattened result set would

include the customer’s properties in every invoice. An updated customer record would require an

update to each one of its invoices in the search index.

Second, there are no real ad hoc capabilities here. You must know beforehand how your users will

explore the database content because you have to predetermine the shape of the results. But often you

don’t know what your next question will be until you see the answer to the first.

Finally, it is now impossible to JOIN content as a database engine does. A JOIN is not like a search; it is a

true Cartesian of results between two sets of content that share a common property value.

This does not need to be so. The rapid, high volume, pure ad hoc querying capability of the search index

is still valid, but the architecture needs to be enhanced to retain the integrity of the cardinal relationship

from the database source. If the search engine was augmented to ingest each table’s rows individually

for all the tables in the database, then it would be possible (with some clever work on the vendor’s part)

to support a JOIN statement executed on the fly at query time.

By the way, because the index contains both structured and unstructured content, the JOIN could be

between a table, email, and a set of documents. Further, since this is a search environment, “fuzzy

JOINs” are possible that capitalize on standard capabilities such as spell correction and synonym

expansion.

NEW DEVELOPMENTS IN SEARCH: INFORMATION ACCESS

While development in enterprise search and web search continue, a new category called unified

information architecture (UIA) is beginning to gain market traction. Unified information Architecture

extends enterprise search capabilities across all types of documents, data, and media. This expanded

scope replaces legacy enterprise search, offering all its functionality and combining simple access to

data and media. The advantages include being able to assemble all relevant information with one query;

connecting content and related data; and searching data with a simple search query instead of a

structured query language and formal reports. UIA can co-exist with search or replace it outright. For

more information about UIA, please visit www.attivo.com.

7

ABOUT ATTIVIO

Attivio’s Active Intelligence Engine® (AIE), redefines the business impact of our customers’ information

assets, so they can quickly seize opportunities, solve critical challenges and fulfill their strategic vision.

Attivio correlates disparate silos of structured data and

unstructured content in ways never before possible.

Offering both intuitive search capabilities and the power

of SQL, AIE seamlessly integrates with existing BI and big

data tools to reveal insight that matters, through the

access method that best suits each user’s technical skills

and priorities. Please visit us at www.attivio.com.

Attivio, Inc. • 275 Grove Street • Newton, MA 02466 USA

o +1.857.226.5040 • f +1.857.226.5072 • [email protected]• www.attivio.com

© 2013 Attivio, Inc. All rights reserved. Attivio, Active Intelligence Engine, and all other related logos and product names are registered trademarks of Attivio. All other company, product, and service names are the property of their respective holders.

FOOTNOTES

1. Source: http://www.worldwidewebsize.com/.

2. Source: http://www.websiteoptimization.com/speed/tweak/average-web-page/.

3. Although adopting a SaaS model gets around this, the cost is still there. It’s just buried in the monthly fee.

http://www.attivio.com/active-intelligence/aie-platform.html

http://www.attivio.com/

mailto:[email protected]

http://www.attivio.com/

Data & Analytics

Enterprise Search Myths and Reality - Attivio White Paper