11
TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 12 Things the Semantic Web Should Know about Content Analytics Seth Grimes, Alta Plana Corporation June 2011 | Sponsored by OpenText Abstract Content analytics is sense-making technology. It semanticizes online, social, and enterprise content. It facilitates semantic data integration, search, and information management and is an underappreciated foundational technology for building the Semantic Web. Technologists and business leaders alike will benefit from understanding the role content analytics plays in semantic computing, starting with 12 essential points.

12 Things the Semantic Web Should Know about Content Analytics

Embed Size (px)

Citation preview

Page 1: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER

12 Things the Semantic Web Should Know about Content Analytics

Seth Grimes, Alta Plana Corporation June 2011 | Sponsored by OpenText

Abstract

Content analytics is sense-making technology. It semanticizes online, social, and enterprise content. It facilitates semantic data integration, search, and information management and is an underappreciated foundational technology for building the Semantic Web. Technologists and business leaders alike will benefit from understanding the role content analytics plays in semantic computing, starting with 12 essential points.

Page 2: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 2

Contents Introduction ................................................................................................................. 3  The Semantic Web and Content Analytics ............................................................... 3  

1.   Entity extraction is a form of content analytics ................................................ 3  2. There are more entities that are dreamt of in DBpedia, Freebase,

Reuters.com, and the like ................................................................................ 4  3. Content analytics discovers, annotates, and extracts the broad range of

information in content, far beyond entities ....................................................... 4  4. Content analytics handles subjectivity: Sentiment, opinion, and emotion .......... 5  5. Content covers more than just text managed in a content management

system and published to the web .................................................................... 6  6. Content analytics is part of a collection of complementary and overlapping

analytical technologies .................................................................................... 7  7. Content analytics generates semantic and structural metadata ........................ 7  8. Content analytics facilitates semantic search and semantic data integration .... 8  9. Content analytics scales from individual messages to wide data spaces

and large corpora ............................................................................................ 9  10. Content analytics can operate in real time for a wide variety of business

goals and business domains ........................................................................... 9  11. Content Analytics is delivered installed, on the cloud, and as-a-service:

Your choice ...................................................................................................... 9  12. Content analytics can be customized, extended, and configured via

inclusion of controlled vocabularies, taxonomies, and ontologies ................. 10  Conclusion ................................................................................................................ 10  

Page 3: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 3

Introduction Semantic computing exploits machine-represented meaning to enhance search, data integration, knowledge management, and information-centered business processes. The ultimate goal is to enable automated knowledge discovery and business-process execution across a linked data web. However, this goal will not be reachable in any meaningful sense unless and until a broad set of information-rich endpoints is available for major business and personal purposes. These Semantic Web endpoints – triple stores that capture entities and relationships, supporting distributed query and inference – and other forms of semantically annotated content aren’t instantiated and populated by some magical process. They must be created.

The creation of meaning – the generation of structured information from “unstructured” sources – is the province of content analytics. Content analytics, along with modern applications that couple content production and annotation along with efforts to map databases into linked-data repositories, are the foundational technologies that facilitate semantic computing and populate the Semantic Web.

So long as the Semantic Web lacks a critical mass of usable data from online, social, and enterprise sources, the Semantic Web will have form but not function. The set of core Semantic Web technologies, a stack of standards and protocols, on their own are not enough. The Semantic Web and broader semantic computing need data, yet almost no historical information, and very little of the information being produced today is in semantic formats. Content analytics can extract semantics for that mass of “unstructured” information to provide semantic structure. By semanticizing the range of existing content, content analytics can and will fuel the realization of the Semantic Web.

The Semantic Web and Content Analytics Despite its very important (and as yet mostly potential) Semantic Web role, and despite the business value being delivered today by content analytics, the technology, solutions, and broader applications are not sufficiently well understood; hence this paper, 12 Things the Semantic Web (and semantic computing practitioners) Should Know about Content Analytics. Let us start with a fundamental point:

1. Entity extraction is a form of content analytics Entities are concrete things, often named in some form of lexicon; for example, people (Thor, Barack Obama), companies (IBM, General Motors), places (Paris, Canada), events (the World Series), enzymes (hexokinase), and even research papers (“The Unreasonable Effectiveness of Data”). Entity extraction is a process that starts by finding entities in source materials, whether web pages, email, audio streams, images, or some other material of interest. Once discerned, the entity is disambiguated (Is “Ford” a car, an

Page 4: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 4

industrial company, an actor [which?], a theater, or a place to cross a river?). Then it is typed (Person, Organization, etc.), and (perhaps) mapped into a canonical form according to a controlled vocabulary. It may be designated with a uniform resource identifier that facilitates associating diverse information to the source material.

Entity extraction is a form of content analysis. It involves reaching into the content, whatever its form, and understanding the inherent structure that is apparent to any educated human reader: the “chunks” into which text and other content is separated, the word morphology, grammar, and larger-scale structure that humans grasp without conscious reflection. The parsing steps may seem simple, but tasks such as disambiguation, which entails consideration of context and usage, decidedly are not. Vikings in a sports article are different from Vikings in a history text; beyond document type, word sequence “the Vikings lost their fourth straight game” tells us which sense of Vikings is in play. Yet –

2. There are more entities that are dreamt of in DBpedia, Freebase, Reuters.com, and the like Common entity sources do not cover all business, scientific, news, or cultural domains. An entity annotation service designed foremost for financial news sources won’t help you much with laboratory science or understanding Iraqi Arabic blog chatter.

Content analytics tools support a variety of techniques that allow you to go beyond the common sources. Tools may allow you to import and apply your own lexicons and taxonomies, and they may infer new entities via syntactic analysis and machine learning (techniques that decode grammar and apply pattern analyses to build or expand on a list of features of interest). Further, content analytics may resolve anaphora, including pronouns as well as other forms of co-reference, accepting different ways of referring to a single thing. The application of natural-language processing helps us understand that in the text –

“Sarkozy's desire to become the new President's main international partner – and, indeed, personal friend – was palpable. Consequently, the famously passionate and emotive Frenchman responded to Obama's reserved personality…”

– “the new President” is Obama and “the famously passionate and emotive Frenchman” is Sarkozy. But entities are not all that content analytics can find.

3. Content analytics discovers, annotates, and extracts the broad range of information in content, far beyond entities

RDF schemas capture relationships among entities: FriendOf, EmployedBy, OwnerOf, and so on; the lists are long, varying by data space. Entity relationships may be engineered in a top-down, prescriptive manner, or they may be mapped from sources

Page 5: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 5

such as relational databases that capture relationships. Wherever they originate, relationships are the key to knowledge and raw material for inference.

If your approach is to extract entities and restrict yourself to relationships expressed in ontologies or other knowledge repositories, you may be leaving vast amounts of valuable information unanalyzed. Source materials capture and express relationships. After all, a blog posting, a tweet, an article, an e-mail message, a video: every form of content was created to communicate. It would be silly to parse a news article and report that country X, person Y, and company Z were mentioned without also extracting the entity relationships present in the text.

Content may contain conventional data, and not just in marked-up data tables. Consider a sentence from a datelined article,

“The Dow Jones Industrial Average finished the trading day at 12,605.32, up 45.14 points (0.36 percent). The S&P 500 closed at 1,343.6, up 2.92 points (0.22 percent).”

Content analytics can extract this data, to RDF or to a database table, along with metadata such as the names of the article author and publication, the publication date, the article’s URL, as well as other available information from HTML Meta tags and page-embedded FOAF, RDFa, or other microcode. Content analytics can infer from the text –

“Among actively traded Colorado stocks, Accelr8 Technology Corp. (AXK)...”

– that (possibly) named entities Accelr8 Technology Corp., AXK, and Colorado are related; sophisticated content analytics will ascribe the ticker symbol AXK to Accelr8 and capture that Accelr8 is located in the geographic area Colorado. Beyond these facts and relationships, strong content analytics will associate the conceptual class “stock market index” with the DJIA and S&P 500 and will identify topics such as “financial markets reporting” and themes such as “the economy” with the source article.

How far beyond entities?

4. Content analytics handles subjectivity: Sentiment, opinion, and emotion We can classify information as factual or as subjective. Attitudinal information – sentiment, opinions, emotions – is very important to business applications that include customer service and support, marketing, product, and service quality, contextual advertising placement, and policy and politics. A business that is listening will pick up on tweets such as –

@robwolfeusa Wow, at #Hilton in Long Island. Exec floor room guaranteed not available and no rooms clean and available at 4:30PM.

Page 6: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 6

– that indicate problems. Content analytics, in this instance, will understand what hotel property is being referred to, what the issue was, and who was posting (the potential often exists to match a social handle to a name or other identifying information and from there to actual business transactions); this facilitates processing and quick responses. This example looks at and matches individual records; content analytics is also applied to aggregate sentiment, classified by familiar categories such as location, age, and sex as well as by company specific dimensions such as product and location.

This class of subjectivity analysis looks for the voice of the customer (or prospect, influencer, voter, patient, or market) as expressed online in blogs, forum postings, reviews, email, surveys, contact-center conversations, and a range of other feedback sources. It is sensitive to the identity of the person who is posting, the needs of the person who may be consuming the information, to context, and to plans or intent captured in text. While subjective information may not have the ability to be matched to particular persons, the benefits of knowing who is posting are prompting entity-analytics R&D into identity resolution based on clues found in text.

Our next point should be obvious by now:

5. Content covers more than just text managed in a content management system and published to the web We have user-generated content online in the form of articles, blogs and comments, status updates, profiles, and forum postings. And certainly, we have content in the conventional sense, material that is created and published via formal, managed processes. But the content label also extends to email, corporate documents and reports; SMS/IM text, contact-center notes and transcripts; and also, as mentioned, to audio streams, images, and video. This includes the above in original, as-created form and in derived (duplicated, quoted, sampled, distorted, and otherwise reworked) forms.

Consider rich-media content in particular. Content analytics solutions are already in use to search, analyze, and mine audio streams for contact-center applications, capable to search not only on speech transcribed to text but on phonemes, on the fragments from which speech is composed, with advanced abilities to distinguish among speakers in a conversation and to detect emotion. A consumer-grade electronic camera’s ability to identify people within the photo frame and to detect whether a subject is smiling or blinking is content analytics; automated image recognition capabilities, and not just via externally applied tags, are advancing rapidly, as is ability to decode image changes in a video stream.

Content analytics, coupled with (other) SemWeb technology and operating independently, can be applied to the spectrum of information types across organizational barriers. Analytics, broadly drawn, provides the key.

Page 7: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 7

6. Content analytics is part of a collection of complementary and overlapping analytical technologies Analytics is the search for business insight in online, social, and enterprise data. Analytics comes in many forms, under a variety of names. The definition common to them all is that analytics transforms source data to derive business information that is stored to databases and communicated in the form of numbers, tables, charts, and visualizations.

Data mining discerns patterns in data in structured forms, typically in databases, to produce predictive models suitable for classification, forecasting, and other functions. BI typically applies dimensional models to data and supports reporting and interactive data analysis, but it may also include predictive-model deployment and in some instances, will also subsume the data mining process. Web analytics is not typically grouped under the BI umbrella, but it is BI, drawing from web server log files to mine behavior patterns from click-stream data, presented in familiar BI dashboards, reports, and charts and feeding data-mining processes that seek to model quantities such as website conversion (a fancy name for sales) and shopping-cart or session abandonment. Social-network analysis looks at the dynamic graph of connections and message propagation across social and enterprise platforms. Lastly, location intelligence is a special sort of BI with data types, structures, analysis, and presentation methods tailored for geospatial data.

These analytics variants operate on numerical, quantified data. Content analytics complements them, in some cases by extracting data (e.g., geographic locations and numbers from data tables) from textual sources and in other cases by using their capabilities for exploratory analysis of text sourced information; for instance, when classified by geographic source or topic and rendered in a map, when presented in BI dashboards and charts, and when incorporated in predictive securities-trading models.

But content analytics can do more than just quantify free-form sources, shown in our next two points.

7. Content analytics generates semantic and structural metadata Metadata is descriptive information. Comparing content to a letter, the writing, and postmark on the envelope is metadata. Consider electronic examples: the values of the To, From, CC, Subject, and routing header fields of an email message; the author, file name, file type, last-saved date, title, language, and tags applied to a document; values annotated with web page META tags, and so on. Some of this metadata is structural, some of it is semantics.

Page 8: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 8

The Dublin Core Metadata Initiative is perhaps the most prominent metadata-standards proponent, providing for natural-language and formal semantic shared vocabularies that facilitate interoperability.1 The natural-language processing (NLP) components of content analytics solutions can and do discern and extract metadata from free-form and semi-structured source materials; all done with the possibility of Dublin Core conformance and of meeting particular, situational needs by extracting advanced metadata such as topics and themes.

Content analytics tools will, depending on the provider and on the user’s needs, create and store an XML-/RDF-/FOAF-annotated version of source materials, extract information of interest to a file or database, or, when invoked as-a-service, return XML-, JSON-, etc. marked up.

Here’s where we come to search and linking.

8. Content analytics facilitates semantic search and semantic data integration Web pages annotated with concepts, topic, synonyms, etc., and with key information content micro-formatted– this is Search Engine Optimization (SEO) –will be more directly accessible as search evolves into information access. For both web search and local enterprise search, that extracted information can be indexed as the basis for concept and faceted search (which are two varieties of semantic search), and for faceted navigation, where users and site visitors see results classified into high-level categories known as facets (facets may be predetermined or they may have been discovered in source materials via NLP and clustering).

Content analytics also enables similarity search, where we can search for documents, messages, or objects that are statistically or semantically similar to one we’re viewing, and for similar searches, which are search queries similar to the one we have issued. Similarity measurement is useful beyond interactive search; for instance for tracking the diffusion of content – messages, press releases, quotations, and so on – across news, social, and interpersonal messages, whether for media measurement, copyright enforcement, or research. Given content’s complexity, content analytics’ ability to “fingerprint” content and measure similarity is an asset in tracking efforts.

Lastly, while annotation is great for SEO and semantic search, it also facilitates data integration, also known as data fusion and record linkage. For Semantic Web

1  http://dublincore.org/metadata-­‐basics/  

Page 9: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 9

applications, annotations would include URIs; for other applications, integration could be accomplished via other content-extracted key information.

Automatic summarization and abstracting are under the content analytics umbrella.

9. Content analytics scales from individual messages to wide data spaces and large corpora Content analytics scales through the use of high-throughput technologies such as Hadoop and deployment on grid-based, scalable hardware. Further –

10. Content analytics can operate in real time for a wide variety of business goals and business domains The choice of particular techniques and tools, where scalability, the need for speed, and other capabilities are concerned, will depend on the information sources, business goals, the type of insights to be sought, and the skills of the users. If the business need is for real-time news and social monitoring for brand and reputation management, security, or military intelligence, one class of solution will be in order that would be very different in application from a solution chosen to provide semantic search and navigation for an online commerce site.

Focusing on real-time capabilities and also the ability to handle noisy social text (replete with slang, idiom, misspellings, abbreviations, sarcasm, and the like), we see that content analytics’ capabilities are a neat complement to the structured Semantic Web, which would be hard-pressed to keep up with today’s flood of raw, chaotic information. The pairing of structured sources and ad-hoc analyses can be especially powerful.

11. Content Analytics is delivered installed, on the cloud, and as-a-service: Your choice Most members of the semantics community are familiar with a few as-a-service annotation services, accessible via web services APIs. They represent only the visible top of a much larger, metaphorical, content analytics iceberg. First, there are many more annotation services, with capabilities that extend far beyond English-language entity analytics to encompass deep information extraction, in the content analytics world. The only barrier to their semantics-world and Semantic Web use is lack of awareness. Further, content analytics is available on the cloud, in hosted form, or may be installed on your own hardware.

Page 10: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 10

12. Content analytics can be customized, extended, and configured via inclusion of controlled vocabularies, taxonomies, and ontologies Analytics means flexibility, the ability to square formal methods and structures with ad-hoc, situational needs and to rely both on shared, standardized resources and on protocols. It is also the ability to depend on proprietary assets and materials not yet brought into compliance with modern forms and into the Semantic Web.

Conclusion We have examined 12 Things the Semantic Web (and Semantic Computing Practitioners) Should Know about Content Analytics. But really, they reduce to a single paragraph:

Content analytics makes sense of the mess of content – of online, social, and enterprise text, and moving forward, of rich media including images, audio, and video – for purposes that extend to semantic data integration, search, and information management. Content analytics, by helping semanticize existing data, is a foundation technology for the Semantic Web and semantic computing. Content analytics is delivering business value today, complementing BI, web analytics, location intelligence, and predictive analytics. Prospective users can look to a variety of technologies and tools to find or craft a solution that best meets particular needs, whether for individual, embedded, or enterprise use. Given that hosted and as-a-service (as well as installed) options are available, getting started is not difficult; given the breadth of capabilities, standards adherence, and customizability, there are few adoption barriers. Semantics practitioners will readily see the value of the technology and will find it well worth trying.

Page 11: 12 Things the Semantic Web Should Know about Content Analytics

TOGETHER, WE ARE THE CONTENT EXPERTS WHITE PAPER 11

Visit online.opentext.com for more information about OpenText solutions. OpenText is a publicly traded company on both NASDAQ (OTEX) and the TSX (OTC) Copyright © 2010 by OpenText Corporation. Trademarks or registered

trademarks of OpenText Corporation. This list is not exhaustive. All other trademarks or registered trademarks are the property of their respective owners. All rights reserved. 11PROD0234EN

Seth Grimes is an analytics strategist with Washington DC based Alta Plana Corporation, founding chair of the Text Analytics Summit and the Sentiment Analysis Symposium, and contributing editor at TechWeb's InformationWeek. He consults, writes, and speaks on business intelligence, data management and analysis systems, text mining, visualization, and related topics. Follow him on Twitter http://twitter.com/sethgrimes

About OpenText

OpenText is the world’s largest independent provider of Enterprise Content Management (ECM) software. The Company's solutions manage information for all types of business, compliance and industry requirements in the world's largest companies, government agencies and professional service firms. OpenText supports approximately 46,000 customers and millions of users in 114 countries and 12 languages. For more information about OpenText, visit www.opentext.com.