38
Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Embed Size (px)

Citation preview

Page 1: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page 2: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 1

Contents FreeBase Api ........................................................................................................................................... 2

WordReferenceApi .................................................................................................................................. 9

DBPedia................................................................................................................................................. 11

YahooApi’s ............................................................................................................................................ 17

YAGO..................................................................................................................................................... 19

TrueKnowledgeAPI ................................................................................................................................ 29

Comparision of DBPedia and FreeBase .................................................................................................. 31

Comparision of DBPedia and YAGO........................................................................................................ 33

Conclusion............................................................................................................................................. 34

Page 3: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 2

FreeBaseApi

It is the API given by Google :

Freebase contains at this time of writing more than 20 million topics, more than 3000 types, and more than 30,000 properties. This is not a small database by any measure. If you were to think of it in terms of relational databases, it is probably the database with the most number of relational tables (3000+ types), and the most number of table columns (30,000+ properties). Furthermore, Freebase is designed to store the amorphous kind of data that you find in everyday life. To store data about the prolific Bob Dylan --who composed songs, sang and performed, wrote books, acted in movies-- which relational table should we use? The "song composer" table, or the "singer" table, or the "book author" table, or the "film actor" table? The answer is that we need to store data about that same person in all those different tables. This complexity is not limited to prolific people; a building could start out as a church, be turned into a hospital during a war, and later become a tourist destination. The apple is a fruit, but also an ingredient in numerous recipes, the logo of a company, and a literary device in the story of Snow White. Those million topics are very intricately connected. A certain politician might have run a campaign funded by a pharmaceutical company, whose board consists of some people who used to study at some particular Ivy League schools. Topics in different domains (politics, business, education, etc.) are linked together, spanning across virtually any combination of tables. Real life is intricately interconnected, and so is Freebase data. Considering the sheer size and the data modeling complexity of Freebase, we can proudly say: this isn't your father's kind of database. It's a whole new kind of database and one that was specifically designed to play well as a citizen of the web. Freebase is not only a web site that people can use directly with their browsers, but it's also a collection of web services that your own web applications can use to achieve things that wouldn't be possible without additional data or a hosting platform where you can develop and run securely your web applications directly in Freebase's own server infrastructure.

Ways to use Freebase:

Use Freebase's Ids to uniquely identify entities anywhere on the web Query Freebase's data using MQL Build applications using our API or Acre, our hosted development platform

Page 4: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 3

ABOUT :

Freebase extract structured data from Wikipedia and make RDF available.

Freebase (the open global structured knowledge base) is a high-profile public instantiation of the Metaweb technology.

We use Metaweb Query Language (MQL) for programmatic queries .

For example :

We want

Find an object in the database whose type is "/music/artist" and whose name is "The Police". Then return its set of albums :

Then our query would be like :

https://api.freebase.com/api/service/mqlread?query={%22query%22:{%22type%22:%22/music/artist%22,%22name%22:%22The%20Police%22,%22album%22:[]}}

and the result we will get is :

{

"code": "/api/status/ok",

"result": {

"album": [ "Outlandosd'Amour",

"Reggatta de Blanc",

"Zenyatt\u00e0 Mondatta",

"Ghost in the Machine",

"Synchronicity",

"Every Breath You Take: The Singles",

"Greatest Hits",

"Message in a Box: The Complete Recordings",

"Live!",

"Every Breath You Take: The Classics",

"Their Greatest Hits",

"Can't Stand Losing You",

"Roxanne '97 (Puff Daddy remix)",

"Roxanne '97",

"The Police",

"Greatest Hits",

"The Very Best of Sting & The Police",

"Brimstone and Treacle",

"Can't Stand Losing You",

"De Do DoDo, De Da DaDa",

"Certifiable: Live in Buenos Aires",

"Roxanne",

"2007-09-16: Geneva",

"Live in Boston",

"The 50 Greatest Songs",

Page 5: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 4

"King of Pain",

"Invisible Sun",

"Message in a Bottle",

"Spirits in the Material World",

"Don't Stand So Close to Me '86",

"The Police Live!",

"Synchronocity",

"The Very Best of Sting & The Police",

"When the World Is Running Down (You Can't Go Wrong)"

],

"name": "The Police",

"type": "/music/artist"

},

"status": "200 OK",

"transaction_id": "cache;cache02.p01.sjc1:8101;2012-12-26T10:15:37Z;0031"

}

For making queries , you need to have thorough knowledge of Meta Web Query Language(MQL) architecture and its notations.

You can write the query in an easy way :

When we try to access API by googleapi, such as :-

Explaining the concept in detail of writing the query , which is to be hit at browser :

Here are the parameter we use for writing the query :

Parameters: -

Param Required Datatype Multiple Default Description

Query yes string False The text you want to match against freebase entities.

Callback no string False JS method name for JSONP callbacks.

Domain no string True A comma separated list of domain IDs. Search results must include these domains.

Exact no boolean False false Matches only the name, and keys 'exactly'. No normalization of any kind is done at indexing and query time. The text is only broken up on space characters.

Filter no string True A filter s-expression.

Page 6: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 5

Format no string False The keyword "classic" to return the same information the original search API would have

Encode no boolean False false Whether or not to html escape entities' names.

Indent no boolean False false Whether to indent the json.

Limit no integer (≥1)

False 20 Return up to this number of results.

mql_output no string False A MQL query thats extracts entity information.

Prefixed no boolean False false Whether or not to match by name prefix. (used for autosuggest)

Start no integer (≥0)

False 0 Allows paging through results.

Type no string True A comma seperated list of type IDs. Search results must include these types.

Lang no string True The language you are searching in. Can pass multiple languages.

Where Query is text you want to search for.

Now comes how to write query , for that we should know the query string .

Example: Here we have query string as Washington, then :

https://www.googleapis.com/freebase/v1/search?query=Washington&indent=true&limit=222&prefixed=true&lang=en

Output is Jsonformat :

Washington Free Base.txt

We can also use filter , as there are no of filters defined :

Page 7: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 6

Filter: - e.g. (any type:/people/person) more details , Filter can make magic refer to http://wiki.freebase.com/wiki/Search_Cookbook The filter param allows you to create more complex rules and constraints to apply to your query. The filter value is a simple language that supports the following symbols: the all, any, should and not operators the type, domain, name, alias, with and without operands the ( and ) parenthesis for grouping and precedence

About Size and Use of FreeBase :-

Size of FreeBase Date source at the latest in accordance to the categories are:

Category Size of Topics

MUSIC 11M

BOOKS 6M

People 2M

TV 1M

Location 1M

Film 877 K

Business 704K

Government 139 K

Here are some topic about FreeBase:

Page 8: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 7

Topic Solution

What is FreeBase Freebase is an open, Creative Commons licensed collection of structured data, and a platform for accessing and manipulating that data via the Freebase API.

Size of FreeBase Freebase contains about 20 million topics (aka entities) .

Is Freebase a wiki? No, though it shares some similarities with open wiki projects: Freebase is a free source of information Freebase is a collaborative project, and Freebase data may be edited by anyone Most of the data in Freebase is openly licensed under Creative Commons However: Freebase does not run on wiki software, but on a graph database that represents structured data Most wikis arrange information primarily in the form of text-based articles, while Freebase houses information in a structured, machine-readable database format

Is Freebase a Semantic Web project?

Yes, Freebase is part of the Semantic Web. We emit Linked Open Data (via RDF) for all our entities, and are involved in various SemWeb projects/communities/etc.

Where does the information in Freebase come from?

Initially, Freebase was seeded by pulling in information from a large number of high-quality open data sources, such as Wikipedia, MusicBrainz, and others. The Freebase community along with the internal Freebase team continue to drive the growth of the graph – focusing on bulk, algorithmic data imports, data extraction from free text, ongoing synchronization of data feeds, and rigorous quality management.

What are the limits on use of API?

You may use Freebase's API for almost any use, including commercial uses, up to a limit of 100,000 API calls per day. If you are interested in using the Freebase API beyond 100,000 API calls per day, please contact Metaweb

Page 9: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 8

What are the rules for using data in Freebase?

It depends on what type of content it is. Data is available for use under the Creative Commons Attribution Only (or CC-BY) license. This means you are free to use it on your site, as long as you credit the Freebase community appropriately. The Freebase attribution policy has all the details. Many of the images in Freebase are also CC-BY, although some images are hosted under different license terms, like GFDL (which is similar to CC-BY), public domain, or Fair Use, and you can use the Freebase API to filter your results by license type. Finally, long descriptions that they have pulled in from Wikipedia are licensed under the GFDL.

What is the relationship between Freebase and Metaweb?

Metaweb is the commercial entity that sponsored and developed the Freebase platform. Metaweb was acquired by Google in July, 2010.

Will the licensing of information in Freebase ever change?

No. The data in Freebase has already been licensed under CC-BY, which means it will always be available under that license; adding a new license would not impact the current corpus of data. Furthermore, all of the data in Freebase is available for download, and people are allowed to store it locally.

References :

http://wiki.freebase.com/wiki/FAQ#Is_Freebase_a_wiki.3F

http://wiki.freebase.com/wiki/DBPedia

http://blog.dbpedia.org/2008/11/15/dbpedia-is-now-interlinked-with-freebase-links-to-opencyc-updated/

Page 10: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 9

Word Reference API

The API comes in two varieties: a JSON format and a regular-HTML/web format.

The URL for the HTML API is http://api.wordreference.com/{api_version}/{API_key}/{dictionary}/{term} and for the JSON API http://api.wordreference.com/{api_version}/{API_key}/json/{dictionary}/{term} where {term} is the term being searched for, {dictionary} is the dictionary you want to search, and {api_version} is the desired version of the API. If {api_version} is omitted, the API will redirect to the latest version automatically. Version upgrades will be posted here; the current version is 0.8.

For translation purpose we use fo.llowing:

Examples: api.wordreference.com/0.8/1/enfr/grin api.wordreference.com/0.8/1/json/enfr/grin

For the use , we are testing this , we will be using "/thesaurus/" in place of {dictionary},

Although this API is useful for conversion of any language word to English word ,and vice versa.

But for the meaning , its data source is very limited ,

We can use this , like :

http://api.wordreference.com/3cd08/json/thesaurus/washington

this will give result for Washington , but its data source is limited ,

So ,

Page 11: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 10

this is not the better option for word disambiguation task , but its good

if we are using this for translating in diff languages .

Getting an API key

To get an API key:

1. Go to the API Registration page and fill out the necessary information. 2. After registering, you will receive your API key. Write this down and record it somewhere.

You must use it to access the API! 3. If you lose your key, you can retrieve it here. 4. If you registered for a key and it is not working, you can check the status of your key here.

Whenever you make a request to the API, make sure your URL includes this unique ID --- otherwise the API will not work, and you will be prompted to include this.

Terms of Service

You must include the copyright line: © WordReference.com You must link back to the term's entry on WordReference's website with the translation

or equivalent of: 'term' at WordReference.com No derivative works (without permission). API data can only be stored and cached for 24 hours (without permission). You are limited to 600 requests to the API per hour by default. Cannot be used in: browser toolbars. Cannot be used in an application or webpage whose primary function is as a dictionary

or translator (without permission).

Page 12: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 11

DBPedia API

DBpedia is a project aiming to extract structured content from the information created as part

of the Wikipedia project. This structured information is then made available on the World Wide

Web. DBpedia allows users to query relationships and properties associated with Wikipedia

resources, including links to other related datasets.DBpedia has been described by Tim Berners-

Lee as one of the more famous parts of the Linked Dataproject.

Developer(s) University of Leipzig, Freie

Universität Berlin,OpenLink

Software

Initial release 23 January 2007

Stable release DBpedia 3.8 / 06 August 2012[1]

Written in Scala, Java, VSP

Operating system Virtuoso Universal Server

Type Semantic Web, Linked Data

License GNU General Public License

Website dbpedia.org

As of September 2011, the DBpedia dataset describes more than 3.64 million things, out of which 1.83 million are classified in a consistent ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations, 183,000 species and 5,400 diseases. The DBpedia data set features labels and abstracts for these 3.64 million things in up to 97 different languages; 2,724,000 links to images and

Page 13: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 12

6,300,000 links to external web pages; 6,200,000 external links into other RDF datasets, 740,000 Wikipedia categories, and 2,900,000 YAGO2 categories. From this dataset, information spread across multiple pages can be extracted, for example book authorship can be put together from pages about the work, or the author.

The DBpedia project uses the Resource Description Framework (RDF) to represent the extracted information. As of September 2011, the DBpedia dataset consists of over 1 billion pieces of information (RDF triples) out of which 385 million were extracted from the English edition of Wikipedia and 665 million were extracted from other language editions.

One of the challenges in extracting information from Wikipedia is that the same concepts can be expressed using different properties in templates, such as birthplace and placeofbirth. Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBpedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of infoboxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions.

DBpedia extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across many different Wikipedia articles. Data is accessed using an SQL-like query language for RDF called SPARQL.

About SPARQL

SPARQL (pronounced "sparkle", a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a query language for databases, able to retrieve and manipulate data stored in Resource Description Framework format.It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is considered as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 became an official W3C Recommendation.

SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns.

SPARQL allows users to write unambiguous queries.

Query forms

The SPARQL language specifies four different query variations for different purposes.

SELECT query

Used to extract raw values from a SPARQL endpoint, the results are returned in a table

format.

CONSTRUCT query

Page 14: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 13

Used to extract information from the SPARQL endpoint and transform the results into

valid RDF.

ASK query

Used to provide a simple True/False result for a query on a SPARQL endpoint.

DESCRIBE query

Used to extract an RDF graph from the SPARQL endpoint, the contents of which is left to

the endpoint to decide based on what the maintainer deems as useful information.

Each of these query forms takes a WHERE block to restrict the query although in the case of the DESCRIBE query the WHERE is optional.

A Simple Example is: “Write a query to find capitals of all the countries in Asia”

PREFIX abc: <http://example.com/exampleOntology#> SELECT ?capital ?country WHERE{ ?x abc:cityname ?capital ; abc:isCapitalOf ?y . ?y abc:countryname ?country ; abc:isInContinent abc:Asia . }

Note: SPARUL, or SPARQL/Update, is an extension to the SPARQL query language that provides

the ability to add, update, and delete RDF data held within a triple store.

The DBpedia knowledge base is served as Linked Data on the Web. As DBpedia

defines Linked Data URIs for millions of concepts, various data providers have

started to set RDF links from their data sets to DBpedia, making DBpedia

one of the central interlinking-hubs of the emerging Web of Data.

Querying DBPedia :

If we are accessing the look up service of DBPedia api , then we can use that as follows , ( it

returns with XML page)

Page 15: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 14

The DBpedia Lookup Service can be used to look up DBpedia URIs by related keywords. Related means that either the label of a resource matches, or an anchor text that was frequently used in Wikipedia to refer to a specific resource matches (for example the resource http://dbpedia.org/resource/Washington can be looked up by the string “Washington”). The results are ranked by the number of inlinks pointing from other Wikipedia pages at a result page.

Two APIs are offered: Keyword Search and Prefix Search. The URL has the form http://lookup.dbpedia.org/api/search.asmx/<API>?<parameters>

Keyword Search

The Keyword Search API can be used to find related DBpedia resources for a given string. The string may consist of a single or multiple words.

Example: Places that have the related keyword “berlin” http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryClass=place&QueryString=berlin

Prefix Search (i.e. Autocomplete)

The Prefix Search API can be used to implement autocomplete input boxes. For a given partial

keyword like berl the API returns URIs of related DBpedia resources like http://dbpedia.org/resource/Berlin.

Example: Top five resources for which a keyword starts with “berl” http://lookup.dbpedia.org/api/search.asmx/PrefixSearch?QueryClass=&MaxHits=5&QueryString=berl

Like we are searching for Washington in Keyword Search with QueryClass specifying as person:-

We give Query as:

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryClass=&QueryString=Washi

ngton&MaxHits=30

the result we get is :

WashingtonSearch.txt

Page 16: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 15

The three parameters are

QueryString: a string for which a DBpedia URI should be found. QueryClass: a DBpedia class from the Ontology that the results should have (for

owl#Thing and untyped resource, leave this parameter empty). o CAUTION: specifying any values that do not represent a DBpedia class will lead

to no results (contrary to the previous behavior of the service). MaxHits: the maximum number of returned results (default: 5)

Note : It’s not able to find out Francisco’D Souza when we give Francisco as QueryString when

when we give 1000 hits.

Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).

If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description. The DBpedia data set contains the following numbers of abstracts per language (July 2012):

Language Number of Abstracts

English 3,770,000

German 1,244,000

French 1,197,000

Dutch 993,000

Italian 882,000

Spanish 879,000

Polish 848,000

Japanese 781,000

Portuguese 699,000

Swedish 457,000

Chinese 445,000

Page 17: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 16

LICENSE :

DBpedia is derived from Wikipedia and is distributed under the same licensing terms as Wikipedia itself. As Wikipedia has moved to dual-licensing, we also dual-license DBpedia starting with release 3.4.

DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License. All DBpedia

releases up to and including release 3.3 are licensed under the terms of the GNU Free Documentation License only.

Page 18: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 17

Yahoo API Yahoo does not have any proper API , which can be used for the purpose of Word

Disambiguation .

It has API such as:

Yahoo! Answers API

Content Analysis API

But these cannot be used for the specific purpose .

ContentAnalysisAPI :

Content Analysis Web Service detects entities/concepts, categories, and relationships within

unstructured content. It ranks those detected entities/concepts by their overall relevance,

resolves those if possible into Wikipedia pages, and annotates tags with relevant meta-data.

Please give our content analysis service a try to enrich your content.

RATE LIMITS :

The Content Analysis service is limited to 5,000 queries per IP address per day and to

noncommercial use.

Reference:http://developer.yahoo.com/search/content/V2/contentAnalysis.html

Yahoo!AnswersApi :

Yahoo! Answers is a place where people ask and answer questions on any topic. The Answers

API lets you tap into the collective knowledge of millions of Yahoo! users. Search for expert

advice on any topic, watch for new questions in the Answers categories of your choice, and

keep track of fresh content from your favorite Answers experts.

Page 19: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 18

There are categories to use Answers API:

Using Answers API:

questionSearch

Find questions that match your query.

getByCategory

List questions from one of our hundreds of categories, filtered by type. You'll need the

category name or ID, which you can get from questionSearch.

getQuestion

Found an interesting question? getQuestion lists all the details for every answer to the

question ID you specify, including the best answer, if it's been chosen. Get that question

ID from questionSearch or getByCategory.

getByUser

List questions from specific users on Yahoo! Answers. You'll need the user id, which you

can get from any of the other services listed above.

RATE LIMITS

Yahoo! Web Search web services are limited to 5,000 queries per IP per day per API. See

information on rate limiting and our UsagePolicy to learn about acceptable uses and how to

request additional queries.

Page 20: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 19

YAGO-

(YET ANOTHER GREAT ONTOLOGY)

YAGO2s is a huge semantic knowledge base, derived from

Wikipedia

WordNet

GeoNames.

Currently, YAGO2s has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.

YAGO is special in several ways:

1. The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value.

2. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and entities.

3. In addition to a taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.

YAGO2s is part of the YAGO-NAGA project at the Max Planck Institute for Informatics in

Saarbrücken/Germany. It is maintained jointly by the Databases and Information Systems

Group and the Ontologies Group.

The YAGO-NAGA project started in 2006 with the goal of building a conveniently searchable,

large-scale, highly accurate knowledge base of common facts in a machine-processible

representation.

They have already harvested knowledge about millions of entities and facts about their

relationships, from Wikipedia and WordNet with careful integration of these two sources. The

resulting knowledge base, coined YAGO, has very high precision and is freely available. The

facts are represented as RDF triples, and they have developed methods and prototype systems

for querying, ranking, and exploring knowledge. The search engine NAGA provides ranked

answers to queries based on statistical models.

Page 21: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 20

where:

NAGA- Not Another Google Answer is a new semantic search engine which provides ranked

answers to queries based on statistical models.

What it contains:

• It contains all the entities and facts from GeoNames - (from a dump of August 2010).

• It also contains textual and structural data from Wikipedia.

• All links+anchor texts between the YAGO entities.

• All Wikipedia category names.

• The titles of references.

YAGO is particularly suited for disambiguation purposes, as it contains a large number of names

for entities. It also knows the gender of people.

YAGO is the resulting knowledge base, the facts are represented as RDF triples (Resource

Description Framework).

Why YAGO-NAGA :

• Three major research:

– Semantic-Web-style knowledge repositories.

• Such as SUMO, OpenCyc, and WordNet.

– Large-scale information extraction.

– Social tagging and Web 2.0 communities that constitute the social Web.

• Wikipedia is another example of the Social Web paradigm.

Page 22: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 21

• The challenge is how to extract the important facts from the Web and organize them

into an explicit knowledge base that captures entities and semantic relationships among

them.

How YAGO-NAGA Works?

• YAGO adopts concepts from the standardized SPARQL Protocol and RDF Query

Language for RDF data but extends them through more expressive pattern matching

and ranking.

• The prototype system that implements these features is NAGA.

Growing the Knowledge Base :

Page 23: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 22

YAGO Knowledge Base :

• Combine knowledge from WordNet& Wikipedia.

• Additional Gazetteers (geonames.org).

Like following diagram explains for a particular entity:

We can check YAGO through

• Browse through the YAGO knowledge base.

– https://d5gate.ag5.mpi-sb.mpg.de/webyagospotlx/Browser

• Ask queries on YAGO using SPOTLX patterns. View the results on a map and timeline.

– https://d5gate.ag5.mpi-sb.mpg.de/webyagospotlx/WebInterface

Page 24: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 23

More than 13 sub-projects of YAGO-NAGA.

AIDA: is a method, implemented in an online tool, for disambiguating mentions of

named entities that occur in natural-language text or Web tables.

– https://d5gate.ag5.mpi-sb.mpg.de/webaida/

To Use this , you should have a knowledge of Ontology and RDF principles.

Some FAQ in brief about YAGO :

Question Answer

What is YAGO?

YAGO is an ontology, i.e., a database with knowledge about the real world. YAGO contains both entities (such as movies, people, cities, countries, etc.) and facts about these entities (who played in which movie, which city is located in which country, etc.). All in all, YAGO contains 10 million entities and 120 million facts.

What is so special about YAGO?

YAGO is special in several ways: The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and entities. In addition to a taxonomy, YAGO has thematic domains such as "music" or "science".

What is new in YAGO2s?

While preserving the quality and accuracy of its predecessor YAGO2, YAGO2s improves over it in several ways: YAGO2s is stored natively in Turtle, making it completely RDF/OWL compliant while still maintaining the fact identifiers that are unique to YAGO. The new YAGO2s architecture enables cooperation of several contributors, facilitates debugging and maintenance. The data is divided into themes, so that users can download only particular pieces of YAGO ("YAGO a la carte"). YAGO2s contains thematic domains such as "music" or "science", which gives a topic structure to YAGO.

Page 25: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 24

How is the taxonomy of YAGO structured?

YAGO classifies each entity into a taxonomy of classes. Every entity is an instance of one or multiple classes. Every class (except the root class) is a subclass of one or multiple classes. This yields a hierarchy of classes — the taxonomy. The YAGO taxonomy is the backbone of the ontology, and is designed with much care and attention to correctness. For those interested in the details of that taxonomy, we provide here a more in-depth explanation of the classes. The taxonomy consists of 4 layers: The root node of the taxonomy is rdfs:Resource. It includes entities, but also properties, literals, etc. rdfs:Resource has a subclass owl:Thing, which is the class of things (entities). Under owl:Thing, there is the class taxonomy from WordNet. Each class name is of the form <wordnet_XXX_YYY>, where XXX is the name of the concept (e.g., singer), and YYY is the WordNet 3.0 synset id of the concept (e.g., 110599806). For example, the class of singers is <wordnet_singer_110599806>. Each class is connected to its more general class by the rdfs:subclassOf relationship. The middle layer of the taxonomy consists of classes that have been derived from Wikipedia categories. For example, one class is <wikicategory_American_rock_singers>, derived from the Wikipedia category American rock singers. Each of these classes is connected to one class of the WordNet layer by a rdfs:subclassOf relationship. In the example, <wikicategory_American_rock_singers>rdfs:subclassOf<wordnet_singer_110599806>. Not all Wikipedia categories become classes in YAGO. The lowest layer of the taxonomy is the layer of instances. Instances comprise individual entities such as rivers, people, or movies. For example, this layer contains <Elvis_Presley>. Each instance is connected to one or multiple classes of the higher layers by the relationship rdf:type. In the example: <Elvis_Presley>rdf:type<wikicategory_American_rock_singers>. This way, you can walk from the instance up to its class by rdf:type, and then further up by rdfs:subclassOf.

Does YAGO have thematic domains?

YAGO provides a class hierarchy in the sense of RDF: Every subclass represents a set of instances that is a subset of the set of instances of the super class. For example, Elvis Presley is in the class of singers (because Elvis is a singer). This class is a subclass of the class of persons, because every singer is a person. This is different from a thematic domain hierarchy! A thematic domain hierarchy contains items such as "Football", "Sports", "Music" etc. In such a hierarchy, Elvis is in the domain "Music". The new YAGO2s now contains a theme with WordNet Domains, which gives such a thematic domain structure to YAGO.

Page 26: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 25

What is the data format of YAGO2s?

Turtle The YAGO knowledge base is a set of independent modular full-text files. These files are in the N3 Turtle format, ending in *.ttl. See http://www.w3.org/TeamSubmission/turtle/ for details on this format. N4 YAGO extends the Turtle format to the "N4 format". In this format, every triple can have an identifier, the fact identifier. The fact identifier is specified as a comment in the line before the triple. As a result, all N4 files are fully backwards compatible with standard Turtle and N3. The fact identifier can appear as a subject in other triples. This is used to annotate YAGO facts with time and space. Identifiers All identifiers in YAGO are standard Turtle identifiers. There are a number of prefixes predefined, such as rdf, rdfs, owl, etc. The base is set to the namespace of YAGO, http://yago-knowledge.org/resource/ YAGO defines its own datatypes, which extend the standard datatypes. Here are examples for identifiers: Entities are written in <> : <Elvis_Presley> Strings are written in double quotes, with optional language tags: "Elvis", "Elvis"@en Literals are written in double quotes with a datatype: "1977-08-16"^^xsd:date, "70"^<m> (<m> is the YAGO literal datatype "meter", which is a subclass of "quantity")

How do labels work in YAGO?

In line with RDF, YAGO distinguishes between the entity (Elvis_Presley) and names for that entity ("Elvis", "The King", "Mr. Presley", etc.). The reason for this distinction is that one entity can have multiple names. Also, one name can mean multiple entities. Consider, e.g., the name "The King", which is highly ambiguous. YAGO links an entity to its name by the relationship rdfs:label. For example, YAGO contains the fact <Elvis_Presley>rdfs:label "Elvis". In addition, YAGO knows, for each entity, its preferred name. This name is designated by the relationship skos:prefLabel. For example, <Elvis_Presley>skos:prefLabel "Elvis Presley". Even if Elvis has multiple names, his standard name is "Elvis Presley". In addition, YAGO contains for each name its preferred meaning. This meaning is designated by <isPreferredMeaningOf>. In the example, <Elvis_Presley><isPreferredMeaningOf> "Elvis". Even if the word "Elvis" can refer to multiple entities, its default meaning is Elvis Presley.

Page 27: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 26

How do meta facts work?

YAGO gives a fact identifier to each fact. For example, the fact <Elvis_Presley>rdf:type<person> could have the fact identifier <id_42>. In the native N4/TTL version of YAGO, the fact identifiers are given in a comment line before the actual fact. In the TSV version, they are simply an additional column. YAGO contains facts about these fact identifiers. For example, YAGO contains <id_42><occursSince> "1935-01-08" <id_42><occursUntil "1977-08-16" <id_42><extractionSource><http://en.wikipedia.org/Elvis_Presley> These facts mean that Elvis was a person from the year 1935 to the year 1977, and that this fact was found in Wikipedia.

What is the difference between YAGO and DBpedia?

DBpedia is a community effort to extract structured information from Wikipedia. In this sense, both YAGO and DBpedia share the same goal of generating a structured ontology. The projects differ in their foci. In YAGO, the focus is on precision, the taxonomic structure, and the spatial and temporal dimension. For a detailed comparison of the projects, see Chapter 10.3 of our AI journal paper "YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia".

How can I access YAGO?

There are several ways to access YAGO: 1.Online in person on our Web Interface 2.Online through the SPARQL interface provided by OpenLink 3.Offline by downloading the TTL version of YAGO, and loading it into an RDF triple store (e.g., Jena) 4.Offline by downloading the TSV version of YAGO, loading it into a database with the script provided at the bottom of the page, and using SQL

YAGO is freely available at http://yago-knowledge.org.

References :

http://www.mpi-inf.mpg.de/yago-naga/yago/

http://www.mpi-inf.mpg.de/yago-naga/yago/publications/aij.pdf

http://www.mpi-inf.mpg.de/yago-naga/javatools/doc/index.html

Page 28: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 27

http://www.mpi-inf.mpg.de/yago-naga/

https://d5gate.ag5.mpi-sb.mpg.de/webyagospotlx/Browser

http://www.google.co.in/url?sa=t&rct=j&q=how+to+use+Yago+data+source+in+an+application

&source=web&cd=6&cad=rja&ved=0CFIQFjAF&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fvi

ewdoc%2Fdownload%3Fdoi%3D10.1.1.85.8206%26rep%3Drep1%26type%3Dpdf&ei=eCjlUI66D

czPrQeD6IHYAQ&usg=AFQjCNGDSu7YPl5asBgwh2coNOfVHui5AA&bvm=bv.1355534169,d.bmk

http://www.mpi-inf.mpg.de/yago-naga/yago/downloads.html

http://www.mpi-inf.mpg.de/~mtb/pub/yago-qa.ppt

http://faculty.ist.unomaha.edu/ylierler/teaching/material/YAGO-NAGA.pptx

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.8206&rep=rep1&type=pdf

http://www.mpi-inf.mpg.de/yago-naga/yago/publications/aij.pdf

Demo Paper:{ http://www.mpi-inf.mpg.de/yago-naga/yago/publications/btw2013d.pdf

https://d5gate.ag5.mpi-sb.mpg.de/webyagospo/FlightPlanner }

Paper: AI journal paper "YAGO2: A Spatially and Temporally Enhanced Knowledge Base from

Wikipedia":

http://www.mpi-inf.mpg.de/yago-naga/yago/publications/aij.pdf

TrueKnowledgeAPI

True Knowledge is a new class of Internet search technology aimed at improving the experience

of finding known facts on the Web. The True Knowledge Answer Engine gives consumers

instant answers to complex questions. Request information on any topic and get back results in

a processable form. Early areas of strength include geographic knowledge, local time, and

geolocation. Natural language questions can also be processed. View demos at their website.

Its now called as EVI .

In January 2012 True Knowledge launched a major new product Evi (pronounced eevee), an

artificial intelligence program which can be communicated with using natural language

The company changed its name from True Knowledge to Evi in June 2012.

Page 29: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 28

The True Knowledge Answer Engine attempts to comprehend posed questions by

disambiguating from all possible meanings of the words in the question to find the most likely

meaning of the question being asked. It does this by drawing upon its database of knowledge of

discrete facts. As these facts are stored in a form that the computer can understand, the

answer engine attempts to produce an answer to what it comprehends to be the question by

logically deducing from them.[5] For example, if one were to type in “What is the birth date of

George W. Bush?”, True Knowledge would reason from the facts “George W. Bush is a

president”, “George W. Bush is a human being”, “A president is a subclass of human being”,

“Date of creation is a more general form for birth date”, and “the 6th of July is the date of

creation for George W. Bush”, to produce the simple answer, “the 6th of July”. True Knowledge

differs from competitors like Freebase and DBpedia in that they offer natural language access.

Unlike the others however, users who post information to True Knowledge granted the

company a "non-exclusive, irrevocable, perpetual licence to use such information to operate

this website and for any other purposes.".

Evi gathers information for its database in two ways: importing it from "credible" external databases (which for them includes Wikipedia: Citation Required) and from user submission following a consistent format and detailed process for input.True Knowledge strives to monitor this user submitted knowledge in multiple ways. One method involves a system of checks and balances in some ways similar to Wikipedia's, allowing users to modify or “agree”/“disagree” with information presented by True Knowledge. The system itself also assesses submitted information due the fact that the information is submitted as discrete facts that computers can understand. The system is able to reject any facts that are semantically incompatible with other approved knowledge. On November 21, 2008, True Knowledge announced on its official blog that over 100,000 facts had been added by beta users and as of August 18, 2010 the True Knowledge database overall contained 283,511,156 facts about 9,237,091 things.

Note :In November 2010, True Knowledge used some 300 million facts to calculate that Sunday April 11, 1954 was the most boring day since 1900.

The True Knowledge API enables developers to utilize True Knowledge’s functionality in third party applications. True Knowledge provides the following API services: the Direct Answer API and the Query API. The Direct Answer API exposes the natural language question answering feature of True Knowledge while the Query API allows users to bypass our natural language translation system and directly query the knowledge base using a simple query language.

Page 30: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 29

API services are comprised of HTTP requests and XML responses.

With a free API account, users of our API services must credit True Knowledge and place a prominent link back to our site: http://www.trueknowledge.com/. With the direct answer service the question URL returned in the tk_question_url tag should be used.

You can see the details about the use in the following link http://images.trueknowledge.com/blog/wp-content/uploads/2011/02/tk_api_docs.pdf

Refernces : http://www.apihub.com/api/true-knowledge-api http://en.wikipedia.org/wiki/Evi_%28software%29 http://www.evi.com/q/francisco http://www.evi.com/

Page 31: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 30

Comparision of FreeBase and DBPedia

Freebase is an open-license database which provides data about million of things from various

domains. Freebase has recently released an Linked Data interface to their content (See release

note). As there is a big overlap between DBpedia and Freebase, they have added 2.4 million

RDF links to DBpedia pointing at the corresponding things in Freebase. These links can be

used to smush and fuse data about a thing from DBpedia and Freebase. For instance, you can

use the Marbles Linked Data browser to view data about the Lord of the Rings from Freebase

and DBpedia smushed together.

They have also updated the the RDF links to OpenCyc, which allow you to use DBpedia instance

data together with conceptual knowledge of OpenCyc.

Major diff b/w FreeBase and DBPedia :

FreeBase DBPedia

Freebase imports data from a wide variety of sources, not just Wikipedia.

DBPedia focuses on just Wikipedia data

Freebase is owned and funded by Google, an incorporated company.

DBPedia is funded by grants/sponsorships from various organisations.

Freebase is part of the Semantic Web. We emit Linked Open Data (via RDF) for all our entities, and are involved in various SemWeb projects/communities/etc. But now FreeBase is connected with DBPedia.

It depends on RDF

Freebase is user-editable and contributions can be made through a public interface

DBPedia requires that you edit Wikipedia for the change to appear in DBPedia

Other important points are :-

DBpedia stores its data as RDF triples in a 3rd-party triple store. Freebase stores its data as n-tuples in a proprietary tuple store.

Both communities make their data available as RDF.

Freebase provides complete data dumps. DBpedia provides complete data dumps

Page 32: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 31

DBpedia schema mappings can be edited by the community. Freebase schema & data can be edited by the community.

DBpedia data is automatically generated from Wikipedia several times a year. Wikipedia data is automatically imported into Freebase every two weeks.

DBpedia lets you query its data via a SPARQL endpoint. Freebase lets you query its data via an MQL API.

DBpedia has strong connections to the Semantic Web research community. Freebase has strong connections to the open data / startup community.

DBpedia tools are predominantly developed by 3rd parties and the open-source community. Freebase tools are predominantly developed by Metaweb and the Freebase community.

Page 33: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 32

Comparision of YAGO and DBPedia

Closest to YAGO in spirit is the DBpedia project, which also extracts an ontological knowledge base from Wikipedia.

DBPedia YAGO

DBpedia project has manually developed its own taxonomy

YAGO re-uses WordNet and enriches it with the leaf categories from Wikipedia

DBpedia's taxonomy has merely 272 classes YAGO2 contains about 350,000.YAGO's compatibility with WordNet allows easy linkage and integration with other resources such as Universal WordNet , which we have exploited for YAGO2

DBpedia outsourced the task of pattern de nition to its community and uses a much larger number of more diverse extraction patterns, but ends up with redundancies and even inconsistencies. Overall, DBpedia contains about 1100 relations

For extracting relational facts from infoboxes, YAGO2 uses carefully handcrafted patterns, and reconciliates duplicate infobox attributes (such as birth- date and dateofbirth), mapping them to the same canonical relation. YAGO2 having about 100 relations.

The following key differences explain this big quantitative gap, and put the comparison in the perspective of data quality.

Many relations in DBpedia are very special. As an example, take air- craftHelicopterAttack, which links a military unit to a means of transportation. Half of DBpedia's relations have less than 500 facts.

YAGO2's relations have more coarse-grained type signatures than DBpedia's. For example, DBpedia knows the relations Writer, Composer, and Singer, while YAGO2 expresses all of them by hasCreated. On the other hand, it is easy for YAGO2 to infer the exact relationship (Writer vs. Composer) from the types of the entities (Book vs. Song). So the same information is present.

Page 34: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 33

YAGO2 does not contain inverse relationships. A relationship between two entities is stored only once, in one direction. DBpedia, in contrast, has several relations that are the inverses of other relations (e.g., hasChild/hasParent). This increases the number of relation names without adding information.

YAGO2 has a sophisticated time and space model, which represents time

and space as facts about facts. DBpedia closely follows the infobox attributes in Wikipedia. This leads to relations such as populationAsOf, which contain the validity year for another fact. A similar observation holds for geospatial facts, with relations such as distanceToCardiff.

Overall, DBpedia and YAGO share the same goal and use many similar ideas. At the same time, both projects have also developed complementary techniques and foci. Therefore, the two projects generally inspire, enrich, and help each other. For example, while DBpedia uses YAGO's taxonomy (for its yago:type triples), YAGO relies on DBpedia as an entry point to the Web of Linked Data

Page 35: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 34

Conclusion

We discussed about the API’s above , giving a brief report on that : Name of API LICENSE SIZE of

Dictionary

Features Ease of Usibility

FreeBaseAPI Creative Commons Attribution Only (or CC-BY) license. Limit of 100,000 API calls per day

20 million topics more than 3000 types, and more than 30,000 properties

Freebase extract structured data from Wikipedia and make RDF available.Freebase is part of the Semantic Web.

It is quite easy to use as , explained in detail above , just by hitting the url to browser.

DBPedia Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License

English 3,770,000 German 1,244,000 French 1,197,000 Dutch 993,000 Italian 882,000 Spanish 879,000 Polish 848,000 Japanese 781,000 Portuguese 699,000 Swedish 457,000 Chinese 445,000

The DBpedia project uses the Resource Description Framework (RDF) to represent the extracted information.DBpedia extracts factual information from Wikipedia pages, allowing users to find answers to questions where the information is spread across many different Wikipedia articles. Data is accessed using an SQL-like query language for RDF called SPARQL.

To use this you should have idea of SPARQL , if you are using it by the url query, uyou have to use SPARQL for complex query , there is other option also by simply giving some parameter in the url query and hit it on the browser.

Page 36: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 35

YAGO YAGO Ontology is licensed under a Creative Commons Attribution 3.0 License YAGO is freely available at http://yago-knowledge.org.It can be accessed by diff ways: 1.Online in person on our Web Interface 2.Online through the SPARQL interface provided by OpenLink 3.Offline by downloading the TTL version of YAGO, and loading it into an RDF triple store (e.g., Jena) 4.Offline by downloading the TSV version of YAGO, loading it into a database with the script provided at the bottom of the page, and using SQL

10 million entities

YAGO2s is a huge semantic knowledge base, derived from • Wikipedia • WordNet • GeoNames. 1. The accuracy of YAGO has been manually evaluated, proving a confirmed accuracy of 95%. Every relation is annotated with its confidence value. 2. YAGO is an ontology that is anchored in time and space. YAGO attaches a temporal dimension and a spacial dimension to many of its facts and entities. 3. In addition to a taxonomy, YAGO has thematic domains such as "music" or "science" from WordNet Domains.

For using this , you should have knowledge of RDF and ontology principles,it gives out RDF tuples to its queries.

WordReference

You must include the copyright line: © WordReference.com You are limited to 600 requests

This has a large data source of different language dictionary. It's

WordReference offers dictionary translations of words and phrases, not machine translation of

Its easy to use. Get apikey ,and use it to create the url query and hit the request.

Page 37: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 36

to the API per hour by default.

dictionary keeps on growing

sentences. It has recently released a "/thesaurus/" for its use of dictionary of english .

True Knowledge API

They provide access to portions of our service through an API enabling people to build applications on top of our platform. The basic service is free and they offer paid upgrades for additional features and services. Free API accounts have a daily limit (currently 2,000 "tokens") for each person or organisation. To discuss an upgrade please contact [email protected]

283,511,156 facts about 9,237,091 things

True Knowledge provides the following API services: the Direct Answer API and the Query API. The Direct Answer API exposes the natural language question answering feature of True Knowledge while the Query API allows users to bypass our natural language translation system and directly query the knowledge base using a simple query language. API services are comprised of HTTP requests and XML responses.

The Direct Answer API exposes the natural language question answering feature of True Knowledge while the Query API allows users to bypass our natural language translation system and directly query the knowledge base using a simple query language. API services are comprised of HTTP requests and XML responses.We can see the use of this from the following http://images.trueknowledge.com/blog/wp-content/uploads/2011/02/tk_api_docs.pdf

About YahooAPI’s :

We discussed about the 2 Api which Yaahoogave , i.e.

Yahoo! Answers API

Content Analysis API

Page 38: Text Analytics Online Knowledge Base / Database

Text Analytics Online Knowledge Base / Database

Page | 37

These API’s are for answering any question and to analyse the content , but they are not having any specific feature which can give the facility of word disambiguation. There are many other data sources such as which can be used for specific purposes , such as :

KnowItAll

Omega

WolframAlpha

Cyc