Upload
hoangdan
View
228
Download
1
Embed Size (px)
Citation preview
How structured data (Linked Data) help in Big Data
Analysis --- Expand Patent Data with Linked Data
Cloud
Lishan Zhang
Electrical Engineering and Computer SciencesUniversity of California at Berkeley
Technical Report No. UCB/EECS-2013-96
http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-96.html
May 17, 2013
Copyright © 2013, by the author(s).All rights reserved.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.
How structured data (Linked Data) help in Big Data Analysis
-‐-‐-‐ Expand Patent Data with Linked Data Cloud
M.Eng Program
Lishan Zhang
24106243
I
Outline
Abstract ....................................................................................................................................... 1
Introduction .............................................................................................................................. 2
Literature Review .................................................................................................................... 6
Unveil the underlying information among Big data .............................................................. 6
Previous solutions .......................................................................................................................................... 8
Approaches ........................................................................................................................................................ 8
Conclusion ....................................................................................................................................................... 10
Methodology ........................................................................................................................... 12
SPARQL: query language for RDF data .................................................................................... 13
SPARQL Endpoint query .............................................................................................................. 14
HTTP request ................................................................................................................................... 17
User Interface design .................................................................................................................... 17
Discussion ............................................................................................................................... 20
Results ................................................................................................................................................ 20
Explanation of Results ................................................................................................................................ 20
What is different ........................................................................................................................................... 22
Limitation of this approach ...................................................................................................................... 22
II
Evaluation ......................................................................................................................................... 23
User Study ....................................................................................................................................................... 23
Heuristic Evaluation .................................................................................................................................... 24
Future Work ..................................................................................................................................... 27
Conclusions or Impact Statement ................................................................................... 29
Bibliography ........................................................................................................................... 30
Appendix ................................................................................................................................. 32
1
Abstract
Big Data is currently a big topic in the world. It is a commonly used term to describe
data that exceeds the processing capacity of on-‐hand database management tools.
We often use 4V (Volume, Variety, Velocity and Value) to describe its characteristics.
Big Data can be structured or unstructured data that has potential values behind
them. It is of vital importance to extract and analysis the valuable information in Big
Data.
On the other hand, Linked Data is a new concept for most of the people. Linked Data
refers to the collection of interrelated datasets that can be publishing and sharing on
the web. Unlike Big Data, Linked Data is highly structured. It is used to build the
Semantic Web which huge amount of data on the web are available in standard
format. The technologies enable people to figure out more advanced analytical
questions by querying the data and drawing inferences using vocabularies.
In our project, we would like to explore the potential use of Linked Data in analyzing
Big Data. We will build a search engine to combine information in Linked Data into
these Patent Data to see if we can dig out more information of each patent. There is
already a huge Linked Data cloud that contains a large amount of publishing open
data. We can also see the potential to connect these public data with patent data to
answer advanced questions. When we search for inventor name or certain patent in
the search interface, we query from Linked Data Cloud and Patent database
separately and return the result. In this way, we can combine the patent itself with
the inventor information from DBpedia.
2
Introduction
Nowadays, we are generating much more data than any point in the history.
The explosion of data is driven from two particular sources: the social network
sharing information about our activities and a variety of sensors collating
information on our environment. [1]
Needless to say, there could be priceless value hidden in this booming data. If we
make good use of them, we may gain valuable information and pattern inside the
data. However, it will also become a thread if we cannot handle this ever-‐increasing
amount of data.
Big Data is a commonly used term to describe data that exceeds the processing
capacity of conventional database systems. [2] We often identified Big Data with
four main attributes: Volume, Velocity, Variety and Value. Big Data can be structured
or unstructured data that has potential values behind them. The McKinsey Global
Institute describes Big Data as “The next frontier for innovation, competition and
productivity.” [3] But processing these big raw datasets pose challenges in both data
management and algorithms. It is of vital importance to extract and analysis the
valuable information in Big Data.
The major difficulties in processing Big Data include capturing, storage, search,
sharing, analytics and visualizing. [4] There are already several approaches to
analyzing Big Data. For example, MapReduce is a programming model and an
implementation for processing and generating large data sets. It runs on a large
cluster of machines and is highly scalable. In addition, NoSQL employed non-‐
relational data storage systems to process unstructured and semi-‐structured Big
3
Data. Some institutes and companies also developed their own mathematics models
and algorithms to dig out useful information from Big Data.
We will mainly focus on variety of Data in this thesis. Variety means that Big Data
has different types of data and various degrees of structure that does not fit into
neat relational structures. It is a mix of structured, semi-‐structured and
unstructured data such as text, sensor data, video, log files and more. Those data
cannot be integrated into an application directly. [2]
The current approaches for Big Data emphasize the ability to deal with the volume
and velocity like MapReduce and NoSQL. In the paper, we are trying to work from a
different approach. We are concern about the variety of Big Data. Since most data is
unstructured, it is hard to interlink different datasets and create valuable context
behind that. We see there may be a potential value to link different datasets and
expend the value of the sole data with the help of Linked Data.
Linked Data is used to organize and publish highly structured data with globally
unique identifiers, which make it easy to combine various datasets. Richard
Cyganiak and Anja Jentzsch created Linked Data Diagram of the Cloud which
describes how many datasets have been published on the web. [5] The Linked Data
cloud is growing constantly, data integration is becoming more important in this
field.
4
Fig 1: The Linking Open Data cloud diagram
In this paper, we are trying to figure out the potential use for linked data into Big
Data analysis by building a prototype of our concepts. We are using U.S. utility
patent dataset and linked with the Public Linked Data cloud. We will build a search
engine for Patent Graph search, and query the endpoint from Linked Data
Cloud like DBpedia and Freebase and simultaneous query the SQL data from
Patent datasets and show the combined results in the interface. The diagram
below can illustrate the querying process:
5
Fig 2: The querying process of Patent Search Engine
In this way we can add more related information about the Patent and even provide
some recommendations for Patent search. We can see there will be many potential
values created by this interconnection. And Linked Data would definitely be valued
later in Big Data Analysis.
6
Literature Review
Unveil the underlying information among Big data
Big Data has become one of the hottest topics in the industry. In this data booming
world, some traditional technologies can no longer serve the need to analyze the
large volume of data. New approaches must be introduced in order to keep up with
the pace of the Big Data. Linked data concept is a useful way to unveil the useful
information, especially the data on the Internet.
Big Data is a commonly-‐used term to describe data that exceeds the processing
capacity of conventional database systems. We are generating much more data than
before with the booming of social network and Media, mobile devices, Internet
Transactions and networked devices and sensors.
Big Data is too big, too fast and doesn’t fit the conventional database architectures.
Due to the unique nature of Big Data, the first question we need to answer is can we
find an alternative way to process the data. More importantly, can we dig out the
useful information from the big data?
Big data requires exceptional technologies to efficiently process large quantities of
data. There are huge amount of valuable patterns and information hidden in the Big
Data, which require us to extract them. Usually, there are four problems when it
comes to Big data: Volume, Velocity, Variety and Value (4V) [6] .
7
Volume and Velocity
In this data booming world, the speed of data growth is exponential. Particularly,
with the increasingly popularity of social media, user generated content has started
to dominate. For example, there are roughly 60 hours of video uploaded to YouTube
every minute [7]. It is also astonishing that there are over 340 million tweets
generated daily in May 2012 [8]. Just to make this more visualizable, the amount of
information in the world doubles every five years [9]. There is more information in
the daily edition of The New York Times than an individual man or woman in the
16th Century had to process in their whole lives.
Huge amount of data requires tremendous storage space and extremely fast
processing speed to deal with the data. It has always been challenging for any
company, government or individual to deal with the issue.
Variety and Value
Big Data relates not just to new information sources: it’s equally applicable for
gaining new insights from data that was previously inaccessible and to accelerating
and easing existing analytical processes [10]. In fact, most big data is low value until
rolled up and analyzed, at which point it becomes valuable.
It is challenging due to big data’s variety. Big data has different structures and
shapes, causing it very difficult to analyze with traditional technologies, such as
MySQL or Oracle. Integrating these data sources are a very expensive operation
[11] . Plus, correlating different pieces of data and reconnect those data to make
8
them more valuable, readable and accessible has always been an interesting
problem.
Previous solutions
Previously, there are several ways to processing and analyzing big data. Usually,
they utilize advanced hardware and parallel processing techniques to break the
speed bottleneck. Others have employed non-‐relational data storage systems to deal
with unstructured and semi-‐structured big data. Meanwhile, a lot of companies and
have been trying to apply unique math models, advance analytics and data
visualization technology to dig the insights from Bit data.
Approaches
MapReduce
MapReduce is a breakthrough concept announced by Google. It is a programming
model and an implementation for processing and generating large data sets [12] . It
is able to run on a large cluster of machines and is highly scalable.
MapReduce is not only successful at Google, but is also open-‐sourced to the public
under the name of Hadoop, a highly scalable compute and storage platform [13].
Hadoop breaks huge chunk of data into pieces and process/analyze it at the same
time.
9
NoSQL
NoSQL was a database that did not expose the standard SQL interface and it was
first used by Carol Strozzi [14]. It works in conjunction with Hadoop to serve up
discrete data stored among large volumes of multi-‐structured data to end-‐user and
automated Big Data applications [15].
Digging useful information
Various companies have taken actions to dig out the useful information from the
various data in the web. For example, Splunk is a small company that has been in
the business for less than 5 years. Splunk’s mission is to make ambiguous big data
more readable, useful and valuable to everyone. For example, one of its partners,
Amazon, is asking Splunk to find out the habits of their customers.
Another company, Jive, is a software company in the social business software
industry. It is also trying to help its customers to consolidate the big data they are
dealing with. One of the example data is the price information of all the
merchandise: what price should be set in order to be the best price.
Downsides
However, all of these approaches are not perfect. For example, Hadoop is a very
young technology and still developing. It is very hard to manage the Hadoop system
and it does not support real-‐time data processing and analysis.
10
NoSQL, on the other hand, is that most NoSQL databases traded ACID (atomicity,
consistency, isolation, durability) compliance for performance and scalability. It
also suffers from its ‘youth’: no mature management and monitoring tools.
Conclusion
Key results
Big Data holds tremendous value and it will be beneficial to understand what it
really means. Many new technologies, such as MapReduce and NoSQL, have been
applied to solve this issue. However, it is never safe to say that we already have the
perfect tools for this job. As the data continues to boom exponentially, new
technology such as Linked data will definitely be the key to the next-‐generation
analytics platform and data management system.
Shortcomings
Linked data applications usually follow different architectures and pattern. For
instance, one pattern will require the data to be replicated so that the applications
may work with stale data. Another pattern, named On-‐The-‐Fly Dereferencing
Pattern works very slowly when dealing with complex operations.
Additional Work
Because Linked data is a newly raised concept and it is still undergoing a lot of
improvements, some of the disadvantages will be nullified in soon. However, the
11
fact that we are in a data-‐exploding era cannot be reverted. More and more data are
coming to us and the technology must keep evolving in order to keep up with the
pace.
Artificial intelligence can be applied when dealing with Big Data. A ‘databot’ that
can crawl the Linked data, infer relationships, and figure out what information can
be extracted will definitely be useful.
12
Methodology
For this thesis, we are building a use case in order to figure out the potential use for
Linked Data into Patent Data. More specifically, we will build a search engine and we
named it “Patent Graph”. So when people type a certain patent number or the
inventor, we can show them the relevant information such as the picture of inventor,
his workplace, alma mater, doctoral advisor and the biography. This information is
obtained from DBpedia, which is a structured data format from Wikipedia. And
DBpedia makes this information available on the web so that people can easily link
to the data. Besides, we will also make new search around the result simply by
clicking the related information on the page. For example, if we are interested in a
co-‐worker or the advisor in the patent that we search, we can just click the name
and then will return a new search around the person and his patents. In addition, we
can provide recommendations based on the searching results. If time allows, we will
also be willing to convert the Patent Data into RDF format and publish on the web
then more people can benefit from that. In this way, the Linked Data help us to
analysis the Patent Data by expanding our patent datasets with related data and
finding more useful information.
The Patent Data that we use is the Patent Inventor Database from Fung institute.
The database disambiguated all inventor names from the U.S. utility patent database
from 1979 to 2010. And the Linked Data we use is DBpedia. The DBpedia dataset
extract structured content from the information created by Wikipedia and it can be
accessed online through SPARQL query endpoint.
13
Since we are building a search engine to extract the information from both Linked
Data Cloud and relational database, we are building a web service based on that and
we use a Model-‐View-‐Controller (MVC) software architecture.
My part of work includes implementing the search interface and query from the
Linked Data Cloud. The techniques involve SPARQL endpoint query, HTTP request
and User Interface design.
SPARQL: query language for RDF data
Resource Description Framework (RDF) is a directed, labeled graph data format to
describe resources on the web. It is designed to be read and understand by
computer rather than people. Most RDF documents are written in XML, which can
easily be exchanged between different computers and platforms. The RDF language
is also a part of “The Semantic Web”. Semantic Web is a set of standards and best
practices for sharing data and the semantics of that data over the web for use by
application. [16] Rather than just putting data on the web, the Semantic Web is
about making links so that a person or machine can explore the web of data. [17]
We define RDF statement as a triple of the form (Subject, Predicate, Object) and uses
uniform resource identifiers (URIs) to name the data objects. For example, if we
need to express “Tom is a man”, we should represent as Tom(Subject),
sex(Predicate), man(Object). The data stored in Linked Data Cloud is RDF data.
SPARQL stays for SPARQL Protocol and RDF Query Language. SPARQL is a standard
query language designed for querying RDF databases. There are four different forms
of query in SPARQL: SELECT, ASK, DESCRIBE and CONSTRUCT and we use SELECT
14
form most of the time. [18]The main idea of SPARQL is pattern matching. So it is
easily traverse relationship by querying collections of triples. The syntax of SPARQL
is quite similar to SQL. A simple SPARQL query example can be as follow:
PREFIX dbont: <http://dbpedia.org/ontology/>
SELECT ?musician ?place
WHERE {
?musician dbont:birthPlace ?place .
}
First we need to initiate a namespace. In this case is http://dbpedia.org/ontology.
And we find all the musicians and their birth places as place and return. The partial
result is showed below. We can type the SPARQL query example in DBpedia
endpoint to get the full list.
musician place http://dbpedia.org/resource/Federico_Garc%C3%ADa_Lorca http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Trinidad_Jim%C3%A9nez http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Ibn_Tufail http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Fran_Perea http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Ver%C3%B3nica_S%C3%A1nchez http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Berni_Rodr%C3%ADguez http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Jos%C3%A9_Celestino_Mutis http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Pepe_Marchena http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Antonio_de_Olivares http://dbpedia.org/resource/Andalusia
http://dbpedia.org/resource/Tanya_Anne_Crosby http://dbpedia.org/resource/Andalusia
SPARQL Endpoint query
Endpoint is an association between a fully specified Interface Binding and a network
address, specified by a URI. It is used to communicate with an instance of a Web
Service. An endpoint indicates a specific location for accessing a Web Service using a
15
specific protocol and data format. [19] A SPARQL endpoint enables users to query a
knowledge base via the SPARQL language. Results are typically returned in one or
more machine-‐processable formats like HTML. For simplicity, we can say that a
SPARQL endpoint is the place you send your SPARQL query and receive the result.
The commonly used SPARQL Endpoints are lists below (SparqlEndpoints, 2013):
Data Source Endpoint Address
DBpedia http://dbpedia.org/sparql
U.S. Census http://www.rdfabout.com/sparql
FactForge http://factforge.net/sparql
data.gov.uk http://data.gov.uk/sparql
In our project, we need to query the bio information of the patent inventor from
DBpedia through SPARQL endpoint query. The information of a certain person is the
same as we often see in Wikipedia, but it is in a different format. For example, as for
our professor David A. Patterson, the Wikipedia page and DBpedia page are showed
as below. We can see they have quite different representation of the same
information. In DBpedia, data is machine-‐readable. We can get the value from the
property on the left side. We just need to select the properties we need in SPARQL
query and can get the corresponding values more convenient.
16
Fig 3: Screenshot of an example of Wikipedia
Fig 4: Screenshot of an example of DBpedia
17
HTTP request
The Hypertext Transfer Protocol (HTTP) can work as a request-‐response protocol
between a client and server. An HTTP request consists of a request method, a
request URL, header fields and a body. The request methods are GET, HEAD, POST,
PUT, DELETE, OPTIONS, TRACE. [20] The two commonly used HTTP request
methods are GET and POST. While these two methods have similar function, GET
emphasizes requests data from a specified resource while POST submits data to be
processed to a specified resource. We use POST method here to avoid caching.
In our case, the client is the Search Interface that submits an HTTP request using
JavaScript to the server endpoint with the SPARQL query. Then the server returns a
response to the client. The response contains status and content information about
the request. Consider that JavaScript is not good at dealing with RDF data; we set the
return format as json format.
User Interface design
The User Interface (UI) design for our prototype is simple and clean. It looks like a
simplified Wikipedia. We query from both the Patent Data and Linked Data Cloud
and display the output in the interface. The structure of the User Interface is the
Patent information surrounded by some information of the inventor of the patent.
We can see the screenshot as below:
18
Fig 5: Screenshot of Paten Search Interface
The left side contains the basic information including his profile picture, working
place, Alma Mater and Doctoral Advisor. The upper right side is a biography of the
inventor. Then followed his patent information got from relational database. If we
click the link in the left side, it can lead us to the certain Wikipedia page to get more
information. The UI design emphasizes the Patent part while putting the relevant
information surrounded.
The procedure
The procedure works as below:
On the client-‐side, when people search a keyword, a HTTP request message will
send to the DBpedia web server. We write a wrapper class “SPARQLWrapper.js” in
JavaScript that is similar to SPARQL Endpoint interface to Python. [21]
19
The SPARQL endpoint query is http://dbpdia.org/sparql. We send the request with
searched title and some properties like abstract, workplaces and so on to the server
endpoint. But it will return html page, which is not what we need. So we set the
accept field in Request Header to identify the return data type. Here we need to
return json format. We use GET and POST methods to send the SPARQL.
The web server then will provide resources and return a response message to the
client. The response message is read by JavaScript and write into html and display in
the User Interface.
For the Patent Data part, we have potentially two main approaches. One approach is
to use the Patent Data as the relational database and query the data from local
database. And the other approach is to convert it to RDF format and store it in triple
store or even publish on the web. The first approach is efficient because we just
need to obtain the Patent information from the search keyword. It is quite
convenient to use relational database. The bottleneck would be how to store the
data. The whole dataset could be saved locally or upload in Google Datastore.
The second approach is more complex because we need to pre-‐process the whole
dataset and convert to RDF format. Since the Patent Data is quite large, many
existing tools like Google Refine cannot hold such a large amount of data. The
advantage for the second approach is that the Patent Data can interlink with other
Linked Data and make Patent Data more available.
Since the large amount of Data is always a problem, we will begin from a small
subset and go from there. For example, we can use the Patent Data from Berkeley
Professor first.
20
Discussion
In this section, I will main discuss the use case that we bring Linked Data in Patent
Data search. Also I will talk about how linked Data helped in patent search, what is
the limitation and how linked data can be used in broader context. I also evaluate
the User Interface of the search interface and test with real users.
Results
Explanation of Results
For our Capstone Project, we would like to explore the potential use of Linked Data
to help Big Data Analysis. And thus we are building a patent search engine based on
these two concepts. Linked Data has many advantages like highly structured data,
machine-‐readable and interlinked between different data sources. So we take
advantage of the structured data format of Linked Data and use it to expand the
search result for patent and add more values to it. Basically we have proven the
hypothesis that Linked Data works in this situation and it will have many other
implications.
Here is our User Interface for after searching for a certain patent:
21
Fig 6: Screenshot of Paten Search Result
From the screenshot we can easily see that it has association information adding
into the patent search result. Here we add some wiki information for the certain
inventor. In this way user can easily distinguish the exact inventor by looking at the
biography or some related information like work place, alma mater and doctoral
advisor. It will help in disambiguation for patents since there will be a large amount
of people with the same name but work in different areas and have totally different
patents.
Besides, users can also search for the patents for the coworkers by clicking their
names in the page. Or if the users are interested in the workplace or alma mater,
they can also just click the link and it will lead them to the Wikipedia page of the
certain item.
22
With the help of Open Linked Data, we have a new kind of patent association search
that disambiguation the patent search and provide a broader context of the patent
related information.
What is different
We have many some changes compare to our initial ideas in our implementation.
First for the patent data, we retain its format as relational database and query with
SQL rather than converting it into RDF format. Actually we have worked in some
small prototype to convert the data using Google Refine. But it becomes really
complex when we use a large amount of data. And it is not necessary to covert data
format in our use case. So we decided to query the relational database directly and
combine the result with inventor information from Linked Data.
Also we decide to put the patent data locally and use PHP to query the relational
database and send back to client side with json format. We find out this is the most
efficient way of doing that at this stage. If time permits, we would probably put them
in the cloud server so that we can run the search engine remotely.
Limitation of this approach There are also some limitations of our patent search.
Firstly, we are assuming that the inventor would have a Wikipedia page so that we
can find the corresponding information in DBpedia. However, this would not also be
the case. Although more people get their own page in Wikipedia, there would not be
23
all the people who held their patents. In such case, we won’t find their information
from the Linked Data Cloud and it would cause a problem.
Secondly, the user will need to type the full name of the inventor in order to match
the name in DBpedia and the inventor name in patent database. Compare with
Google Patent Search, it is kind of limited because Google can find us a lot of
information based on selection rank even if we didn’t type the full name.
Thirdly, we are using patent data as its original format and run two queries to
search from DBpedia and relational database. It doesn’t make the best use of Linked
Data because the advantage of Linked Data over other format is that it is in the same
format and different datasets can be interlinked together. Later it would be better if
we can actually convert the patent data into RDF format and even publish the data
into Open Linked Data Cloud. In this way, the patent data would have been
interlinked with all the other data source in the cloud and make use of the Linked
Data concept better.
Evaluation
In the evaluation part, I will mainly discuss the User Interface we build for patent
search and the effectiveness and convenience of search experience for real users.
User Study
We have asked some people in different areas to do the usability experiments to
experience the search engine and made some changes based on their feedback.
24
Most of them think that the patent association search result is better comparing it
with the traditional approach. They often encounter the problem whether they get
the right one when they search for patents. With our prototype they can easily get
the information of the inventor and therefore get correct and comprehensive
understanding of the information they retrieve.
They thinks that our patent search has clear output with the associate information
and it can also run relevant search. But they also point out the limitation of the
approach. We can only have basic information for the patent itself. If users would
like to know about some details of the patent itself, we cannot provide that because
we don’t have that information in Patent Database.
Heuristic Evaluation
We examine our User Interface with the famous 10 Usability Heuristics introduced
by Jakob Nielsen. It is a usability engineering method for finding the usability
problems in a user interface design. [22] We have a small set of evaluators examine
the interface with the recognized usability principles with point one to ten and
combine the result of evaluation.
We asked our users to go through a set of tasks we designed in our search interface
and provide evaluators with the goals of the system and allowed them to do their
own tasks. After that, they filled out the sheet of Heuristic Evaluation.
The Heuristic Evaluation Sheet is designed as followed:
25
Heuristic Evaluation principles Points (1-‐10)
Comments
Visibility of system status
Match between system and the real world
User control and freedom
Consistency and standards
Error prevention
Recognition rather than recall
Flexibility and efficiency of use
Aesthetic and minimalist design
Help users recognize, diagnose, and recover from errors
Help and documentation
We analyzed the results the real users provides and explained the evaluation result.
The principle got Good if the average point is more than 6 out of 10, otherwise it
need to improve.
(1). Visibility of system status: Good (8.7)
Our interface has clear layout and different components will not combine together
when it shows. User can easily see if they have obtained the search result and how
the information likes.
(2). Match between system and the real world: Good (8.2)
The interface is kind of like the Wikipedia format to show the bio-‐information and
put the patent in the front of the page so that it is easy to understand.
26
(3). User control and freedom: Good (7.1)
Users can search new patent by using the textbox in the upper left corner or simply
click the information in the page.
(4). Consistency and standards: Need to improve (5.8)
For the search textbox, we can only do search for the existing patents number and
some inventor information. So user may get confused about what they should enter
at first.
(5). Error prevention: Need to improve (5.0)
We don’t build the function for auto-‐completion or auto-‐correction so that users
need to type correctly in order to get the result.
(6). Recognition rather than recall: Good (7.5)
We have minimized the user’s memory load by making the objects and actions
visible. Users don’t have to remember information but can just click in the old result.
(7). Flexibility and efficiency of use: Good (7.2)
The differences between novice user and expert user will not be huge because there
are no complicated actions needed for the search feature.
(8). Aesthetic and minimalist design: Good (6.8)
The interface contains the most relevant and needed information and diminishes
the extra information with low visibility.
27
(9). Help users recognize, diagnose, and recover from errors: Need to improve
(5.7)
If users type some names that does not exist in the Wikipedia or they make some
typo, there is no error messages to indicate the problem precisely.
(10). Help and documentation: Need to improve (5.5)
We actually didn’t implement the documentation part to help user understand the
functionality of the search engine. Normally people will understand because the
interface looks like all the other search engine.
Future Work
Enriching the functionality of the Patent Search
Now we only focus on how to combine the Linked Data and relational Data together
to make the patent search more convenient. So we only use a limited information
collected from only one source of Open Linked Data Cloud. In fact, there are many
more things we can do to enrich the functionality of the Patent Search. For example,
we can obtain the geo information in the Patent Data and do some visualization of
from the Geo Names Data from Linked Data Cloud. Or we can even visualize some
Patent Search Graph to show the relationships between different inventors and
their patents more explicit.
28
Querying a Collection of Datasets in Linked Data
We query data from only DBpedia for this project. But since Linked Data is
interlinked, we may be able to query a collection of datasets using an existing
SPARQL endpoint and access to a set of copies of relevant dataset. For example,
OpenLink SW has a majority of dataset from the LOD cloud using SPARQL endpoint.
[23]
Applying the concept to other topics
Currently we apply the patent data with the DBpedia in Linked Data Cloud. There
are many other sources in Linked Data Cloud we may use like Geo Names data,
IMDB data, BBC music and so on. We may make use of these sources and find other
available applications. For example, we can search for a certain music singer and get
the relevant biographical information along with their albums and songs in different
data sources.
29
Conclusions or Impact Statement
For our capstone project, it is a research project to explore the potential use of
Linked Data into Big Data. We have do some research about Big Data, knowing the
existing approaches to analysis Big Data and their strength and weakness. And we
figured out that the highly structured Linked Data might be a potential solution for
unstructured Big Data analytics and dig out more values behind the Big Data.
Based on that, we are building a search engine to describe how Linked Data help in
Big Data Analysis by expanding the Patent Data with the Open Linked Data Cloud. In
this way, we may be able to find out the patent association information through the
Linked Data Cloud and combine with the patent search to get a comprehensive
answer.
Although we have learned a lot about the mechanism of Linked Data and use it in
our prototype, there is something remains to be learned. For example, we just query
from a single sources from Linked Data Cloud, we may explore multiply queries
from different sources or directly convert the Patent Data into RDF format and
publish it in the Linked Data cloud.
The strength for Linked Data is its structured and uniform format that information
can be shared among different datasets and it can be read automatically by
computers. Yet we still need to figure out the drawbacks like complicated pre-‐
processing procedures and the way to protect the available data in the web.
Our prototype has proven that linked data has many advantages and can be used in
data analysis in different situations. We can see a bright future for making better use
of linked data and semantic web to help in big data analysis.
30
Bibliography 1. Ian Mitchell, Mark Wilson. Linked Data: Connecting and exploiting big data. London : Fujitsu UK, 2012.
2. Dumbill, Edd. What is big data? An introduction to the big data langscape. [Online] January 11, 2012. http://strata.oreilly.com/2012/01/what-‐is-‐big-‐data.html.
3. James Manyika, Michael Chui, Brad Borwn, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers. Big Data: The next frontier for innovation, competition, and productivity. s.l. : McKinsey Global Institute, 2011. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
4. Roebuck, Kevin. Big Data: High-‐impact Strategies – What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors. s.l. : Lightning Source Incorporated, 2011.
5. Richard Cyganiak, Anja Jentzsch. Linking Open Data cloud diagram. [Online] 2011. http://lod-‐cloud.net/.
6. Hopkins, Brian and Evelson, Boris. Expand Your Digital Horizon with Big Data . s.l. : Forrester , 2011.
7. Oreskovic, Alexei. YouTube, Google Inc's video website, is streaming 4 billion online videos every day, a 25 percent increase in the past eight months, according to the company. [Online] Jan. 23, 2012. [Cited: Nov. 30, 2012.] http://www.reuters.com/article/2012/01/23/us-‐google-‐youtube-‐idUSTRE80M0TS20120123.
8. twittersearch. The Engineering Behind Twitter’s New Search Experience. [Online] May 31, 2011. [Cited: Nov 30, 2012.] http://engineering.twitter.com/2011/05/engineering-‐behind-‐twitters-‐new-‐search.html.
9. O'Brien, Kevin. Why Media Literacy? A Catholic Reflection. [Online] [Cited: Nov. 30, 2012.] http://www.medialit.org/reading-‐room/why-‐media-‐literacy-‐catholic-‐reflection.
10. IDC European Software Predictions. Woodward, Alys, et al. 2012, IDC.
11. IDC Worldwide Big Data Taxonomy . Woo, Benjamin, et al. 2011.
12. Ghemawat, Jeffrey Dean and Sanjay. 2004, OSDI, p. 13.
13. Jablonski, Joey. Introduction to Hadoop. Fremont : Dell Inc., 2011.
14. ADAM LITH, JAKOB MATTSSON. Investigating storage solutions for large data. Goteborg : Chalmers University of Technology, 2010.
15. Kelly, Jeff. Big Data: Hadoop, Business Analytics and Beyond. Nov. 8, 2012.
31
16. DuCharme, Bob. Learning SPARQL. s.l. : O'REILLY, 2011.
17. Berners-‐Lee, Tim. Linked Data Design Issues. [Online] 06 18, 2009. http://www.w3.org/DesignIssues/LinkedData.html.
18. Matthews, Andrew. Understanding SPARQL. [Online] 2008. http://www.ibm.com/developerworks/xml/tutorials/x-‐sparql/section3.html.
19. SPARQL endpoint. [Online] 2011. http://semanticweb.org/wiki/SPARQL_endpoint.
20. HTTP Requests. [Online] http://docs.oracle.com/javaee/1.4/tutorial/doc/HTTP2.html.
21. Ivan Herman, Sergio Fernandez, Carlos Tejo. SPARQL Endpoint interface to Python. [Online] 2008. http://sparql-‐wrapper.sourceforge.net/.
22. Nielsen, Jakob. 10 Usability Heuristics for User Interface Design. [Online] 1995. http://www.nngroup.com/articles/ten-‐usability-‐heuristics/.
23. Hartig, Olaf. Querying Linked Data with SPARQL. [Online] 2009. http://www.slideshare.net/olafhartig/querying-‐linked-‐data-‐with-‐sparql.
24. Public Data Sets on AWS. [Online] http://aws.amazon.com/publicdatasets.
25. SparqlEndpoints. [Online] 2013. http://esw.w3.org/topic/SparqlEndpoints.
32
Appendix Here I will list some code snippets described in methodology. SPARQLWrapper.js (function(root, factory) { if(typeof define === "function"){ define("SPARQLWrapper", factory); // AMD || CMD }else{ root.SPARQLWrapper = factory(); // <script> } }(this, function(){ 'use strict' function SPARQLWrapper(endpoint){ this.endpoint = endpoint; this.queryPart = ""; this.type = "json"; } SPARQLWrapper.prototype = { constructor: SPARQLWrapper, setQuery: function(query){ this.queryPart = "query=" + encodeURI(query); }, setType: function(type){ this.type = type.toLowerCase(); }, query: function(type, callback){ callback = callback === undefined ? type : this.setType(type) || callback; var xhr = new XMLHttpRequest(); xhr.open('POST', this.endpoint, true); xhr.setRequestHeader('Content-‐type', 'application/x-‐www-‐form-‐urlencoded'); switch(this.type){ case "json": type = "application/sparql-‐results+json"; break; case "xml": type = "application/sparql-‐results+xml"; break; case "html": type = "text/html";
33
break; default: type = "application/sparql-‐results+json"; break; } xhr.setRequestHeader("Accept", type); xhr.onreadystatechange = function(){ if(xhr.readyState == 4){ var sta = xhr.status; if(sta == 200 || sta == 304){ callback(xhr.responseText); }else{ console && console.error("Sparql query error: " + xhr.status + " " + xhr.responseText); } window.setTimeout(function(){ xhr.onreadystatechange= new Function(); xhr = null; },0); } } xhr.send(this.queryPart); } } return SPARQLWrapper; }));