Semantic Based Multimedia Analysis and Retrieval

8/3/2019 Semantic Based Multimedia Analysis and Retrieval

1/6

26th

IEEEP Students Seminar 2011

Pakistan Navy Engineering College

National University of Sciences & Technology

Semantic Based Multimedia Analysis and Retrieval

Sana Aslam, Mahwash Makhdoom, Madeeha Khan, Amna Basharat

FAST National University of Computer & Emerging Sciences

Department of Computer Sciences

AK Brohi Road, H-11/4 Islamabad, Pakistan

[email protected], [email protected], [email protected], [email protected]

AbstractThe volume of multimedia files is increasing day by

day. Especially with the advancement in religious e-learning and

multimedia knowledge resources, it has become highly

demanding that effective retrieval and search methodologies

should be incorporated in this area. Many renowned scholars

from all over the world deliver lectures in which they discussdiverse issues/topics and these are usually hours long; therefore

it is problematic and time consuming to navigate to a particular

segment within the multimedia files manually. The necessity of

present time is that efficient search methods should be devised

which facilities the user to select the topic of concern within the

multimedia files with just a single click. Moreover, users should

be assisted in a way that allows them to query in natural

language instead of keyword based search so that the retrieval is

improved and results are based upon content and context. In

this paper we are proposing a method that enables the users to

navigate to a particular segment within the multimedia files.

The architecture of Semantic based Multimedia Analysis and

Retrieval (SMART) allows content indexing and time stamped

alignment of transcriptions with multimedia file. The

architecture of SMART incorporates the natural language

processing techniques for efficient query and the modern

semantic web technologies for efficient search. Search will be

handled by modeling the knowledge base with ontologies. The

advantage of making ontologies is that it allows machine

interoperability; further this would help to retrieve the relevant

results. The architecture envisions combining natural language

processing techniques along with modern semantic web

technologies and use them in a domain that opens new ways ofknowledge sharing and information retrieval.

Keywords-component: Natural Language Processing,

Transcription Alignment, Islamic Scholarly Lectures, Multimedia

Segment Navigation, Ontology

I. INTRODUCTIONWith the changes in needs of users new trends have been

introduced and developed to store information. The

advancement in the field of e-learning has transformed

multimedia resources as a very valuable source of knowledge

and information. The search engines nowadays, do not enable

users to search a particular segment within the media file,

further presently the search is performed on the basis of text

associated with the media files. The major issue in the

current process is that the search results have high recall but

low precision [1], but contemporary users demand efficient

and precise information. Usually Multimedia data is very

detailed and lengthy and often different topics are discussed

in a single file, so it becomes tedious and tiresome when it

comes for user to search a particular topic or finding answer

to particular information from the file manually.

With the advancement in this field, there has been an urge in

users to retrieve the most relevant, accurate and precise data

when they are querying. Keyword based search provide

hundreds and thousands of hits but it becomes frustrating and

tiresome for users to search the relevant content. In order to

overcome the problem, there is a need to model the current

searching techniques that enable the users to not only retrieve

the most accurate results by querying in natural language but

also provide them with the exact content that matches theirsearch criteria. There are a number of search engines

presently that are working to incorporate semantics in their

architecture e.g. Hakia [2] is a semantic based image retrieval

search engine, True Knowledge [3] is another that enables

users to query in natural language and then returns the

precise answer to that, still a significant effort needs to be

done to incorporate semantics for efficient retrieval of

multimedia data resources especially in the domain of

Islamic scholarly lectures.

The architecture of SMART uses ELAN tool for alignment

of transcriptions with multimedia file and then uses the

modern semantic based technologies to annotate that data

efficiently so that relevant data retrieval is achieved [4].Studies have shown that use of semantics can empower the

capabilities of e-learning [7, 8] and can support knowledge

virtualization.

In SMART we have transcribed the media file and then time

aligned the media file with the corresponding text. A

metadata is attached with the media file that contains

information about the file. A knowledge base is attached with

the system
mailto:[email protected],%[email protected],%[email protected],%[email protected]:[email protected],%[email protected],%[email protected],%[email protected]:[email protected],%[email protected],%[email protected],%[email protected]


2/6

26th




The tags are generated with the help of the transcribed file

and the knowledge base. The search is performed with the

help of the tags and the media file is returned to the user

which is navigated to the segment which the user demanded.

In order to test our approach, we have chosen the Islamic

scholarly lectures as our target domain thus the domainconcepts represented in the ontologies is concerning the

views of different Islamic scholars. The main aim of SMART

is to open new ways of incorporating technology in Islamic

world and to bridge the gap between these two to help people

in understanding the concepts within the religion with ease.

Structure of this paper: Section 2 describes the motivation

behind the project and challenges associated with working in

this domain. Section 3 describes the design goals, detailed

architecture of SMART and implementation. Section 4

discusses the implementation details. Conclusion and future

work is further discussed in Section 5.

II. MOTIVATION BEHIND WORKING IN THIS DOMAIN ANDCHALLENGESA. MotivationThe motivating factors behind carrying out this project are:

To enable users to retrieve relevant information from the

multimedia files. We would achieve this by pruning the

irrelevant results. The search results provided would be fewer

but would be more precise and relevant.

One of the main targets is to enable users to get the view of

different scholars on diverse topics in a less amount of time

by reducing the query time and also facilitate the users by

enabling them to query in natural language

Previously keyword based search has been performed upon

text resources and multimedia lectures. Semantic based

search techniques have not been performed on multimedia

files till now. So we hope that SMART would open new

ways of efficient and meaningful search in this area

B. Challenges AssociatedThe challenges associated with this project are described as

follows

One of the basic tasks is to convert the multimedia file into

text. One way to achieve this is through speech recognition,

but when tested, the results provided through speechrecognition were not up to the mark as the domain contained

words which are not frequently spoken in English language.

Also the videos contained many words of Arabic language.

So the accuracy achieved through speech recognition was

between 35-45 % which was quite low to work with, as the

search process was dependent on the text associated with the

media file. Another challenge associated with working in this

domain is that no previous work has been done in this area

and more over due to the complexity resulting from many

diverse views of different scholars on the same topic. So to

create a link between them and provide the user with sound

results is a challenging task in this domain.

III. SMARTDESIGN GOALS AND SMARTARCHITECTUREA. Design GoalsThe existing search engines retrieve the multimedia content

based upon the tags associated with the file such as its title,

name of speaker etc. To facilitate the user to efficiently

navigate to particular segment of interest is a challenging and

demanding task nowadays. Many researches show that there

has not been done a significant amount of work in this

prospect. This research claims to propose an architecture that

is capable of facilitating the user to navigate to a particular

segment within the media file and that too by allowing the

user to query in natural language. The purpose behind

facilitating the user to query in natural language is that it will

target and return more precise content the user wants to

search and will prune the results that are of no use for user.

Further elaborations on the goals have been provided below

that provide the basis on which the architecture of SMART is

formulated

1) Processing of Textual Content of MultimediaResources:

SMART should be able to process and align the textual

content i.e. transcriptions associated with multimedia file

efficiently so that the acquisition of timestamps associated

with text of a multimedia file is achieved. Time stamped

information will help to create link between the text of the

file and multimedia content.

2) Automatic Tagging of New Multimedia Files Added

in Repository:

The knowledge base comprises of the most commonly

occurring terms in this domain, whenever a new file is added

in the repository, SMART should be able to automatically tag

the segments of file that contains those domain terms and

should save their timestamps.

3) Natural Language Processing

This research is envisioned to provide the user with facility to

query in natural language. It will allow the user to do query

as sentences in English language. SMART should be able to

process the query and apply Natural Language Processingtechniques [9] on query structure so that extraction of the

meaning out of query and its precise association with text of

the media is ensured.

4) Ontological Knowledge Model:

To enable efficient search, the need of hour is to model data

to knowledge so that ontological model of knowledge can be

designed for this particular domain of Islamic Scholarly


3/6

26th




Lectures. This ontological model would plot the information

such as speaker, topic, etc.

5) Intelligent Retrieval of Information:

SMART should be able to retrieve efficient and meaningful

results by pruning the irrelevant hits and only providing the

user with most precise results.

B. High Level System Architecture of SMART

The high level system architecture of SMART is shown in

Figure 1 and it comprises of five major activities:

Transcription Alignment, Query Processing, Knowledge

Extraction, Knowledge Modeling, and Query-Result

Accuracy analysis. In the first phase, the multimedia

resources are aligned with the transcriptions that would be

treated as the repository of SMART. In the Query Processing

unit applies Natural Language Processing Techniques on the

user Query and extract its meaning so that it can be mapped

with accuracy in the transcribed data. The Knowledge

Extraction unit comprises of metadata generation and tagsgeneration in which the transcribed aligned data is parsed in a

way that it incorporates in it the data relevant to the data i.e.

metadata of the data. In the tag generation unit the tags are

generated on the parsed data to initiate the navigation

process. Knowledge Modeling builds respective ontology

models for religious scholarly texts and with use of the

ontology schemas. Ontology repository stores the Ontologies.

The ontology repository and the metadata repository form the

knowledge base for the system which will be used for

efficient search retrieval purpose. The incorporation of

semantic web technologies is to speed up the search process

and reduce the response time of the overall process. In the

final phase of Query-Result Accuracy Analysis, the

compatibility analysis of query and the extracted result and

its accuracy is determined using the natural language

techniques and by analysis of the thematic coherence

between query and the results. Finally the results are returned

back to the user and displayed through user interface of

SMART application. The results will comprise of the list of

different speakers and files associated with them that contain

accurate timestamps of occurrence of the answer of userquery within the multimedia stream.

Figure 1: High level System Architecture for Multimedia Segment Navigation

The detailed architecture of SMART on the basis of which

the design and implementation details are formulated isdiscussed in following subsections:

1) Transcription Alignment Unit:As discussed above, the challenge associated with SMART is

to convert the audio into text. The complexity lies with the

fact that the scholarly lectures in English language, contains

Arabic terms and some Urdu terms, so accuracy cannot be

achieved and risk factor cannot be neglected in such a

sensitive domain of religious lectures. For this reason manual

transcriptions are generated for each of the multimedia file.The transcriptions are then aligned with the multimedia

stream using ELAN tool that is an open source tool used for

aligning transcriptions with multimedia. The importance of

this unit lies with the fact that the accurate the alignment, the

more efficient would be the search process. ELAN aligns

transcriptions along with timestamps which are required for

segment navigation [5].

2) Knowledge Extraction Unit:


4/6

26th




This unit consists of two components; these are Metadata

Generation and Tag generation. A brief detail of both the

components is as follows:

3) Metadata Generation:

This component is responsible for generating metadata of the

multimedia files. The metadata holds information of the file

by storing the name of file, its topic, its location [6].

4) Tag Generation

Tag generation unit is responsible for generating tags in the

multimedia file by identifying useful tags with the help of

knowledge base.

5) Query Processing Unit:

The user query would be passed on to the Query Processing

Engine, which would extract useful keywords from the user

query. Here, Natural Language Processing techniques would

be applied on the user query to understand the syntax and

semantics of the user Query.

6) Knowledge Base:

The knowledge base would contain the most frequently used

words in the Islamic domain. With the help of these words

the tags for the particular video would be generated.

7) Knowledge Modeling Unit

In this unit, the data will be transformed into the form of

knowledge models with the help of metadata generated with

the use of ontologies. The ontological model of data will be

stored in this unit that would comprise of schemas to

incorporate semantics in the data for efficient search and

retrieval purpose. The user query from the Query ProcessingUnit will be mapped onto the data for acquiring the exact

location holding the answer to that query. For efficient

retrieval the data has already been shaped in the form of

ontologies so it would facilitate to speed up the overall

process of knowledge extraction and acquisition. The use of

ontologies facilitates machine interoperability and

conceptualizes the domain in a format that is understood [10]

by the machine.

8) Query-Result Compatibility Analysis Unit

In this unit the query and its corresponding mapping obtained

in the data would be verified. This would be necessary

because the domain under consideration is very sensitive andthere is a risk involved in returning the results to the user

without its proper validation and verification. From this unit,

the verified results would be returned to the user-interface for

SMART application.

IV. IMPLEMENTATION

The implementation of SMART architecture is divided into

two modules. The navigation strategy completion and second

is the NLP techniques with semantic incorporation. Till now

we have implemented the first module i.e. the

implementation of segment navigation within multimedia

stream based on keywords. The details of subsections of this

module are elaborated as follows:

A. Navigation StrategyIt comprises of Transcription generation, alignment with

multimedia stream and facilitating keyword based navigation

of multimedia stream. The three subsections are discussed in

detail as follows:

1) Transcription Generation

Before discussing in detail the first part of transcription

generation, it is important to understand the reason behind

using transcriptions when there are many speech recognition

engines available these days. This is due to the fact that the

domain we are targeting holds in it concepts of Arabic andsome Urdu terms even if the whole lecture is in English

language. This raises the challenge of recognizing

multilingual stream of data file, which to date is not

achievable and efficient. Another issue is that the speech

recognition engines available today are workable with

applying machine learning techniques on them, this approach

works well if there is a single speaker because the machine

has to be trained on it. Moreover due to diversity in the

dialects and pronunciation of various speakers, it is very

difficult to recognize the spoken words with accuracy [11].

This domain is so sensitive that different views are required

by users to understand the concepts within the religion. Also

this would constrict the domain to a single speaker thatwould not benefit the users who want to know views of

different scholars on a same topic. So to deal with above

mentioned issues, we have transcribed the multimedia files.2) Alignment with Multimedia Stream:

Navigation within the stream is possible if we get the time

stamped information of the topics discussed in the

multimedia file. For this there is a need of aligning

transcriptions with the multimedia stream. We have used

ELAN tool for this purpose. ELAN (the Eudico Linguistic

Annotator) is a program that allows aligning the

transcriptions and adding annotations to a video file. ELAN

aligns the transcriptions with the media file and returns thetime stamped data i.e. the words spoken in the video along

with the time at which they were spoken. [5]

3) Keyword Based Navigation:

In this unit of the architecture, keywords based search is

implemented. In this we will discuss in detail the working of

Knowledge Extraction unit of SMART architecture. The

knowledge extraction unit consists of two subunits. One is


5/6

26th




the tag generation unit and the other is to attach the metadata

with the corresponding media file. The tag generation unit

gets the input in the form of the time aligned data file. The

tags are generated with the help of Knowledge base. The

knowledge base consists of most commonly used words in

the Islamic domain so that tags could be added to in

relevance to the multimedia files. The tagged repository is

maintained which consists of metadata. The metadata is

comprised of the keyword information and the corresponding

media file in which it is occurring. The metadata associated

with the transcription consists of detailed information

regarding keywords contained within knowledge base, their

corresponding timestamps and path of multimedia file

containing those keywords. Figure 2 shows the detailed

working of tag generation.

Figure 2: Detailed Architecture of the Tag Generation Unit

With the acquisition of tagged information, it is now possible

to navigate to a particular segment within the multimedia file.

The detailed algorithm of the search process implemented isdiscussed below.

ALGORITHM1:NAVIGATION WITHIN MULTIMEDIA STREAM

1. Initialize UserQuery to U2. Initialize SelectedSpeakerto S

Input the user query

3. ifthe user selects the speaker4. Store speaker name in Temp5. Retrieve the names of multimedia files of the

corresponding speaker from the meta-data file

6. Search the database for the USERQUERYWHEREmultimedia file name belongs to retrieved list

7. Retrieve the results, their corresponding timestampsand multimedia file names

8. else9. Search the database for the USERQUERY10. Retrieve the results, their corresponding timestamps

and multimedia file names

11. Display the retrieved results to the users12. User selects the multimedia file and plays it

The workflow of the components on the basis of the

algorithm is showed in Figure 3

Figure 3: Detailed Architecture of the Search Engine

B. Incorporation of Semantics

The second module of implementation of SMART includes

incorporation of semantics in the architecture for efficient

search and using NLP techniques for query processing.

V. CONCLUSIONS &FUTURE WORKIn this paper we have proposed a potentially powerful and

novel approach for the retrieval of multimedia information.

The crux of our innovation is the development of a procedure

through which we can retrieve a particular segment from amultimedia file. We have used a domain of Islamic Scholarly

lectures for project demonstration but our results can be

generalized and can be applied on other domains as well.

Moreover, speech recognition does not prove to be a vital

approach for working in this domain due to the variety of

words spoken in different languages within a single lecture.

That is why going with transcriptions is necessary for an

effective search. Combined with modern semantic

technologies, we are hopeful that SMART, in comparison

with other semantic based search engines would prove to be

an efficient and effective search engine for multimedia files.

Although we are confident that the conceptual framework for

this project is sound, and its implementation is completely

feasible from a technical standpoint, but still some other

important aspects are needed to be covered in future. These

include adding semantics to achieve efficiency and

effectiveness while retrieving the results.

Moreover, until now we have been working with videos in

English. Later on we would like to incorporate videos in


6/6

26th




Urdu language as well. The need of this lies with the fact that

the domain has a vast amount of data in Urdu language that

could be used a valuable resource of knowledge and

information. In future we would also work on user studies

and evaluation.

REFERENCES

[1] Latifur, Dennis August 2000 Effective Retrieval of

Audio Information from Annotated Text Using

Ontologies1, ACM SIGKDD Workshop on

Multimedia Data Mining, Boston, MA

[2] http://www.hakia.com [Accessed 15 September

2010]

[3] http://www.trueknowledge.com/[Accessed 28

October 2010]

[4] Y. Xiao, M. Xiao, and F. Zhang, Agents-Based

Intelligent Retrieval Framework for the Semantic

Web, in Proc. WiCom, 2007, pp. 5357-5360.

[5] http://www.lat-mpi.eu [Accessed 15 November2010]

[6] R. Guenther and J. Radebaugh: Understanding

Metadata Bethesda: NISO, 2004

[7] Y. Li and M. Dong, Towards a Knowledge Portal

for E-Learning Based on Semantic Web, inProc.

8th IEEE Int. Conf on Advanced Learning

Technologies, ICALT'08. 2008, pp. 910-912.

[8] N. Henze, P. Dolog, and W. Nejdl, Reasoning and

Ontologies for Personalized E-Learning in the

Semantic Web. Educational Technology &Society,

pp. 82-97.

[9] O. Kucuktunc, U. Gudukbay, and O. Ulusoy. A

natural language-basedinterface for querying a videodatabase. IEEE MultiMedia, 14(1):8389,2007.

[10] H. Alani, S. Kim, D.E. Millard, M.J. Weal, W. Hall,

P.H. Lewis and N.R. Shadbolt, Automatic

Ontology-Based Knowledge Extraction from Web

Documents, Proc. IEEE Intelligent Systems, 2003,

pp. 14-21.

[11] Forsberg, M. (2003). Why is Speech Recognition

Difficult. Chalmers University of Technology
http://www.hakia.com/http://www.trueknowledge.com/http://www.trueknowledge.com/http://www.lat-mpi.eu/http://www.lat-mpi.eu/http://www.lat-mpi.eu/http://www.trueknowledge.com/http://www.hakia.com/