11
Audio-based Queries for Video Retrieval over Java Enabled Mobile Devices Iftikhar Ahmad a , Faouzi Alaya Cheikh b , Serkan Kiranyaz b and Moncef Gabbouj b a Nokia Corporation, P. O. Box 88, FIN-33721, Tampere, Finland, ([email protected] ) b Institute of Signal Processing, Tampere University of Technology, P. O. Box 553, FIN-33101, Tampere 33720, Finland; EMAIL: [email protected], [email protected], [email protected] ABSTRACT In this paper we propose a generic framework for efficient retrieval of audiovisual media based on its audio content. This framework is implemented in a client-server architecture where the client application is developed in Java to be platform independent whereas the server application is implemented for the PC platform. The client application adapts to the characteristics of the mobile device where it runs such as screen size and commands. The entire framework is designed to take advantage of the high-level segmentation and classification of audio content to improve speed and accuracy of audio-based media retrieval. Therefore, the primary objective of this framework is to provide an adaptive basis for performing efficient video retrieval operations based on the audio content and types (i.e. speech, music, fuzzy and silence). Experimental results approve that such an audio based video retrieval scheme can be used from mobile devices to search and retrieve video clips efficiently over wireless networks. Keywords: Audio, Content, Framework, Java, Mobile, Retrieval, Video, Wireless. 1. Introduction The amount of personal digital information is increasing at a fast rate. New generation mobile devices, which support various multimedia functionalities with integrated microphones and cameras, facilitate the creation of audiovisual content. These mobile devices are no longer used only for voice communication; nowadays, they are more frequently used to capture and manipulate different media types and to run different applications. Additionally when combined with Wireless Local Area Network (WLAN) [44] and 3G [35] network technologies they may provide high speed access to the wealth of multimedia items on the Internet. Therefore, accessing multimedia items from these mobile devices is no longer a problem; however, retrieving a specific media item (i.e. an item of interest) from a multimedia database using a mobile device is still a challenging research area. In this context, one particular user scenario might be the following. Using a multimedia mobile device, a user can record an audio/video clip and run an application to perform a content-based query-by-example (QBE) operation virtually from anywhere. However, Content-Based Multimedia Retrieval (CBMR) from mobile devices adds new challenges besides those encountered in typical content- based multimedia retrieval operations such as in [3], [11], [23], [25], [26], [40] etc. For instance, different mobile devices come in different designs and capabilities. Moreover, they have different operating systems, input/output limitations and they support different media file formats. Recently, the capabilities of mobile devices have been significantly improved (faster input/output, bigger memory capacity, processing and battery power) but comparatively they are still far behind computers. As a result, it is hard to provide a generic solution that suits all mobile devices and therefore, special care must be taken when developing applications for them. Multimedia retrieval on mobile devices is an emerging research area [1], [7], [10], [13] [18], [32]. With the current mobile operating systems such as Symbian OS [43], MS Windows Mobile [38], Linux [37], etc., it became possible to develop applications that can run on mobile devices to perform sophisticated media manipulation tasks. These efforts paved the way in front of the development of CBMR over mobile devices. In this context, some of the most relevant efforts are the following: Guldogan et al. [13] proposed on device content-based image indexing and retrieval framework for Symbian series 60 devices. This approach presents several limitations; for example: mobile devices have proprietary Application Programming Interfaces (APIs) for handling (i.e. accessing, processing, editing, streaming, etc) multimedia items, and this limits the applications to certain set of devices or certain platforms (operating systems). Another limitation is the large consumption of power and system resources. Even though a standalone CBMR

Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

Audio-based Queries for Video Retrieval over Java Enabled Mobile Devices

Iftikhar Ahmada, Faouzi Alaya Cheikhb, Serkan Kiranyazb and Moncef Gabboujb

aNokia Corporation, P. O. Box 88, FIN-33721, Tampere, Finland, ([email protected]) bInstitute of Signal Processing, Tampere University of Technology, P. O. Box 553, FIN-33101, Tampere 33720, Finland; EMAIL: [email protected], [email protected], [email protected]

ABSTRACT In this paper we propose a generic framework for efficient retrieval of audiovisual media based on its audio content. This framework is implemented in a client-server architecture where the client application is developed in Java to be platform independent whereas the server application is implemented for the PC platform. The client application adapts to the characteristics of the mobile device where it runs such as screen size and commands. The entire framework is designed to take advantage of the high-level segmentation and classification of audio content to improve speed and accuracy of audio-based media retrieval. Therefore, the primary objective of this framework is to provide an adaptive basis for performing efficient video retrieval operations based on the audio content and types (i.e. speech, music, fuzzy and silence). Experimental results approve that such an audio based video retrieval scheme can be used from mobile devices to search and retrieve video clips efficiently over wireless networks. Keywords: Audio, Content, Framework, Java, Mobile, Retrieval, Video, Wireless.

1. Introduction The amount of personal digital information is increasing at a fast rate. New generation mobile devices, which support various multimedia functionalities with integrated microphones and cameras, facilitate the creation of audiovisual content. These mobile devices are no longer used only for voice communication; nowadays, they are more frequently used to capture and manipulate different media types and to run different applications. Additionally when combined with Wireless Local Area Network (WLAN) [44] and 3G [35] network technologies they may provide high speed access to the wealth of multimedia items on the Internet. Therefore, accessing multimedia items from these mobile devices is no longer a problem; however, retrieving a specific media item (i.e. an item of interest) from a multimedia database using a mobile device is still a challenging research area. In this context, one particular user scenario might be the following. Using a multimedia mobile device, a user can record an audio/video clip and run an application to perform a content-based query-by-example (QBE) operation virtually from anywhere. However, Content-Based Multimedia Retrieval (CBMR) from mobile devices adds new challenges besides those encountered in typical content-based multimedia retrieval operations such as in [3], [11], [23], [25], [26], [40] etc. For instance, different mobile devices come in different designs and capabilities. Moreover, they have different operating systems, input/output limitations and they support different media file formats. Recently, the capabilities of mobile devices have been significantly improved (faster input/output, bigger memory capacity, processing and battery power) but comparatively they are still far behind computers. As a result, it is hard to provide a generic solution that suits all mobile devices and therefore, special care must be taken when developing applications for them.

Multimedia retrieval on mobile devices is an emerging research area [1], [7], [10], [13] [18], [32]. With the current mobile operating systems such as Symbian OS [43], MS Windows Mobile [38], Linux [37], etc., it became possible to develop applications that can run on mobile devices to perform sophisticated media manipulation tasks. These efforts paved the way in front of the development of CBMR over mobile devices. In this context, some of the most relevant efforts are the following: Guldogan et al. [13] proposed on device content-based image indexing and retrieval framework for Symbian series 60 devices. This approach presents several limitations; for example: mobile devices have proprietary Application Programming Interfaces (APIs) for handling (i.e. accessing, processing, editing, streaming, etc) multimedia items, and this limits the applications to certain set of devices or certain platforms (operating systems). Another limitation is the large consumption of power and system resources. Even though a standalone CBMR

Page 2: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

system can be implemented for mobile devices, with increasing number of multimedia items it might take an unreasonable amount of time to perform a content-based retrieval operation. Furthermore, such a system will eventually reduce the mobile device talk and stand by times. Kim et al. [18] proposed a Visual-Content Recommender for the Mobile Web. Their work combines collaborative filtering with content-based image retrieval. Gandhi et al. [10] proposed an intelligent multimedia content management on mobile devices with the help of metadata. Vlachos and Vrechopoulos [32] presented a study about personalized on demand music for mobile music services. Davis and Sarvas [7] developed a system for image retrieval over mobile devices using metadata associated with the images. In the system they proposed, the XHTML pages contain a large amount of redundant data, which eventually increases the retrieval time. Pham and Wong [30] discussed different features and requirements with which a mobile multimedia application needs to comply. Due to the aforementioned limitations and drawbacks in mobile devices, most of the proposed systems [1], [7], [10], [18] in the literature use client-server architecture, where the client runs on a mobile device and the server runs on a computer. Furthermore, they are in general based on the metadata and textual annotations.

Content-based video retrieval from mobile device is a new emerging area. From content-based video retrieval point of view, the audio information can be even more important than the visual part since it is mostly unique and significantly stable within the entire duration of the content. However, audio-based studies lag far behind the visual counterpart and the development of robust and generic systems for audio content management is still in its infancy. Recent, promising content-based audio retrieval techniques that may be categorized into two major categories, the first “Query by Humming” (QBH) approach is used for music retrieval [2], [3], [6], [11], [23], [25], [26], [27]. This approach has the disadvantage of being only feasible when the audio type is music stored in a symbolic format or polyphonic transcription (i.e. MIDI). Moreover, it is not suitable for various music genres such as Trance, Hard-Rock, Techno and several others. Such a limited approach obviously cannot be a generic solution for the audio retrieval problem. The second is the well-known “Query by Example” (QBE) technique, which is also common for visual retrievals of multimedia items. One of the most popular audio-based indexing and retrieval frameworks is MuscleFish [39]. Wold et al. [33] proposed a fundamental approach to retrieve sound clips based on their acoustic features such as pitch, brightness, harmonicity, loudness, bandwidth, etc. The main drawback of this approach is that it is a supervised algorithm that is only feasible to some limited sub-set of audio collection and hence cannot provide an adequate and global approach for general audio indexing. In [9], Foote proposed an approach for the representation audio clips using templates, which characterize the content. Khokar and Li [19] proposed a wavelet-based approach for short sound file retrievals and applied it to the MuscleFish database. They achieved around 70% recall rate for diverse audio classes. Spevak and Favreau presented the SoundSpotter [31] prototype system for content-based audio section retrieval within an audio file. In their work, the user selects a specific passage (section) within an audio clip and also sets the number of retrievals. The system then retrieves similar passages within the same audio file by performing a pattern matching of the feature vectors and a ranking operation afterwards.

All the aforementioned systems and techniques achieved a certain performance; however present significant limitations and drawbacks. First the limited amount of features extracted from the audio data often fails to capture the perceptual content of the audio data. Second, the similarity estimation in the query process is based on the computation of the (dis-) similarity distance between a query and each item in the database and a ranking operation afterwards. Therefore, especially for large databases it may turn out to be a costly operation and the retrieval time becomes unreasonably long for a particular search engine or application. Third, all of the aforementioned techniques are designed to work in pre-fixed audio parameters (i.e. with a fixed format, sampling rate, bits per sample, etc.). Obviously, large-scale multimedia databases may contain digital audio in different formats (compressed or uncompressed), encoding schemes (MPEG Layer-2 [14], [28], MP3 [4], [14], [16], [28], AAC [4], [15], AMR [41], ADPCM, etc.), other capturing, encoding and acoustic parameters (i.e. sampling frequency, bits per sample, sound volume level, bit-rate, etc.) and durations. It is obvious that the perceived audio content is totally independent from such parameters and if not designed accordingly, the feature extraction techniques are often affected drastically by such parameters and therefore, the efficiency and the accuracy of the indexing and retrieval operations will both suffer as a result. Finally, they are mostly designed either for short sound files bearing a unique content or manually selected (short) sections. However, in a multimedia database, each clip can contain multiple content types, which are temporally (and also spatially) mixed with indefinite durations. Even the same audio content type (i.e. speech or music) may be produced by different sources (people, instruments, etc.) and should therefore, be analyzed accordingly. In order to overcome the aforementioned problems and shortcomings in the audio-based video retrieval, in this paper we propose a framework, which uses a generic audio classification and segmentation scheme [20], especially suitable for audio-based multimedia indexing and retrieval systems. This scheme is automatic and uses no information from the visual content of the video. In the proposed framework, an efficient architecture for audio-based queries to retrieve video clips over Java [36] enabled

Page 3: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

mobile devices (mobile phones, Personal Digital Assistants (PDAs), communicators, etc.) is presented. Since Java is device agnostic, an application developed in Java may therefore, be supported by the vast majority of mobile devices. This framework is so-called Mobile MUVIS (M-MUVIS) based on MUVIS [40] and has a client-server architecture. M-MUVIS server comprises of two Java servlets (web applications) [5], [24]: MUVIS Query Servlet (MQS) and MUVIS Media Retrieval Servlet (MMRS), running inside a Tomcat web server [5], which performs the audio-based query operation and the media retrieval on the server side. MQS has native libraries for efficient audio query operations. The second servlet, MMRS, is used for the media retrieval. In order to take advantage of the flexibility and portability of Java for mobile devices, M-MUVIS client has been developed using Java 2 Platform Micro Edition (J2ME) [17].

The rest of the paper is organized as follows: M-MUVIS framework is described in Section 2. Section 3 presents the audio-based multimedia retrieval over mobile platforms. In Section 4, experimental results are presented. Conclusions are drawn in Section 5.

2. M-MUVIS Framework As mentioned earlier, M-MUVIS is designed as a client-server framework as shown in Figure 1, where the client application is used to initiate the content-based query operation such as query by example (QBE) and sends a query request to the server, which performs the query operation and sends the query results back to the client. As shown in the figure, there are two servlets on the M-MUVIS server side: the MQS is used for performing content-based query operation, while the MMRS is used for the media retrieval operation. Session is used to share information between the client, MQS and MMRS.

Get Features by name

Display ResultsMMRS

SimilarityMeasurement

MultimediaDatabase

Audio FeaturesClient

Audio FeatureExtraction

Best Matches

Other Sources (PC)

Off – Line ProcessingOn - Line Processing

MQS

FeaturesMedia Items

FeaturesMedia Items

FeaturesAudio Features

Query item

Audiovisual query

Session

Server Multimedia item name Get Features by name

Display ResultsMMRS

SimilarityMeasurement

MultimediaDatabase

Audio FeaturesClient

Audio FeatureExtraction

Best Matches

Other Sources (PC)

Off – Line ProcessingOn - Line Processing

MQS

FeaturesMedia Items

FeaturesMedia Items

FeaturesAudio Features

Query item

Audiovisual query

Session

Server Multimedia item name

Figure 1: M-MUVIS framework

The client and the server use Hyper Text Terminal Protocol (HTTP) [34] for communication. Since HTTP is a

stateless protocol, a session [24] is created on the server side to store the configuration parameters sent with the query request by the M-MUVIS client. Session is shared between MQS and MMRS. Each session has its own unique session identifier number. The client uses the session identifier for each transaction with the server. Additionally the query results are stored into the session information. Therefore, the M-MUVIS client retrieves the query results from the

Page 4: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

session. In this way the URLs of the media items are not transmitted to the client in an explicit way in order to keep a high-level of security in this framework. More information about M-MUVIS framework can be found from [1].

2.1. M-MUVIS Database Formation MUVIS and M-MUVIS share a common database structure. Therefore, using MUVIS DbsEditor [40] application, M-MUVIS databases are created and managed offline via exclusive tasks such as dynamic multimedia item insertions and removals, new audiovisual features extraction, performing a similarity indexing scheme, etc. DbsEditor is also used to convert audio/video clips or images into M-MUVIS databases. More detailed information about the multimedia database management in M-MUVIS and particularly DbsEditor application can be found in [40].

Generally speaking, mobile devices support only a limited set of media formats. Therefore, in M-MUVIS framework there is a need to convert some media items to the supported media formats before streaming them to a client. This is performed on the server side using a dedicated media converter. The media formats supported by M-MUVIS for mobile device streaming are: H.263+ [35] for video and AMR/WAMR [41], [35] for audio.

2.2. Query Input and Output Various query parameters for an M-MUVIS client are presented in Figure 2. These parameters can be set by the user prior to query operations, e.g. the query type, “Audio Query”, as can be seen from Figure 2-D. In Figure 2–A, the main menu is shown whereas B, C, D and E show the different settings for query parameters.

A EDCBA EDCBA EDCB Figure 2: Graphical User Interface (GUI) for a query configuration.

The media retrieval operation performed by the MMRS creates the “Query Resultant Image” (QRI) and sends it

to the client side. The QRI presented to the user is made of twelve key-frames of the most similar video clips to the query represented as thumbnail images. In order to retrieve the similar video clips to a query video, the user selects thumbnail image of a retrieved video clip in the QRI and sends a streaming request it to the server. The server then uploads the requested video clip to the client. Figure 3 presents the main Graphic User Interface (GUI) of an M-MUVIS client and a sample QRI. Further information about the QRI formation can be found in [1].

Page 5: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

CBA ED

F G H I J

CBA EDCBA ED

F G H I JF G H I J Figure 3: Screenshot from a query operation in M-MUVIS.

The phases of an ongoing query operation in an M-MUVIS client can be seen in Figure 3; in A, the user can

views the first key frame of twelve randomly selected media items from an active database. In B a snapshot of the menu presenting several options for a query operation is shown. Whereas in C and D the query operation and QRI wait dialogs are shown. Screen shots E to J show the twelve first key frames of the most similar video clips from the selected database with their corresponding similarity scores.

2.3. M-MUVIS Retrieval Scheme: Progressive Query In order to perform the content-based query operation, the usual approach is to map the database items such as images, video and audio clips into some high dimensional vector space called feature domain. The feature domain may consist of several types of features extracted from the visual and audio content. Careful selection of the feature set to be used for a particular application is a key success factor in a content-based multimedia retrieval (CBMR) system. Assuming that these features capture the semantic content of the media items; the perceived similarity between two items can then be estimated by the (dis-) similarity distance between their feature vectors. Therefore, the similarity-based retrieval problem with respect to a given query (item) can be transformed into the problem of finding database items whose feature vectors are close to the query feature vector. This is called query-by-example (QBE), which is one of the most common retrieval schemes used in CBMR systems. The exhaustive search based QBE operation is called Normal Query (NQ), and works as follows: using the available features of the query multimedia item and all the database items, similarity distances are calculated and then fused to obtain a unique similarity distance per database item. Ranking the items according to their similarity distances (to the query item) over the entire database yields the query results. NQ is computationally costly and the retrieval time is proportional with the database size. Therefore, it cannot provide a feasible solution for content-based retrieval operations especially for mobile platforms.

Progressive Query (PQ) [22] is designed in order to address these limitations and drawbacks. PQ is composed of periodic series of Progressive Sub-Queries (PSQs) and partitions the database items into some sub-sets within which individual sub-queries can be performed. Therefore, a sub-query is a fractional query process that is performed over a sub-set of database items. Once a sub-query is completed over a particular sub-set, its results are merged with the last (overall) retrieval result to obtain a new (overall) retrieval result. This is a continuous operation, which proceeds incrementally, sub-set by sub-set to cover all the items within the database, and each time a new sub-query operation is completed, PQ saves the retrieval results into the client session. More information about PQ can be found in [22].

Page 6: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

Usually mobile device users cannot afford to wait for a long time to retrieve the query results. Therefore, PQ settings such as PSQ result index and PQ time period (tp) allow the mobile device users to determine how fast they can access the sub-query results in earlier stages. Furthermore, when there is an efficient indexing structure available in an M-MUVIS database, PQ can conveniently be used to retrieve the relevant items in the earliest possible time. For this purpose Hierarchical Cellular Tree (HCT) [21] indexing technique is used to perform similarity indexing to the M-MUVIS databases in order to further reduce the retrieval times. An illustration of the PQ operation in M-MUVIS framework is shown in Figure 4.

Query InformationHTTP

MMRS

Timet 2t 3t 4t

M-MUVIS Database

MQS

Sub-Query Fusion

Sub-Query Fusion

1 1+2 1+2+3

1 2 3 4

Periodic Sub-query Results

ProgressiveSub-query Result

Sub-set 1

Sub-set 2Sub-set 3

Sub-set N

Session

t = tp

Query InformationHTTP

MMRS

Timet 2t 3t 4tt 2t 3t 4t

M-MUVIS Database

MQS

Sub-Query Fusion

Sub-Query Fusion

Sub-Query Fusion

Sub-Query Fusion

1 1+2 1+2+3

1 2 3 4

Periodic Sub-query Results

ProgressiveSub-query Result

Sub-set 1

Sub-set 2Sub-set 3

Sub-set N

Sub-set 1

Sub-set 2Sub-set 3

Sub-set N

Sub-set 1

Sub-set 2Sub-set 3

Sub-set N

Session

t = tp

Figure 4: Progressive Query in M-MUVIS.

2.4. M-MUVIS Client Architecture The M-MUVIS client application adapts to different device User Interfaces (UIs). Therefore, the same application (executable) can be used on different Java enabled mobile devices. The client application is developed in Mobile Information Device Profile (MIDP) [17] that is a Java profile for resource limited mobile devices and allows the mapping of the commands [17] to the device buttons in order to have the look and feel of the device native applications.

Page 7: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

A B

C D

E F

A BA B

C DC D

E FE F

Figure 5: Audio-based video retrieval operation from M-MUVIS client application on Nokia 9500.

An example of audio-based video retrieval operation in a video database using an M-MUVIS client application running on Nokia 9500 communicator [41] is shown in Figure 5. The main GUI of the M-MUVIS client is shown in A. In B, the client requests a set of video clips selected randomly from the active database and shown in Figure 5 C. The video clip (“Clip_789”) is used to initiate the query operation. In E and F, first key frames of the 12 clips with the highest rank are shown along with their similarity scores displayed on top of them.

Mobile devices are limited in Random Access Memory (RAM). Additionally, Java virtual machine consumes a significant part of it. Therefore, only a limited amount of the computational memory is usually left for the M-MUVIS client application. For an efficient memory management, M-MUVIS client does not cache the media items on a mobile device. In order to further reduce the memory usage, QRI is JPEG encoded and the M-MUVIS server uses a high compression rate (i.e. low quality factor) for JPEG encoding.

3. Audio-based Indexing and Retrieval for Video Databases Audio information often plays an important role in understanding the content of digital media and in certain cases; audio might even be the only source of information e.g. audio-only clips. Henceforth, audio information has been recently used for content-based multimedia indexing and retrieval [8]. Audio may also provide significant advantages over the visual counterpart especially if the extracted features from the content are close to those used by the human auditory system. This, on the other hand, requires efficient and generic audio (content) analysis that yields robust semantic classification and segmentation.

Page 8: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

Silence MusicSpeech Fuzzy

Audio Framing & Classification Conversion

Uncertain Speech Music Fuzzy

AFeX Module(s)

......

57

3

010

20 1

29 6

15

Audio Indexing

Speech Music Fuzzy

KF Feature Vectors

Classification & Segmentation per granule / frame.1

Audio Stream

AFeX Operationper frame3

Audio Framingin Valid Classes2

KF Extractionvia MST Clustering4

MST

BA

Silence MusicSpeech Fuzzy

Audio Framing & Classification Conversion

Uncertain Speech Music Fuzzy

AFeX Module(s)

......

57

3

010

20 1

29 6

15

Audio Indexing

Speech Music Fuzzy

KF Feature Vectors

Classification & Segmentation per granule / frame.1

Audio Stream

AFeX Operationper frame3

Audio Framingin Valid Classes2

KF Extractionvia MST Clustering4

MST

BA

Figure 6: Audio indexing and retrieval in M-MUVIS.

Audio-based video indexing and retrieval is shown in Figure 6-A and B respectively. As shown in Figure 6-A, audio indexing is applied to the audio track of a video clip in an M-MUVIS database. The classification and segmentation of the audio stream is the first step. As a result of this step the entire audio clip is segmented into 4 class types and the audio frames among three class types (speech, music and fuzzy) are used for indexing. Silent frames are simply discarded since they do not carry any audio content information. The frame conversion is applied in step 2 due to the (possible) difference occurred in frame durations used in classification and segmentation and the latter Audio Feature eXtraction (AFeX) operations. The boundary frames, which contain more than one class types are assigned as uncertain and also discarded from indexing since their content is not pure, rather mixed and hence do not provide a clean content information. The remaining speech, music and fuzzy frames (within their corresponding segments) are subject to audio feature extraction process using present (AFeX) modules and their corresponding feature vectors are indexed into descriptor files separately after a clustering (key-framing) operation via Minimum Spanning Tree (MST) Clustering [12].

The aforementioned indexing scheme uses the audio classification per segment information to improve the efficiency in such a way that during an audio-based query scheme, the audio frames, with matching class types, will be compared with each other via a similarity measure. Figure 6-B illustrates the class matching and minimum distance search mechanisms during the similarity distance calculations per sub-feature. More details about audio indexing and retrieval can be obtained in [8], [20]. The server performs the audio-based indexing and retrieval in M-MUVIS framework.

4. Experimental Results The “Open Video” database used in these experiments is created using video clips from “The Open Video Project” [42]. It contains 1229 video clips; each of which has one minute duration (approximately) with a total duration of 20 hours (approximately). The clips are from the 1960s but contain color video with audio. In these experiments the M-

Page 9: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

MUVIS server runs on a PC equipped with Pentium 4, 2.99GHz and 1.9GB of RAM. The features used in audio-based and visual queries performed in this section are as follows: Mel Frequency Cepstral Coefficients (MFCC) [20] as the audio feature and YUV, HSV, RGB color histograms and the Gray Level Co-Occurrence Matrix (GLCM) [29] as visual features. The performance appraisal is carried out subjectively (via visual inspection). Therefore, the experimental results presented in this section are based only on the subjective evaluation via ground truth (i.e. the retrieval results are evaluated by a group of people) methodology and they are meant to be evaluator-independent. In this section we first assess the retrieval speed (i.e. the query time) regarding to M-MUVIS client. Afterwards using a query video, the aural retrieval performance will be compared with the visual counterpart in a subjective (via visual inspection) way.

Table 1: Audio-based query time statistics

The retrieval performance evaluation is based on the speed (or timing) measurements using PQ over HCT and

NQ query methods. We perform 50 QBE retrieval operations on a large database, with 50 different query video clips. We used 2 seconds as the PQ period (tp) and measured the total query time to retrieve relevant video clips (a maximum of one miss was allowed) among the first (highest ranked) 12 results. Table 1 presents the retrieval time statistics (mean (µ ) and standard deviation (σ )) of 50 query operations over the test database. The Server Query Time (SQT) is the time spent to perform a query operation on the server side whereas the Client Query Time (CQT) is the entire time passed from sending a query request, performing the query operation on the server side until the retrieval of the QRI. As expected, PQ over HCT achieves a significant retrieval speed, e.g. only 9-12 seconds on the average to perform a CBMR operation in the sample database. The traditional query mechanism, NQ, on the other hand yield such a retrieval time that is not feasible for mobile platforms and their users. PQ statistics presented in Table 1 are computed based on the query time needed to retrieve at least 91% (i.e. one miss allowed) of the relevant media items (using ground-truth methodology). A higher σ in CQT is observed in 3G than WLAN network due to dynamic latencies in 3G networks.

A BA B Figure 7: Visual (A) and audio (B) QRIs retrieved by an M-MUVIS client running in a Nokia 9500 from the Open

Video database. Top-left is the query clip. Figure 7 shows two sample retrievals via visual (A) and audio (B) queries from the Open Video database. The

query clip (the first key frame in the clip), which is shown on the top left side of Figure 7 A and B, is a documentary about helicopter. The audio track of this clip is mostly speech accompanied with occasional music and environmental (helicopter engine) noise. Among 12 retrievals with the highest ranking, the visual query (left) achieves the retrieval of only two relevant clips whereas the audio query retrieved all the relevant ones. This retrieval example shows that the audio-based query can outperform the visual counterpart especially when there is a significant variation in the visual scenery, lightning conditions, background or object motions, camera effects, etc.

Network PQ over HCT NQ CQT (ms) SQT (ms) CQT (ms) SQT (ms)

µ σ µ σ µ σ µ σ 6630 [41] (3G) 11,777 949 5,578 33 307,385 3,955 290,465 441 9500 [41] (WLAN) 9,609 104 5,005 10 301,281 352 289,654 306

Page 10: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

5. Conclusions An audio-based video retrieval framework over Java enabled mobile devices is proposed under M-MUVIS system. In this framework, the user may perform audio-based query operations to retrieve video clips similar to a query video from a multimedia database. In order to achieve a high efficiency in terms of speed and accuracy from audio queries along with a user-friendly architecture, M-MUVIS framework has been significantly improved by integrating the following innovative techniques and features:

• A generic framework, which supports a wide range of multimedia family where each multimedia item can have different capturing - encoding parameters, indefinite durations and file formats, etc.

• Any M-MUVIS client, which runs on a mobile device, can now perform audio-based queries within video databases located on a remote server.

• Audio–based queries can be a better alternative compared to the visual counterpart. • M-MUVIS client application adopts the native look and feel of the device where it runs whilst providing a

uniform UI across a range of mobile devices. • An important feature for mobile platform users is the retrieval speed. Using PQ over HCT, M-MUVIS

achieves a significant retrieval speed with respect to the traditional approach. • Whenever needed, MMRS performs audio and video conversions according to the request from an M-MUVIS

client.

We foresee that CQT can be further reduced by improving the session management on the server side, which will reduce the amount of data exchanged between the M-MUVIS client and server.

REFERENCES [1] I. Ahmad, S. Kiranyaz and M. Gabbouj, “An Efficient Image Retrieval Scheme on Java Enabled Mobile Devices”,

MMSP 05, International Workshop on Multimedia Signal Processing, Shanghai, China, November 2005. [2] D. Bainbridge, “Extensible optical music recognition”, PhD thesis, Department of Computer Science, University of

Canterbury, New Zealand, 1997. [3] S. Blackburn and D. DeRoure. “A Tool for Content Based Navigation of Music”, In Proc. ACM Multimedia 98. [4] K. H. Brandenburg, “MP3 and AAC Explained”, AES 17th International Conference, Florence, Italy, September

1999. [5] V. Chopra, A. Bakore, J. Eaves, B. Galbraith, S. Li, C. Wiggers, “Professional Apache Tomcat 5”, published by

Wrox, ISBN 0764559028, May 17, 2004. [6] T. C. Chou, A. L. P. Chen, C. C. Liu, “Music Databases: Indexing Techniques and Implementation”, Proc. of the

1996 International Workshop on Multi-Media Database Management Systems (IW-MMDBMS '96), p.46, August 14-16, 1996.

[7] M. Davis and R. Sarvas, “Mobile Media Metadata for Mobile Imaging”, ICME 2004 Special Session on Mobile Imaging, Taipei, Taiwan, IEEE Computer Society Press, 2004.

[8] M Gabbouj, S. Kiranyaz, K. Caglar, E. Guldogan, O. Guldogan and F. A. Qureshi, “Audio-based Multimedia Indexing and Retrieval Scheme in MUVIS Framework”, Proceedings of 2003 IEEE International Symposium on Intelligent Signal Processing and Communication Systems, ISPACS 2003, (invited plenary talk), Awaji Island, Japan, December 7-10, 2003.

[9] J. T. Foote, “Content-Based Retrieval of Music and Audio”, In Proc. SPIE, vol3229, pp. 138-147, 1997. [10] B. Gandhi, A. Martinez, and F. Bentley, “Intelligent Multimedia Content Management on Mobile Devices”,

Multimedia and Expo, 2004. ICME '04. 2004 IEEE International Conference, pp. 1703-1706 Vol.3, Taipei, Taiwan, 2004.

[11] A. Ghias, J. Logan, and D. Chamberlin. B. C. Smith, “Query By Humming”, In Proc. ACM Multimedia 95, pp. 231-236, 1995.

[12] R.L. Graham and O. Hell, “On the history of the minimum spanning tree problem”, Annual Hist. Comput. 7, pp. 43-57, 1985.

Page 11: Audio-based Queries for Video Retr ieval over Java …muvis.cs.tut.fi/Documents/SPIE_05_06.pdfAudio-based Queries for Video Retr ieval over Java Enabled Mobile Devices Iftikhar Ahmad

[13] O. Guldogan, M. Gabbouj, “Content-based image indexing and retrieval framework on symbian based mobile platform”, European Signal Processing Conference, EUSIPCO 2005, Antalya, Turkey, Sep. 2005.

[14] ISO/IEC 11172-3, Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/s, Part 3: Audio, 1992.

[15] ISO/IEC CD 14496-3 Subpart4: 1998, Coding of Audiovisual Object Part 3: Audio, 1998. [16] ISO/IEC 13818-3:1997, Information technology -- Generic coding of moving pictures and associated audio

information, Part 3: Audio, 1997. [17] J. Keogh, “The Complete Reference J2ME”, published by Osborne/McGraw-Hill, February 27, 2003. [18] C. Y. Kim, J. K. Lee, Y. H. Cho, D. Kim: VISCORS: “A Visual-Content Recommender for the Mobile Web”,

IEEE Intelligent Systems, 19(6): pp. 32-39, 2004. [19] A. Khokhar, G. Li “Content-based Indexing and Retrieval of Audio Data using Wavelet”, ICME 2000. [20] S. Kiranyaz, “Advanced Techniques for Content-Based Management of Multimedia Databases”, PhD Thesis,

Publication 541, Tampere University of Technology, Tampere, Finland, June 2005. [21] S. Kiranyaz and M. Gabbouj, “A Dynamic Content-Based Indexing Method for Multimedia Databases:

Hierarchical Cellular Tree”, Proc. of IEEE International Conference on Image Processing, ICIP 2005, Genoa, Italy, pp. 533-536, September 11-14, 2005.

[22] S. Kiranyaz, M. Gabbouj, “A Novel Multimedia Retrieval Technique: Progressive Query (WHY WAIT?)”, Proc. 5th International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS 2004, Instituto Superior Técnico, Lisboa, Portugal, April-21-23, 2004.

[23] L. Kjell and L. Pauli, “Musical Information Retrieval using musical Parameters”, International Computer Music Conference, Ann Arbour, 1998.

[24] S. Li, P. Houle, M. Wilcox, R. Phillips, P. Mohseni, S. Zeiger, H. Bergsten, M. Ferris, D. Ayers, “Professional Java Server Programming”, published by Peer Information Inc., ISBN: 1861002777, August 1999.

[25] L. Lu, H. You, H. J. Zhang, “A New Approach to Query by Humming in Music Retrieval”, in Proc. of ICME 2001, Tokyo,August 2001.

[26] R. J. McNab, L.A.Smith, I. H. Witten, C. L. Henderson, and S. J. Cunningham, “Towards the digital music library: tune retrieval from acoustic input”, In Proc. of ACM Digital Libraries '96, 1118, 1996.

[27] R. J. McNab, L. A. Smith, D. Bainbridge, and I. H. Witten., “The New Zealand Digital Library MELody inDEX.” http://www.dlib.org/dlib/may97/meldex/05written.html, May 1997.

[28] D. Pan, “A tutorial on MPEG/Audio Compression”, IEEE Multimedia, pp 60-74, 1995. [29] M. Partio, B. Cramariuc, M. Gabbouj, A. Visa, “Rock Texture Retrieval Using Gray Level Co-occurrence Matrix”,

In Proc. of 5th Nordic Signal Processing Symposium, October 2002. [30] B. Pham and O. Wong, “Handheld devices for applications using dynamic multimedia data”, Proc. of 2nd

International Conference on Computer Graphics and Interactive Techniques in Austalasia and South East Asia, Singapore, pp. 123-130, 2004.

[31] C. Spevak and E. Favreau, “Soundspotter - a prototype system for content-based audio retrieval”, in Proc. of the COST G-5 Conf. on Digital Audio Effects (DAFX-02), Hamburg, Germany, September 2002.

[32] P. Vlachos and A. Vrechopoulos. “Key success factors in the emerging landscape of mobile music services”, In Proc. of the 3rd International Conference on Web Delivering of Music (WEDELMUSIC 2003), 2003.

[33] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based Classification, Search, and Retrieval of Audio”, IEEE Multimedia, pp. 27-36, 1996.

[34] C. Wong, “HTTP Pocket Reference”, 1st edition published by O’Reilly Media, Inc., ISBN: 1565928628, June 6, 2000.

[35] “3G”, http://www.3gpp.org/About/about.htm [36] “Java”, http://java.sun.com/ [37] “Linux Devices”, http://www.linuxdevices.com/ [38] “MS Windows mobile”, http://www.microsoft.com/windowsmobile/ [39] “Muscle Fish LLC.”, http://www.musclefish.com/ [40] “MUVIS”, http://muvis.cs.tut.fi/ [41] “Nokia”, http://www.nokia.com/ [42] “Open Video Project”, http://www.open-video.org/ [43] “Symbian OS”, http://www.symbian.com/ [44] “WLAN”, http://grouper.ieee.org/groups/802/11/