5
A Keyword-based Video Summarization Learning Platform with Multimodal Surrogates Wen-Hsuan Chang Graduate Institute of Network Learning Technology National Central University Jhongli City, Taiwan [email protected] Jie-Chi Yang Graduate Institute of Network Learning Technology National Central University Jhongli City, Taiwan [email protected] Yu-Chieh Wu Dept. of Communications and Management Ming Chuan University Taipei City, Taiwan [email protected] Abstract—In general, video-based learning contains rich media information, but displaying an entire video linearly is time- consuming. As an alternative, video summarization techniques extract important content provides short but informative fragments. In this paper, a video learning platform (KVSUM: Keyword-based Video Summarization) is presented that integrates image processing, text summarization, and keyword extraction techniques. Without human annotators, the learning platform can process an input video and transform it into online learning materials automatically. The video frames are first split from a given video while the transcription is used to generate the text summary and keywords. In the current study, the video surrogates are composed of extracted keywords, text and video summaries, and video frames. In other words, KVSUM is able to provide both visual and verbal surrogates. In order to validate the effect of surrogates in the KVSUM, a comparison with another video surrogate, the fast forward (FF), to evaluate learners’ comprehension to video contents. Sixty undergraduate students took part in examining two different video surrogates. The experimental results show that KVSUM had a more positive effect than FF in comprehension to videos. In terms of system usage and satisfaction, KVSUM is significantly more attractive than FF to learners. Keywords- video summarization; video surrogate; keyword cloud; multimedia learning; video-based learning I. INTRODUCTION With the rapid development of the Internet, multimedia learning has become one of the most popular delivery methods in education. Multimedia instruction refers to the presentation of material combines verbal and visual representations [1]. Following dual-coding theory, combining verbal and visual information is better than pure text or single visual learning. Therefore, multimedia instruction is considered as an effective complementary resource intended to promote learning. Thus, video is one of the most typical types of multimedia learning containing various media types, for example, text, image, motions, objects, speech, and sound [2]. While large video resources are persistently produced, a good video summarization technique is imperative for users to facilitate the searching and viewing of suitable videos. Over the past few years, many video searching and summarization techniques have been developed, such as [2]. These methods substantially increase the ease of searching and browsing a video database. However, different video surrogate types might produce different effects on learning, especially for those learners whose preference is not for viewing video in a linear format. The literature [3, 4] defines a video surrogate as a condensed video representation that contains major features of the original video content. In other words, a video surrogate is considered as a kind of video summarization representation [5, 6]. Good surrogates may help users quickly grasp the video content [4, 7] and also allow them to browse more detailed information to obtain incidental learning [8]. Thus, surrogates are essential for friendly user interfaces [4]. Although most video surrogates are text-based, other types such as still image (storyboard / keyframe), moving image (video skim / fast forward) and audio (spoken keyword / speech) have been emphasized recently. In the past decade, researchers have verified the surrogate with combined visual and verbal metadata as having better effects on users’ comprehension and is preferred over visual-only or text-only surrogates [8]. Either visual or verbal surrogates possess unique contributions of more valid understanding and both surrogates may complement each other. Another important type of surrogate is audio, which is considered as an effective factor for capturing key content. This surrogate utilizes spoken keywords [9] or spoken descriptions [4, 7] together with videos as the representation. Usually, the fusing of spoken surrogates and visual sources is very time-consuming. For example, the selection and locating of spoken keywords at their time stamps of appearance in videos should be synchronized with the video content. More recent studies have argued that video fast forward (FF), which allows the video to be watched at a faster speed, but with no audio provided, has great time savings [7, 9]. In fact, FF may be considered as a kind of video summarization as it selects video frames equally, and has proven to be a substantial support for learners to efficiently make sense of videos [10]. However, the production of FF is time 2011 11th IEEE International Conference on Advanced Learning Technologies 978-0-7695-4346-8/11 $26.00 © 2011 IEEE DOI 10.1109/ICALT.2011.19 37

[IEEE 2011 11th IEEE International Conference on Advanced Learning Technologies (ICALT) - Athens, GA, USA (2011.07.6-2011.07.8)] 2011 IEEE 11th International Conference on Advanced

Embed Size (px)

Citation preview

A Keyword-based Video Summarization Learning Platform with Multimodal Surrogates

Wen-Hsuan Chang

Graduate Institute of Network Learning

Technology National Central University

Jhongli City, Taiwan [email protected]

Jie-Chi Yang Graduate Institute of Network Learning

Technology National Central University

Jhongli City, Taiwan [email protected]

Yu-Chieh Wu Dept. of Communications and

Management Ming Chuan University

Taipei City, Taiwan [email protected]

Abstract—In general, video-based learning contains rich media information, but displaying an entire video linearly is time-consuming. As an alternative, video summarization techniques extract important content provides short but informative fragments. In this paper, a video learning platform (KVSUM: Keyword-based Video Summarization) is presented that integrates image processing, text summarization, and keyword extraction techniques. Without human annotators, the learning platform can process an input video and transform it into online learning materials automatically. The video frames are first split from a given video while the transcription is used to generate the text summary and keywords. In the current study, the video surrogates are composed of extracted keywords, text and video summaries, and video frames. In other words, KVSUM is able to provide both visual and verbal surrogates. In order to validate the effect of surrogates in the KVSUM, a comparison with another video surrogate, the fast forward (FF), to evaluate learners’ comprehension to video contents. Sixty undergraduate students took part in examining two different video surrogates. The experimental results show that KVSUM had a more positive effect than FF in comprehension to videos. In terms of system usage and satisfaction, KVSUM is significantly more attractive than FF to learners.

Keywords- video summarization; video surrogate; keyword cloud; multimedia learning; video-based learning

I. INTRODUCTION With the rapid development of the Internet, multimedia

learning has become one of the most popular delivery methods in education. Multimedia instruction refers to the presentation of material combines verbal and visual representations [1]. Following dual-coding theory, combining verbal and visual information is better than pure text or single visual learning. Therefore, multimedia instruction is considered as an effective complementary resource intended to promote learning. Thus, video is one of the most typical types of multimedia learning containing various media types, for example, text, image, motions, objects, speech, and sound [2]. While large video resources are persistently produced, a good video summarization technique is imperative for users to facilitate the searching and viewing of suitable videos.

Over the past few years, many video searching and summarization techniques have been developed, such as [2]. These methods substantially increase the ease of searching and browsing a video database. However, different video surrogate types might produce different effects on learning, especially for those learners whose preference is not for viewing video in a linear format.

The literature [3, 4] defines a video surrogate as a condensed video representation that contains major features of the original video content. In other words, a video surrogate is considered as a kind of video summarization representation [5, 6]. Good surrogates may help users quickly grasp the video content [4, 7] and also allow them to browse more detailed information to obtain incidental learning [8]. Thus, surrogates are essential for friendly user interfaces [4]. Although most video surrogates are text-based, other types such as still image (storyboard / keyframe), moving image (video skim / fast forward) and audio (spoken keyword / speech) have been emphasized recently.

In the past decade, researchers have verified the surrogate with combined visual and verbal metadata as having better effects on users’ comprehension and is preferred over visual-only or text-only surrogates [8]. Either visual or verbal surrogates possess unique contributions of more valid understanding and both surrogates may complement each other.

Another important type of surrogate is audio, which is considered as an effective factor for capturing key content. This surrogate utilizes spoken keywords [9] or spoken descriptions [4, 7] together with videos as the representation. Usually, the fusing of spoken surrogates and visual sources is very time-consuming. For example, the selection and locating of spoken keywords at their time stamps of appearance in videos should be synchronized with the video content.

More recent studies have argued that video fast forward (FF), which allows the video to be watched at a faster speed, but with no audio provided, has great time savings [7, 9]. In fact, FF may be considered as a kind of video summarization as it selects video frames equally, and has proven to be a substantial support for learners to efficiently make sense of videos [10]. However, the production of FF is time

2011 11th IEEE International Conference on Advanced Learning Technologies

978-0-7695-4346-8/11 $26.00 © 2011 IEEE

DOI 10.1109/ICALT.2011.19

37

consuming, especially while adding other complementary metadata, such as audio surrogates.

Regardless of the several types of surrogates available to learners, only those which are useful for increasing comprehension, especially as accommodated multimedia instructions, should be selected. Therefore, in the current study, the computer-generated and multimodal video surrogates are utilized in this video learning platform with the purpose of decreasing the amount of time-consumed for the manufacturing of related instruction materials, and to retain the essence or the most informative parts of original videos by textual analysis. Moreover, the keyword is an important resource of the textual surrogate that could facilitate searching specific subjects and lead to better comprehension [8, 9].

A keyword-based video summarization learning platform (KVSUM) provides a keyword cloud as a textual surrogate to support learners to organize information of videos before viewing. In other words, KVSUM produces not only video summaries but an additional textual video surrogate which is expected as an aid to enhance the effective comprehension.

In this study, two different surrogate representation types are compared. The two surrogate types have the following summarization mechanisms: the FF is created by average sampling in the post production; KVSUM is created by the summarization module, operated by techniques of image processing, text summarization, and keyword extraction. Moreover, KVSUM presents text and graphics as separation information and utilizes the keyword cloud as an index generated by keyword weighting scheme. This study examines the effect of learner performance and preferences of both types of video surrogates in the video learning platform.

The following sections are organized as follows. Section 2 describes the two different types of video surrogates using automatic video content summarization and the usage of user interfaces (UI). The details of the experimental design of the study are described in Section 3. Section 4 presents the results and discussions. The conclusion and future work are summarized in Section 5.

II. SYSTEM OVERVIEW Figure 1 depicts an overview of the proposed KVSUM

system, while Figure 2 shows the FF system architecture. As shown in Figure 1, there are five key modules in the

KVSUM, namely, keyword extraction, subtitle detection, summarization module, thumbnails generation and keyword cloud generation. The input of KVUSM is simply the raw video while the output presents a composed set of a keyword cloud, video fragments with transcriptions, and thumbnails.

The KVSUM system splits a raw video into three parts firstly: video frames, subtitles, and time-stamps. The video frame set is used to generate the thumbnails for the thumbnails generation component. Both keyword extraction and summarization modules make use of the split subtitles to discover the keywords and generate the summary. Furthermore, the summarization module generates variant length of video summary by setting up the time constraint.

After keywords are generated, the keyword cloud generation module draws the keywords as classic tag-cloud, according to the proposed keyword weighting scheme. During the last stage, KVSUM integrates the keyword cloud, text summaries, thumbnails and video fragments for learners. Learners are able to retrieve the keyword cloud for video summary watching. Different from conventional video summarization systems [2], KVSUM provides not only summary displaying, but also the keyword link to each time stamp in the video.

Figure 1. System architecture of KVSUM

The FF is a speed-based video compression technique. Given the time constraint, for example, half of the video time, the FF will display the input video with 2 times faster than the normal speed. The tighter the time constraint, the higher the playing speed is. However, to encourage learners watch all the caption information, a human annotator is employed to post-process the captions and insert them into the free time stamps in the video. Clearly, the post-processing is time-consuming.

Figure 2. System architecture of the FF

A. Keyword weighting scheme It is noteworthy that there is no explicit space symbol in

Mandarin Chinese. To extract the Chinese characters, the assessor variety score [11] (AV) is adopted to measure the discriminative power of a word string. When the AV score is above the predefined threshold, it will be treated as a word.

Each keyword is given a weighted score according to the Equation(1). These scores will determine the presentation of keywords in the keyword cloud.

38

∑ ×+

×+=

j iij

iijij

PFNtf

PFNtfaWeight

2)]/log()0.1[log(

)/log()0.1log()( (1)

The weighting scheme is basically a TFIDF-based measurement [12]. However, there is no inversed word document frequency (IDF) indicator for each testing video. Instead, the transcription of the whole video was segmented into passages and the passage frequency used. aij is the term i in passage j; tfij is the frequency of term i in passage j. PFi means the number of passages which contains term i. The denominator of the above equation is the normalized factor.

B. Surrogate Presentation The presentations of KVSUM and FF both mainly

belong to the moving image surrogate, and therefore they possess video cover information and playback interfaces which allow learners intuitive operation. The FF would be shown at high speed but display full video content; and in consequence, did not provide extra metadata such as transcription or keywords. By contrast, there were several fragments of video summarizations in the KVSUM and the keyword cloud had the index function to list these fragments. Furthermore, extra metadata including transcriptions and thumbnails were provided in order to help learners organize scattered information.

1) Keyword Cloud: A keyword/tag cloud is one of popular navigation tools, which presents a group of keywords/tags most often used in the user interface and are used to being the highlight of content in a searching structured database [13, 14]; so that learners may quickly find out what they want to know. The significance of a keyword is recognized by its font size relative to other keywords. This study utilized the keywords extracted by KVSUM. It then presented the UI, according to the following criterion in the keyword cloud generation module, may support learners to build their knowledge structure of specific subject in videos. The actual font size of the keyword can be estimated as follows.

[ ]

⎥⎦⎤

⎢⎣⎡ −

−×⎥⎦⎤

⎢⎣⎡ −

+=))((min))((max

))((min)()(

minmax

min

iiii

iii

iaWaW

fsfsaWaWfsaf (2)

∑=j

iji aweightaW )()(

) weight(:)( keyword of sizefont :)(

iji

ii

aaWaaf

Both fsmin and fsmax are the predefined minimum and maximum displayed font size threshold. W(ai) is derived from Equation (1). Equation (2) can also normalize each keyword according to the predefined font size constraint.

The above schemes were used to generate the keyword cloud.

2) The UI Usage: People usually get video information from the simple description of the cover page before viewing, and if they want to obtain more details about the videos, they need to make an effort to watch and annotate; however, this takes time. By using either KVSUM or FF, learners view the videos which are summaried down to 10% of the original length. While learners operate the FF, a semi-automated system, the unique playback window enables a viewer to play the video directly at a high speed rate and a serial visualization presentation. Furthermore, they may still simply comprehend the video from the original descriptions of the cover page. Learners browse the most important part of video information from an automated process in the KVSUM composed of a video cover, the keyword cloud, video summaries with summary transcription (with marked keyword), key frames and thumbnails. The keyword cloud enables learners to link different lists of summaries aggregated by keywords and be aware of the order of significance from the font size of keywords. In addition, each video summary provides not only three or four thumbnails but its key frame and transcription. Furthermore, the thumbnails may be clicked on to enlarge the view as well as while clicking on the key frame, the playback window will pop up.

III. METHODOLOGY To investigate the effect of learner performance and

preferences, the measures of two surrogates are compared.

A. Paticipants Sixty subjects (twenty-four male and thirty-six female)

whom were undergraduate students between the ages of 18 and 22 years old and had experiences of viewing on-line videos participated in this experiment.

B. Experiment Material The purpose of this study is to examine how two

different types of surrogates affect learners’ comprehension of video content. Two Discovery™ videos, which were originally 50 minutes, were selected. Using these two videos, each fragment was set up for 19 second intervals in the summarization module of KVSUM. In addition, the time constraint of final summaries was set up as 5 minutes. On the other hand, the FF would produce approximately 5 minutes of video summaries by speeding-up the original video content with post-processing on the captions.

C. Quiz This study administered quizzes to subjects before and

after they viewed the video surrogates; furthermore, the order of questions and options was presented randomly in both the pre-test and post-test so that the participants had less recollection during the post-test. The quizzes both had

39

ten multiple choice questions based on the content of the two assigned videos.

D. Questionnaire A satisfaction questionnaire within a five-point Likert

scale was used to examine how learners perceived both video surrogates in the video learning platform. The questionnaire included three main factors: usefulness (UF), ease of use (UE), and satisfaction (SA).

E. Reliability and Validity Analysis The reliability averages (standardized Cronbach's α

coefficient) of the measures of usefulness, ease of use, and the system satisfaction across the studies were 0.848, 0.801, and 0.892 respectively, and the alpha coefficient of 0.922 was for the total scale of reliability. They were all up to the acceptable standard of 0.5. For the validity analysis, the composite reliability (CR) was 60.07%.

F. Procedure The experiment consisted of two phases. In the first

phase, participants completed a background survey, followed by a 10-question multiple-choice quiz for each video to evaluate their prior knowledge in the topic area of the video content. In order to counterbalance against memory effects, the second phase was executed a week after the first phase. The second phase consisted of two sessions. In each ten-minute session, participants viewed a video by one of surrogates, either KVSUM or FF. While watching the summaries, participants were able to pause, jump back and/or forward or replay. After ten minutes of viewing, participants were instructed to stop and take a post quiz without reviewing video surrogates. Finally, the participants filled out the questionnaires from both sessions. The procedure of the experiment took a total of 40 minutes.

IV. RESULTS AND DISCUSSION

A. The Score of Quiz and Satisfaction Questionnaire A significant difference exists in the quiz score of

participants at using different types of video surrogates and the results are presented as follows.

TABLE I. THE RESULT OF THE SCORE OF QUIZ

Mean S. D. t p

KVSUM 4.02 1.98 7.31 .000

FF 1.47 1.91 - -

The paired samples t-test was performed on the total percentage correct of quiz to evaluate differences in the performance of learners while using different types of video surrogates between KVSUM and FF. As shown in Table I, learners scored higher in using KVSUM (M = 4.02, SD = 1.98) than FF (M = 1.47, SD = 1.91). There was a statistical significant difference between the two different types of video surrogates with regard to the performance of learners under p< 0.001.

TABLE II. THE RESULT OF THE SATISFACTION QUESTIONNAIRE

Mean S. D. t p

UE KVSUM 3.52 .69 11.69 .000

FF 2.10 .60 - -

UF KVSUM 3.63 .83 6.33 .000

FF 2.60 .88 - -

SA KVSUM 3.43 .82 10.23 .000

FF 2.09 .69 - -

Note. UE = ease of use, UF = usefulness, SA= satisfaction

The paired samples t-test was performed on the learner’s perception of usefulness, ease of use and satisfaction. Regarding the perceived usefulness, ease of use and satisfaction, KVSUM received higher scores than the FF, and there were significant differences (p<.001) between these two different types of video surrogates as shown in Table II.

B. Discussion Several participants noted that they discovered the details

of some subjects in the video while using KVSUM and thus it was easy to recognize what main objectives of comprehension were, instead of viewing the entire video.

The automatically generated keyword cloud and the summarized video transcription in the KVSUM were actually effective verbal surrogates. The verbal surrogates reinforce the video fragments and thumbnails, so that while learners gain the overall concept of the video from verbal information, the visual surrogate supports the identifying and clarifying of specific objects. In other words, the verbal surrogates actually complement the limitation of comprehension caused by only viewing the visual surrogates. Thus, participants considered the keyword cloud, video transcriptions and thumbnails [were] useful to organize the video contents and connect the concepts between keywords.

For example, one noted: “I could quickly understand the contents of video by checking keywords and thumbnails.” Another participant expressed that “…the summarized transcription figured out the main idea from the video, and therefore, it [was] easier for me to view the video and more quickly get the key point.” Furthermore, some participants explicitly expressed that the keyword cloud enabled them to quickly review some blurred parts of the videos. However, other participants thought the order of keywords in the keyword cloud was a little messy, so that they had to concentrate more on it.

In contrast, most participants noted that although videos were displayed directly in the FF, which supported time-savings due to being sped-up, they had to replay it several times or pause it in order to comprehend the video contents. Some participants expressed they would pay much more attention viewing the video at first, but the speed made them lose patience to continue focusing on it.

In sum, the video surrogates of KVSUM not only enabled learners to make sense of the content but motivated them to look for more details of the video. However, a few

40

participants indicated that the length of some video fragments was too short to find the parts which they were interested in. This situation was mainly caused by the time constraint of the video summarization; this was inevitable regarding comparisons as per viewing full videos. Nevertheless, in order to be less sensitive to time, the user is able to easily adjust the length of summaries in the KVSUM system, by setting up the time constraint.

V. CONCLUSION This paper presents a keyword-based video

summarization learning platform (KVSUM) which integrates image processing, text summarization and keyword extraction techniques for learning. While multimedia environments provide richer media sources, learners now demand a variety of verbal and visual information to assist their learning in a timely manner. To address this issue, KVSUM generates significant video summaries accompanying by both visual and verbal surrogates with the aim to satisfy the learners’ needs. The effectiveness of KVSUM is validated by an empirical comparison with the established fast forward (FF). The results also indicate that KVSUM is more attractive than the FF to learners. One of the reason being that the high speed of the FF may make learners uncomfortable while viewing videos, forcing participants to make a greater effort; otherwise losing important information. In contrast, learners are able to follow their own pace to view various surrogates provided by KVSUM. Furthermore, according to [10], a video surrogate should offer and set various facets and cues under user control. Consequently, a further consideration may take into account the options of the speed rate in the FF as a comparison, as well as the rethinking of the mix facets of video surrogates in the KVSUM.

ACKNOWLEDGMENTS The authors would like to thank all the subjects who

participated in the study. This study was partially supported by grant (NSC 97-2628-S-008-001-MY3) from the National Science Council of Taiwan.

REFERENCES

[1] R. E. Mayer, Multimedia learning, 2ne ed., New York: Cambridge University Press, 2009, pp.1-15

[2] J. C. Yang, Y. T. Huang, C. C. Tsai, C. I. Chung, and Y. C. Wu, “An Automatic Multimedia Content Summarization System for Video Recommendation,” Educational Technology & Society, 12 (1), 2009, pp. 49–61.

[3] M.Yang, B. M. Wildemuth, G. Marchionini, T.Wilkens, G. Geisler, A. Hughes, and C.Webster, “Measuring user performance during interactions with digital video collections,” Proc. of the American Society for Information Science and Technology, 40(1), 2003, pp. 3-12.

[4] Y. Song and G. Marchionini, “Effects of audio and visual surrogates for making sense of digital video,” Proc. of the SIGCHI conference

on Human factors in computing systems, San Jose, California, USA, Apr. 2007, pp. 867-876.

[5] B. L.Yeo and M. M. Yeung, “Retrieving and visualizing video. Commun,” ACM, 40(12), 1997, pp. 43-52.

[6] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Video abstracting. Commun,” ACM, 40(12), 1997, pp. 54-62.

[7] G. Marchionini, Y. Song, and R. Farrell, “Multimedia surrogates for video gisting: Toward combining spoken words and imagery,” Inf. Process. Manage., 45(6), 2009, pp. 615-630.

[8] W. Ding, G. Marchionini, and D. Soergel, “Multimodal surrogates for video browsing,” Proc. of the fourth ACM conference on Digital libraries, Berkeley, California, United States, 1999, pp. 85-93.

[9] M. B. Wildemuth, G. Marchionini, T.Wilkens, and et.al, Alternative surrogates for video objects in a digital library: Users' perspectives on their relative usability, Vol. 2458., Berlin, ALLEMAGNE: Springer, 2002.

[10] B. M. Wildemuth, G. Marchionini, M. Yang, G. Geisler, T. Wilkens, A. Hughes, and R. Gruss, “How fast is too fast?: evaluating fast forward surrogates for digital video,” Proc. of the 3rd ACM/IEEE-CS joint conference on Digital libraries, Houston, Texas. May 2003, pp. 221-230.

[11] H. Zhao and C. Kit, “Incorporating global information into supervised learning for Chinese word segmentation,” Proc. of The 10th Conference of the Pacific Association for Computational Linguistics (PACLING-2007), 2007, pp.66-74.

[12] H. Yang, L. Chaison, Y. Zhao, S. Y. Neo, and T. S. Chua, “VideoQA: question answering on news video,” Proc of the 11th ACM international conference on multimedia, 2003, pp 632-641.

[13] G. Koutrika, Z. M. Zadeh, and H Garcia-Molina, “Data clouds: summarizing keyword search results over structured data,” Proc. of the 12th International Conference on Extending Database Technology: Advances in Database Technology, ACM, Saint Petersburg, Russia, 2009, pp. 391-402.

[14] S. Lohmann, J. Ziegler, and L. Tetzla, “Comparison of tag cloud layouts: Task-related performance and visual exploration,” In INTERACT (1) , 2009, pp. 392-404.

41