43
,120 (;$0(16$5%(7( 7(.1,. *581'1,9c +3 672&.+2/0 69(5,*( 3UHGLFWLQJ PRYLH VXFFHVV XVLQJ PDFKLQH OHDUQLQJ WHFKQLTXHV 32-$1 6+$+5,9$5 &$5/ -(51%b&.(5 .7+ 6.2/$1 )g5 '$7$9(7(16.$3 2&+ .20081,.$7,21

WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

,120�(;$0(16$5%(7(� 7(.1,.�*581'1,9c�����+3

��672&.+2/0 69(5,*( ����

3UHGLFWLQJ�PRYLH�VXFFHVV�XVLQJ�PDFKLQH�OHDUQLQJ�WHFKQLTXHV

32-$1�6+$+5,9$5

&$5/�-(51%b&.(5

.7+6.2/$1�)g5�'$7$9(7(16.$3�2&+�.20081,.$7,21

Page 2: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Predicting movie success using ma-chine learning techniques

POJAN SHAHRIVAR, CARL JERNBÄCKER

Computer Engineering, Master of ScienceDate: June 7, 2017Supervisor: Iolanda LeiteExaminer: Örjan EkebergSwedish title: Förutsägelse av filmers framgång med maskininlärningSchool of Computer Science and Communication

Page 3: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

ii

Abstract

The area of creating predictive models using machine learning has increased in size in re-cent years. The market for movies is still big with hundreds of new movies created everyyear. The purpose of this report is to investigate whether it is possible to classify movierating and box office revenue with metadata available before release. This was done bybuilding a classification model with metadata obtained from the internet such as, budgetand what actors are involved, etc. This study managed to correctly predict what ratinga movie would have about 82% of the time using the technique with the highest suc-cess rate. The times as a model failed to predict the correct rating, it was usually by onerating group, corresponding to a deviation of approximately 17%. When a predictionof gross sales was made, it gave a positive result of 15% of the time. The results of thisreport are to a certain extent consistent with previous studies with similar focus in theprediction of the grade. The precision of the predictions can further be increased with alarger data set with more features.

Page 4: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

iii

Sammanfattning

Området med att skapa prediktiva modeller med maskininlärning har ökat i storlek desenaste åren. Marknaden för filmer är stor med hundratals nya filmer skapade varje år.Syftet med denna rapport är att undersöka om det är möjligt att klassificera filmbetygoch brutto biljettförsäljing med metadata som är tillgängliga före utgåvan. Detta gjor-des genom att bygga en klassificeringsmodell med metadata som erhållits från internet,såsom budget och vilka aktörer som är involverade etc. Denna studie lyckades korrektförutsäga vilket betyg en film erhåller omkring 82% av fallen med den mest framgångs-rika modell. I de utfallen där modellen misslyckades med att förutsäga rätt betyg, detvar vanligtvis av en betygsklass, vilket motsvarade en avvikelse på ungefär 17%. När enförutsägelse för bruttoförsäljningen gjordes gav det ett positivt resultat av 15% av fallen.Resultaten av denna rapport är i viss utsträckning förenlig med tidigare studier med lik-nande metodik. Precisionen på förutsägelserna kan ökas med ett utökat data set med flerattribut.

Page 5: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Contents

Contents iv

List of Figures vi

List of Tables vii

1 Introduction 11.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Definitions and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 42.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Method 83.1 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Preprocessing of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Title length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Year/Month/Day/Weekday . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.4 Actor/Director/Writer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.5 Rating value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.6 Box office revenue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Results 154.1 Gross box office revenue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.2 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1.3 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iv

Page 6: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CONTENTS v

5 Discussion 185.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Conclusion 20

Bibliography 21A Features in previous works . . . . . . . . . . . . . . . . . . . . . . . . . 23B Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24C Genres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Page 7: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

List of Figures

1.1 Visual representation of 5 fold cross validation . . . . . . . . . . . . . . . . . 3

2.1 Average ratings and rating distribution (Source: IMDb, User ratings for Beautyand the Beast) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Simplified example of how a decision tree would be evaluated . . . . . . . . 52.3 Simplified example of how a SVM would look like . . . . . . . . . . . . . . . 62.4 Simplified example of how a KNN would look like . . . . . . . . . . . . . . . 6

3.1 Visual representation of the steps involved in the study . . . . . . . . . . . . 83.2 The number of movies with each rating, (from 0.0 to 10.0). . . . . . . . . . . 103.3 Scatter of average rating of director, writer and actor 1-3 . . . . . . . . . . . 103.4 Number of ratings corresponds to the final rating . . . . . . . . . . . . . . . . 113.5 Importance of attributes (predictors) using ReliefF algorithm. Attributes in de-

scending order of prediction power. . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Box office results presented in a diagram . . . . . . . . . . . . . . . . . . . . . 154.2 ROC curve for linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1 Decision tree, 4 classes (3.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Decision tree, 4 classes (3.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Decision tree, 5 classes (3.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Decision tree, 8 classes (3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 KNN, 6 classes (3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Decision tree, 6 classes (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Coarse KNN, 17 classes (3.2.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi

Page 8: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

List of Tables

2.1 Features used in Predicting the Future With Social Media . . . . . . . . . . . . . 72.2 Summary of results by previous work . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Extracted data for each movie . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 had the boundaries structured in a way where the distribution of movies are

symmetrical in each category around the median of the data set. See figure (3.2)for the distributions of all the ratings. . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Rating categories divided into 6 subgroups of equal length over the range ofthe data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Rating categories divided into 8 subgroups, each value is rounded down intothe nearest whole number which resulted in thees 8 groups . . . . . . . . . 12

3.5 Rating categories are divided into 4 subgroups, the head (0-3.5) of the data setis cut of as well the tail (9,5 -10,0). The body is then divided into groups of equallength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.6 Rating categories divided into 4 groups of equal length . . . . . . . . . . . . 123.7 Rating categories divided into 5 groups of equal length . . . . . . . . . . . . 123.8 Rating categories are divided into 17 groups, each value is rounded to the near-

est half integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.9 Final Features that where used . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Classification rate (%) with different response classes (see table 3.2-3.8) . . . 16

1 Summary of previous works features . . . . . . . . . . . . . . . . . . . . . . . 23

vii

Page 9: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and
Page 10: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Chapter 1

Introduction

The worldwide Box office revenue 2016 was 38.3 billion USD [1] with hundreds of newmovies made each year. If one can use a computer to predict how successful a moviewill be even before it is released, this would be a powerful tool to use. An aspiring Hol-lywood director or a movie studio with some technical skills could predict whether theirmovie idea is going to be a safe investment. With the wast amount of data published onthe Internet and the increasing power of the modern computer, is it possible to take ad-vantage of these resources to make predictions. And this is how we ended up with thesequestions:

• Is it possible to use machine learning algorithms to predict if movie will receive ahigh or low rating?

• Is the available data online viable for use in predicting movie success?

• Is there a correlation between attributes and success of a movie?

The study can be used as a proof of concept for applications in other areas, and shouldhighlight some of the challenges one needs to overcome to successfully create a predic-tion model. This idea could in theory be extended to predict credit ratings, the stockmarket or housing market. The only requirement being a vast and reliable data source.

When combining the questions mentioned above to form a problem statement, for-mulating good as a measurement of a movies rating and sales, the following problemstatement was produced.

1.1 Problem statement

This thesis will investigate; Is it possible to classify the rating and box office revenue of a movieusing metadata freely available on the web?

1.2 Scope

The focus of this thesis is to formulate a method of how to preprocess the data set andevaluate which attributes are the most useful, by evaluating the correlation between theattributes and the success rate of the machine learning. Using this method the if it canachieve a viable success rate when trying to predict the rating and box office revenue.

1

Page 11: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

2 CHAPTER 1. INTRODUCTION

The data set is going to be obtained on IMDb using a web scraper of own creation. Thedata set will be limited to 10,000 unique movies by reason of making the workflow moremanageable when processing the data. The machine learning algorithms that are going tobe used are algorithms available in the classification learner module in MATLAB, mainly,SVM, Decision tree, KNN. [2]

1.3 Purpose

Previous studies that have been conducted show that it is possible to predict the successof a movie using attributes such as budget, rating, actors and director. Thus the goal ofthis thesis is to further examine the possibility of using a greater data set with featurespreviously not used with machine learning.

1.4 Definitions and Acronyms

Box office revenue

Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and home video are a few. To be able to com-pare a movies, one revenue stream was selected, which is the income generated by the-atre ticket sales, or box office sales. The figures used in this study are used from box-officemojo, the leading online box-office reporting service, owned and operated by IMDb.

Confusion matrix

The confusion matrix visualises the performance of the classification model. Each columnof the matrix represents the instances in a predicted class while each row represents theinstances in an actual class.

IMDb

Internet Movie Database, abbreviated IMDb, is an online database with movies, cast,directors, writers, fictional characters, production crew, plot summaries, summaries, triviaand more. IMDb is the most comprehensive online database of its kind and has over 70million registered users.

MATLAB

MATLAB is a platform made for solving scientific and engineering problems. The plat-form has a library of prebuilt toolboxes that can be used to solve different problems, inthis study, the Statistics and Machine Learning Toolbox is used.

Web scraper

A web scraper is an automated program that extracts data from websites to a local databaseor spreadsheet.

Page 12: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 1. INTRODUCTION 3

ROC curve

Receiver operating characteristic (abbreviated ROC) curve is a plot that illustrates perfor-mance of a classification model by plotting the true positive rate against the false positiverate.[3] A ROC curve is a common way of representing the results of a binary classifier.Area under the curve (AUC) is a way to visualise how good a classifier is, a classifierwith a AUC value of 1 is a perfect classifier, and a classifier with a value of 0,5 is a terri-ble classifier.

Cross validation

Using cross validation the original data set is partitioned into random subsets of equalsize [4]. One subset is used as the validation data and the remaining subsets are usedas training data. This procedure is then repeated as many times as desired, the resultsare then taken an average of and presented as the validated result. See figure (1.1) for avisual representation. This validation technique on all the results mentioned below with5 folds.

Figure 1.1: Visual representation of 5 fold cross validation

Page 13: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Chapter 2

Background

This chapter is meant to give a higher understanding of the different parts in study, itwill include the data set used, and the different methods of evaluating the data. Someprevious work that have been made in the area of predictive models for movie successwill be mentioned due to the methods used.

2.1 Data

All the data used is retrieved from IMDb; IMDb is an online movie database that waslaunched 26 years ago in October 17, 1990 and is now under the ownership of Amazon.The data that resides on IMDb is submitted by professionals or registered users and ismoderated before going live.

IMDb lets registered users cast their vote on a scale 1.0-10.0. IMDb displays weightedvote averages rather than raw averages. Various filters are applied to the raw voting datathat the users cast to reduce the attempt of ballot-stuffing by organisations and userswho wish to change the rating of a movie. See figure (2.1) how IMDb presents the rat-ing for a movie (in this case Beauty and the Beast), the value used in this report is theweighted mean value of the IMDb users. By using this method of weighted mean makes

Figure 2.1: Average ratings and rating distribution (Source: IMDb, User ratings for Beautyand the Beast)

4

Page 14: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 2. BACKGROUND 5

it more robust as an estimator for distributions that are not normal. This is to ensure thatthe final rating is representative of the general voting population and not influenced byindividuals or organisations who want to sway the rating in a specific direction.

The exact method of calculating a weighted mean rating is not disclosed [5], however,the method is used across the entire database without exception, which means that thereis no bias in which movies that are affected.

The ratings that are used in this thesis is the rating that is displayed and not the aver-age rating.

2.2 Decision tree

Decision trees are basically a number of questions with binary answers. For each answerto a new question that lead to new yes or no question until finally a prediction can bemade. See figure 2.2 for an example.

Figure 2.2: Simplified example of how a decision tree would be evaluated

Page 15: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

6 CHAPTER 2. BACKGROUND

2.3 Support Vector Machine

Support vector machine (SVM) is a supervised machine learning model that is used forclassification. SVMs work by maximise the margin between separating hyperplane. Inlinear SVM the plane can be split by a line, see figure (2.3) for an example how the modelcould look like. For example could the red values be answer A and the blue be answerB. If a new value would be introduced to the system and positioned on the red side, themodel would predict the new value to be equal to answer A. If there are more answerspossible a hyperplane is created to be able to split all the answers up in different areas.

Figure 2.3: Simplified example of how a SVM would look like

2.4 KNN

K nearest neighbours (KNN) is a simple and easy to use algorithm for machine learning,the results are easy to interpret, it is not heavy for the computer to calculate therefore tocalculation time is a fraction of other algorithms. But this comes at a cost of the predic-tive power of the algorithm. KNN work by calculating the number of nearest neighboursof a given type, the type with the most neighbours is the prediction. See figure (2.4) foran example. If the blue diamond is new value that is going to be predicted, the algo-rithm would calculate that the red squares are most of the nearest neighbours, thereforethe model would predict that the diamond is a square.

Figure 2.4: Simplified example of how a KNN would look like

Page 16: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 2. BACKGROUND 7

2.5 Previous work

Previous work has been made in the area prediction of movie revenue. Nikhil Apte,Mats Forssell and Anahita Sidhwa [6] used a couple of different machine learning tech-niques on a data set which they retrieved from IMDb. They set up some restrictions suchas the box office revenue had to be greater than $1,000,000 after taking inflation to ac-count, and was made after 1 of January 1990. The final data set was 2510 movies andthey achieved a median error of around 35 percent.

In August 2016 a paper was published about the same subjects mentioned in this re-port, machine learning and rating prediction. Muhammad Hassan Latif and HammadAfza [7] wrote the paper for IJCNS, the paper is about using machine learning to pre-dict the rating of a movie. During their preprocessing of the data, they categorised themovies budget to a more easily managed value to a scale from 1 to 9. They also cate-gorised the rating of a movie in to 4 different categories, Terrible, Poor, Average and Excel-lent. The results obtained by this method using 2000 data point was around 80% successrate of predicting the rating of a movie.

There is another method of making predictions where social media interaction is mea-sured and analysed. In this case the volume of tweets and their content is used as thedata set. Predicting the Future With Social Media is a study conducted by Sitaram Asurand Bernardo A. Huberman [8] shows that it is possible to use social media to predictthe box office revenue for a movie. This study used Twitter as the source of the data. Thedata set with 2.89 million tweets from 1.2 million users where then used to create a lin-ear regression model which obtained an accuracy of 98%. See table (2.1) for the featuresused. They also used the meta data from the tweets like timestamp of tweet to createnew features like tweet-rate.

Features ValueUrls percentages of urls during critical periodRetweets percentages of retweets during critical period

Table 2.1: Features used in Predicting the Future With Social Media

Karl Persson [9] at the university of Skövde study aimed to check the predictive perfor-mance of random forest in comparison to support vector machine. He manage a successrate of 84% when using random forest, and a success rate of 86% when using supportvector machines. To validate his results he used 10-fold cross validation. See appendix Afor a summary of all features used in previous studies.

StudyMethod and results

Validation Prediction Success rate

Predicting Movie Box Office Gross 20% withhold from data set Movie Revenue 65%Prediction of Movies popularity Using Machine Learning Techniques 10 fold cross validation Movie Rating 80%Predicting movie ratings :A comparative study on random forests and support vector machines 10 fold cross validation Movie rating 83%Predicting the Future With Social Media Cross validation Box office revenue 98%

Table 2.2: Summary of results by previous work

Page 17: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Chapter 3

Method

Figure 3.1: Visual representation of the steps involved in the study

The methodology is divided into 5 steps, the first one being data selection and extrac-tion, see section (3.1). The data is then preprocessed, described in section (3.2). The pre-processed data set used with different algorithms to determine and select which features(3.3) to use in the final data set used in the classification model (3.4).

3.1 Data extraction

IMDb is used as source for the data used in this study. IMDb does not have an open APIand to be able to extract the relevant data, a web scraper [10] was made. The scraperis coded in Node.js and sequentially downloads the information for each movie into adatabase. The movies are sorted in descending order by number of rating and containsmovies that originate from the US as movies from other countries have their budget rep-resented in their native currency. Multiple currencies would require further preprocessingto normalise the data which was avoided in this case.

8

Page 18: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 3. METHOD 9

Each entry in the data set contains: title, runtime, release date, genre, IMDb rating, num-ber of ratings, metascore, number of oscars, budget (usd), gross (usd), director (averagerating of movies and number of oscars won), writers (same as director), top billed cast(same as director). The currencies are not adjusted for inflation on IMDb. Average rat-ings are based on the latest 10 movies that the director, writer or actor has been creditedin.

Value Type From ToTitle TitleRun time Int 68 271Year Int 1918 2017Month Int 1 12Day Int 1 31Weekday Int 1 7Genre Genres (appendix B) 1 324Rating value Float 1,6 9,6Rating count Int 9999 1793960Meta score Int 1 100Oscar Int 0 11Budget Int 0 384Gross revenue Int 1 1301Director Name 1 1445Director Oscars Int 0 6Director Rating Float 5,0 8,0Writer NameWriter Oscars Int 0 6Writer Rating Float 4,0 8,0First Actor NameFirst Actor Oscars Int 0 4First Actor Rating Float 5,0 8,0Second Actor NameSecond Actor Oscars Int 0 4Second Actor Rating Float 5,0 8,0Third Actor NameThird Actor Oscars Int 0 4Third Actor Rating Float 4,0 8,0

Table 3.1: Extracted data for each movie

3.2 Preprocessing of data

Duplicate values are removed, cells without a value were interpreted as NaN instead ofremoving the entire row. Movies with less than 10000 ratings where taken out of the dataset as these movies tended to be quite volatile and not representative of the majority ofviewers. Movies with very few ratings also tend to have inflated ratings. Movies thatmade less than 1 million USD in gross box office revenue after taking inflation into ac-count where also removed from the data set.

Page 19: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

10 CHAPTER 3. METHOD

Figure 3.2: The number of movies with each rating, (from 0.0 to 10.0).Min: 3.475, Max: 9.469, Mean: 6.472, Range 5.994

though the rating scale is 0.0-10.0, the ratings tend to be concentrated around themean. The range of the ratings indicate that the rating scale can be trimmed to a muchsmaller scale that is more representative of the actual movie quality.

Figure 3.3: Scatter of average rating of director, writer and actor 1-3

In figure (3.3) we can see a scatter plot of the average rating of the director, writerand star actors. The figure illustrates the correspondence of cast rating and movie rating.

Page 20: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 3. METHOD 11

Figure 3.4: Number of ratings corresponds to the final rating

3.2.1 Title length

Title length is the length of the English title of every given movie, the shortest moviewas 1 character long, and the longest was 66 characters long.

3.2.2 Year/Month/Day/Weekday

Year, month and day was the date of the first screening of a movie, weekday is what dayof the week the first screening was, Monday through Sunday.

3.2.3 Genre

Genre is the given combination of genres of a movie, for example is a Comedy differentfrom a Action|Comedy. A complete list of all the genre combinations can be found inthe appendix under Genres.

3.2.4 Actor/Director/Writer

For each actor who played in a movie received a unique id, for the primary actor a moviethe number of unique ids ranged from 1 to 1180, for secondary actor the ids ranged from1 to 3157 and for the third actor it ranged from 1 to 4890. The same procedure was madefor directors (1 to 1445) and writers (1 to 3411).

3.2.5 Rating value

Ratings were categorised into 7 different rating scales, see table 3.2 to table 3.8, thus thenumber of prediction outcomes were reduced from 101 to a value between 4 and 17.While this reduces the accuracy of the prediction, it decreases the misclassification errorsignificantly. This also reflects the range of the data, see figure (3.2).

Page 21: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

12 CHAPTER 3. METHOD

Category From To1 0 3,62 3,7 5,03 5,1 6,54 6,6 7,85 7,9 8,46 8,5 10,0

Table 3.2: had the boundaries struc-tured in a way where the distribution ofmovies are symmetrical in each categoryaround the median of the data set. Seefigure (3.2) for the distributions of all theratings.

Category From To1 0 4,42 4,4 5,43 5,4 6,44 6,4 7,45 7,4 8,46 8,4 10,0

Table 3.3: Rating categories divided into6 subgroups of equal length over therange of the data set

Category From To1 1 22 2 33 3 44 4 55 5 66 6 77 7 88 8 9

Table 3.4: Rating categories dividedinto 8 subgroups, each value is roundeddown into the nearest whole numberwhich resulted in thees 8 groups

Category From To1 3,5 52 5,0 6,53 6,5 8,04 8,0 9,5

Table 3.5: Rating categories are dividedinto 4 subgroups, the head (0-3.5) ofthe data set is cut of as well the tail (9,5-10,0). The body is then divided intogroups of equal length

Category From To1 0 2,52 2,5 5,03 5,0 7,54 7,5 10,0

Table 3.6: Rating categories divided into4 groups of equal length

Category From To1 0 22 2 43 4 64 6 85 8 10

Table 3.7: Rating categories divided into5 groups of equal length

Category 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17From 1,5 2,0 2,5 3,0 3,5 4,0 4,5 5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5 9,0 9,5To 2,0 2,5 3,0 3,5 4,0 4,5 5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5 9,0 9,5 10,0

Table 3.8: Rating categories are divided into 17 groups, each value is rounded to the nearesthalf integer

Page 22: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 3. METHOD 13

3.2.6 Box office revenue

When predicting the gross box office revenue the same method was used, the scrapedvalue for the gross box office revenue see table 3.1 . The value was rounded of to thenearest 10 million after taking inflation to account[6]. This was made to reduce the num-ber of possible outcomes from millions to a Little more than hundred. The formula toused to take inflation to account was:

Xpres = Xthen ∗ (1 + AvgInf)∆y (3.1)

The AvgInf (average inflation) was assumed to be 2,5% and the ∆y is the number ofyears since release until present day.

3.3 Feature selection

The data set includes features that are categorical such as genre, which means that nu-merical transformations are inappropriate, thus feature selection is main technique ofreducing the number of dimensions. Feature selection reduces the number of predictorvariables by selecting a subset of predictor variables to create a prediction model. Thiscan be done manually or with a search algorithm. In this study a search algorithm wasused to reduce work load. The algorithm in question is sequential feature selection whichhas two parts:

• A criterion which the algorithm seeks to minimise over the feature subspace, in ourcase this is the misclassification rate.

• Search algorithm which explores the feature subspace in search of the most opti-mal feature selection. The search algorithm has two variants, sequential forwardselection (SFS) adds features to an empty candidate set, and sequential backwardselection (SBS) which removes features from a full candidate set.

Statistics and Machine Learning Toolbox in MATLAB has a built in function for this whichwas used with the SFS option.The data set contained 29 features to begin with, and the SFS algorithm selected 10 fea-tures for the candidate set: Runtime, rating count, genre, metascore, budget, actor 1-3ratings, director and writer rating. All the different combinations of genre is listed in theappendix under section Genres.

Feature Value RangeRuntime Positive integer 68-271Genre Positive integer 1-324Rating count Positive integer 9999-1793960Metascore Positive integer 1-100Budget Positive integer 1-38Actor 1 rating Float 5,0-8,0Actor 2 rating Float 5,0-8,0Actor 3 rating Float 4,0-8,0Director rating Float 5,0-8,0Writer rating Float 4,0-8,0

Table 3.9: Final Features that where used

Page 23: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

14 CHAPTER 3. METHOD

Figure 3.5: Importance of attributes (predictors) using ReliefF algorithm. Attributes indescending order of prediction power.

3.4 Classification

The three algorithms that are used in this study has many variants and are all imple-mented in MATLAB:

• Decision trees: Deep tree, medium tree, and shallow tree

• Support vector machines: Linear SVM, fine Gaussian SVM, medium Gaussian SVM,coarse Gaussian SVM, quadratic SVM, and cubic SVM

• Nearest neighbor classifiers: Fine KNN, medium KNN, coarse KNN, cosine KNN,cubic KNN, and weighted KNN

Decision trees differ in complexity based on the number of splits, depth, of the tree. Adeep tree has ≤ 100 splits, medium tree ≤ 20 splits and a shallow <4 splits. The simpleand fast classifiers, shallow tree, linear SVM and coarse KNN are used to make featureselection while the accurate and slow trained classifiers, fine KNN, cubic SVM and deeptree are used in the end to make the final classification models.

Page 24: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Chapter 4

Results

In the following chapter all results by the study will be presented. All values are calcu-lated using Decision tree, SVM and KNN and validated using 5 fold cross validation.

4.1 Gross box office revenue

Figure 4.1: Box office results presented in a diagram

Two different models where created for this section, the first one tried to predict thegross box office revenue to the nearest ten’s of million. The second one tried to predict ifthe movie would make money, not by a specific amount, just if the budget was smallerthan the revenue from ticket sales.

4.1.1 Decision tree

Using a medium decision tree a success rate of 15.0% was achieved. All features listedabove in section 3.3 was used to predict the grouped gross box office revenue. Whentrying to predict if a movie would make money the the success rate increased to 67.4%instead.

15

Page 25: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

16 CHAPTER 4. RESULTS

4.1.2 SVM

For the SVM a linear SVM was used, it was able to produce a success rate of 11.0%. Forthe second model the linear SVM had the highest success rate of 68.8% when trying topredict if a movie would make money. See figure (4.2) for the ROC curve.

Figure 4.2: ROC curve for linear SVM

4.1.3 KNN

Using a Coarse KNN algorithm a success rate of 12.2% was achieved. The same featuresthat where used in the decision tree where also used in the KNN. The second modelwhich checked if a movie would make money had a success rate of 66.8% when usinga Coarse KNN algorithm.

4.2 Rating

Response classesAlgorithm

Medium Tree Linear SVM Coarse KNN

4 classes (3.5) 68.3 60.2 68.34 classes (3.6) 83.0 68.3 82.05 classes (3.7) 76.0 64.8 75.56 classes (3.2) 53.7 52.4 54.56 classes (3.3) 54.0 52.1 54.08 classes (3.4) 53.1 47.5 52.617 classes (3.8) 32.2 31.8 32.6

Table 4.1: Classification rate (%) with different response classes (see table 3.2-3.8)

Page 26: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 4. RESULTS 17

In almost all cases, a medium decision tree (20 splits) provided the best results. Classifi-cation accuracy decreases with an increase of classes. Highest accuracy is achieved with4 response classes and with a medium tree. This is due to the high concentration of en-tries in the third class. See appendix B for a visual representation of performance of thebest performing classifier for each response class.

Page 27: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Chapter 5

Discussion

In this chapter a discussion is presented with respect to the features, the data set and themethod.

5.1 Features

The results of this study is arguably on par with the other studies mentioned when pre-dicting the rating. Since the aim of this study was to take the techniques used by previ-ous studies and improve them adding more features and a larger amount of data, it ishard claim that to be a success. In part due to our achieved rate of successfully predict-ing the rating was equal or lower than all the studies used as theoretical basis for thisstudy. When comparing to Muhammad Hassan Latif and Hammad Afza paper [7] whoachieved a success rate of predicting the rating of a movie to around 80%, there wheresome slight differences. For instance, we had a couple of different groupings of varioussizes, for example a group with different outcomes of varying spans who where sym-metrically split around the median of the data set. Where Latif and Afza instead had therating split up into 4 outcomes of equal spans, and they split up the budget into 9 dif-ferent categories where we instead rounded it up to the nearest 10-million. In this caseno apparent improvement was made in the predictability of the model. By splitting upthe the ratings into groups like Latif and Afza we managed to increase the success ratea bit from there 80% to 83% see table (3.6) using our method. If one compare the groupspresented in table (3.2) and table (3.3) there is a slight increase in predictability, whichagrees with our preliminary test by splitting up the rating in groups of varying spanaround the median. This showed an increase in predictability on our data set, thereforeone can argue that even more improvement to the method can be achieved.

This project could only get some of the features presented on IMDb. Instead of sum-ming up all the awards for a movie the scraper used could only get the number of Os-car. This left out many different awards, ranging from golden globes to small film fes-tivals. The decision was made to get the number of Oscars on account of it is the mostprestigious. This ended up being a problem instead for the reason that only a handfulof movies have ever received an Oscar, so most of the data regarding the number of Os-cars a movie had accumulated ended up being useless. A similar problem with plot key-words was also discovered to late. The problem was difficulty of getting the scarper tostore all the plot keywords in correct manner. Therefore this was also removed from thefinal data set. If thees two features could be added to the data set they would probably

18

Page 28: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

CHAPTER 5. DISCUSSION 19

increase the predictability even further on the basis of previous work done in the area.Future work might benefit by looking beyond IMDb as the data provider and creatingan even bigger data set. When comparing to [8] twitter looks to be a useful provider ofmeta data to make accurate predictions.

5.2 Data set

Due to a number of reasons the magnitude of the final data set was not as large as in-tended. The primary reason to this problem was that the data set contained a large num-ber of duplicates of movies. This because the scraper used did not obtain the entire dataset in one session, and during the time between sessions the data changed and the soft-ware interpret it as a new movie. The scraper was told to scrape from the highest ratedmovies to the lowest rated movies on IMDb and upon reaching 10 000 movies it wastold to stop. The discovery of duplicates was not made until to late for changes to made.This then resulted in that the data set was not as large or as diverse as planed from thestart. The movies with the highest rating also tended to be the movies that had the high-est rating (3.4) and therefore there was very few movies with rating a rating below 5.0and above 8, which reflected in a high misclassification rating.

A better method of extracting raw data was discovered late in the project. IMDb re-leases their database in raw files that can be processed into a MySQL database. It shouldbe mentioned that the database contains more than 30 tables and making queries thatcontain all the features that are used in this study is no simple task.

5.3 Method

There was a small uncertainty in the validation method, due to the cross validation ran-domised the sectioning the was a small difference in the output value, (was around 1 to2 percent). Though this was a small ambiguity, it created by it self an uncertainty of theresult.

When looking at the confusion matrices for predicting the rating, one can see that themain confusion is around the middle where the majority of the data is aggregated. Ourmodel had a tendency to over predict the rating of the movie, with a greater amount ofanalysis of why this is the case the success rate might be greatly increased.

Preprocessing of genres could have been done differently. In our study, each combi-nation of genres were considered a unique genre which resulted a very large spread seeGenres in appendix for complete list. This made genre a weaker classifier than it was pre-dicted to be.

Page 29: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Chapter 6

Conclusion

The intent of this study of creating a way to predicting the success of movies using his-torical data was not a direct success, it shows that other features are necessary to accu-rately make a prediction. When trying to predict the rating of a movie a highest successrate of around 65% achieved when using 6 groups and 83% when using 4 groups. Thiscould arguably be a good enough result and shows promise, but when it is compared toprevious studies with similar methodologies the result equal in predictability.

When trying to predict the box office revenue of a movie a highest success rate of15% was achieved, this result show that the method used in this study are difficult usefor an accurate prediction, when compared to previous studies it shows that our methodof preprocessing the data is not the best way to do it. When giving the model a binarydecision of "will the revenue be greater than the budget" a success rate of around thehigh 60 percent was achieved. Therefore one can draw the following conclusion, usinghistorical data to create a model to predict the rating of a movie shows promise anddeserves to be properly evaluated. But then using the same historical data for tryingto predict the box office revenue shows no promise using or method, previous studieshave shown that it is possible to make accurate predictions, but the methods used in thisstudy did not improve upon them.

20

Page 30: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

Bibliography

[1] Statista Inc. Global box office revenue from 2016 to 2020, 2016. URL https://www.statista.com/statistics/259987/global-box-office-revenue/.

[2] MathWorks. Classification learner, 2017. URL https://se.mathworks.com/help/stats/classificationlearner-app.html.

[3] Roc curves and area under the curve explained, 2017. URL http://www.dataschool.io/roc-curves-and-auc-explained/.

[4] Jeff Schneider. Cross validation, 1997. URL http://www.cs.cmu.edu/~schneide/tut5/node42.html.

[5] Imdb weighted mean rating, 2017. URL http://www.imdb.com/help/show_leaf?votes.

[6] Nikhil Apte, Mats Forssell, and Anahita Sidhwa. Predicting movie revenue. CS229,Stanford University, 2011.

[7] Muhammad Hassan Latif and Hammad Afzal. Prediction of movies popularity us-ing machine learning techniques, 2016. http://paper.ijcsns.org/07_book/201608/20160820.pdf.

[8] Sitaram Asur and Bernardo A Huberman. Predicting the future with social media.In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM Inter-national Conference on, volume 1, pages 492–499. IEEE, 2010.

[9] Karl Persson. Predicting movie ratings : A comparative study on random forestsand support vector machines, 2015.

[10] Pojan Shahrivar. Web scraper source code, June 7, 2017. URL https://gits-15.sys.kth.se/pojans/kth.kex.scraper.

21

Page 31: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

22 BIBLIOGRAPHY

Page 32: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

BIBLIOGRAPHY 23

A Features in previous works

Feat

ure

Stud

yPr

edic

ting

Mov

ieBo

xO

ffice

Gro

ssPr

edic

tion

ofM

ovie

spo

pula

rity

Usi

ngM

achi

neLe

arni

ngTe

chni

ques

Pred

icti

ngm

ovie

rati

ngs

:Aco

mpa

rati

vest

udy

onra

ndom

fore

sts

and

supp

ortv

ecto

rm

achi

nes

Box

Offi

ceIn

tobu

cket

sof

alo

gari

thm

icsc

ale

NaN

NaN

Act

orBy

Nam

eN

aNN

aND

irec

tor

ByN

ame

NaN

NaN

Rel

ease

date

Dat

eN

aNN

aNLe

ngth

ofm

ovie

Int

NaN

NaN

MA

AP

Rat

ing

G/P

G/P

G13

/R/N

RG

/PG

/PG

13/R

/NR

G/P

G/P

G13

/R/N

RG

enre

NaN

20di

ffer

entn

umbe

rs,o

nefo

rea

chge

nre

20di

ffer

entg

enre

sR

atin

gN

aN4

Gro

ups

0.0

-2.5

-5.0

-7.7

-10.

00.

0-10

.0Bu

dget

NaN

9D

iffer

entg

roup

sPo

siti

vein

tege

r

Aw

ards

NaN

4va

lues

Osc

arW

on,O

scar

Nom

inee

,G

olde

nG

lobe

won

,Gol

den

glob

eno

min

eeN

aN

Scre

ens

NaN

Num

ber

ofsc

reen

son

open

ing

wee

kend

NaN

Ope

ning

wee

kend

NaN

9D

iffer

entg

roup

sN

aNM

etas

core

NaN

Asc

ore

from

0to

100

NaN

Num

ber

ofvo

tes

NaN

Posi

tive

inte

ger

NaN

Act

orR

atin

gN

aNN

aN0.

0-10

.0D

irec

tor

Rat

ing

NaN

NaN

0.0-

10.0

Prod

ucer

Rat

ing

NaN

NaN

0.0-

10.0

Num

ber

ofm

ovie

sw

ith

acto

rN

aNN

aNPo

siti

vein

tege

rN

umbe

rof

mov

ies

mad

eby

dire

ctor

NaN

NaN

Posi

tive

inte

ger

Num

ber

ofm

ovie

spr

oduc

edby

prod

ucer

NaN

NaN

Posi

tive

inte

ger

Table 1: Summary of previous works features

Page 33: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

24 BIBLIOGRAPHY

B Confusion matrices

Figure 1: Decision tree, 4 classes (3.6)

Figure 2: Decision tree, 4 classes (3.5)

Page 34: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

BIBLIOGRAPHY 25

Figure 3: Decision tree, 5 classes (3.7)

Figure 4: Decision tree, 8 classes (3.4)

Page 35: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

26 BIBLIOGRAPHY

Figure 5: KNN, 6 classes (3.2)

Figure 6: Decision tree, 6 classes (3.3)

Page 36: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

BIBLIOGRAPHY 27

Figure 7: Coarse KNN, 17 classes (3.2.5)

Page 37: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

28 BIBLIOGRAPHY

C Genres

1. Action|Adventure|Fantasy

2. Animation|Adventure|Comedy

3. Action|Adventure

4. Action|Thriller

5. Action|Adventure|Sci-Fi

6. Adventure|Fantasy

7. Action|Adventure|Comedy

8. Action|Adventure|Family

9. Action|Sci-Fi

10. Action|Adventure|Western

11. Adventure|Family|Fantasy

12. Drama|Romance

13. Action|Adventure|Drama

14. Animation|Drama|Family

15. Action|Adventure|Horror

16. Action|Crime|Thriller

17. Action|Crime|Drama

18. Animation|Adventure|Family

19. Adventure

20. Animation|Action|Adventure

21. Comedy|Family|Fantasy

22. Action|Comedy|Sci-Fi

23. Adventure|Drama|Sci-Fi

24. Action|Sci-Fi|Thriller

25. Action|Adventure|History

26. Family|Fantasy|Musical

27. Action|Adventure|Crime

28. Drama|Horror|Sci-Fi

29. Animation|Comedy|Family

30. Drama|Fantasy|Romance

31. Adventure|Comedy|Family

32. Action|Adventure|Thriller

33. Adventure|Drama|Family

34. Mystery|Thriller

35. Comedy|Fantasy|Horror

36. Action|Drama|Thriller

37. Drama|Fantasy|Horror

38. Action|Comedy|Fantasy

39. Action|Drama|History

40. Action|Comedy|Crime

41. Adventure|Drama|Thriller

42. Adventure|Sci-Fi|Thriller

43. Action|Adventure|Mystery

44. Adventure|Mystery|Sci-Fi

45. Action|Drama|Mystery

46. Animation|Action|Comedy

47. Action|Adventure|Romance

48. Adventure|Drama|Fantasy

49. Action|Fantasy|Sci-Fi

50. Action|Mystery|Sci-Fi

51. Action|Family|Sci-Fi

52. Comedy|Drama|Romance

53. Action|Comedy|Romance

54. Action|Drama|Sci-Fi

55. Action|Comedy|Thriller

56. Action|Drama|War

57. Action|Mystery|Thriller

Page 38: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

BIBLIOGRAPHY 29

58. Crime|Thriller

59. Action|Drama|Family

60. Action|Crime|Mystery

61. Action|Drama|Fantasy

62. Action|Fantasy|Thriller

63. Adventure|Drama|Romance

64. Biography|Drama|Sport

65. Drama|History|War

66. Comedy|Crime

67. Drama|Western

68. Biography|Comedy|Crime

69. Biography|Crime|Drama

70. Crime|Drama

71. Fantasy|Horror|Mystery

72. Action|Crime|Sci-Fi

73. Action|Romance|Thriller

74. Action|Fantasy|Horror

75. Comedy|Drama|Sci-Fi

76. Action|Crime|Fantasy

77. Comedy|Romance

78. Action|Adventure|Biography

79. Adventure|Comedy|Sci-Fi

80. Comedy|Action|Sci-Fi

81. Drama|Family|Fantasy

82. Comedy

83. Action|Horror|Sci-Fi

84. Crime|Drama|Mystery

85. Action|Drama|Sport

86. Action|Comedy

87. Drama|History|Romance

88. Crime|Drama|Thriller

89. Adventure|Comedy|Drama

90. Drama|Mystery|Sci-Fi

91. Drama|History|Thriller

92. Biography|Comedy|Drama

93. Comedy|Sci-Fi|Thriller

94. Horror|Sci-Fi|Thriller

95. Drama|History|Sport

96. Action|Crime|Romance

97. Comedy|Fantasy

98. Comedy|Fantasy|Romance

99. Comedy|Drama|Family

100. Action|Comedy|Family

101. Comedy|Romance|Sci-Fi

102. Comedy|Family|Romance

103. Action|Fantasy

104. Comedy|Crime|Sport

105. Comedy|Drama|Fantasy

106. Comedy|Mystery

107. Action

108. Drama|Sci-Fi|Thriller

109. Sci-Fi|Adventure|Action

110. Comedy|Sci-Fi

111. Drama|Horror|Mystery

112. Comedy|Family|Sci-Fi

113. Sci-Fi|Thriller

114. Drama|Fantasy|Sport

115. Comedy|Drama

116. Comedy|Crime|Drama

117. Drama|Thriller

Page 39: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

30 BIBLIOGRAPHY

118. Drama|Romance|Sport

119. Adventure|Drama|War

120. Crime|Mystery|Thriller

121. Action|Drama|Romance

122. Adventure|Comedy|Crime

123. Comedy|Drama|Musical

124. Drama

125. Action|Horror|Thriller

126. Animation|Comedy|Fantasy

127. Action|Comedy|Sport

128. Biography|Drama|History

129. Drama|War

130. Adventure|Biography|Drama

131. Animation|Adventure|Drama

132. Drama|Fantasy|Mystery

133. Drama|Music|Musical

134. Drama|Horror|Romance

135. Action|Sci-Fi|Sport

136. Fantasy|Mystery|Romance

137. Drama|Sci-Fi

138. Biography|Drama|Thriller

139. Crime|Drama|History

140. Drama|Fantasy|Thriller

141. Drama|Sport

142. Crime|Drama|Horror

143. Drama|Mystery|Romance

144. Adventure|Biography|Crime

145. Drama|Musical|Romance

146. Comedy|Sport

147. Crime|Drama|Fantasy

148. Drama|Mystery|Thriller

149. Animation|Family|Fantasy

150. Biography|Drama

151. Comedy|Crime|Romance

152. Adventure|Comedy

153. Comedy|Family

154. Drama|Romance|Western

155. Adventure|Thriller|Western

156. Action|Biography|Drama

157. Crime|Drama|Music

158. Drama|Music

159. Action|Crime

160. Adventure|Crime|Drama

161. Comedy|Crime|Music

162. Adventure|Drama|History

163. Comedy|Family|Musical

164. Romance|Sci-Fi|Thriller

165. Mystery|Sci-Fi|Thriller

166. Drama|Horror|Musical

167. Action|Thriller|War

168. Drama|Fantasy

169. Horror|Mystery

170. Adventure|Comedy|Fantasy

171. Horror|Mystery|Thriller

172. Action|Comedy|Mystery

173. Crime|Romance|Thriller

174. Crime|Drama|Romance

175. Comedy|Drama|Thriller

176. Action|Horror

177. Comedy|Crime|Musical

Page 40: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

BIBLIOGRAPHY 31

178. Drama|Fantasy|Musical

179. Comedy|Drama|Music

180. Comedy|Musical

181. Drama|Music|Romance

182. Adventure|Drama|Western

183. Horror

184. Drama|Romance|Thriller

185. Fantasy|Horror

186. Comedy|Romance|Western

187. Biography|Drama|Music

188. Horror|Thriller

189. Adventure|Drama|Mystery

190. Comedy|Music|Romance

191. Biography|Drama|Romance

192. Comedy|Music

193. Comedy|Drama|Sport

194. Comedy|Crime|Thriller

195. Mystery|Romance|Thriller

196. Comedy|Western

197. Horror|Mystery|Sci-Fi

198. Comedy|Family|Music

199. Adventure|Drama

200. Drama|Family

201. Crime|Mystery|Comedy

202. Drama|History

203. Comedy|Horror|Romance

204. Action|Comedy|Horror

205. Drama|Thriller|War

206. Comedy|Romance|Sport

207. Biography|Drama|Family

208. Comedy|Romance|Drama

209. Action|Western

210. Drama|Romance|War

211. Animation|Comedy|Musical

212. Comedy|Family|Sport

213. Adventure|Comedy|Romance

214. Comedy|Crime|Mystery

215. Mystery|Romance|Sci-Fi

216. Comedy|Crime|Family

217. Drama|Family|Sport

218. Drama|Mystery

219. Fantasy|Romance

220. Drama|Horror

221. Comedy|Horror

222. Action|Biography|Crime

223. Drama|Horror|Thriller

224. Horror|Sci-Fi

225. Crime|Drama|Sport

226. Adventure|Horror|Thriller

227. Animation|Adventure|Crime

228. Crime|Horror|Mystery

229. Animation|Comedy|Drama

230. Adventure|Biography|History

231. Action|Horror|Romance

232. Adventure|Family|Romance

233. Comedy|Drama|History

234. Comedy|War

235. Action|Drama|Music

236. Fantasy|Horror|Thriller

237. Action|Drama

Page 41: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

32 BIBLIOGRAPHY

238. Comedy|Romance|Thriller

239. Action|Drama|Western

240. Adventure|Horror|Mystery

241. Adventure|Animation|Family

242. Adventure|Comedy|Musical

243. Action|Comedy|Music

244. Adventure|Fantasy|Romance

245. Comedy|Horror|Thriller

246. Adventure|Fantasy|Horror

247. Adventure|Comedy|Horror

248. Drama|Romance|Sci-Fi

249. Crime|Drama|War

250. Romance|Sci-Fi

251. Drama|Family|Music

252. Adventure|Crime|Mystery

253. Adventure|Drama|Horror

254. Documentary|Action|Comedy

255. Drama|Sport|Thriller

256. Comedy|Musical|Romance

257. Adventure|Family

258. Action|Crime|Horror

259. Adventure|Comedy|Mystery

260. Comedy|Action

261. Adventure|Fantasy|Mystery

262. Thriller

263. Adventure|Family|Sci-Fi

264. Sci-Fi

265. Drama|Family|Musical

266. Comedy|Thriller

267. Crime|Horror|Thriller

268. Drama|Family|Romance

269. Action|Comedy|Drama

270. Family|Music|Romance

271. Drama|Romance|Family

272. Biography|Drama|Fantasy

273. Drama|Family|History

274. Western

275. Comedy|Drama|War

276. Biography|Drama|War

277. Adventure|Mystery|Thriller

278. Horror|Mystery|Romance

279. Comedy|Mystery|Romance

280. Comedy|Mystery|Sci-Fi

281. Comedy|Crime|Horror

282. Comedy|Fantasy|Musical

283. Comedy|Drama|Mystery

284. Action|Horror|Mystery

285. Drama|Fantasy|Music

286. Family|Sci-Fi

287. Adventure|Comedy|Music

288. Action|Comedy|War

289. Fantasy|Horror|Sci-Fi

290. Drama|History|Mystery

291. Drama|Fantasy|Western

292. Documentary|Drama

293. Animation|Crime|Drama

294. Horror|Musical|Sci-Fi

295. Adventure|Biography|Comedy

296. Comedy|Mystery|Crime

297. Crime|Horror

Page 42: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

BIBLIOGRAPHY 33

298. Comedy|Musical|Adventure

299. Comedy|Horror|Sci-Fi

300. Musical|Romance

301. Documentary|Drama|War

302. Crime|Drama|Musical

303. Drama|Musical

304. Fantasy|Horror|Romance

305. Adventure|Horror

306. Comedy|Horror|Mystery

307. Adventure|Sci-Fi

308. Documentary|Music

309. Documentary|Crime|Drama

310. Animation|Action|Crime

311. Documentary

312. Comedy|Drama|Horror

313. Crime|Drama|Film-Noir

314. Documentary|Crime

315. Action|Sport|Thriller

316. Documentary|Comedy

317. Crime|Film-Noir|Thriller

318. Documentary|Crime|War

319. Comedy|Fantasy|Thriller

320. Documentary|Drama|Sport

321. Crime|Fantasy|Horror

322. Crime|Film-Noir|Mystery

323. Drama|Fantasy|Sci-Fi

324. Documentary|Comedy|Drama

Page 43: WHFKQLTXHV - DiVA portal1106715/FULLTEXT01.pdf · Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and

www.kth.se