Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
,120�(;$0(16$5%(7(� 7(.1,.�*581'1,9c�����+3
��672&.+2/0 69(5,*( ����
3UHGLFWLQJ�PRYLH�VXFFHVV�XVLQJ�PDFKLQH�OHDUQLQJ�WHFKQLTXHV
32-$1�6+$+5,9$5
&$5/�-(51%b&.(5
.7+6.2/$1�)g5�'$7$9(7(16.$3�2&+�.20081,.$7,21
Predicting movie success using ma-chine learning techniques
POJAN SHAHRIVAR, CARL JERNBÄCKER
Computer Engineering, Master of ScienceDate: June 7, 2017Supervisor: Iolanda LeiteExaminer: Örjan EkebergSwedish title: Förutsägelse av filmers framgång med maskininlärningSchool of Computer Science and Communication
ii
Abstract
The area of creating predictive models using machine learning has increased in size in re-cent years. The market for movies is still big with hundreds of new movies created everyyear. The purpose of this report is to investigate whether it is possible to classify movierating and box office revenue with metadata available before release. This was done bybuilding a classification model with metadata obtained from the internet such as, budgetand what actors are involved, etc. This study managed to correctly predict what ratinga movie would have about 82% of the time using the technique with the highest suc-cess rate. The times as a model failed to predict the correct rating, it was usually by onerating group, corresponding to a deviation of approximately 17%. When a predictionof gross sales was made, it gave a positive result of 15% of the time. The results of thisreport are to a certain extent consistent with previous studies with similar focus in theprediction of the grade. The precision of the predictions can further be increased with alarger data set with more features.
iii
Sammanfattning
Området med att skapa prediktiva modeller med maskininlärning har ökat i storlek desenaste åren. Marknaden för filmer är stor med hundratals nya filmer skapade varje år.Syftet med denna rapport är att undersöka om det är möjligt att klassificera filmbetygoch brutto biljettförsäljing med metadata som är tillgängliga före utgåvan. Detta gjor-des genom att bygga en klassificeringsmodell med metadata som erhållits från internet,såsom budget och vilka aktörer som är involverade etc. Denna studie lyckades korrektförutsäga vilket betyg en film erhåller omkring 82% av fallen med den mest framgångs-rika modell. I de utfallen där modellen misslyckades med att förutsäga rätt betyg, detvar vanligtvis av en betygsklass, vilket motsvarade en avvikelse på ungefär 17%. När enförutsägelse för bruttoförsäljningen gjordes gav det ett positivt resultat av 15% av fallen.Resultaten av denna rapport är i viss utsträckning förenlig med tidigare studier med lik-nande metodik. Precisionen på förutsägelserna kan ökas med ett utökat data set med flerattribut.
Contents
Contents iv
List of Figures vi
List of Tables vii
1 Introduction 11.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Definitions and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 42.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Method 83.1 Data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Preprocessing of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Title length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Year/Month/Day/Weekday . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.4 Actor/Director/Writer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.5 Rating value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.6 Box office revenue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Results 154.1 Gross box office revenue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.2 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1.3 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iv
CONTENTS v
5 Discussion 185.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Conclusion 20
Bibliography 21A Features in previous works . . . . . . . . . . . . . . . . . . . . . . . . . 23B Confusion matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24C Genres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
List of Figures
1.1 Visual representation of 5 fold cross validation . . . . . . . . . . . . . . . . . 3
2.1 Average ratings and rating distribution (Source: IMDb, User ratings for Beautyand the Beast) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Simplified example of how a decision tree would be evaluated . . . . . . . . 52.3 Simplified example of how a SVM would look like . . . . . . . . . . . . . . . 62.4 Simplified example of how a KNN would look like . . . . . . . . . . . . . . . 6
3.1 Visual representation of the steps involved in the study . . . . . . . . . . . . 83.2 The number of movies with each rating, (from 0.0 to 10.0). . . . . . . . . . . 103.3 Scatter of average rating of director, writer and actor 1-3 . . . . . . . . . . . 103.4 Number of ratings corresponds to the final rating . . . . . . . . . . . . . . . . 113.5 Importance of attributes (predictors) using ReliefF algorithm. Attributes in de-
scending order of prediction power. . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Box office results presented in a diagram . . . . . . . . . . . . . . . . . . . . . 154.2 ROC curve for linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 Decision tree, 4 classes (3.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 Decision tree, 4 classes (3.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Decision tree, 5 classes (3.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Decision tree, 8 classes (3.4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 KNN, 6 classes (3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Decision tree, 6 classes (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Coarse KNN, 17 classes (3.2.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
List of Tables
2.1 Features used in Predicting the Future With Social Media . . . . . . . . . . . . . 72.2 Summary of results by previous work . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Extracted data for each movie . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 had the boundaries structured in a way where the distribution of movies are
symmetrical in each category around the median of the data set. See figure (3.2)for the distributions of all the ratings. . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Rating categories divided into 6 subgroups of equal length over the range ofthe data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Rating categories divided into 8 subgroups, each value is rounded down intothe nearest whole number which resulted in thees 8 groups . . . . . . . . . 12
3.5 Rating categories are divided into 4 subgroups, the head (0-3.5) of the data setis cut of as well the tail (9,5 -10,0). The body is then divided into groups of equallength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Rating categories divided into 4 groups of equal length . . . . . . . . . . . . 123.7 Rating categories divided into 5 groups of equal length . . . . . . . . . . . . 123.8 Rating categories are divided into 17 groups, each value is rounded to the near-
est half integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.9 Final Features that where used . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Classification rate (%) with different response classes (see table 3.2-3.8) . . . 16
1 Summary of previous works features . . . . . . . . . . . . . . . . . . . . . . . 23
vii
Chapter 1
Introduction
The worldwide Box office revenue 2016 was 38.3 billion USD [1] with hundreds of newmovies made each year. If one can use a computer to predict how successful a moviewill be even before it is released, this would be a powerful tool to use. An aspiring Hol-lywood director or a movie studio with some technical skills could predict whether theirmovie idea is going to be a safe investment. With the wast amount of data published onthe Internet and the increasing power of the modern computer, is it possible to take ad-vantage of these resources to make predictions. And this is how we ended up with thesequestions:
• Is it possible to use machine learning algorithms to predict if movie will receive ahigh or low rating?
• Is the available data online viable for use in predicting movie success?
• Is there a correlation between attributes and success of a movie?
The study can be used as a proof of concept for applications in other areas, and shouldhighlight some of the challenges one needs to overcome to successfully create a predic-tion model. This idea could in theory be extended to predict credit ratings, the stockmarket or housing market. The only requirement being a vast and reliable data source.
When combining the questions mentioned above to form a problem statement, for-mulating good as a measurement of a movies rating and sales, the following problemstatement was produced.
1.1 Problem statement
This thesis will investigate; Is it possible to classify the rating and box office revenue of a movieusing metadata freely available on the web?
1.2 Scope
The focus of this thesis is to formulate a method of how to preprocess the data set andevaluate which attributes are the most useful, by evaluating the correlation between theattributes and the success rate of the machine learning. Using this method the if it canachieve a viable success rate when trying to predict the rating and box office revenue.
1
2 CHAPTER 1. INTRODUCTION
The data set is going to be obtained on IMDb using a web scraper of own creation. Thedata set will be limited to 10,000 unique movies by reason of making the workflow moremanageable when processing the data. The machine learning algorithms that are going tobe used are algorithms available in the classification learner module in MATLAB, mainly,SVM, Decision tree, KNN. [2]
1.3 Purpose
Previous studies that have been conducted show that it is possible to predict the successof a movie using attributes such as budget, rating, actors and director. Thus the goal ofthis thesis is to further examine the possibility of using a greater data set with featurespreviously not used with machine learning.
1.4 Definitions and Acronyms
Box office revenue
Movies generate income from many revenue streams, movie rentals and purchases, the-atre ticket/box office sales, merchandising and home video are a few. To be able to com-pare a movies, one revenue stream was selected, which is the income generated by the-atre ticket sales, or box office sales. The figures used in this study are used from box-officemojo, the leading online box-office reporting service, owned and operated by IMDb.
Confusion matrix
The confusion matrix visualises the performance of the classification model. Each columnof the matrix represents the instances in a predicted class while each row represents theinstances in an actual class.
IMDb
Internet Movie Database, abbreviated IMDb, is an online database with movies, cast,directors, writers, fictional characters, production crew, plot summaries, summaries, triviaand more. IMDb is the most comprehensive online database of its kind and has over 70million registered users.
MATLAB
MATLAB is a platform made for solving scientific and engineering problems. The plat-form has a library of prebuilt toolboxes that can be used to solve different problems, inthis study, the Statistics and Machine Learning Toolbox is used.
Web scraper
A web scraper is an automated program that extracts data from websites to a local databaseor spreadsheet.
CHAPTER 1. INTRODUCTION 3
ROC curve
Receiver operating characteristic (abbreviated ROC) curve is a plot that illustrates perfor-mance of a classification model by plotting the true positive rate against the false positiverate.[3] A ROC curve is a common way of representing the results of a binary classifier.Area under the curve (AUC) is a way to visualise how good a classifier is, a classifierwith a AUC value of 1 is a perfect classifier, and a classifier with a value of 0,5 is a terri-ble classifier.
Cross validation
Using cross validation the original data set is partitioned into random subsets of equalsize [4]. One subset is used as the validation data and the remaining subsets are usedas training data. This procedure is then repeated as many times as desired, the resultsare then taken an average of and presented as the validated result. See figure (1.1) for avisual representation. This validation technique on all the results mentioned below with5 folds.
Figure 1.1: Visual representation of 5 fold cross validation
Chapter 2
Background
This chapter is meant to give a higher understanding of the different parts in study, itwill include the data set used, and the different methods of evaluating the data. Someprevious work that have been made in the area of predictive models for movie successwill be mentioned due to the methods used.
2.1 Data
All the data used is retrieved from IMDb; IMDb is an online movie database that waslaunched 26 years ago in October 17, 1990 and is now under the ownership of Amazon.The data that resides on IMDb is submitted by professionals or registered users and ismoderated before going live.
IMDb lets registered users cast their vote on a scale 1.0-10.0. IMDb displays weightedvote averages rather than raw averages. Various filters are applied to the raw voting datathat the users cast to reduce the attempt of ballot-stuffing by organisations and userswho wish to change the rating of a movie. See figure (2.1) how IMDb presents the rat-ing for a movie (in this case Beauty and the Beast), the value used in this report is theweighted mean value of the IMDb users. By using this method of weighted mean makes
Figure 2.1: Average ratings and rating distribution (Source: IMDb, User ratings for Beautyand the Beast)
4
CHAPTER 2. BACKGROUND 5
it more robust as an estimator for distributions that are not normal. This is to ensure thatthe final rating is representative of the general voting population and not influenced byindividuals or organisations who want to sway the rating in a specific direction.
The exact method of calculating a weighted mean rating is not disclosed [5], however,the method is used across the entire database without exception, which means that thereis no bias in which movies that are affected.
The ratings that are used in this thesis is the rating that is displayed and not the aver-age rating.
2.2 Decision tree
Decision trees are basically a number of questions with binary answers. For each answerto a new question that lead to new yes or no question until finally a prediction can bemade. See figure 2.2 for an example.
Figure 2.2: Simplified example of how a decision tree would be evaluated
6 CHAPTER 2. BACKGROUND
2.3 Support Vector Machine
Support vector machine (SVM) is a supervised machine learning model that is used forclassification. SVMs work by maximise the margin between separating hyperplane. Inlinear SVM the plane can be split by a line, see figure (2.3) for an example how the modelcould look like. For example could the red values be answer A and the blue be answerB. If a new value would be introduced to the system and positioned on the red side, themodel would predict the new value to be equal to answer A. If there are more answerspossible a hyperplane is created to be able to split all the answers up in different areas.
Figure 2.3: Simplified example of how a SVM would look like
2.4 KNN
K nearest neighbours (KNN) is a simple and easy to use algorithm for machine learning,the results are easy to interpret, it is not heavy for the computer to calculate therefore tocalculation time is a fraction of other algorithms. But this comes at a cost of the predic-tive power of the algorithm. KNN work by calculating the number of nearest neighboursof a given type, the type with the most neighbours is the prediction. See figure (2.4) foran example. If the blue diamond is new value that is going to be predicted, the algo-rithm would calculate that the red squares are most of the nearest neighbours, thereforethe model would predict that the diamond is a square.
Figure 2.4: Simplified example of how a KNN would look like
CHAPTER 2. BACKGROUND 7
2.5 Previous work
Previous work has been made in the area prediction of movie revenue. Nikhil Apte,Mats Forssell and Anahita Sidhwa [6] used a couple of different machine learning tech-niques on a data set which they retrieved from IMDb. They set up some restrictions suchas the box office revenue had to be greater than $1,000,000 after taking inflation to ac-count, and was made after 1 of January 1990. The final data set was 2510 movies andthey achieved a median error of around 35 percent.
In August 2016 a paper was published about the same subjects mentioned in this re-port, machine learning and rating prediction. Muhammad Hassan Latif and HammadAfza [7] wrote the paper for IJCNS, the paper is about using machine learning to pre-dict the rating of a movie. During their preprocessing of the data, they categorised themovies budget to a more easily managed value to a scale from 1 to 9. They also cate-gorised the rating of a movie in to 4 different categories, Terrible, Poor, Average and Excel-lent. The results obtained by this method using 2000 data point was around 80% successrate of predicting the rating of a movie.
There is another method of making predictions where social media interaction is mea-sured and analysed. In this case the volume of tweets and their content is used as thedata set. Predicting the Future With Social Media is a study conducted by Sitaram Asurand Bernardo A. Huberman [8] shows that it is possible to use social media to predictthe box office revenue for a movie. This study used Twitter as the source of the data. Thedata set with 2.89 million tweets from 1.2 million users where then used to create a lin-ear regression model which obtained an accuracy of 98%. See table (2.1) for the featuresused. They also used the meta data from the tweets like timestamp of tweet to createnew features like tweet-rate.
Features ValueUrls percentages of urls during critical periodRetweets percentages of retweets during critical period
Table 2.1: Features used in Predicting the Future With Social Media
Karl Persson [9] at the university of Skövde study aimed to check the predictive perfor-mance of random forest in comparison to support vector machine. He manage a successrate of 84% when using random forest, and a success rate of 86% when using supportvector machines. To validate his results he used 10-fold cross validation. See appendix Afor a summary of all features used in previous studies.
StudyMethod and results
Validation Prediction Success rate
Predicting Movie Box Office Gross 20% withhold from data set Movie Revenue 65%Prediction of Movies popularity Using Machine Learning Techniques 10 fold cross validation Movie Rating 80%Predicting movie ratings :A comparative study on random forests and support vector machines 10 fold cross validation Movie rating 83%Predicting the Future With Social Media Cross validation Box office revenue 98%
Table 2.2: Summary of results by previous work
Chapter 3
Method
Figure 3.1: Visual representation of the steps involved in the study
The methodology is divided into 5 steps, the first one being data selection and extrac-tion, see section (3.1). The data is then preprocessed, described in section (3.2). The pre-processed data set used with different algorithms to determine and select which features(3.3) to use in the final data set used in the classification model (3.4).
3.1 Data extraction
IMDb is used as source for the data used in this study. IMDb does not have an open APIand to be able to extract the relevant data, a web scraper [10] was made. The scraperis coded in Node.js and sequentially downloads the information for each movie into adatabase. The movies are sorted in descending order by number of rating and containsmovies that originate from the US as movies from other countries have their budget rep-resented in their native currency. Multiple currencies would require further preprocessingto normalise the data which was avoided in this case.
8
CHAPTER 3. METHOD 9
Each entry in the data set contains: title, runtime, release date, genre, IMDb rating, num-ber of ratings, metascore, number of oscars, budget (usd), gross (usd), director (averagerating of movies and number of oscars won), writers (same as director), top billed cast(same as director). The currencies are not adjusted for inflation on IMDb. Average rat-ings are based on the latest 10 movies that the director, writer or actor has been creditedin.
Value Type From ToTitle TitleRun time Int 68 271Year Int 1918 2017Month Int 1 12Day Int 1 31Weekday Int 1 7Genre Genres (appendix B) 1 324Rating value Float 1,6 9,6Rating count Int 9999 1793960Meta score Int 1 100Oscar Int 0 11Budget Int 0 384Gross revenue Int 1 1301Director Name 1 1445Director Oscars Int 0 6Director Rating Float 5,0 8,0Writer NameWriter Oscars Int 0 6Writer Rating Float 4,0 8,0First Actor NameFirst Actor Oscars Int 0 4First Actor Rating Float 5,0 8,0Second Actor NameSecond Actor Oscars Int 0 4Second Actor Rating Float 5,0 8,0Third Actor NameThird Actor Oscars Int 0 4Third Actor Rating Float 4,0 8,0
Table 3.1: Extracted data for each movie
3.2 Preprocessing of data
Duplicate values are removed, cells without a value were interpreted as NaN instead ofremoving the entire row. Movies with less than 10000 ratings where taken out of the dataset as these movies tended to be quite volatile and not representative of the majority ofviewers. Movies with very few ratings also tend to have inflated ratings. Movies thatmade less than 1 million USD in gross box office revenue after taking inflation into ac-count where also removed from the data set.
10 CHAPTER 3. METHOD
Figure 3.2: The number of movies with each rating, (from 0.0 to 10.0).Min: 3.475, Max: 9.469, Mean: 6.472, Range 5.994
though the rating scale is 0.0-10.0, the ratings tend to be concentrated around themean. The range of the ratings indicate that the rating scale can be trimmed to a muchsmaller scale that is more representative of the actual movie quality.
Figure 3.3: Scatter of average rating of director, writer and actor 1-3
In figure (3.3) we can see a scatter plot of the average rating of the director, writerand star actors. The figure illustrates the correspondence of cast rating and movie rating.
CHAPTER 3. METHOD 11
Figure 3.4: Number of ratings corresponds to the final rating
3.2.1 Title length
Title length is the length of the English title of every given movie, the shortest moviewas 1 character long, and the longest was 66 characters long.
3.2.2 Year/Month/Day/Weekday
Year, month and day was the date of the first screening of a movie, weekday is what dayof the week the first screening was, Monday through Sunday.
3.2.3 Genre
Genre is the given combination of genres of a movie, for example is a Comedy differentfrom a Action|Comedy. A complete list of all the genre combinations can be found inthe appendix under Genres.
3.2.4 Actor/Director/Writer
For each actor who played in a movie received a unique id, for the primary actor a moviethe number of unique ids ranged from 1 to 1180, for secondary actor the ids ranged from1 to 3157 and for the third actor it ranged from 1 to 4890. The same procedure was madefor directors (1 to 1445) and writers (1 to 3411).
3.2.5 Rating value
Ratings were categorised into 7 different rating scales, see table 3.2 to table 3.8, thus thenumber of prediction outcomes were reduced from 101 to a value between 4 and 17.While this reduces the accuracy of the prediction, it decreases the misclassification errorsignificantly. This also reflects the range of the data, see figure (3.2).
12 CHAPTER 3. METHOD
Category From To1 0 3,62 3,7 5,03 5,1 6,54 6,6 7,85 7,9 8,46 8,5 10,0
Table 3.2: had the boundaries struc-tured in a way where the distribution ofmovies are symmetrical in each categoryaround the median of the data set. Seefigure (3.2) for the distributions of all theratings.
Category From To1 0 4,42 4,4 5,43 5,4 6,44 6,4 7,45 7,4 8,46 8,4 10,0
Table 3.3: Rating categories divided into6 subgroups of equal length over therange of the data set
Category From To1 1 22 2 33 3 44 4 55 5 66 6 77 7 88 8 9
Table 3.4: Rating categories dividedinto 8 subgroups, each value is roundeddown into the nearest whole numberwhich resulted in thees 8 groups
Category From To1 3,5 52 5,0 6,53 6,5 8,04 8,0 9,5
Table 3.5: Rating categories are dividedinto 4 subgroups, the head (0-3.5) ofthe data set is cut of as well the tail (9,5-10,0). The body is then divided intogroups of equal length
Category From To1 0 2,52 2,5 5,03 5,0 7,54 7,5 10,0
Table 3.6: Rating categories divided into4 groups of equal length
Category From To1 0 22 2 43 4 64 6 85 8 10
Table 3.7: Rating categories divided into5 groups of equal length
Category 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17From 1,5 2,0 2,5 3,0 3,5 4,0 4,5 5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5 9,0 9,5To 2,0 2,5 3,0 3,5 4,0 4,5 5,0 5,5 6,0 6,5 7,0 7,5 8,0 8,5 9,0 9,5 10,0
Table 3.8: Rating categories are divided into 17 groups, each value is rounded to the nearesthalf integer
CHAPTER 3. METHOD 13
3.2.6 Box office revenue
When predicting the gross box office revenue the same method was used, the scrapedvalue for the gross box office revenue see table 3.1 . The value was rounded of to thenearest 10 million after taking inflation to account[6]. This was made to reduce the num-ber of possible outcomes from millions to a Little more than hundred. The formula toused to take inflation to account was:
Xpres = Xthen ∗ (1 + AvgInf)∆y (3.1)
The AvgInf (average inflation) was assumed to be 2,5% and the ∆y is the number ofyears since release until present day.
3.3 Feature selection
The data set includes features that are categorical such as genre, which means that nu-merical transformations are inappropriate, thus feature selection is main technique ofreducing the number of dimensions. Feature selection reduces the number of predictorvariables by selecting a subset of predictor variables to create a prediction model. Thiscan be done manually or with a search algorithm. In this study a search algorithm wasused to reduce work load. The algorithm in question is sequential feature selection whichhas two parts:
• A criterion which the algorithm seeks to minimise over the feature subspace, in ourcase this is the misclassification rate.
• Search algorithm which explores the feature subspace in search of the most opti-mal feature selection. The search algorithm has two variants, sequential forwardselection (SFS) adds features to an empty candidate set, and sequential backwardselection (SBS) which removes features from a full candidate set.
Statistics and Machine Learning Toolbox in MATLAB has a built in function for this whichwas used with the SFS option.The data set contained 29 features to begin with, and the SFS algorithm selected 10 fea-tures for the candidate set: Runtime, rating count, genre, metascore, budget, actor 1-3ratings, director and writer rating. All the different combinations of genre is listed in theappendix under section Genres.
Feature Value RangeRuntime Positive integer 68-271Genre Positive integer 1-324Rating count Positive integer 9999-1793960Metascore Positive integer 1-100Budget Positive integer 1-38Actor 1 rating Float 5,0-8,0Actor 2 rating Float 5,0-8,0Actor 3 rating Float 4,0-8,0Director rating Float 5,0-8,0Writer rating Float 4,0-8,0
Table 3.9: Final Features that where used
14 CHAPTER 3. METHOD
Figure 3.5: Importance of attributes (predictors) using ReliefF algorithm. Attributes indescending order of prediction power.
3.4 Classification
The three algorithms that are used in this study has many variants and are all imple-mented in MATLAB:
• Decision trees: Deep tree, medium tree, and shallow tree
• Support vector machines: Linear SVM, fine Gaussian SVM, medium Gaussian SVM,coarse Gaussian SVM, quadratic SVM, and cubic SVM
• Nearest neighbor classifiers: Fine KNN, medium KNN, coarse KNN, cosine KNN,cubic KNN, and weighted KNN
Decision trees differ in complexity based on the number of splits, depth, of the tree. Adeep tree has ≤ 100 splits, medium tree ≤ 20 splits and a shallow <4 splits. The simpleand fast classifiers, shallow tree, linear SVM and coarse KNN are used to make featureselection while the accurate and slow trained classifiers, fine KNN, cubic SVM and deeptree are used in the end to make the final classification models.
Chapter 4
Results
In the following chapter all results by the study will be presented. All values are calcu-lated using Decision tree, SVM and KNN and validated using 5 fold cross validation.
4.1 Gross box office revenue
Figure 4.1: Box office results presented in a diagram
Two different models where created for this section, the first one tried to predict thegross box office revenue to the nearest ten’s of million. The second one tried to predict ifthe movie would make money, not by a specific amount, just if the budget was smallerthan the revenue from ticket sales.
4.1.1 Decision tree
Using a medium decision tree a success rate of 15.0% was achieved. All features listedabove in section 3.3 was used to predict the grouped gross box office revenue. Whentrying to predict if a movie would make money the the success rate increased to 67.4%instead.
15
16 CHAPTER 4. RESULTS
4.1.2 SVM
For the SVM a linear SVM was used, it was able to produce a success rate of 11.0%. Forthe second model the linear SVM had the highest success rate of 68.8% when trying topredict if a movie would make money. See figure (4.2) for the ROC curve.
Figure 4.2: ROC curve for linear SVM
4.1.3 KNN
Using a Coarse KNN algorithm a success rate of 12.2% was achieved. The same featuresthat where used in the decision tree where also used in the KNN. The second modelwhich checked if a movie would make money had a success rate of 66.8% when usinga Coarse KNN algorithm.
4.2 Rating
Response classesAlgorithm
Medium Tree Linear SVM Coarse KNN
4 classes (3.5) 68.3 60.2 68.34 classes (3.6) 83.0 68.3 82.05 classes (3.7) 76.0 64.8 75.56 classes (3.2) 53.7 52.4 54.56 classes (3.3) 54.0 52.1 54.08 classes (3.4) 53.1 47.5 52.617 classes (3.8) 32.2 31.8 32.6
Table 4.1: Classification rate (%) with different response classes (see table 3.2-3.8)
CHAPTER 4. RESULTS 17
In almost all cases, a medium decision tree (20 splits) provided the best results. Classifi-cation accuracy decreases with an increase of classes. Highest accuracy is achieved with4 response classes and with a medium tree. This is due to the high concentration of en-tries in the third class. See appendix B for a visual representation of performance of thebest performing classifier for each response class.
Chapter 5
Discussion
In this chapter a discussion is presented with respect to the features, the data set and themethod.
5.1 Features
The results of this study is arguably on par with the other studies mentioned when pre-dicting the rating. Since the aim of this study was to take the techniques used by previ-ous studies and improve them adding more features and a larger amount of data, it ishard claim that to be a success. In part due to our achieved rate of successfully predict-ing the rating was equal or lower than all the studies used as theoretical basis for thisstudy. When comparing to Muhammad Hassan Latif and Hammad Afza paper [7] whoachieved a success rate of predicting the rating of a movie to around 80%, there wheresome slight differences. For instance, we had a couple of different groupings of varioussizes, for example a group with different outcomes of varying spans who where sym-metrically split around the median of the data set. Where Latif and Afza instead had therating split up into 4 outcomes of equal spans, and they split up the budget into 9 dif-ferent categories where we instead rounded it up to the nearest 10-million. In this caseno apparent improvement was made in the predictability of the model. By splitting upthe the ratings into groups like Latif and Afza we managed to increase the success ratea bit from there 80% to 83% see table (3.6) using our method. If one compare the groupspresented in table (3.2) and table (3.3) there is a slight increase in predictability, whichagrees with our preliminary test by splitting up the rating in groups of varying spanaround the median. This showed an increase in predictability on our data set, thereforeone can argue that even more improvement to the method can be achieved.
This project could only get some of the features presented on IMDb. Instead of sum-ming up all the awards for a movie the scraper used could only get the number of Os-car. This left out many different awards, ranging from golden globes to small film fes-tivals. The decision was made to get the number of Oscars on account of it is the mostprestigious. This ended up being a problem instead for the reason that only a handfulof movies have ever received an Oscar, so most of the data regarding the number of Os-cars a movie had accumulated ended up being useless. A similar problem with plot key-words was also discovered to late. The problem was difficulty of getting the scarper tostore all the plot keywords in correct manner. Therefore this was also removed from thefinal data set. If thees two features could be added to the data set they would probably
18
CHAPTER 5. DISCUSSION 19
increase the predictability even further on the basis of previous work done in the area.Future work might benefit by looking beyond IMDb as the data provider and creatingan even bigger data set. When comparing to [8] twitter looks to be a useful provider ofmeta data to make accurate predictions.
5.2 Data set
Due to a number of reasons the magnitude of the final data set was not as large as in-tended. The primary reason to this problem was that the data set contained a large num-ber of duplicates of movies. This because the scraper used did not obtain the entire dataset in one session, and during the time between sessions the data changed and the soft-ware interpret it as a new movie. The scraper was told to scrape from the highest ratedmovies to the lowest rated movies on IMDb and upon reaching 10 000 movies it wastold to stop. The discovery of duplicates was not made until to late for changes to made.This then resulted in that the data set was not as large or as diverse as planed from thestart. The movies with the highest rating also tended to be the movies that had the high-est rating (3.4) and therefore there was very few movies with rating a rating below 5.0and above 8, which reflected in a high misclassification rating.
A better method of extracting raw data was discovered late in the project. IMDb re-leases their database in raw files that can be processed into a MySQL database. It shouldbe mentioned that the database contains more than 30 tables and making queries thatcontain all the features that are used in this study is no simple task.
5.3 Method
There was a small uncertainty in the validation method, due to the cross validation ran-domised the sectioning the was a small difference in the output value, (was around 1 to2 percent). Though this was a small ambiguity, it created by it self an uncertainty of theresult.
When looking at the confusion matrices for predicting the rating, one can see that themain confusion is around the middle where the majority of the data is aggregated. Ourmodel had a tendency to over predict the rating of the movie, with a greater amount ofanalysis of why this is the case the success rate might be greatly increased.
Preprocessing of genres could have been done differently. In our study, each combi-nation of genres were considered a unique genre which resulted a very large spread seeGenres in appendix for complete list. This made genre a weaker classifier than it was pre-dicted to be.
Chapter 6
Conclusion
The intent of this study of creating a way to predicting the success of movies using his-torical data was not a direct success, it shows that other features are necessary to accu-rately make a prediction. When trying to predict the rating of a movie a highest successrate of around 65% achieved when using 6 groups and 83% when using 4 groups. Thiscould arguably be a good enough result and shows promise, but when it is compared toprevious studies with similar methodologies the result equal in predictability.
When trying to predict the box office revenue of a movie a highest success rate of15% was achieved, this result show that the method used in this study are difficult usefor an accurate prediction, when compared to previous studies it shows that our methodof preprocessing the data is not the best way to do it. When giving the model a binarydecision of "will the revenue be greater than the budget" a success rate of around thehigh 60 percent was achieved. Therefore one can draw the following conclusion, usinghistorical data to create a model to predict the rating of a movie shows promise anddeserves to be properly evaluated. But then using the same historical data for tryingto predict the box office revenue shows no promise using or method, previous studieshave shown that it is possible to make accurate predictions, but the methods used in thisstudy did not improve upon them.
20
Bibliography
[1] Statista Inc. Global box office revenue from 2016 to 2020, 2016. URL https://www.statista.com/statistics/259987/global-box-office-revenue/.
[2] MathWorks. Classification learner, 2017. URL https://se.mathworks.com/help/stats/classificationlearner-app.html.
[3] Roc curves and area under the curve explained, 2017. URL http://www.dataschool.io/roc-curves-and-auc-explained/.
[4] Jeff Schneider. Cross validation, 1997. URL http://www.cs.cmu.edu/~schneide/tut5/node42.html.
[5] Imdb weighted mean rating, 2017. URL http://www.imdb.com/help/show_leaf?votes.
[6] Nikhil Apte, Mats Forssell, and Anahita Sidhwa. Predicting movie revenue. CS229,Stanford University, 2011.
[7] Muhammad Hassan Latif and Hammad Afzal. Prediction of movies popularity us-ing machine learning techniques, 2016. http://paper.ijcsns.org/07_book/201608/20160820.pdf.
[8] Sitaram Asur and Bernardo A Huberman. Predicting the future with social media.In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM Inter-national Conference on, volume 1, pages 492–499. IEEE, 2010.
[9] Karl Persson. Predicting movie ratings : A comparative study on random forestsand support vector machines, 2015.
[10] Pojan Shahrivar. Web scraper source code, June 7, 2017. URL https://gits-15.sys.kth.se/pojans/kth.kex.scraper.
21
22 BIBLIOGRAPHY
BIBLIOGRAPHY 23
A Features in previous works
Feat
ure
Stud
yPr
edic
ting
Mov
ieBo
xO
ffice
Gro
ssPr
edic
tion
ofM
ovie
spo
pula
rity
Usi
ngM
achi
neLe
arni
ngTe
chni
ques
Pred
icti
ngm
ovie
rati
ngs
:Aco
mpa
rati
vest
udy
onra
ndom
fore
sts
and
supp
ortv
ecto
rm
achi
nes
Box
Offi
ceIn
tobu
cket
sof
alo
gari
thm
icsc
ale
NaN
NaN
Act
orBy
Nam
eN
aNN
aND
irec
tor
ByN
ame
NaN
NaN
Rel
ease
date
Dat
eN
aNN
aNLe
ngth
ofm
ovie
Int
NaN
NaN
MA
AP
Rat
ing
G/P
G/P
G13
/R/N
RG
/PG
/PG
13/R
/NR
G/P
G/P
G13
/R/N
RG
enre
NaN
20di
ffer
entn
umbe
rs,o
nefo
rea
chge
nre
20di
ffer
entg
enre
sR
atin
gN
aN4
Gro
ups
0.0
-2.5
-5.0
-7.7
-10.
00.
0-10
.0Bu
dget
NaN
9D
iffer
entg
roup
sPo
siti
vein
tege
r
Aw
ards
NaN
4va
lues
Osc
arW
on,O
scar
Nom
inee
,G
olde
nG
lobe
won
,Gol
den
glob
eno
min
eeN
aN
Scre
ens
NaN
Num
ber
ofsc
reen
son
open
ing
wee
kend
NaN
Ope
ning
wee
kend
NaN
9D
iffer
entg
roup
sN
aNM
etas
core
NaN
Asc
ore
from
0to
100
NaN
Num
ber
ofvo
tes
NaN
Posi
tive
inte
ger
NaN
Act
orR
atin
gN
aNN
aN0.
0-10
.0D
irec
tor
Rat
ing
NaN
NaN
0.0-
10.0
Prod
ucer
Rat
ing
NaN
NaN
0.0-
10.0
Num
ber
ofm
ovie
sw
ith
acto
rN
aNN
aNPo
siti
vein
tege
rN
umbe
rof
mov
ies
mad
eby
dire
ctor
NaN
NaN
Posi
tive
inte
ger
Num
ber
ofm
ovie
spr
oduc
edby
prod
ucer
NaN
NaN
Posi
tive
inte
ger
Table 1: Summary of previous works features
24 BIBLIOGRAPHY
B Confusion matrices
Figure 1: Decision tree, 4 classes (3.6)
Figure 2: Decision tree, 4 classes (3.5)
BIBLIOGRAPHY 25
Figure 3: Decision tree, 5 classes (3.7)
Figure 4: Decision tree, 8 classes (3.4)
26 BIBLIOGRAPHY
Figure 5: KNN, 6 classes (3.2)
Figure 6: Decision tree, 6 classes (3.3)
BIBLIOGRAPHY 27
Figure 7: Coarse KNN, 17 classes (3.2.5)
28 BIBLIOGRAPHY
C Genres
1. Action|Adventure|Fantasy
2. Animation|Adventure|Comedy
3. Action|Adventure
4. Action|Thriller
5. Action|Adventure|Sci-Fi
6. Adventure|Fantasy
7. Action|Adventure|Comedy
8. Action|Adventure|Family
9. Action|Sci-Fi
10. Action|Adventure|Western
11. Adventure|Family|Fantasy
12. Drama|Romance
13. Action|Adventure|Drama
14. Animation|Drama|Family
15. Action|Adventure|Horror
16. Action|Crime|Thriller
17. Action|Crime|Drama
18. Animation|Adventure|Family
19. Adventure
20. Animation|Action|Adventure
21. Comedy|Family|Fantasy
22. Action|Comedy|Sci-Fi
23. Adventure|Drama|Sci-Fi
24. Action|Sci-Fi|Thriller
25. Action|Adventure|History
26. Family|Fantasy|Musical
27. Action|Adventure|Crime
28. Drama|Horror|Sci-Fi
29. Animation|Comedy|Family
30. Drama|Fantasy|Romance
31. Adventure|Comedy|Family
32. Action|Adventure|Thriller
33. Adventure|Drama|Family
34. Mystery|Thriller
35. Comedy|Fantasy|Horror
36. Action|Drama|Thriller
37. Drama|Fantasy|Horror
38. Action|Comedy|Fantasy
39. Action|Drama|History
40. Action|Comedy|Crime
41. Adventure|Drama|Thriller
42. Adventure|Sci-Fi|Thriller
43. Action|Adventure|Mystery
44. Adventure|Mystery|Sci-Fi
45. Action|Drama|Mystery
46. Animation|Action|Comedy
47. Action|Adventure|Romance
48. Adventure|Drama|Fantasy
49. Action|Fantasy|Sci-Fi
50. Action|Mystery|Sci-Fi
51. Action|Family|Sci-Fi
52. Comedy|Drama|Romance
53. Action|Comedy|Romance
54. Action|Drama|Sci-Fi
55. Action|Comedy|Thriller
56. Action|Drama|War
57. Action|Mystery|Thriller
BIBLIOGRAPHY 29
58. Crime|Thriller
59. Action|Drama|Family
60. Action|Crime|Mystery
61. Action|Drama|Fantasy
62. Action|Fantasy|Thriller
63. Adventure|Drama|Romance
64. Biography|Drama|Sport
65. Drama|History|War
66. Comedy|Crime
67. Drama|Western
68. Biography|Comedy|Crime
69. Biography|Crime|Drama
70. Crime|Drama
71. Fantasy|Horror|Mystery
72. Action|Crime|Sci-Fi
73. Action|Romance|Thriller
74. Action|Fantasy|Horror
75. Comedy|Drama|Sci-Fi
76. Action|Crime|Fantasy
77. Comedy|Romance
78. Action|Adventure|Biography
79. Adventure|Comedy|Sci-Fi
80. Comedy|Action|Sci-Fi
81. Drama|Family|Fantasy
82. Comedy
83. Action|Horror|Sci-Fi
84. Crime|Drama|Mystery
85. Action|Drama|Sport
86. Action|Comedy
87. Drama|History|Romance
88. Crime|Drama|Thriller
89. Adventure|Comedy|Drama
90. Drama|Mystery|Sci-Fi
91. Drama|History|Thriller
92. Biography|Comedy|Drama
93. Comedy|Sci-Fi|Thriller
94. Horror|Sci-Fi|Thriller
95. Drama|History|Sport
96. Action|Crime|Romance
97. Comedy|Fantasy
98. Comedy|Fantasy|Romance
99. Comedy|Drama|Family
100. Action|Comedy|Family
101. Comedy|Romance|Sci-Fi
102. Comedy|Family|Romance
103. Action|Fantasy
104. Comedy|Crime|Sport
105. Comedy|Drama|Fantasy
106. Comedy|Mystery
107. Action
108. Drama|Sci-Fi|Thriller
109. Sci-Fi|Adventure|Action
110. Comedy|Sci-Fi
111. Drama|Horror|Mystery
112. Comedy|Family|Sci-Fi
113. Sci-Fi|Thriller
114. Drama|Fantasy|Sport
115. Comedy|Drama
116. Comedy|Crime|Drama
117. Drama|Thriller
30 BIBLIOGRAPHY
118. Drama|Romance|Sport
119. Adventure|Drama|War
120. Crime|Mystery|Thriller
121. Action|Drama|Romance
122. Adventure|Comedy|Crime
123. Comedy|Drama|Musical
124. Drama
125. Action|Horror|Thriller
126. Animation|Comedy|Fantasy
127. Action|Comedy|Sport
128. Biography|Drama|History
129. Drama|War
130. Adventure|Biography|Drama
131. Animation|Adventure|Drama
132. Drama|Fantasy|Mystery
133. Drama|Music|Musical
134. Drama|Horror|Romance
135. Action|Sci-Fi|Sport
136. Fantasy|Mystery|Romance
137. Drama|Sci-Fi
138. Biography|Drama|Thriller
139. Crime|Drama|History
140. Drama|Fantasy|Thriller
141. Drama|Sport
142. Crime|Drama|Horror
143. Drama|Mystery|Romance
144. Adventure|Biography|Crime
145. Drama|Musical|Romance
146. Comedy|Sport
147. Crime|Drama|Fantasy
148. Drama|Mystery|Thriller
149. Animation|Family|Fantasy
150. Biography|Drama
151. Comedy|Crime|Romance
152. Adventure|Comedy
153. Comedy|Family
154. Drama|Romance|Western
155. Adventure|Thriller|Western
156. Action|Biography|Drama
157. Crime|Drama|Music
158. Drama|Music
159. Action|Crime
160. Adventure|Crime|Drama
161. Comedy|Crime|Music
162. Adventure|Drama|History
163. Comedy|Family|Musical
164. Romance|Sci-Fi|Thriller
165. Mystery|Sci-Fi|Thriller
166. Drama|Horror|Musical
167. Action|Thriller|War
168. Drama|Fantasy
169. Horror|Mystery
170. Adventure|Comedy|Fantasy
171. Horror|Mystery|Thriller
172. Action|Comedy|Mystery
173. Crime|Romance|Thriller
174. Crime|Drama|Romance
175. Comedy|Drama|Thriller
176. Action|Horror
177. Comedy|Crime|Musical
BIBLIOGRAPHY 31
178. Drama|Fantasy|Musical
179. Comedy|Drama|Music
180. Comedy|Musical
181. Drama|Music|Romance
182. Adventure|Drama|Western
183. Horror
184. Drama|Romance|Thriller
185. Fantasy|Horror
186. Comedy|Romance|Western
187. Biography|Drama|Music
188. Horror|Thriller
189. Adventure|Drama|Mystery
190. Comedy|Music|Romance
191. Biography|Drama|Romance
192. Comedy|Music
193. Comedy|Drama|Sport
194. Comedy|Crime|Thriller
195. Mystery|Romance|Thriller
196. Comedy|Western
197. Horror|Mystery|Sci-Fi
198. Comedy|Family|Music
199. Adventure|Drama
200. Drama|Family
201. Crime|Mystery|Comedy
202. Drama|History
203. Comedy|Horror|Romance
204. Action|Comedy|Horror
205. Drama|Thriller|War
206. Comedy|Romance|Sport
207. Biography|Drama|Family
208. Comedy|Romance|Drama
209. Action|Western
210. Drama|Romance|War
211. Animation|Comedy|Musical
212. Comedy|Family|Sport
213. Adventure|Comedy|Romance
214. Comedy|Crime|Mystery
215. Mystery|Romance|Sci-Fi
216. Comedy|Crime|Family
217. Drama|Family|Sport
218. Drama|Mystery
219. Fantasy|Romance
220. Drama|Horror
221. Comedy|Horror
222. Action|Biography|Crime
223. Drama|Horror|Thriller
224. Horror|Sci-Fi
225. Crime|Drama|Sport
226. Adventure|Horror|Thriller
227. Animation|Adventure|Crime
228. Crime|Horror|Mystery
229. Animation|Comedy|Drama
230. Adventure|Biography|History
231. Action|Horror|Romance
232. Adventure|Family|Romance
233. Comedy|Drama|History
234. Comedy|War
235. Action|Drama|Music
236. Fantasy|Horror|Thriller
237. Action|Drama
32 BIBLIOGRAPHY
238. Comedy|Romance|Thriller
239. Action|Drama|Western
240. Adventure|Horror|Mystery
241. Adventure|Animation|Family
242. Adventure|Comedy|Musical
243. Action|Comedy|Music
244. Adventure|Fantasy|Romance
245. Comedy|Horror|Thriller
246. Adventure|Fantasy|Horror
247. Adventure|Comedy|Horror
248. Drama|Romance|Sci-Fi
249. Crime|Drama|War
250. Romance|Sci-Fi
251. Drama|Family|Music
252. Adventure|Crime|Mystery
253. Adventure|Drama|Horror
254. Documentary|Action|Comedy
255. Drama|Sport|Thriller
256. Comedy|Musical|Romance
257. Adventure|Family
258. Action|Crime|Horror
259. Adventure|Comedy|Mystery
260. Comedy|Action
261. Adventure|Fantasy|Mystery
262. Thriller
263. Adventure|Family|Sci-Fi
264. Sci-Fi
265. Drama|Family|Musical
266. Comedy|Thriller
267. Crime|Horror|Thriller
268. Drama|Family|Romance
269. Action|Comedy|Drama
270. Family|Music|Romance
271. Drama|Romance|Family
272. Biography|Drama|Fantasy
273. Drama|Family|History
274. Western
275. Comedy|Drama|War
276. Biography|Drama|War
277. Adventure|Mystery|Thriller
278. Horror|Mystery|Romance
279. Comedy|Mystery|Romance
280. Comedy|Mystery|Sci-Fi
281. Comedy|Crime|Horror
282. Comedy|Fantasy|Musical
283. Comedy|Drama|Mystery
284. Action|Horror|Mystery
285. Drama|Fantasy|Music
286. Family|Sci-Fi
287. Adventure|Comedy|Music
288. Action|Comedy|War
289. Fantasy|Horror|Sci-Fi
290. Drama|History|Mystery
291. Drama|Fantasy|Western
292. Documentary|Drama
293. Animation|Crime|Drama
294. Horror|Musical|Sci-Fi
295. Adventure|Biography|Comedy
296. Comedy|Mystery|Crime
297. Crime|Horror
BIBLIOGRAPHY 33
298. Comedy|Musical|Adventure
299. Comedy|Horror|Sci-Fi
300. Musical|Romance
301. Documentary|Drama|War
302. Crime|Drama|Musical
303. Drama|Musical
304. Fantasy|Horror|Romance
305. Adventure|Horror
306. Comedy|Horror|Mystery
307. Adventure|Sci-Fi
308. Documentary|Music
309. Documentary|Crime|Drama
310. Animation|Action|Crime
311. Documentary
312. Comedy|Drama|Horror
313. Crime|Drama|Film-Noir
314. Documentary|Crime
315. Action|Sport|Thriller
316. Documentary|Comedy
317. Crime|Film-Noir|Thriller
318. Documentary|Crime|War
319. Comedy|Fantasy|Thriller
320. Documentary|Drama|Sport
321. Crime|Fantasy|Horror
322. Crime|Film-Noir|Mystery
323. Drama|Fantasy|Sci-Fi
324. Documentary|Comedy|Drama
www.kth.se