28
Final Project CS/ECE/ME 539 Professor Hu UW-Madison MWF 1:20p David A. Gerasimow The Design and Implementation of a Dynamic Data MLP to Predict Motion Picture Revenue

Gerasim Ow

  • Upload
    edgar

  • View
    27

  • Download
    1

Embed Size (px)

DESCRIPTION

studying...

Citation preview

Final Project

CS/ECE/ME 539

Professor HuUW-Madison

MWF 1:20p

David A. Gerasimow

The Design and Implementation of a Dynamic Data MLP to Predict Motion Picture RevenueTable of ContentsIntroduction: Preface, Past Research, Improvements Over Past Research3

Initial Data Collection4

Data Collection Improvements, Data Encoding5

Pre-analysis of Data, Development of the Dynamic Data Neural Network, Step 1 of the UpdateWizard: Downloading, Step 2 of the UpdateWizard: Updating6

Step 3 of the UpdateWizard: Creating Training and Testing Files, Development of the MLP using Dynamic Data7

Using the Dynamic Data MLP, Choice 1 of moviesbp.m, Choice 2 of moviesbp.m9

Figure 1: DataExtractor Screenshot10

Figure 2: DataConcatenator Screenshot11

Figure 3: DataConverter Screenshot: Films Removed From Data File12

Figure 4: DataConverter Screenshot: Films to be Updated13

Figure 5: Results of preanalysis.m14

Figure 6: UpdateWizard Screenshot Step 1, Figure 7: UpdateWizard Screenshot Step 217

Figure 8: UpdateWizard Screenshot Step 3, Figure 9: NewMovie Screenshot18

Discussion of Results19

Bibliography20

VB Source Code21

Note to Grader: This report is over twenty pages, but this is because I was unsure if the grader has the ability to read and run Visual Basic 6.0 source code.IntroductionPrefaceFor the last century, film has been one of the American publics favorite entertainment mediums. Large production companies often spend hundreds of millions of dollars to create a single film. However, the amount of money spent on creating a film seems to have little bearing on its success. The Blair Witch Project, for instance, was made for under one million dollars, but it made over twenty-nine million dollars in its first weekend in the box office. On the other hand, Waterworld, starring superstar Kevin Costner, cost roughly one-hundred and seventy-five million dollars to produce, but made back less than half of that amount in domestic box office revenue.Predicting how much a movie will earn in opening-weekend box office revenue is a notoriously difficult thing to do. There are many subjective aspects of a movie. In addition, public taste changes quickly and unpredictably. Developing a mathematical formula to predict how much a film will make will allow production companies to maximize profit and skip film development projects that will hurt their profit margins.Past ResearchIn CS/ECE/ME 539, in the fall semester of 2001, a student attempted to predict the opening weekend box office revenue of a given film using an artificial neural network. He claimed that an accurate prediction of how much a movie will gross in total can be achieved by examining its opening weekend. They are proportional to each other. If a film has a huge opening weekend, it is likely to earn a lot of money in the long run. His logic is correct, and I will use it again in this project.

The networks inputs are the films characteristics, such as genre, rating, runtime, etc. Despite his thorough work, there are deficiencies in his project. This project will be a major improvement over his results. Namely, it will produce higher correct classification results; while, at the same time, it will allow future users to easily update the data files. The neural network will, over time, accumulate more and more training data. A major component of this report is developing what I call a dynamic data neural network. The training data is automatically updated weekly. Instead of the project ending with the end of the semester, future classes will be able to easily update this projects results, and the networks correct classification rates will improve over time.Improvements over Past Research

As an avid film buff, I came up with the idea to research film box office revenue with respect to neural networks independently, but I was disappointed to find out that it had been done before. As such, I set out to improve upon previous results. By adding more features and more feature vectors, I hoped to receive better results. Moreover, I wrote an UpdateWizard that can automatically and repeatedly update the data files. For example, every week, the top grossing films are listed at www.boxofficeguru.com. The UpdateWizard automatically downloads the list and updates the data file. Finally, it will create new training and testing data files for use with MATLAB and the dynamic data neural network.Initial Data Collection

Data pertaining to box office revenue is plentiful on the internet. As such, I sought out the data with the most input features and the most reliability. After a thorough search, I decided to use the information found at www.boxofficeguru.com. Already entered into its .html files are all the films since 1989 that have grossed more then fifteen-million dollars in their first weekend. Also, additional data is posted. The films opening date, number of theatres at opening, distributor, number of days in opening weekend, and, most importantly, the exact amount the film grossed in its opening weekend are available.Unfortunately, the data is not in a pleasant, readable format for programming use. Consequently, using Microsoft Visual Basic 6.0 Professional Edition, I wrote a Windows application called DataExtractor (dataextractor.exe) to parse the information out of the *.html files. For this portion of the data collection, which is only performed once, I manually downloaded the data files and renamed them 35plus.htm, 25to35.htm, 20to25.htm, 17to20.htm, and 15to17.htm. After running these five files through the DataExtractor, five readable files are created called 35plus_1.txt, 25to35_1.txt, 20to25_1.txt, 17to20_1.txt, and 15to17_1.txt. A screenshot of the DataExtractor is found on page 10 (Fig. 1).

The source code for the DataExtractor is found on pages 21+.After the DataExtractor has parsed the information, the five output files (35plus_1.txt, 25to35_1.txt, 20to25_1.txt, 17to20_1.txt, and 15to17_1.txt) need to be concatenated. Again, using Visual Basic 6.0, I developed another Windows application called DataConcatenator (dataconcatenator.exe). It takes the five aforementioned output files as inputs and creates a single file called concatenated_data.txt. A screenshot of the DataConcatenator is found on page 11 (Fig. 2).

The source code for the DataConcatenator is found on pages 21+.Now that a readable, single file exists, more input features needed to be added. In order to avoid unnecessary reentering of data, I wrote another Windows application called DataConverter (dataconverter.exe). As its inputs, it takes two files: data.txt and concatenated_data.txt. data.txt contains the data the student from last semester used, while concatenated_data.txt contains the updated film information created by the DataExtractor. The DataConverter compares these two files. If a film from data.txt did not gross over fifteen-million dollars in its first weekend, it is removed from the data file. Otherwise, the data is copied into mydata.txt. Films that have been released since data.txt (which was created in 2001) are determined and are enumerated in the file titlestoupdate.txt. The DataConverter displays the films that did not gross over fifteen-million dollars as well as movies that need to be updated (i.e., released since 2001).

Several screenshots of the DataConverter in action are found on pages 12-13 (fig. 3 and fig. 4). The source code for the DataConverter can be found on pages 21+.Once these three programs have ran (DataExtractor, DataConcatenator, DataCoverter), a list of film titles that need to be updated is created and stored under the filename titlestoupdate.txt. With this information, I looked up all the films in the file at www.imdb.com. This website has more information than www.boxofficeguru.com. A films genre, rating, runtime, color/black & white/animated, and sequel data are listed. I looked up each movie individually and entered the data in a Microsoft Word document. Data Collection ImprovementsHere, I made some improvements over the last project done on box office revenue. First, I eliminated his use of the IMDB user ratings as an input feature. This data is irrelevant because people do not know if a movie will be good before its opening weekend. While this data would be important if developing a neural network to determine total gross revenue, it is not useful in determining opening weekend gross revenue. In addition, I removed the input feature that determines the day of the month on which the film was released. Because any given day of the week does not correspond to any specific day of a month, this input feature was random, and therefore, it had little use in the MLP development.Several other improvements were also made. First, from general observations of the film industry, sequels tend to do well. The audience knows what to expect. Usually, a film studio only releases a sequel if the original did well. Whether or not a film is a sequel is an important aspect of determining its opening weekend revenue. Next, another input feature was added. Animated films tend to do very well. Whether or not a film is animated has a significant impact on its opening weekend revenue. The addition of these two input features increased correct classification rates of the multi-layer perceptron.Data EncodingThe data contained in movies.txt is the final data file after the procedures described above have been followed. Many features of a film are not numerical. As such, I created an encoding scheme that allowed the non-numerical data fields to be useful to the multi-layer perceptron.

GenreRatingDistributor

Action20G-5Sony1

Comedy21Universal2

Drama22Warner Brothers3

Family23Fox4

Horror24PG-4New Line5

Mystery25Buena Vista6

Animation26Paramount7

Romance27MGM/United Artists8

Sci-Fi28PG-13-3MGM9

Thriller29DreamWorks10

Western30Miramax11

Cell Intentionally Left BlankTriStar12

R-2Columbia13

Artisan14

Polygram15

USA Films16

Cell Intentionally Left BlankOrion17

Pre-analysis of the DataBefore developing the MLP, it is helpful to thoroughly examine the data. As such, I wrote preanalysis.m in MATLAB to assist me in this task. It produces graphs containing how many films have certain characteristics. Also, mean values and the inputs standard deviations are computed where applicable. The graphs produced by preanalysis.m can be found on pages 14-16 (fig. 5). Development of the Dynamic Data Neural NetworkA major component of this project is the development of what I call a dynamic data neural network. The training and testing data used by the MLP is constantly changing. This is an improvement over other neural networks, including the one previously designed to tackle the opening weekend box office revenue problem.In Visual Basic 6.0, I developed a Windows application called the UpdateWizard (updatewizard.exe). This program performs all the necessary steps to update the data. Consequently, as time goes on, the MLP, that will be developed later, will change and improve as its training data is updated.

Step 1 of the UpdateWizard: DownloadingThe UpdateWizard begins by downloading the most up-to-date data files from www.boxofficeguru.com. The program contacts the server and downloads five files: open35+.htm, open25-35.htm, open20-25.htm, open17-20.htm, and open15-17.htm. The files are processed and concatenated using methods similar to those found in the DataExtractor and DataConverter. After the files have been downloaded, processed, and linked, the updated data is compared to the current data file (movies.txt). Films that are new since the last update are presented to the user.A screenshot of the UpdateWizard in step 1 can be found on page 17 (fig. 6).

Step 2 of the UpdateWizard: Updating

In this step of the UpdateWizard, the user enters the information for the films that are new since the last update. This information can be found at www.imdb.com. After all the films have been updated, they are added to the data file movies.txt. The data is now up-to-date. If no updates are available, this step is skipped, and the UpdateWizard proceeds directly to step 3.A screenshot of the UpdateWizard in step 2 can be found on page 17 (fig. 7).

Step 3 of the UpdateWizard: Creating Training and Testing Files

In the third and final step of the UpdateWizard, training and testing files are created. The user has several options in this step. First, the user decides how many classes the data will be partitioned into. Second, he or she decides when to begin the testing file. By selecting a date, the training file will consist of all the films that were released prior to that date, and the testing file will consist of all the films that were released after that date.

User Options in Step 3Training File Options (i.e., classification scheme):1. 2 Classes Class 1: 15m-22.5m, Class 2: 22.5m+

2. 4 Classes Class 1: 15m-18.5m, Class 2: 18.5m-23m, Class 3: 23m-32m, Class 4: 32m+

3. 5 Classes Class 1: 15m-17m, Class 2: 17m-20m, Class 3: 20m-25m, Class 4: 25m-35m, Class 5: 35m+

Testing File Options (i.e., training and testing data separation):Begin Testing File on January 1st of 2001, 2002, or 2003.

Output File Descriptiontraining_X_YYYY.txt and testing_X_YYYY.txt where X is the classification scheme and YYYY is the year at which the testing file begins.A screenshot of the UpdateWizard in step 3 can be found on page 18 (fig. 8).

The source code for the UpdateWizard can be found on pages 21+.Development of the MLP using Dynamic DataBased on Professor Hus bp.m, bptest.m, and bpconfig.m, I developed a multi-layer perceptron to predict a films opening weekend revenue using dynamic data. Professor Hus MATLAB source code was modified and is contained in moviesbp.m, moviesbptest.m, and moviesbpconfig.m. As in previous homeworks, I ran many trials to determine the optimal configuration for the MLP. After the configuration was determined, it was hard-coded into moviesbpconfig.m so that future users do not have to enter the configuration each time the MLP is run.After modifying Professor Hus multi-layer perceptron MATLAB files, I began to test the MLP in order to determine the networks optimal configuration. To aid in this process, I used three-way cross-validation. I wrote a MATLAB program called threeway.m to accomplish this task. The m-file prompts the user to choose the classification scheme that is appropriate for the training task. Then, the m-file concatenates the training and testing files for the classification scheme. Next, it repartitions the concatenated data file into three equally sized files. These three files are then normalized by computing the each data columns mean and standard variance. Tables Showing Trials for Optimal MLP ConfigurationMean and standard deviation of eight trials to determine optimal learning rate, , and momentum constant, . Selected learning rate and momentum constant in bold.Trial12345678MeanStd

= .1 = .152.173953.875348.224650.349152.458252.349849.301452.958651.4611.9544

= .354.324852.340850.314051.651249.029351.661250.232352.843751.5501.6750

= .550.324454.332455.339852.320253.483949.923853.320454.894252.9922.0092

= .752.348757.234954.143755.230452.209558.309456.420553.365754.9082.2731

= .3 = .149.023951.023550.120350.765452.547849.314051.503651.112650.6761.1589

= .350.324949.992951.341853.123851.209551.563750.314050.432751.0381.0170

= .550.870851.304749.314049.021954.301452.023753.287853.204951.6661.9033

= .753.923852.37755.043152.290852.543854.032454.148755.871453.7791.3041

= .5 = .150.123947.134446.123848.028349.872145.293843.173450.191947.4932.5537

= .349.097350.891751.818750.239848.210248.123450.320952.540750.1551.6065

= .551.309450.341051.520951.514152.298452.208553.340750.431051.6210.9953

= .750.349651.250851.254052.029550.049354.032155.013249.082951.6332.0102

= .7 = .143.230846.094343.052943.230847.423044.013246.320947.498045.1081.9267

= .344.403942.05446.428947.031949.109846.487745.482747.012146.0012.0897

= .546.389744.872447.025146.142148.243750.123949.340948.213847.5441.7537

= .744.123046.234745.304749.032950.098749.871347.092950.131947.7362.3663

Mean and standard deviation of eight trials to determine optimal number of hidden layers, HL. Selected number of hidden layers in bold.Trial12345678MeanStd

HL=156.309457.954158.342057.409254.223155.320958.342857.911256.9771.536

HL=254.170852.340749.120351.827453.347952.502753.052349.104251.9331.8776

HL=353.430852.549953.002348.234850.329951.439053.258150.210951.5571.8458

Mean and standard deviation of eight trials to determine optimal number of hidden nodes in the single hidden layer. Selected number of hidden neurons in bold.Trial12345678MeanStd

H = 248.235149.423947.340245.422347.310047.123645.981046.349847.1481.2763

H = 450.346050.014848.012952.548951.025350.054452.258753.398050.9571.7294

H = 654.540753.450957.209855.234054.239053.540956.430855.210754.9821.328

H = 852.230854.349852.148751.502350.253052.487953.320555.399152.7121.6205

The mean and standard deviation values were computed using alphamu.m, hl.m, and h.m. These MATLAB files read data from files stats_am.txt, stats_h.txt, and stats_hl.txt.

Based on the above trials, the MLP configuration is as follows:Learning Rate0.1

Momentum Constant0.7

Number of Hidden Layers1

Number of Hidden Neurons6

Maximum Number of Epochs5000

Samples Per Epoch64

Scaling of Input[-5,5]

Neurons in the hidden layer use tanh() activation function.

Neurons in the output layer use sigmoidal() activation function.

These are the default activation functions provided in Hus bp.m.

Using the Dynamic Data MLP

After using the UpdateWizard to update the data file and create training and testing files, the dynamic data MLP is ready to use. From the MATLAB prompt, run moviesbp.m by entering moviesbp. Note that moviesbp.m requires many support m-files that are not included in the *.zip file. They are, however, available for download from the CS/ECE/ME 539 website http://www.cae.wisc.edu/~ece539/fall03/index.html in the section entitled MATLAB Files Used in the Class. When moviesbp.m begins, the user has two choices.

Choice 1 of moviesbp.m

Choice 1, or Predict the Revenue of a Newly Released Film allows the user to test the dynamic data MLP on a new movie. The user must first, however, run the Windows application called NewMovie (newmovie.exe). This program, developed using Visual Basic 6.0, provides a graphical user interface that lets the user enter the characteristics of a new film. The NewMovie program then creates a file called testsinglemovie.txt based on the entered characteristics.Once the output file is created, moviesbp.m can be run. After selecting option one, the MLP is trained per the users instructions. Then, the MLP is tested using the film information contained in testsinglemovie.txt. Finally, moviesbp.m classifies the movie and predicts its revenue. Depending on the classification scheme the user chose earlier, the film is classified. Consult classes.txt for a description of classification schemes.A screenshot of the NewMovie in action can be found on page 18 (fig. 9).The source code for NewMovie can be found on pages 21+.

Choice 2 of moviesbp.mChoice 2, or Simply Train and Test the MLP, allows the user to train and test the dynamic data MLP. It runs much like Professor Hus bp.m. Neuron weights can be found in the variable w. It also outputs the confusion matrices and classification rates for the training and testing datasets.Figure 1: DataExtractor Screenshot

Figure 2: DataConcatenator Screenshot

Figure 3: DataConverter Screenshot: Films Removed From Data File

Figure 4: DataConverter Screenshot: Films to be Updated

Figure 5: Results of preanalysis.m

Figure 5 Continued: Results of preanalysis.m

Figure 5 Continued: Results of preanalysis.m

Figure 5 Continued: Results of preanalysis.m

Figure 6: UpdateWizard Screenshot Step 1

Figure 7: UpdateWizard Screenshot Step 2

Figure 8: UpdateWizard Screenshot Step 3

Figure 9: NewMovie Screenshot

Discussion of Results

The classification rates of the dynamic data multi-layer perceptron are in the range from fifty-four to fifty-nine percent. This is roughly a four percent improvement over a similar project performed in the fall of 2001. As discussed in the introduction, predicting the box-office success of a film is difficult to do. As such, the MLP classifies films correctly more than half of the time. This is a good result because it occurs when there are four classes. If the MLP did not perform better than random classification, its classification rates would be around twenty percent. This project is a success because I improved upon past results.Moreover, the most interesting aspect of the project was the UpdateWizard. The MLP developed can easily be retrained to data that is constantly changing. The UpdateWizard makes this entire process easy and seamless. It is my hope that over time the wizard will accumulate more and more data which will cause correct classification rates to further improve. The UpdateWizards functionality is better than I originally expected. It is fairly easy to use and rarely makes mistakes. As such, this component of the project is a success, especially because no CS/ECE/ME 539 students have attempted such an application of neural networks in the past.Bibliography

Film Industry:

Rand, Philip A Guide to the Film Industry London: Emerald, 2003.Visual Basic References:

David, Harold Visual Basic 5 Secrets. Foster City, CA: IDG Books Worldwide, 1997.

Mansfield, Richard The Visual Guide to Visual Basic for Windows: The Illustrated, Plain-English Encyclopedia to the Windows Programming Language Version 3.0, 2nd Edition. Chapel Hill, NC: Ventana Press, Inc., 1993.Neural Networks:

Haykin, Simon Neural Networks: A Comprehensive Foundation, 2nd Edition. Upper Saddle River, NJ: Prentice-Hall, Inc., 1999.Neelakanta, Perambur S., ed. Information-Theoretic Aspects of Neural Networks. Boca Raton, FL: CRC Press, 1999.