7
Indian Regional Movie Dataset for Recommender Systems Prerna Agarwal * IIIT Delhi [email protected] Richa Verma * IIIT Delhi [email protected] Angshul Majumdar IIIT Delhi [email protected] ABSTRACT Indian regional movie dataset is the rst database of regional Indian movies, users and their ratings. It consists of movies belonging to 18 dierent Indian regional languages and metadata of users with varying demographics. rough this dataset, the diversity of Indian regional cinema and its huge viewership is captured. We analyze the dataset that contains roughly 10K ratings of 919 users and 2,851 movies using some supervised and unsupervised collaborative ltering techniques like Probabilistic Matrix Factorization, Matrix Completion, Blind Compressed Sensing etc. e dataset consists of metadata information of users like age, occupation, home state and known languages. It also consists of metadata of movies like genre, language, release year and cast. India has a wide base of viewers which is evident by the large number of movies released every year and the huge box-oce revenue. is dataset can be used for designing recommendation systems for Indian users and regional movies, which do not, yet, exist. e dataset can be downloaded from hps://goo.gl/EmTPv6. KEYWORDS Movie dataset, Recommender System, Metadata, Indian Regional Cinema, Crowd Sourcing, Collaborative Filtering 1 INTRODUCTION Recommendation systems [17, 18, 19, 20, 33] utilize user ratings to provide personalized suggestions of items like movies and products. Some popular brands that provide such services are Amazon [34], Netix, IMDb, BarnesAndNoble [35], etc. Collaborative Filtering (CF) and Content-based (CB) recommendation are two commonly used techniques for building recommendation systems. CF systems [21, 22, 23]operate by gathering user ratings for dierent items in a given domain and compare various users, their similarities and dierences, to determine the items to be recommended. Content- based methods recommend items by comparing representations of content that a user is interested in to representations of content that an item consists of. ere are several datasets like MovieLens [1] and Netix [2] that are available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India has been the largest producer of movies in the world for the last * Prerna Agarwal (now working with IBM Research) and Richa Verma (now working with TCS Research ) have equal contribution Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Conference’17, Washington, DC, USA © 2016 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 DOI: 10.1145/nnnnnnn.nnnnnnn few years with a lot of diversity in languages and viewers. As per the UNESCO cinema statistics [9], India produces around 1,724 movies every year with as many as 1,500 movies in Indian regional languages. India’s importance in the global lm industry is largely because India is home to Bollywood in Mumbai. ere’s a huge base of audience in India with a population of 1.3 billion which is evident by the fact that there are more than two thousand multiplexes in India where over 2.2 billion movie tickets were sold in 2016. e box oce revenue in India keeps on rising every year. erefore, there is a huge need for a dataset like Movielens in Indian context that can be used for testing and bench-marking recommendation systems for Indian Viewers. As of now, no such recommendation system exists for Indian regional cinema that can tap into the rich diversity of such movies and help provide regional movie recommendations for interested audiences. 1.1 Motivation As of now, Netix and Movielens datasets do not have a comprehen- sive listing of regional productions as the clipping shows in Figure 1( borrowed from [39]). erefore, a substantial source of such a data comprising movies of various regions, varying languages and genres encompassing a wider folklore is strongly needed that could provide such data in a suitable format required for building and benchmarking recommendation systems. To capture the diversity of Indian regional cinema, popular websites like Netix are trying to shi focus towards it [36, 37]. e goal is to bring some of the greatest stories from Indian regional cinema on a global platform. rough this, viewers are exposed to a wide variety of new and diverse stories from India. As a result of this initiative, Indian regional cinema will be available across countries. Building a recommendation system using a dataset of such movies and their audience can prove to be useful in such situations. Here, we present such a dataset which is the rst of its kind. 1.2 Contributions Web portal for data collection: A web portal where a user can sign up by lling details like email, date of birth, gender, home town, languages known and occupation. e user can then provide rating to movies as like/dislike. Indian Regional Cinema Dataset: It is the rst dataset of Indian Regional Cinema which contains ratings by users for dierent regional movies along with user and movie metadata. User metadata is collected while signing up on the portal. Movie metadata consists of genre, release year, description, language, writer, director, cast and IMDb rating. Detailed analysis of the dataset using some supervised and unsupervised Collaborative Filtering techniques. arXiv:1801.02203v1 [cs.IR] 7 Jan 2018

Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

Indian Regional Movie Dataset for Recommender SystemsPrerna Agarwal∗

IIIT [email protected]

Richa Verma∗IIIT Delhi

[email protected]

Angshul MajumdarIIIT Delhi

[email protected]

ABSTRACTIndian regional movie dataset is the �rst database of regional Indianmovies, users and their ratings. It consists of movies belonging to18 di�erent Indian regional languages and metadata of users withvarying demographics. �rough this dataset, the diversity of Indianregional cinema and its huge viewership is captured. We analyzethe dataset that contains roughly 10K ratings of 919 users and2,851 movies using some supervised and unsupervised collaborative�ltering techniques like Probabilistic Matrix Factorization, MatrixCompletion, Blind Compressed Sensing etc. �e dataset consists ofmetadata information of users like age, occupation, home state andknown languages. It also consists of metadata of movies like genre,language, release year and cast. India has a wide base of viewerswhich is evident by the large number of movies released everyyear and the huge box-o�ce revenue. �is dataset can be used fordesigning recommendation systems for Indian users and regionalmovies, which do not, yet, exist. �e dataset can be downloadedfrom h�ps://goo.gl/EmTPv6.

KEYWORDSMovie dataset, Recommender System, Metadata, Indian RegionalCinema, Crowd Sourcing, Collaborative Filtering

1 INTRODUCTIONRecommendation systems [17, 18, 19, 20, 33] utilize user ratings toprovide personalized suggestions of items like movies and products.Some popular brands that provide such services are Amazon [34],Net�ix, IMDb, BarnesAndNoble [35], etc. Collaborative Filtering(CF) and Content-based (CB) recommendation are two commonlyused techniques for building recommendation systems. CF systems[21, 22, 23]operate by gathering user ratings for di�erent items ina given domain and compare various users, their similarities anddi�erences, to determine the items to be recommended. Content-based methods recommend items by comparing representations ofcontent that a user is interested in to representations of contentthat an item consists of.�ere are several datasets like MovieLens [1] and Net�ix [2] that areavailable for testing and bench-marking recommendation systems.We present an Indian regional movie dataset on similar lines. Indiahas been the largest producer of movies in the world for the last∗Prerna Agarwal (now working with IBM Research) and Richa Verma (now workingwith TCS Research ) have equal contribution

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).Conference’17, Washington, DC, USA© 2016 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . .$15.00DOI: 10.1145/nnnnnnn.nnnnnnn

few years with a lot of diversity in languages and viewers. As perthe UNESCO cinema statistics [9], India produces around 1,724movies every year with as many as 1,500 movies in Indian regionallanguages. India’s importance in the global �lm industry is largelybecause India is home to Bollywood in Mumbai. �ere’s a huge baseof audience in India with a population of 1.3 billion which is evidentby the fact that there are more than two thousand multiplexes inIndia where over 2.2 billion movie tickets were sold in 2016. �e boxo�ce revenue in India keeps on rising every year. �erefore, thereis a huge need for a dataset like Movielens in Indian context that canbe used for testing and bench-marking recommendation systemsfor Indian Viewers. As of now, no such recommendation systemexists for Indian regional cinema that can tap into the rich diversityof such movies and help provide regional movie recommendationsfor interested audiences.

1.1 MotivationAs of now, Net�ix and Movielens datasets do not have a comprehen-sive listing of regional productions as the clipping shows in Figure1( borrowed from [39]). �erefore, a substantial source of such adata comprising movies of various regions, varying languages andgenres encompassing a wider folklore is strongly needed that couldprovide such data in a suitable format required for building andbenchmarking recommendation systems.To capture the diversity of Indian regional cinema, popular websiteslike Net�ix are trying to shi� focus towards it [36, 37]. �e goal isto bring some of the greatest stories from Indian regional cinemaon a global platform. �rough this, viewers are exposed to a widevariety of new and diverse stories from India. As a result of thisinitiative, Indian regional cinema will be available across countries.Building a recommendation system using a dataset of such moviesand their audience can prove to be useful in such situations. Here,we present such a dataset which is the �rst of its kind.

1.2 Contributions• Web portal for data collection: A web portal where a user

can sign up by �lling details like email, date of birth, gender,home town, languages known and occupation. �e usercan then provide rating to movies as like/dislike.

• Indian Regional Cinema Dataset: It is the �rst dataset ofIndian Regional Cinema which contains ratings by usersfor di�erent regional movies along with user and moviemetadata. User metadata is collected while signing upon the portal. Movie metadata consists of genre, releaseyear, description, language, writer, director, cast and IMDbrating.

• Detailed analysis of the dataset using some supervised andunsupervised Collaborative Filtering techniques.

arX

iv:1

801.

0220

3v1

[cs

.IR

] 7

Jan

201

8

Page 2: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

Figure 1: Amazon and Net�ix: Focus on Regional Films

2 RELATEDWORKMovieLens [1] is a web-based portal that recommends movies. Ituses the �lm preferences of its users, collected in the form of movieratings and preferred genres and then utilizes some collaborative�ltering techniques to make movie recommendations. �e Depart-ment of Computer Science and Engineering at the University ofMinnesota houses a research lab known as Grouplens Research thatcreated Movielens in 1997 [38]. Indian Regional Cinema dataset isinspired from Movielens. �e primary goal was to collect data forperforming research on providing personalized recommendations.MovieLens released three datasets for testing recommendation sys-tems: 100K, 1M and 10M datasets. �ey have released 20M datasetas well in 2016. In the dataset, users and movies are representedwith integer IDs, while ratings range from 1 to 5 at a gap of 0.5.Net�ix released a training data set for their contest, Net�ix Prize [8],which consists of about 100,000,000 ratings for 17,770 movies givenby 480,189 users. Each rating in the training dataset consists offour entries: user, movie, date of grade, grade. Users and moviesare represented with integer IDs, while ratings range from 1 to 5.�ese datasets are largely for hollywood movies and TV series, andtheir viewers. �ey are not designed for those user communitieswhich are inclined towards watching Indian regional cinema.From the view point of recommender systems, there have been a lotof work using user ratings for items and metadata to predict theirliking and disliking towards other items [4, 5, 6, 11]. Many unsuper-vised and supervised collaborative �ltering techniques have beenproposed and benchmarked on movielens dataset. Here, in thispaper, we have chosen few popular techniques such as user-usersimilarity to establish baseline and then other deeper techniques

such as Blind Compressed Sensing, Probabilistic Matrix Factoriza-tion, Matrix completion, Supervised Matrix Factorization are usedon our dataset to provide benchmarking results. �ese techniquesare chosen over others because these techniques have proven toprovide be�er accuracy in recent works [6].

3 INDIAN REGIONAL MOVIE DATASET�is is the �rst dataset of Indian regional cinema which coversmovies of 18 di�erent regional languages and a variety of userratings for such movies. It consists of 919 users with varying demo-graphics and 2,851 movies with di�erent genres. It has 10K ratingsfrom 919 users.

3.1 Metadata Information�e data for movies has been scraped from IMDb [3]. IMDb has acollection of Indianmovies spanning acrossmultiple Indian regionallanguages and genres. Each movie is associated with the followingmetadata.

• Movie id: Each movie has a unique id for its representation.• Description: Description of the movie for users.• Language: Language(s) used in the movie. A movie may

have been released in multiple regional languages. �edistribution is shown in Table 1.

• Release date: Date of release of the movie.• Rating count: As per IMDb, to judge the popularity of the

movie.• Crew: Director, writer and cast of the movie.• Genre: Movie genre. It can be one or multiple out of 20

genres available on IMDb. �e number of movies for eachgenre are shown in Figure 2.

Table 1: Language Distribution

Languages Movie Count User CountHindi 615 902Bengali 582 28Assamese 22 9Tamil 313 30Nepali 51 9Punjabi 150 78

Rajasthani 18 14Malayalam 346 16Bhojpuri 26 21Kannada 303 11Haryanvi 3 18Manipuri 8 4Urdu 129 23

Marathi 204 14Telugu 338 18Oriya 98 6

Gujarati 49 7Konkani 6 4

Page 3: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

Figure 2: Genre Distribution

For be�er recommendations, it is important to include the factorswhich in�uence user ratings the most. Following is the metadatainformation collected for a user:

• User id: A user will have a unique id for its representation.• Languages: �e languages known by the users. Its count is

shown in Table 1.• State: �e state of India that the user belongs to. �e region

wise distribution is shown in Figure 3.

Figure 3: Region-wise Distribution

• Age: Date of birth of the user is taken as input to calculatethe age of the user. �e distribution is shown in Figure 4.

• Gender : Denotes the gender of the user. Gender distribu-tion of the data is shown in Figure 5.

• Occupation: It denotes the occupation of the user. It can beany one out of student, self-employed, service, retired andothers. Its distribution is shown in Figure 6.

Figure 4: Age Distribution

Figure 5: Gender Distribution

Page 4: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

Table 2: Comparison of Datasets

Dataset Movie Count User Count Rating Count Sparsity(%) Release YearOur Dataset 2851 919 10,000 99.96 2017

Movielens 100K 1700 1000 100,000 99.94 4/1998Movielens 1M 6000 4000 1,000,000 99.96 2/2003

Figure 6: Occupation Distribution

3.2 MovieLens vs Indian Regional CinemaDataset

�e key di�erence between the presented dataset and movielens isthat the la�er does not contain movies from Indian regional cinema.Movielens only has few Hindi and Urdu movies. Also, our data hasbeen collected mainly from the viewers of regional movies in India.�e user metadata, thus, collected can be used to recommend morerelevant movies for such audiences.Also, the MovieLens datasets are biased towards a certain categoryof users. �ey contain data only from users who have rated atleast twenty movies. �e datasets do not include the data of thoseusers who could not �nd enough movies to rate or did not �nd thesystem easy enough to use. �ere is a possibility that there is afundamental di�erence between such users and the other users inthe datasets. Our dataset makes no such distinction among usersbased on the number of movies that they have rated.To make the process of rating multiple movies easier for a user, wehave used the concept of binary rating for movies where, a usercan either ”like” or ”dislike” a movie denoted by ”1” and ”-1” inthe dataset. On the other hand, MovieLens uses a 10-point scalefor rating (from 0 to 5). A basic comparison of these datasets areshown in Table 2. �e table indicates the number of users, movies,ratings, release year and the sparsity of datasets.

3.3 Dataset CollectionFor the collection of user information and movie ratings, a webportal named Fickscore [15] is created where users can sign up�lling in all details as shown in Figure 7. �e user has to providethe preferred languages so that the portal can ask users to rate themovies of their preferred languages.While signing up, the user is prompted to �ll up the metadata

Figure 7: Sign Up form on Portal

information. �e user can then login to the portal to rate movies aseither like or dislike and the responses are recorded as shown inFigure 8.

Figure 8: Rating Movies on Portal

Page 5: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

4 UNSUPERVISED COLLABORATIVEFILTERING TECHNIQUES

To analyze the dataset, some unsupervised techniques are usedsuch as user-user similarity, item-item similarity, Matrix Factoriza-tion, Probabilistic Matrix Factorization, Blind Compressed Sensingetc. �e main advantage of using such techniques is the ease ofimplementation and their incremental nature. On the other hand,it is human data dependent its performance decreases on increaseof sparsity of data. �ese techniques cannot address the cold startproblem i.e., when a new user or item adds in the dataset whoseratings are not available because these use ratings of users to makepredictions. Bias correction is performed on the dataset by cal-culating global mean, user bias and item bias and then the abovetechniques are used to predict the rating of a new user for an item.

4.1 User and Item-based similarityIn user-user model, a similarity matrix A is calculated, each entryAi j indicates the score computed by cosine similarity between auser i and another user j. It denotes how much similar are twousers i and j, higher the score higher is the similarity. Similarly, initem-item model, each entryAi j of the similarity matrix ’A’ denotesthe cosine similarity score between an item i and another item j.Higher the score, the two items are more similar.Cosine similarity can be calculated for two users u and u’ using thefollowing equations:

Su,u′ =

∑∀j ru, jru′, j√∑

∀j r2u, j∑∀j r2u′, j

(1)

Where, ri, j denotes the rating by ith user for jth item.Prediction for uth user for jth item is done as:

ˆru, j =∑

u′∈N (u), j ∈R(u′)wu,u′ru,u′ (2)

Where, wu,u′ is the normalized similarity weight, ru,u′ is therating by u’ user for jth item.

Similar to user-user similarity, item-item similarity is calculatedby computing cosine similarity between two items and ratings arepredicted in the similar way.

4.2 Matrix factorization�ere are some hidden traits (latent factors) of liking/disliking ofusers which may depend on the pa�ern of their ratings. Users andmovies are mapped by this model to a joint latent factor space. Eachitem i and user j is associated with vectorq andp respectively whichmeasures the possessiveness of an item or user for those factors.�e dot product qTi pu denotes the liking of a user for a speci�c itemwhich approximates the rating rui [10]. Computing the mapping ofeach user and item to factor vectors is a major challenge. Imputationcan prove to be expensive as it noticeably increases the amountof data during calculation. To model the observed ratings directlywith regularization, the following equation is used:

minq∗ .p∗

∑(u,i ∈k )

(rui − qTi pu )2 + λ(| |qi | |2 + | |pu | |2) (3)

Here, k is the set of those user-item pairs in the training set forwhich rui is known.

�e system uses the already observed ratings to �t a model on themand uses that model to predict the new ratings.�e intuition behind using matrix factorization to analyze thisdataset is that there should be some latent features that determinehow a user rates an item. For example, two users may give highratings to a certain movie if they both like the actors/actresses ofthe movie, or if the movie is an action movie, which is a genrepreferred by both users. Hence, if we can discover these latentfeatures, we should be able to predict the rating given by a certainuser to a certain item, because the features associated with the usershould match with the features associated with the item.

4.3 Probabilistic Matrix FactorizationIt can handle large datasets because it scales linearly with the num-ber of observations in the dataset. Let Ri j represent the rating ofa user i for a movie j. Let U and V be latent feature matrices foruser and movie, respectively. �e column vectors are denoted asUi , representing user-speci�c latent feature vectors, and Vj , repre-senting movie-speci�c latent feature vectors. [4]. �e log posterioris maximized over movie and user features with hyper parametersusing the following equation:

12

N∑i=1

M∑j=1

Ii j (Ri j −UTi Vj )

2 +λu2

N∑i=1| |Ui | |2f rob +

λv2

M∑j=1| |Vj | |2f rob

(4)Where, λu and λv are the regularization parameters for user anditem respectively. A local minimum of the equation can be com-puted by gradient descent in U and V . �e model performance ismeasured by computing mean average error (MAE) and root meansquared error (RMSE) on the test set.�is model can be viewed as a probabilistic extension of the SVDmodel, since if all ratings have been observed, the objective givenby the equation reduces to the SVD objective in the limit of priorvariances going to in�nity. �is technique be�er addresses thesparsity and scalability problems and thus improves predictionperformance. It gives an intuitive rationale for recommendation.

4.4 Blind Compressed SensingA dense user item matrix is not a reasonable assumption as eachuser will like/dislike a trait to certain extent [24]. However, anyitem will possess only a few of the a�ributes and never all. Hence,the item matrix will ideally have a sparse structure rather than adense one as formulated in earlier works.�e objective of this approach is to �nd the user and item latentfactor matrices. As per the approach, user latent factor matrix canbe dense but the same does not logically follow for the item latentfactor matrix. �e sparsity of the item latent factor matrix increasesthe recommendation accuracy signi�cantly. [5]. �e followingequation is minimized:

min(U ,V )

| |Y −A(UV )| |2 + λu | |U | |f rob + λv | |V | |f rob (5)

Where λu and λv are regularization parameters for user and itemrespectively. A is the binary mask matrix and Y is the rating matrix.U andV is the user latent matrix and item latent matrix respectivelywhich were assumed to be dense in earlier models.

Page 6: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

Table 3: Unsupervised Techniques

Techniques Movielens 100K Our Dataset Movielens 1MErrors MAE RMSE MAE RMSE MAE RMSEUser-User similarity 0.6980 1.026 0.5307 1.03 0.607 0.8810Item-Item similarity 0.744 1.061 0.648 1.049 0.671 0.9196Matrix Factorization 0.828 1.128 0.471 0.971 0.6863 0.8790Probabilistic Matrix Factorization 0.7564 0.9639 0.481 0.9372 0.7241 0.9127Blind Compressed Sensing 0.7356 0.9409 0.463 0.9612 0.6917 0.8789Matrix Completion 0.8324 1.102 0.4827 0.9264 0.7196 0.9102

Table 4: Supervised Technique

Technique Movielens 100K Our dataset Movielens 1MErrors MAE RMSE MAE RMSE MAE RMSESupervised Matrix Factorization 0.7199 0.9196 0.4367 0.9283 0.6709 0.8567

4.5 Matrix CompletionMatrix completion involves �lling up the missing entries of a par-tially observed matrix. It aims to compute the matrix with thelowest rank or, if the rank of the completed matrix is known, amatrix of rank r that matches the known entries. A popular ap-proach for solving the problem is nuclear-norm-regularized (NN)matrix [7] as shown in the following equation.

min | |B − R | |2f rob + λ | |R | |NN (6)

Where,B = Rk−1 +M

T (Y −M .Rk−1) (7)and,M is the binary mask. R is the rating matrix imputed and Y isthe original rating matrix.

5 SUPERVISED COLLABORATIVE FILTERINGTECHNIQUES

To analyze the dataset, some supervised techniques are used suchas supervised Matrix Factorization. �e main advantage of usingsupervised methods is that whenever a new user or new item comesin, it can make predictions for them as well which unsupervisedtechniques fail to do [28, 29, 30, 31]. �is is also called as cold startproblem. �ese are scalable and are dependent on the metadatainformation of user and item because of which it gives more accu-rate predictions as it establishes relation well. Bias correction isperformed on the dataset by calculating user bias and item bias andthen the above technique is used to calculate the rating of a newuser for an item.

5.1 Supervised Matrix Factorization�e task of predicting ratings becomes di�cult largely becauseof the sparsity of the ratings available in the database of a recom-mender system. �erefore, using the knowledge related to usersdemographics and item categories can enhance prediction accuracy[25, 26, 27, 32]. Classes are formed as per users age group, genderand occupation. A user can belong to multiple classes at a time.Class label information is important to learn the latent factor vec-tors of users and movies in a supervised environment, in a waythat they are consistent with the class label information available.

Class label information puts in additional constraints which resultsin reducing the search space as a result of which determinacy ofthe problem is reduced.Mathematically, within the matrix factorization framework, addi-tional information of user metadata (U) and item metadata (V) canbe used and the following equation can be minimized [6].

min(U ,V ,A,C)

| |Y −M(UV )| |f rob + λu | |U | |f rob + λv | |V | |f rob+

µu | |W −UC | |f rob + µv | |Q −AV | |f rob(8)

Where,Wi j = 1 if user i belongs to class j else 0. C is the linearmap from latent factor space to classi�cation domain. Q is theclass information matrix created similar to W . Other variableshave their usual meanings. Introducing supervised learning intothe latent factor model helps in improving the prediction accuracyby reducing the problem of rating matrix sparsity. �e value ofregularization parameters are determined using l-curve technique[16]

6 EXPERIMENTS AND RESULTS�ree di�erent datasets are used to compare the results of super-vised and unsupervised collaborative �ltering techniques used topredict user ratings. �e datasets used for experiments are Movie-lens 100K, MovieLens 1M and our dataset of Indian regional movies.For error calculation, Mean absolute error (MAE) and Root meansquared error (RMSE) is calculated between the actual ratings andthe predicted ratings. �e datasets are divided into 5 folds for eval-uation. �e ratings are binarized into like/dislike (1/-1) labels forexperiments. Results of di�erent techniques on these datasets areshown in Table 3 and Table 4.As the values in the Table 2 indicate, the basic cosine similarity mea-sures between users and movies perform fairly well on all datasets.�e minimum MAE values result from the experiments using ourdataset. Since the sparsity of regional cinema dataset andMovielens1M dataset is is very high (as indicated in Table 2), techniques likeProbabilistic Matrix Factorization and Blind Compressed Sensingperform be�er than other basic similarity measures and amongthem the least MAE is again shown for our dataset.To use metadata information, the information is encoded in the

Page 7: Indian Regional Movie Dataset for Recommender …available for testing and bench-marking recommendation systems. We present an Indian regional movie dataset on similar lines. India

form of one hot vector of 1’s and 0’s where in case of languages, mul-tiple 1’s can be present in the vector. Since supervised techniquesuses both user and item metadata they outperform unsupervisedcollaborative �ltering techniques. Among all three datasets, min-imum MAE is shown on our dataset. �is shows that the IndianRegional Cinema dataset can prove to be useful for building andbenchmarking recommendation systems in Indian context, whichhas the most diverse languages and demographics.

7 CONCLUSION AND FUTUREWORKIndia is one of the country where not only varying languages arepresent, it’s population’s demographics are also very diverse innature. �erefore, Indian regional cinema has a lot of diversitywhen it comes to the number of languages and the demographics ofthe viewers. �ere are thousands of such movies that are producedannually and there is a huge community of people who watch them.�erefore, a recommender system for Indian regional movies isneeded to address the preferences of the growing number of theirviewers. �is dataset has around 10K ratings by Indian users, alongwith their demographic information. We believe that this datasetcould be used to design, improve and benchmark recommendationsystems for Indian regional cinema. We plan to release the dataseta�er its publication. We further want to release another versionof this dataset with more number of ratings and users, which willhelp to improve the current state of recommender systems for theIndian audience.

REFERENCES[1] MovieLens: h�ps://movielens.org/[2] Net�ix: h�ps://www.net�ix.com/in/[3] IMDb: h�p://www.imdb.com/[4] A. Mnih and R. Salakhutdinov, Probabilistic matrix factorization, in Proc. Adv.

Neural Inf. Process. Syst., 2007, pp. 1257-1264.[5] A. Gogna and A. Majumdar. (2015). Blind compressive sensing framework for

collaborative �ltering. Available: h�p://arxiv.org/abs/1505.01621[6] A. Gogna and A. Majumdar, A Comprehensive Recommender System Model: Im-

proving Accuracy for Both Warm and Cold Start Users, in IEEE Access, vol. 3, no. ,pp. 2803-2813, 2015.

[7] T. Hastie, R. Mazumder, J. Lee, R. Zadeh, Matrix Completion and Low-Rank SVDvia Fast Alternating Least Squares, h�ps://arxiv.org/abs/1410.2596.

[8] Net�ix Prize: h�p://www.net�ixprize.com/[9] h�p://uis.unesco.org/en/news/record-number-�lms-produced[10] Yehuda Koren, Matrix Factorization Techniques for Recommender Systems, Pub-

lished by the IEEE Computer Society, August 2009.[11] Xiaoyuan Su and Taghi M. Khoshgo�aar, A survey of collaborative �ltering

techniques, in Advances in arti�cial intelligence, vol. 2009, pp. 4, 2009.[12] �omas Hofmann, Latent semantic models for collaborative �ltering, in ACM

Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 89-115, 2004.[13] Gediminas Adomavicius and Alexander Tuzhilin, Toward the next generation

of recommender systems: A survey of the state-of-the-art and possible extensions,in Knowledge and Data Engineering, IEEE Transactions on, vol. 17, no. 6, pp.734-749, 2005.

[14] Sivan Gleichman and Yonina C Eldar, Blind compressed sensing, in Information�eory, IEEE Transactions on, vol. 57, no. 10, pp. 6958-6975, 2011.

[15] Flickscore: h�p://�ickscore.iiitd.edu.in[16] C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, vol. 161. Engle-

wood Cli�s, NJ, USA: Prentice-Hall, 1974.[17] M. F. Hornick and P. Tamayo, Extending recommender systems for disjoint

user/item sets: �e conference recommendation problem, IEEE Trans. Knowl. DataEng., vol. 24, no. 8, pp. 1478-1490, Aug. 2012.

[18] Q.Liu, E.Chen, H.Xiong, Y.Ge, Z.Li and X.Wu, A cocktail approach for travelpackage recommendation, IEEE Trans. Knowl. Data Eng.,vol. 26, no. 2, pp. 278-293, Feb. 2014.

[19] Y. Koren and R. Bell, Advances in collaborative �ltering, in Recommender SystemsHandbook. New York, NY, USA: Springer, 2011, pp. 145-186.

[20] X. Su and T. M. Khoshgo�aar, A survey of collaborative �ltering techniques, Adv.Artif. Intell., vol. 2009, Jan. 2009, Art. ID 4.

[21] R. M. Bell and Y. Koren, Improved neighborhood-based collaborative �ltering, inProc. KDD-Cup Workshop, 2007, pp. 7-14.

[22] C.Desrosiers, and G.Karypis, A comprehensive survey of neighborhood basedrecommendation methods, in Recommender Systems Handbook. New York, NY,USA: Springer, 2011, pp. 107-144.

[23] J. Wang, A. P. de Vries, and M. J. T. Reinders, Unifying user-based and item-basedcollaborative �ltering approaches by similarity fusion, in Proc. 29th Annu. Int.ACM SIGIR Conf. Res. Develop. Inf. Retr., 2006, pp. 501-508.

[24] S.Gleichman and Y.C.Eldar, Blind compressed sensing,IEEETrans. Inf. �eory, vol.57, no. 10, pp. 6958-6975, Oct. 2011.

[25] H.Ma, D.Zhou, C.Liu, M.R.Lyu, and I.King, Recommender systems with socialregularization, in Proc. 4th ACM Int. Conf. Web Search Data Mining, 2011, pp.287-296.

[26] P. Massa and P. Avesani, Trust-aware recommender systems, in Proc.ACM Conf.Recommender Syst., 2007, pp. 17-24.

[27] X.Tang, Y.Xu, and S.Geva, Learning higher order interactions for user and item pro-�ling based on tensor factorization, in Proc. 20th Int. Conf. Intell. User Interfaces,2015, pp. 213-224.

[28] Q.Gu, J.Zhou, and C.Ding,Collaborative �ltering:Weighted non negative matrixfactorization incorporating user and item graphs, in Proc. SDM, 2010, pp. 199-210.

[29] S.-T. Park, D. Pennock, O. Madani, N. Good, and D. DeCoste, Naive �lterbots forrobust cold-start recommendations, in Proc. 12th ACM SIGKDD Int. Conf. Knowl.Discovery Data Mining, 2006, pp. 699-705.

[30] A. L. V. Pereira and E. R. Hruschka, Simultaneous co-clustering and learning toaddress the cold start problem in recommender systems, Knowl.-Based Syst., vol.82, pp. 11-19, Jul. 2015.

[31] Z.-K. Zhang, C. Liu, Y.-C. Zhang, and T. Zhou, Solving the cold-start problemin recommender systems with social tags, EPL (Europhys. Le�.), vol. 92, no. 2, p.28002, Nov. 2010.

[32] A.-T. Nguyen, N. Denos, and C. Berrut, Improving new user recommendationswith rule-based induction on cold user data, in Proc. ACM Conf. RecommenderSyst., 2007, pp. 121-128.

[33] G. Adomavicius and A. Tuzhilin, Toward the next generation of recommendersystems: A survey of the state-of-the-art and possible extensions, IEEE Trans. Knowl.Data Eng., vol. 17, no. 6, pp. 734-749, Jun. 2005.

[34] Amazon: h�p://www.amazon.in[35] BarnesAndNoble: h�ps://www.barnesandnoble.com[36] h�ps://www.clickondetroit.com/entertainment/

net�ix-steps-up-its-ba�le-with-amazon-in-india[37] h�p://www.�nancialexpress.com/industry/technology/

the-desi-content-ba�leground/798145/[38] h�ps://grouplens.org/datasets/movielens/[39] h�p://timeso�ndia.indiatimes.com/companies/

amazon-to-join-bollywood-�lm-industry-hires-consultants-to-create-a-blueprint\-for-a-hindi-�lm-studio/articleshow/58176571.cms