k-Separability Presentation

An Efficient Collaborative Recommender Systembased on k -separability

Georgios Alexandridis Georgios Siolas Andreas Stafylopatis

Department of Electrical and Computer EngineeringNational Technical University of Athens

20th International Conference on Artificial Neural Networks(ICANN 2010)

Alexandridis, Siolas, Stafylopatis (NTUA) k -separability Collaborative Recommender ICANN’10 1 / 16

Outline

1 Current Trends in Recommender SystemsRecommender SystemsDesign Issues

2 Theoretical & Practical Aspects of our Contributionk-SeparabilitySystem Architecture

3 Evaluating our SystemExperimentResultsConclusions


What are the Recommender Systems?

Recommender Systems attempt to present information items (e.g.movies, music, books, news stories) that are likely to be of interestto the user.

Some implementations

I AmazonF "Customers Who Bought This Item Also Bought"

I Google NewsF "Recommended Stories"

I Online Audio BroadcastersF last.fmF Pandora


What are the Recommender Systems?

Recommender Systems attempt to present information items (e.g.movies, music, books, news stories) that are likely to be of interestto the user.Some implementations

I AmazonF "Customers Who Bought This Item Also Bought"

I Google NewsF "Recommended Stories"

I Online Audio BroadcastersF last.fmF Pandora


Taxonomy of Recommender Systems

Criterion: How are the predictions made?I Content-Based Recommenders

F Locate "similar" itemsI Collaborative Recommenders

F Find "like-minded" usersI Hybrid Recommenders

F Combination of the two

Which method is the best?

I Open academic subjectI Highly dependent on the application domainI We followed the Collaborative Recommender approach

F Computationally simpler than the Hybrid approachF A user rating is more than a mere number; it is an aggregation of

various characteristics


Taxonomy of Recommender Systems

Criterion: How are the predictions made?I Content-Based Recommenders

F Locate "similar" itemsI Collaborative Recommenders

F Find "like-minded" usersI Hybrid Recommenders

F Combination of the two

Which method is the best?I Open academic subjectI Highly dependent on the application domainI We followed the Collaborative Recommender approach

F Computationally simpler than the Hybrid approachF A user rating is more than a mere number; it is an aggregation of

various characteristics


Collaborative Recommender Systems

Key Component: The User Ratings’ Matrix

Ratings

I Indicate how much a user likes an item

F "like" \"dislike"F 1-star up to 5-stars

I1 I2 I3 I4U1 5 3 2U2 3 5 2U3 1 2U4 2 3

Users become each other’s predictor

I By locating positive and negative correlations among them.



Key Component: The User Ratings’ MatrixRatings

I Indicate how much a user likes an itemF "like" \"dislike"F 1-star up to 5-stars

I1 I2 I3 I4U1 5 3 2U2 3 5 2U3 1 2U4 2 3







I1 I2 I3 I4U1 5 3 2U2 3 5 2U3 1 2U4 2 3







I1 I2 I3 I4U1 5 3 2U2 3 5 2U3 1 2U4 2 3

Users become each other’s predictorI By locating positive and negative correlations among them.


Challanges in Collaborative Recommender SystemDesign

1 The cold-start problem

I Recommendations cannot be made unless a user has providedsome ratings

I Solutions:

F Recommend the most popular itemsF Explicity ask the user to rate some items prior to making

recommendations

2 The sparsity problem

I The ratings matrix is sparse

F Empty elements: More than 90%

I Solution: Dimensionality Reduction techniques

F Singular Value Decomposition (SVD) yields good results

I Pros: The resultant matrix is substantially smaller & densierI Cons: The dataset becomes very "noisy"

F Most elements assume values that are marginally larger than zero

I Conclusion: We are in need of techniques that can "learn" noisydatasets!



1 The cold-start problemI Recommendations cannot be made unless a user has provided

some ratingsI Solutions:


recommendations2 The sparsity problem

I The ratings matrix is sparse

F Empty elements: More than 90%

I Solution: Dimensionality Reduction techniques

F Singular Value Decomposition (SVD) yields good results


F Most elements assume values that are marginally larger than zero

I Conclusion: We are in need of techniques that can "learn" noisydatasets!



1 The cold-start problemI Recommendations cannot be made unless a user has provided

some ratingsI Solutions:


recommendations2 The sparsity problem

I The ratings matrix is sparseF Empty elements: More than 90%

I Solution: Dimensionality Reduction techniquesF Singular Value Decomposition (SVD) yields good results


F Most elements assume values that are marginally larger than zeroI Conclusion: We are in need of techniques that can "learn" noisy

datasets!


"Noisy" Datasets

The added noise in the dataset hinders the discovery of patternsin data

I Data clusters become difficult to separate

Machine Learning techniques for highly non-separable datasets

I Support Vector Machines, Radial Basis Functions

F Computing the support vector (or estimating the surface . . . ) can be acomputationally intensive task

I Evolutionary Algorithms

F Meaningful Recommendations are not always guaranteed(evolutionary dead-ends)

I Our approach: Use k -separability!

F Originally proposed by W. Duch1

F Special case of the more general method of Projection PursuitF Application to Feed-Forward ANNsF Extends linear separability of data clusters into k > 2 segments on

the discriminating hyperplane

1W. Duch, K-separability. Lecture Notes in Computer Science 4131 (2006) 188-197


"Noisy" Datasets


I Data clusters become difficult to separateMachine Learning techniques for highly non-separable datasets

I Support Vector Machines, Radial Basis Functions

F Computing the support vector (or estimating the surface . . . ) can be acomputationally intensive task

I Evolutionary Algorithms








"Noisy" Datasets



I Support Vector Machines, Radial Basis FunctionsF Computing the support vector (or estimating the surface . . . ) can be a

computationally intensive taskI Evolutionary Algorithms








"Noisy" Datasets



I Support Vector Machines, Radial Basis FunctionsF Computing the support vector (or estimating the surface . . . ) can be a

computationally intensive taskI Evolutionary Algorithms


I Our approach: Use k -separability!F Originally proposed by W. Duch1





Extending linear separability to 3-separabilityThe 2-bit XOR problem

A highly non-separable datasetIt can be learned by a 2-layered perceptron, or ......by a single layer percpetron that implements k -separability!

The activation function must partition the input space into 3distinct areas

I Soft-Windowed Activation Functions

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

(a) Input Space Partitioning

−2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

(b) Soft-Windowed ActivationFunction



A highly non-separable datasetIt can be learned by a 2-layered perceptron, or ......by a single layer percpetron that implements k -separability!The activation function must partition the input space into 3distinct areas


−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2


−2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1




A highly non-separable datasetIt can be learned by a 2-layered perceptron, or ......by a single layer percpetron that implements k -separability!The activation function must partition the input space into 3distinct areas


−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2


−2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1



Generalizing to k -separability

Complex DatasetsI Combine the output of two neurons (or more . . . )

I e.g. A 5-separable dataset can be learned by the combined outputof 2 neurons

Generalization by Induction

I m-neuron output ⇒ 2m + 1 regions on the discriminating line⇒ k = 2m + 1-separable dataset

Use in a Recommendation Engine

I Create a 2-layered perceptron

F n-sized input vector, m-sized hidden layer, single output layerF Overall, an n → m → 1 projection

I Build a model (NN) for each user

F Input: The ratings of the n "neighbors" of the target user on an itemhe hasn’t evaluated

F Output: A "score" for the unseen item





Generalization by InductionI m-neuron output ⇒ 2m + 1 regions on the discriminating line

⇒ k = 2m + 1-separable dataset

Use in a Recommendation Engine

I Create a 2-layered perceptronF n-sized input vector, m-sized hidden layer, single output layerF Overall, an n → m → 1 projection

I Build a model (NN) for each user

F Input: The ratings of the n "neighbors" of the target user on an itemhe hasn’t evaluated

F Output: A "score" for the unseen item





Generalization by InductionI m-neuron output ⇒ 2m + 1 regions on the discriminating line

⇒ k = 2m + 1-separable datasetUse in a Recommendation Engine

I Create a 2-layered perceptronF n-sized input vector, m-sized hidden layer, single output layerF Overall, an n → m → 1 projection

I Build a model (NN) for each userF Input: The ratings of the n "neighbors" of the target user on an item

he hasn’t evaluatedF Output: A "score" for the unseen item


Implementation Details

The index of separability (k ) is not known a-prioriI Setting k to a fixed value is of little helpI It can lead to either overspecialization or to large training errors

Therefore, k is a problem parameter: it has to be estimatedDynamic Network ArchitectureSparse user ratings’ matrix ⇒ small overall network size ⇒Constructive Network AlgorithmOur constructive network algorithm was derived from the NewConstructive Algorithm2

2Islam MM et al. A new constructive algorithm for architectural and functional adaptation of artificial neural

networks.

IEEE Trans Syst Man Cybern B Cybern. 2009 Dec;39(6):1590-605




Therefore, k is a problem parameter: it has to be estimated

Dynamic Network ArchitectureSparse user ratings’ matrix ⇒ small overall network size ⇒Constructive Network AlgorithmOur constructive network algorithm was derived from the NewConstructive Algorithm2


networks.





Therefore, k is a problem parameter: it has to be estimatedDynamic Network Architecture

Sparse user ratings’ matrix ⇒ small overall network size ⇒Constructive Network AlgorithmOur constructive network algorithm was derived from the NewConstructive Algorithm2


networks.





Therefore, k is a problem parameter: it has to be estimatedDynamic Network ArchitectureSparse user ratings’ matrix ⇒ small overall network size ⇒Constructive Network Algorithm

Our constructive network algorithm was derived from the NewConstructive Algorithm2


networks.





Therefore, k is a problem parameter: it has to be estimatedDynamic Network ArchitectureSparse user ratings’ matrix ⇒ small overall network size ⇒Constructive Network AlgorithmOur constructive network algorithm was derived from the NewConstructive Algorithm2


networks.



Constructive Network Algorithm

1 Create a minimal architecture2 Train the network in two phases on the whole Training Set3 Iteratively add neurons in the hidden layer

I Create new Training Sets based on the Classification Error(Boosting Algorithm)

I Only the newly added neuron’s weights are adapted; all otherremain "frozen"

4 Stop network construction when the Classification Error stabilizes

Boosting AlgorithmInspired from AdaBoost and used in Network Training as a way ofavoiding local minimaFunctionality

I Unlearned samples ⇒ New neurons in the hidden layer ⇒ Newclusters discovered


Constructive Network Algorithm

1 Create a minimal architecture2 Train the network in two phases on the whole Training Set3 Iteratively add neurons in the hidden layer

I Create new Training Sets based on the Classification Error(Boosting Algorithm)

I Only the newly added neuron’s weights are adapted; all otherremain "frozen"

4 Stop network construction when the Classification Error stabilizes

Boosting AlgorithmInspired from AdaBoost and used in Network Training as a way ofavoiding local minimaFunctionality

I Unlearned samples ⇒ New neurons in the hidden layer ⇒ Newclusters discovered


Our Collaborative Recommender System

Input: The user ratings’ matrix and the target user

Output: A model (NN) for the target userSteps

1 Pick from the user ratings’ matrix all the co-raters of the target user2 Compute the SVD of the co-raters matrix, retaining only the

non-zero Singular Values3 Partition the resultant matrix in 3 different sets; the Training Set, the

Validation Set and the Test Set4 Train a Constructive ANN Architecture (as discussed previously...)

5 Compute the Performance Metrics on the Test Set



Input: The user ratings’ matrix and the target userOutput: A model (NN) for the target user

Steps







Input: The user ratings’ matrix and the target userOutput: A model (NN) for the target userSteps






ExperimentThe MovieLens Database

Contains the ratings of 943 users on1682 moviesSparse matrix (6.3% of non-zeroelements)Each user has rated at least 20movies (106 on average), but. . .Discrete Exponential Distribution

I 60% of all users have rated 100movies or less

I 40% of all users have rated 50movies or less

We followed a purely CollaborativeStrategy

I Taking into account only the userratings’ and not any otherdemographic information

0 100 200 300 400 500 600 700 8000

20

40

60

80

100

120

140

(a) Rated items per user


ExperimentTest Sets & Metrics

Many users rate only a few movies. How would our systemperform?

I Group A: The few raters user group.

F Contains all users who have rated 20-50 movies

How would our system perform on the average case?

I Group B: The moderate raters user group.

F Contains all users who have rated 51-100 moviesF May be used in comparisons to other implementations

We randomly picked 20 users from each group (40 users in total).The results were averaged for each groupMetrics

1 Precision2 Recall3 F-measure




I Group A: The few raters user group.F Contains all users who have rated 20-50 movies

How would our system perform on the average case?

I Group B: The moderate raters user group.








How would our system perform on the average case?I Group B: The moderate raters user group.










We randomly picked 20 users from each group (40 users in total).The results were averaged for each group

Metrics











Results

Table: Performance Results

Methodology Precision Recall F-measureOurSystem: User Group B (moderate ratings) 75.38% 82.21% 79.37%OurSystem: User Group A (few ratings) 74.07% 88.86% 78.97%MovieMagician Clique-based 74% 73% 74%Movielens 66% 74% 70%SVD/ANN 67.9% 69.7% 68.8%MovieMagician Feature-based 61% 75% 67%MovieMagician Hybrid 73% 56% 63%Correlation 64.4% 46.8% 54.2%

Observations

I Our system achieves good results in both usergroups andoutperforms the other approaches

I Recall is higher in the few raters group because they seem to rateonly the movies they like

F Therefore, the recommender cannot generalize


Results

Table: Performance Results

Methodology Precision Recall F-measureOurSystem: User Group B (moderate ratings) 75.38% 82.21% 79.37%OurSystem: User Group A (few ratings) 74.07% 88.86% 78.97%MovieMagician Clique-based 74% 73% 74%Movielens 66% 74% 70%SVD/ANN 67.9% 69.7% 68.8%MovieMagician Feature-based 61% 75% 67%MovieMagician Hybrid 73% 56% 63%Correlation 64.4% 46.8% 54.2%

Observations

I Our system achieves good results in both usergroups andoutperforms the other approaches

I Recall is higher in the few raters group because they seem to rateonly the movies they like

F Therefore, the recommender cannot generalize


Conclusions

We have presented a complete Collaborative RecommenderSystem that is specifically fit for those cases where information islimitedOur system achieves a good trade-off between Precision andRecall, a basic requirement for RecommendersThis is due to the fact that k -separability is able to uncovercomplex statistical dependencies (positive and negative)We don’t need to filter the neighborhood of the target user as othersystems do (e.g. by using the Pearson Correlation Formula).

I All "neighbors" are consideredI Extremely useful in cases of sparse datasets


Education

k-Separability Presentation