8
Building a Relevance Engine with the Interest Graph 140 Proof Research Kumar Dandapani, John Manoogian III Introduction The proliferation of electronic social networks since the late 2000s has generated an abundance of data on individuals and their interests. The volume and cacophony of this data creates sig- nificant challenges in characterizing social media users and identifying relevant information from the social stream. Although relevance is often a nebulous, contextual concept, it can be safely defined as occurring when content engages the interests of a user. Consequently, arriving at rel- evant content motivates the creation of a system- atic and falsifiable mechanism for identifying the interests of network users. At 140 Proof, this ef- fort has come to be part of a broader development effort called the Relevance Engine. A commonly used approach for building inter- est groups is to cluster users with similar char- acteristics, often by their usage of keywords, but often these approaches preclude falsifiability and rely heavily on intense computation as a substi- tute for acquiring subject area expertise on social media interests. While such techniques are capa- ble of maximizing a well-defined objective func- tion, such as the number of Facebook ’Likes’ over a period of time, they can often lead to conclu- sions that suggest relationships between variables that are nonsensical or prove to be unstable over time. The Relevance Engine represents a depar- ture from such prevailing techniques by providing a framework for acquiring subject area expertise from experiments on social media engagement and then combining this information with com- ponents of network graph theory. This approach allows us to understand the underlying data gen- erating processes and the statistical properties of the explanatory variables that expose a user’s in- terests. This paper describes the empirical and theoretical foundations of our approach. 1. Social Networks Social media networks share many common com- ponents. This section describes the participants in these networks, their motivations, behaviors, and how one can use these qualities to obtain rel- evant information from the interest graph. 1.1 Users Participants in social media vary widely in their usage of the network and their propensity for en- gagement. Some individuals focus more heavily on the creation of content, while others use so- cial media primarily as a content consumption mechanism. Our approach to identifying inter- ests accounts for these different uses by relying less on user-generated content and more heavily on observed behavior to social stream content. 1.2 Interests For the purposes of our investigation, interests are defined as a collection of subjects, activities, 1

Building a Relevance Engine with the Interest Graph

Embed Size (px)

DESCRIPTION

The Interest Graph is a network of asymmetric connections between people and the things we publicly "Like," follow, pin, post, and check into. These asymmetrical signals of interest (ASIs) map a topology of aspiration, interest, and curiosity between people and the stuff they love, both online and off. By mapping this public graph of what people like, 140 Proof is able to understand not just who likes what, but what topics are hot, what social accounts are influential about what topics, and how companies and small businesses can better optimize the way they try to acquire customers and talk to their constituents.

Citation preview

Page 1: Building a Relevance Engine with the Interest Graph

Building a Relevance Engine with the Interest Graph

140 Proof Research

Kumar Dandapani, John Manoogian III

Introduction

The proliferation of electronic social networkssince the late 2000s has generated an abundanceof data on individuals and their interests. Thevolume and cacophony of this data creates sig-nificant challenges in characterizing social mediausers and identifying relevant information fromthe social stream. Although relevance is oftena nebulous, contextual concept, it can be safelydefined as occurring when content engages theinterests of a user. Consequently, arriving at rel-evant content motivates the creation of a system-atic and falsifiable mechanism for identifying theinterests of network users. At 140 Proof, this ef-fort has come to be part of a broader developmenteffort called the Relevance Engine.

A commonly used approach for building inter-est groups is to cluster users with similar char-acteristics, often by their usage of keywords, butoften these approaches preclude falsifiability andrely heavily on intense computation as a substi-tute for acquiring subject area expertise on socialmedia interests. While such techniques are capa-ble of maximizing a well-defined objective func-tion, such as the number of Facebook ’Likes’ overa period of time, they can often lead to conclu-sions that suggest relationships between variablesthat are nonsensical or prove to be unstable overtime. The Relevance Engine represents a depar-ture from such prevailing techniques by providinga framework for acquiring subject area expertisefrom experiments on social media engagement

and then combining this information with com-ponents of network graph theory. This approachallows us to understand the underlying data gen-erating processes and the statistical properties ofthe explanatory variables that expose a user’s in-terests. This paper describes the empirical andtheoretical foundations of our approach.

1. Social Networks

Social media networks share many common com-ponents. This section describes the participantsin these networks, their motivations, behaviors,and how one can use these qualities to obtain rel-evant information from the interest graph.

1.1 Users

Participants in social media vary widely in theirusage of the network and their propensity for en-gagement. Some individuals focus more heavilyon the creation of content, while others use so-cial media primarily as a content consumptionmechanism. Our approach to identifying inter-ests accounts for these different uses by relyingless on user-generated content and more heavilyon observed behavior to social stream content.

1.2 Interests

For the purposes of our investigation, interestsare defined as a collection of subjects, activities,

1

Page 2: Building a Relevance Engine with the Interest Graph

or attitudes that are capable of reliably drawingthe attention of a subset of the population. De-signing content that engages a group of users thatmatch an interest is a challenging, but critical,input to the Relevance Engine. If it proves tobe impossible to generate social media contentthat engages some subset of the population dis-proportionately, then it is likely that the interestis either too broadly defined (e.g. right-handedpeople) or it represents a concept that is so ab-stract (e.g. identifying idealists) that it can’t becaptured with social media content. As a result,designing social stream content for a given in-terest draws heavily on subject area expertise insocial media.

1.3 Engagement

Social networks have several mechanisms to ob-serve a users response to specific content in thesocial stream. Some of the most common exam-ples include the ’Like’ button on Facebook, theFavorite or Retweet options on Twitter, and theoption to +1 on Google+. The development ofthe Relevance Engine does not rely exclusively ona specific engagement mechanism, but requiresthat interest-specific or interest-neutral contentbe presented in a manner that allows one to un-ambiguously engaged exclusively with the stimu-lating content.

1.4 Content

Social networks offer multiple channels in whicha user is presented with content. Tweets, wallposts, and links are some common examples ofthese content distribution mechanism. Many ofthese channels offer a way in which both interest-specific and interest-neutral content can be deliv-ered.

The Relevance Engine draws upon the conceptof interest-neutral content for determining a base-line engagement rate for a given network user. Bydesign, certain types of content should not gar-ner a disproportionate amount of attention andnot produce statistically significant differences in

engagement rates when compared to the overallnetwork. An example of this broad, general con-tent would be a non-partisan news article. Whileall content has the potential for unwittingly be-ing interest-specific, if the engagement rate forthe social media content is within the marginsof random variation centered around the bench-mark rate of engagement for the network, it caneffectively be described as interest-neutral. Thebenefit of having such content is that it provides acontrol for the varying levels of engagement thatare inherent to participants in social media. Userswill vary widely in their baseline propensity toobservably engage with any type of social mediacontent and a failure to recognize and correctlyquantify this engagement propensity can create apower confound that results in Type I and TypeII erroneous inferences. By controlling for thesebiases, the Relevance Engine can allow us to eval-uate if a user matches a given interest regardlessof their baseline engagement propensity.

An additional benefit of interest-neutral contentis to measure if interest-specific content in our in-ventory is anomalous. Content that is universallymore engaging for reasons beyond its intent (e.g.an objective news story with a provocative title)will likely observe a higher engagement rate forreasons other than the fact that it resonates withusers that share a specific interest. Such contentcan be readily handled by comparing the the sta-tistical properties of each piece of content withthe broader network.

2. Defining Interests by Engage-ment

Identifying the interests of users by their engage-ment behavior is achieved through inferential sta-tistical procedures in which each piece of socialmedia content is treated as a binomial randomvariable and the user’s decision to observably en-gage with content is coded as a success. This sec-tion describes our approach to designing interestidentification experiments and the application ofbinomial sequential tests as a way of quantifying a

2

Page 3: Building a Relevance Engine with the Interest Graph

user’s propensity to engage with interest-specificcontent beyond their baseline rate of engagement.

2.1 Experimental Design

Group sequential multiple sampling proceduresprovide a compelling framework for inferring theinterests of a given user. Sequential sampling isan alternative to fixed sample-size tests, whichare expensive in terms of both media inventoryand the amount of time consumed before statis-tically valid conclusions can be drawn. In thecontext of social networks, sequential sampling isparticularly compelling given that the number ofsessions per user is not known beforehand andis subject to a high degree of variability due tounpredictable group sizes.

To formalize this model, we start by defining Ias a vector of unique interests that are believedto be capable of being measured. The objec-tive of our sequential trials is to map these inter-est onto the user space U . Using this notation,Ii(i = 1, 2, ...) represents the ith interest underinvestigation during a given trial. Trials are con-ducted on a per user basis, so Uu,Ii(u = 1, 2, ...) isa dichotomous variable indicating if the uth useron the network shares an interest in Ii.

Each interest is represented by an inventory ofsocial media content that is denoted by the ma-trix A where Aj(j = 1, 2, ...) represents the jthpiece of content for Interest Ii. To make thisproblem more tenable, it is assumed that eachpiece of content Aj is contextually independentof A1...N , where a total of N pieces of contenthave been constructed for Ii, and that order theorder of the content does not affect the likelihoodof engagement. Given N unique interest-specificpieces of content in set A for Ii, the distributionof engagement per user can be described from aseries of independent Bernoulli trials where πIi isthe sample proportion of interest-specific, socialstream content with which the user observable en-gaged (i.e. success). From this, we can describea given user’s engagement behavior with respect

to Ii using a binomial distribution.

πIi =1

N A

NA∑j=1

Aj (1)

Uu,Ii ∼ B(NA, πIi) (2)

From the binomial distribution the meanand variance of proportion of engagements forinterest-specific content are

E[Uu,Ii ] = NAπIi (3)

V[Uu,Ii ] = NAπIi(1− πIi) (4)

Similarly, we define C to be a matrix with car-dinality NC to represent the interest-neutral con-trol group of the experiment that are presentedto all network users. It is again assumed thateach piece of content is independent and can bepresented in random order to the user withoutintroducing any pronounced confounds.

πC =1

NC

NC∑j=1

Cj (5)

Uu,C ∼ B(NC , πC) (6)

The binomially distributed random variableUu,C describes our estimates of the baseline so-cial media engagement rate for user u. Based onasymptotic properties, the standard error of oursample estimated user social media engagementrate, E[Uu,C ], will decrease as the user is exposedto more content from the control inventory.

E[Uu,C ] = NCπC (7)

V[Uu,C ] = NCπC(1− πC) (8)

The sequence of binomial trials for a given userswhere i ∈ (1, N) and chosen from alternating as-signment, random sampling without replacementas follows:

Cj , Ii, Cj+1, Ii+1 . . . Cj=NC, Ii=NA

(9)

Under the assumptions of independence de-scribed by Meeker (1981), the nominal excepted

3

Page 4: Building a Relevance Engine with the Interest Graph

engagements xIi and xC can be obtained from theformula

P (xIi , xC ;NA, NC , πIi , πC) =(NA

xIi

)πIi

xIi (1− πIi)NA−xIi(

NA

xC

)πC

xC (1− π2)NC−xC (10)

2.2 Analyzing Sequential Trial Experi-ments

Sequential binary testing allows us to accept orreject the null hypothesis that for a given useru the engagement rate for interest-specific con-tent, E[Uu,Ii ] cannot be distinguished from thatof the user’s baseline engagement rate E[Uu,C ].The two-sided hypothesis being tested is: H0 :Pr(Uu,Ii) = Pr(Uu,C) vs. HA : Pr(Uu,Ii) 6=Pr(Uu,C). Extending on the formulation de-scribed by Jennison and Turnbull (1993), we letθ = πIi − πC represent the parameter indicatingthat the proportions are not equivalent. We startby defining a marginal distribution of Wk for thecase where the user receives an equal number ofcontrol and interest-specific items, nk. Under thiscondition, the parameter estimate can be approx-imated from the normal distribution from the fol-lowing

Wk =

nk∑i=1

XIi −nk∑i=1

XC ∼N(nkθ, 2nkσ2) (11)

Wk ∼ N(nkθ, nk {πIi(1− πIi) + πC(1− πC)})(12)

We start with the null hypothesis that rates ofengagement are identical for treatment and con-trol

Wk ∼ N(0, 2nkσ2o), k = 1, . . . ,K (13)

and we can accept or reject a one or two-sidednull hypothesis as follows

Wk > ck(β)√

2knσ2, k = 1, . . . ,K (14)

Wk < −ck(β)√

2knσ2, k = 1, . . . ,K (15)

The complete framework for sequential tests thenbecomes

Pr{|W1| <√

2nkσ2oc1(β), . . . ,

|Wk−1| <√

2nk−1σ2ock−1(β), |Wk| >√

2nkσ2oc1(β)|Wj ∼ N(0, 2njσ

2o), j = 1, . . . ,K}

2

{nknmax

}2

−{nk−1nmax

}2

, k = 1, . . . ,K (16)

Given a predetermined Type II error power es-timate of β and ck(β), k = 1, . . . ,K, under theassumption of normality, the conditions for stop-ping becomes:

Pr{−√

2nkσ2Φ−1(1− β/2) <

W < −√

2nkσ2Φ−1(1− β/2)|θ = ∆} = α (17)

Statistically valid inferences can then be drawnfrom this test with the minimum sample size nkcalculated as follows

nk =2σ2

δ2

{Φ−1

(1− β

2

)+ Φ−1 (1− α)

}2

(18)

One complication in the analysis of social mediacontent is that the proportion of successes can bequite low even when presenting the most engag-ing content available. Low proportions make itchallenging to arrive at unbiased parameter esti-mates and thus making it difficult to make validinferences during the assumption of normality.To address this issue, we explore the calculationof bias-adjusted Maximum-Likelihood Estimatesas described in (Brown 2002). Alternative bino-mial proportion confidence intervals for engage-ment rate can also be obtained using the WaldInterval approach, RCIs, and the Agresti-Coullapproach.

2.3 Content Quality Control

In our sequential testing framework, users consis-tently receive interest-neutral content and overtime the standard errors around the parameterestimate of the expected proportion of successfulengagements with interest-neutral content should

4

Page 5: Building a Relevance Engine with the Interest Graph

decline while still providing a framework to ac-count for the changing engagement behavior ofusers on the network. With this approach, thecontrol content provides a quality control checkand a mechanism to monitor a user’s changing in-terests. If a given piece of interest-neutral contentis generally less compelling than interest-specificcontent, then users are likely to be erroneouslyclassified as harboring a particular interest, lead-ing to false discovery. To control for this effect, wecontinually assess if a particular piece of contentis universally more or less compelling by lookingat the distributional properties of each piece ofinterest-neutral social media content across thenetwork. If the engagement rate is abnormallyhigh or low relative to the network response rate,then we exclude it from interest identification.This helps to ensure that interest-neutral contentis truly interest neutral. As more sequential trialare conducted such techniques for quality controlwill see improvement.

For j ∈ (1...J), we define the total number ofengagements per interest-specific item as S =∑U

u=1

∑Aj . From MLE, we arrive at the lower

pL and upper pU bound confidence intervals,(pL, pU ) = (pL(S), pU (S)) then the acceptancerange characterizing a piece of content as beingwithin the range of normal engagement is definedas

Pr[S ≤ s | p = pU (S)] = Pr[S ≥ s | p = pL(S)] =α

2(19)

2.4 Early Stopping Rules for SequentialTrials

The rules for stopping a given trial are basedon the likelihood ratio. We define θi to bethe log odds ratio (sequential probability ra-tio) that measures the difference in engagementrate between the content specific to interest Iiand interest-neutral content C. By running atwo-sided hypothesis test we can stop showinginterest-specific impressions to a user when it be-comes clear that he/she has no propensity forthose impressions. Interest-specific content is sus-

pended when once the conditions is met. Thesimplest approach to stopping trials early is basedon the log-odds ratio of the treatment to the con-trol based on a predetermined threshold

θi = log

(πIi(1− πC)

πC(1− πIi)

)(20)

The Relevance Engine also uses the Wald Inter-val. Upon observing a p that is outside of thelong-memory bounds relative to our defined TypeI significance level we suspend trials for that in-terest to the user.

p± κn−1/2(p(1− p))1/2 (21)

where k = (1−α)α

2.5 Future Research

The following topics represent of future researchin both developing an analyzing sequential trialson interest-based engagement.

(1) Time-Series Analysis: Future models will at-tempt to address the time-series component of in-terests. One such model would propose a weight-ing scheme to emphasize more recent engagementbehavior. The existing approach presumes the in-terests are stationary creating a need to periodi-cally run trials on the user to measure changinginterests over time. One such example is a userthat is a new parent engaging with baby relatedsocial media content when previously that userdid not align with such interests.

(2) Assignment Mechanisms: By acknowledgingthe possibility that the the response rate at tis potentially influenced by the response rate att − 1, we can compare alternating assignment,random assignment, adaptive sequencing assign-ments. A major assumption behind this experi-mental design is that the decision to engage withthe control is independent of the treatment andthe decision to engage with one piece of contentdoes not preclude the ability to engage with an-other piece of content. As long as treatmentand control do not compete with one another for

5

Page 6: Building a Relevance Engine with the Interest Graph

the user’s attention, this independence can gener-ally be assumed, but by controlling for the orderin which the content is received we can addressconcerns about the impact of fatigue from beingshown irrelevant content on engagement behav-ior.

(3) Dropouts: Given the potential for users toenter and exit a social network, future researchwill attempt to generate estimates that accountfor users that abandon social media or that onlyreceive a small number of treatments.

(4) Principal component analysis: There are sev-eral endogenous qualities associated with usersthat have compelling causality with engagement.Such factors include the number of network asso-ciations, days since the user joined the network,gender, and location. Research in this area wouldalso help us account for exogenous factors such asmajor news events and announcements and theirconfounding impact on the network. The pres-ence of such events might result in users beingdisinclined to engage with both interest-neutralor interest-specific content for a large window.

3. Defining Interests by NetworkAssociations

Exclusively relying on an engagement-based in-terest graph fails to address the sizable propor-tion of social media users that never observablyengage with any type of content. As a result,the Relevance Engine uses observed, engagementdata as a seed to identify which network associa-tions are suggestive of a user-interest match. Theprocess within the Relevance Engine is to identifythe first, second and third degree associations andthe Simmelian ties between users that have beenclassified as sharing a particular interest and allother users on the network. With this associa-tion matrix, we run logit regressions to see whichof those associations are statistically significant.The explanatory power of certain network asso-ciations on interest alignment are then tested ina statistical framework and the forecasting powerof these models is tested out-of-sample.

3.1 Modeling Network Associations

We start this process by identifying the subset ofusers that engaged with interest-specific contentat a rate that exceeded their baseline engagementrate and create a nxm matrix Z that representsuser-relationships per interest Ii. In this matrix,we use h to represent the user that engaged withinterest-specific content and r to represent Usingthis notation Zhr = X indicates the number ofdegrees in that connection where X ∈ (0, 1, 2, 3)corresponding to the degrees of separation of thenetwork association.

3.1.1 Simmelian Tie Measures

As described in Krackhardt (1999), SimmelianTies are a way of capturing strong associationsin the social graph by observing reciprocity in as-sociations. Simmelian ties are defined as a variantof symmetric relationships in the network whereZ ∩ Z ′

represents all of the mutual relationshipsbetween users h and r. From this foundation,we can characterize the subset of Simmelian tiesas S = Y

⊗(Y 2). These associations are then

codified with a dichotomous variable in our logitframework.

3.2 Logit Regressions

Logit regressions serve as the main frameworkby which we identify which nodes on the net-work have explanatory power on interest-specificengagement. Following our notation on the ex-pected engagements on a given interest by useru, we specify the following logit model and usemaximium-likelihood estimation to arrive at themodel parameters.

logit(E[Uu,Ii |Zu,1 . . . Zu,m) =

β0 + β1Zu,1 + · · ·+ βmZu,m (22)

The logit model is then reduced to the subset

of values where ML estimated t-values ti =√nβiσi

exceed a pre-specified critical value.

6

Page 7: Building a Relevance Engine with the Interest Graph

3.2.1 Heckman-style Model

The use of a Heckman-style model is motivatedby the desire to correct for the inherent self-selection bias associated with interest-based en-gagement that results in a non-random sample.This non-random sample has the potential forcreating a skewed social graph for a given interestby failing to account for users that never observ-ably engage. The basic Heckman procedure tocorrect for this problem are described by Sartori(2003)

yi = β′xi + εi (23)

Ui = γ′wi + ui (24)

This is in many ways similar to an omitted-variable misspecification problem in and OLS re-gression. Where φ is the standard normal distri-bution and Φ is the cumulative standard normaldistribution.

E(yi) = β′xi + θ

[φ(γ

′wi)

Φ(γ′wi)

](25)

3.2.2 Equivalence Relations

The ubiquity of certain network associations (e.g.celebrities) can pose challenges in a regressionframework. An approach to mitigating the effectsof multicollinearity in our network node regres-sions is to identify equivalent relations betweeneach of the nodes on the network by performinga Turing reduction on the matrix Z for each in-terest.

4. Model Validation

Falsifiability is a key feature of the Relevance En-gine and this section outlines the ways in whichwe can validate the success of our classificationmechanism. There are two main types of valida-tions that need to be performed are: (1) assessthe stability of content-based engagement as de-termined by the ability to replicate the sequential

trial experiments on a fresh subset of the net-work (2) perform an out-of-sample validation ofour second stage social graph user-interest classi-fication on a second set of users.

4.1 Methodology

To validate our interest graph inferences, the pop-ulation of network users is divided randomly intotwo groups A and B in which we restrict the sam-ples to users that have participated in the socialnetwork for a comparable period of time to con-trol for adoption differences. After defining a setof N reasonable interests, the engagement-basedsequential tests are performed on group A. Thenetwork associations of the users in group A aredefined for each interest and a Zi matrix is cre-ated for each interest from i = 1 . . . N . We applythe logit model fit on group A onto group B to ar-rive at expected probabilities of engagement YP2.Interest-specific content is then shown to users ingroup B and the forecasted engagement rate isthen compared to the observed engagement rate.

4.2 Evaluation

The objective of this evaluation is to see if theRelevance Engine produces results in which theusers in the out-of-sample group that have beenclassified as sharing have a higher engagementrate for interest-specific content than interest-neutral content. We evaluate our performanceby comparing the root-mean-squared error ofthe out-of-sample data with the in-sample data,where P1 corresponds to estimated engagementbased on the in-sample logit estimate and P2 cor-responds to the out-of-sample observed value.

RMSD(YP1, YP2) =

√E((YP1 − YP2))

2) (26)

References

[1] C. Jennison and B.W. Turnbull Group se-quential tests and repeated confidence inter-vals, with Applications to Normal and Binary

7

Page 8: Building a Relevance Engine with the Interest Graph

Responses Biometrics, Vol. 49., 1993, pp. 31-43.

[2] W.Q. Meeker A Conditional Sequential Testfor the Equality of Two Binomial ProportionsJournal of the Royal Statistical Society. SeriesC, Vol 30., 1981, pp. 109-115

[3] A.E. Sartori An Estimator for Some Binary-Outcome Selection Models Without ExclusionRestrictions. 2003.

[4] W. Lehmacher and G. Wassmer AdaptiveSample Size Calculations in Group SequentialTrials. Biometrics, Vol. 55., 1999, pp. 1286-1290

[5] R.Simon, G.H. Weiss, and D.G. Hoel Se-quential Analysis of Binomial Clinical Trials.Biometrika, Vol. 62., 1975, pp. 195-200

[6] D.A. Schoenfeld A Simple Algorithm for De-signing Group Sequential Trials. Biometrics,Vol. 57., 2001, pp. 972-974

[7] L.D. Brown, T.T Cai, and A. DasGupta Con-fidence Intervals for a Binomial Proportionand Asymptotic Expansions. The Annals ofStatistics, Vol. 30, No. 1, 2002, pp. 160-201

[8] D. Krackhardt Structure, culture and Sim-melian ties in entrepreneurial firms. SocialNetworks, Vol. 24, No. 3, 2002, pp. 279-290

8