18
1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEE Transactions on Knowledge and Data Engineering 1 Mining Competitors from Large Unstructured Datasets George Valkanas, Theodoros Lappas, and Dimitrios Gunopulos, Abstract—In any competitive business, success is based on the ability to make an item more appealing to customers than the competition. A number of questions arise in the context of this task: how do we formalize and quantify the competitiveness between two items? Who are the main competitors of a given item? What are the features of an item that most affect its competitiveness? Despite the impact and relevance of this problem to many domains, only a limited amount of work has been devoted toward an effective solution. In this paper, we present a formal definition of the competitiveness between two items, based on the market segments that they can both cover. Our evaluation of competitiveness utilizes customer reviews, an abundant source of information that is available in a wide range of domains. We present efficient methods for evaluating competitiveness in large review datasets and address the natural problem of finding the top-k competitors of a given item. Finally, we evaluate the quality of our results and the scalability of our approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and Retrieval, Electronic commerce 1 I NTRODUCTION A Long line of research has demonstrated the strategic importance of identifying and monitoring a firm’s competitors [1]. Motivated by this problem, the marketing and management community have focused on empirical methods for competitor identification [2], [3], [4], [5], [6], as well as on methods for analyzing known competitors [7]. Extant research on the former has focused on mining com- parative expressions (e.g. ”Item A is better than Item B”) from the Web or other textual sources [8], [9], [10], [11], [12], [13]. Even though such expressions can indeed be indicators of competitiveness, they are absent in many domains. For instance, consider the domain of vacation packages (e.g flight-hotel-car combinations). In this case, items have no assigned name by which they can be queried or compared with each other. Further, the frequency of textual compara- tive evidence can vary greatly across domains. For example, when comparing brand names at the firm level (e.g. “Google vs Yahoo” or “Sony vs Panasonic”), it is indeed likely that comparative patterns can be found by simply querying the web. However, it is easy to identify mainstream domains where such evidence is extremely scarce, such as shoes, jewelery, hotels, restaurants, and furniture. Motivated by these shortcomings, we propose a new formalization of the competitiveness between two items, based on the market segments that they can both cover. Formally: Definition 1. [Competitiveness]: Let U be the population of all possible customers in a given market. We consider that an item i covers a customer u ∈U if it can cover all of the customer’s requirements. Then, the compet- itiveness between two items i, j is proportional to the number of customers that they can both cover. G. Valkanas and T. Lappas are with the School of Business, Stevens Institute of Technology, Hoboken, NJ, 07030. E-mails: [email protected], [email protected] D. Gunopulos is with the University of Athens. E-mail: [email protected] Our competitiveness paradigm is based on the following observation: the competitiveness between two items is based on whether they compete for the attention and business of the same groups of customers (i.e. the same market segments). For example, two restaurants that exist in different countries are obviously not competitive, since there is no overlap between their target groups. Consider the example shown in Figure 1. Fig. 1: A (simplified) example of our competitiveness paradigm The figure illustrates the competitiveness between three items i, j and k. Each item is mapped to the set of features that it can offer to a customer. Three features are considered in this example: A, B and C. Even though this simple exam- ple considers only binary features (i.e. available/not avail- able), our actual formalization accounts for a much richer space including binary, categorical and numerical features. The left side of the figure shows three groups of customers g 1 ,g 2 , and g 3 . Each group represents a different market segment. Users are grouped based on their preferences with respect to the features. For example, the customers in g 2 are only interested in features A and B. We observe that items i and k are not competitive, since they simply do not appeal to the same groups of customers. On the other hand, j competes with both i (for groups g 1 and g 2 ) and k (for

Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

1

Mining Competitorsfrom Large Unstructured Datasets

George Valkanas, Theodoros Lappas, and Dimitrios Gunopulos,

Abstract—In any competitive business, success is based on the ability to make an item more appealing to customers than thecompetition. A number of questions arise in the context of this task: how do we formalize and quantify the competitiveness between twoitems? Who are the main competitors of a given item? What are the features of an item that most affect its competitiveness? Despitethe impact and relevance of this problem to many domains, only a limited amount of work has been devoted toward an effectivesolution. In this paper, we present a formal definition of the competitiveness between two items, based on the market segments thatthey can both cover. Our evaluation of competitiveness utilizes customer reviews, an abundant source of information that is available ina wide range of domains. We present efficient methods for evaluating competitiveness in large review datasets and address the naturalproblem of finding the top-k competitors of a given item. Finally, we evaluate the quality of our results and the scalability of ourapproach using multiple datasets from different domains.

Index Terms—Data mining, Web mining, Information Search and Retrieval, Electronic commerce

F

1 INTRODUCTION

A Long line of research has demonstrated the strategicimportance of identifying and monitoring a firm’s

competitors [1]. Motivated by this problem, the marketingand management community have focused on empiricalmethods for competitor identification [2], [3], [4], [5], [6],as well as on methods for analyzing known competitors [7].Extant research on the former has focused on mining com-parative expressions (e.g. ”Item A is better than Item B”)from the Web or other textual sources [8], [9], [10], [11], [12],[13]. Even though such expressions can indeed be indicatorsof competitiveness, they are absent in many domains. Forinstance, consider the domain of vacation packages (e.gflight-hotel-car combinations). In this case, items have noassigned name by which they can be queried or comparedwith each other. Further, the frequency of textual compara-tive evidence can vary greatly across domains. For example,when comparing brand names at the firm level (e.g. “Googlevs Yahoo” or “Sony vs Panasonic”), it is indeed likely thatcomparative patterns can be found by simply querying theweb. However, it is easy to identify mainstream domainswhere such evidence is extremely scarce, such as shoes,jewelery, hotels, restaurants, and furniture. Motivated bythese shortcomings, we propose a new formalization of thecompetitiveness between two items, based on the marketsegments that they can both cover. Formally:

Definition 1. [Competitiveness]: Let U be the population ofall possible customers in a given market. We considerthat an item i covers a customer u ∈ U if it can coverall of the customer’s requirements. Then, the compet-itiveness between two items i, j is proportional to thenumber of customers that they can both cover.

• G. Valkanas and T. Lappas are with the School of Business, StevensInstitute of Technology, Hoboken, NJ, 07030.E-mails: [email protected], [email protected]

• D. Gunopulos is with the University of Athens.E-mail: [email protected]

Our competitiveness paradigm is based on the followingobservation: the competitiveness between two items is basedon whether they compete for the attention and business of thesame groups of customers (i.e. the same market segments). Forexample, two restaurants that exist in different countries areobviously not competitive, since there is no overlap betweentheir target groups. Consider the example shown in Figure 1.

Fig. 1: A (simplified) example of our competitivenessparadigm

The figure illustrates the competitiveness between threeitems i, j and k. Each item is mapped to the set of featuresthat it can offer to a customer. Three features are consideredin this example: A,B and C. Even though this simple exam-ple considers only binary features (i.e. available/not avail-able), our actual formalization accounts for a much richerspace including binary, categorical and numerical features.The left side of the figure shows three groups of customersg1, g2, and g3. Each group represents a different marketsegment. Users are grouped based on their preferences withrespect to the features. For example, the customers in g2are only interested in features A and B. We observe thatitems i and k are not competitive, since they simply do notappeal to the same groups of customers. On the other hand,j competes with both i (for groups g1 and g2) and k (for

Page 2: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

2

g3). Finally, an interesting observation is that j competes for4 users with i and for 9 users with k. In other words, k isa stronger competitor for j, since it claims a much largerportion of its market share than i.

This example illustrates the ideal scenario, in which wehave access to the complete set of customers in a givenmarket, as well as to specific market segments and theirrequirements. In practice, however, such information is notavailable. In order to overcome this, we describe a methodfor computing all the segments in a given market based onmining large review datasets. This method allows us to op-erationalize our definition of competitiveness and addressthe problem of finding the top-k competitors of an item inany given market. As we show in our work, this problempresents significant computational challenges, especially inthe presence of large datasets with hundreds or thousandsof items, such as those that are often found in mainstreamdomains. We address these challenges via a highly scalableframework for top-k computation, including an efficientevaluation algorithm and an appropriate index.

Our work makes the following contributions:

• A formal definition of the competitiveness betweentwo items, based on their appeal to the variouscustomer segments in their market. Our approachovercomes the reliance of previous work on scarcecomparative evidence mined from text.

• A formal methodology for the identification of thedifferent types of customers in a given market, aswell as for the estimation of the percentage of cus-tomers that belong to each type.

• A highly scalable framework for finding the top-kcompetitors of a given item in very large datasets.

2 DEFINING COMPETITIVENESS

The typical user session on a review platform, such as Yelp,Amazon or TripAdvisor, consists of the following steps:

1) Specify all required features in a query.2) Submit the query to the website’s search engine and

retrieve the matching items.3) Process the reviews of the returned items and make

a purchase decision.

In this setting, items that cover the user’s requirementswill be included in the search engine’s response and willcompete for her attention. On the other hand, non-coveringitems will not be considered by the user and, thus, will nothave a chance to compete. Next, we present an example thatextends this decision-making process to a multi-user setting.

Consider a simple market with 3 hotels i, j, k and 6binary features: bar, breakfast, gym, parking, pool, wi-fi. Table 1includes the value of each hotel for each feature. In thissimple example, we assume that the market includes 6 mu-tually exclusive customer segments (types). Each segmentis represented by a query that includes the features thatare of interest to the customers included in the segment.Information on each segment is provided in Table 2. Forinstance, the first segment includes 100 customers who areinterested in parking and wi-fi, while the second segmentincludes 50 customers who are only interested in parking.

TABLE 1: Hotels and their Features.

Name Bar Breakfast Gym Parking Pool Wi-Fi

Hilton Yes No Yes Yes Yes YesMarriot Yes Yes No Yes Yes YesWestin No Yes Yes Yes No Yes

TABLE 2: Customer Segments

ID Segment Size Features of Interest

q1 100 (parking, wi-fi)q2 50 (parking)q3 60 (wi-fi)q4 120 (gym, wi-fi)q5 250 (breakfast, parking)q6 80 (gym, bar, breakfast)

In order to measure the competition between any twohotels, we need to identify the number of customers thatthey can both satisfy. The results are shown in Table 3. TheHilton and the Marriot can cover segments q1, q3, and q4.Therefore, they compete for (100 + 50 + 60)/660 ≈ 32%of the entire market. We observe that this is the lowestcompetitiveness achieved for any pair, even though thetwo hotels are also the most similar. In fact, the highestcompetitiveness is observed between the Marriot and theWestin, that compete for 70% of the market. This is a criticalobservation that demonstrates that similarity is not a goodproxy for competitiveness. The explanation is intuitive. Theavailability of both a pool and a bar makes the Hilton andthe Marriot more similar to each other and less similar tothe Westin. However, neither of these features has an effecton competitiveness. First, the pool feature is not required byany of the customers in this market. Second, even thoughthe availability of a bar is required by segment q6, noneof the three hotels can cover all three of this segment’srequirements. Therefore, none of the hotels compete for thisparticular segment.

Another intuitive observation is that the size of thesegment has a direct effect on competitiveness. For example,even though the Westin shares the same number of segments(4) with the other two hotels, its competitiveness with theMarriot is significantly higher. This is due to the size of theq5 segment, which is more than double the size of q4.

TABLE 3: Common segments for restaurant pairs

Restaurant Pairs Common Segments Common %

Hilton, Marriot (q1, q2, q3) 32%Hilton, Westin (q1, q2, q3, q4) 50%Marriot, Westin (q1, q2, q3, q5) 70%

The above example is limited to binary features. In thissimple setting, it is trivial to determine if two items canboth cover a feature. However, as we discuss in detail inSection 2.1, the items in a market can have different typesof features (e.g. numeric) that may be only partially coveredby two items. Formally, let p(q) be the percentage of usersrepresented by a query q and let V i,j

q be the pairwise coverageoffered by two items i and j to the space defined by thefeatures in q. Then, we define the competitiveness betweeni and j in a market with a feature subset F as follows:

Page 3: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

3

CF (i, j) =∑q∈2F

p(q)× V qi,j , (1)

This definition has a clear probabilistic interpretation:given two items i, j, their competitiveness CF (i, j) represents theprobability that the two items are included in the consideration setof a random user. This new definition has direct implicationsfor consumers, who often rely on recommendation systemsto help them choose one of several candidate products. Theability to measure the competitiveness between two itemsenables the recommendation system to strategically selectthe order in which items should be recommended or thesets of items that should be included together in a grouprecommendation. For instance, if a random user u showsinterest in an item i, then she is also likely to be interested inthe items with the highest CF (i, ·) values. Such competitiveitems are likely to meet the criteria satisfied by i and evencover additional parts of the feature space. In addition, asthe user u rates more items and the system gains a moreaccurate view of her requirements, our competitivenessmeasure can be trivially adjusted to consider only thosefeatures from F (and only those value intervals within eachfeature) that are relevant for u. This competitiveness-basedrecommendation paradigm is a departure from the standardapproach that adjusts the weight (relevance) of an item jfor a user u based on the rating that u submits for itemssimilar to j. As discussed, this approach ignores that (i) thesimilarity may be due to irrelevant or trivial features and(ii) for a user who likes an item i, an item j that is farsuperior than i with respect to the user’s requirements (andthus quite different) is a better recommendation candidatethan an item j′ that is highly similar to i.

In the following two sections we describe the computa-tion of the two primary components of competitiveness: (1)the pairwise coverage V q

i,j of a query that includes binary, cat-egorical, ordinal or numeric features, and (2) the percentagep(q) of users represented by each query q.

2.1 Pairwise CoverageWe begin by defining the pairwise coverage of a singlefeature f . We then define the pairwise coverage of an entirequery of features q.Definition 2. [Pairwise Feature Coverage]: We define the

pairwise coverage V fi,j of a feature f by two items i, j as

the percentage of all possible values of f that can becovered by both i and j. Formally, given the set of allpossible values V f for f , we define:

V fi,j =

|{v ∈ V f : v∠f [i] ∧ v∠f [j]}||values(f)|

,

where v∠f [i] represents that v is covered by the value ofitem i for feature f .

Next, we describe the computation of V fi,j for different

types of features.[Binary and Categorical Features]: Categorical features takeone or more values from a finite space. Examples of single-value features include the brand of a digital camera or thelocation of a restaurant. Examples of multi-value featuresinclude the amenities offered by a hotel or the types of

cuisine offered by a restaurant. Any categorical feature canbe encoded via a set of binary features, with each binary fea-ture indicating the (lack of) coverage of one of the originalfeature’s possible values. In this simple setting, the featurecan be fully covered (if f [i] = f [j] = 1 or, equivalently,f [i]× f [j] = 1), or not covered at all. Formally, the pairwisecoverage of a binary feature f by two items i, j can becomputed as follows:

V fi,j = f [i]× f [j] (binary features) (2)

[Numeric Features]: Numeric features take values from apre-defined range. Henceforth, without loss of generality,we consider numeric features that take values in [0, 1], withhigher values being preferable. The pairwise coverage ofa numeric feature f by two items i and j can be easilycomputed as the smallest (worst) value achieved for f byeither item. For instance, consider two restaurants i, j withvalues 0.8 and 0.5 for the feature food quality. Their pair-wise coverage in this setting is 0.5. Conceptually, the twoitems will compete for any customer who accepts a quality≤ 0.5. Customers with higher standards would eliminaterestaurant j, which will never have a chance to compete fortheir business. Formally, the pairwise coverage of a numericfeature f by two items i, j can be computed as follows:

V fi,j = min(f [i], f [j]) (numeric features) (3)

[Ordinal Features]: Ordinal features take values from afinite ordered list. A characteristic example is the popularfive star scale used to evaluate the quality of a service orproduct. For example, consider that the values of two itemsi and j on the 5-star rating scale are ⋆⋆ and ⋆⋆⋆, respectively.Customers that demand at least 4 stars will not considereither of the two items, while customers that demand atleast 3 stars will only consider item j. The two items willthus compete for all customers that are willing to accept 1or 2 stars. Therefore, as in the case of numeric features, thepairwise coverage for ordinal features is determined by theworst of the two values. In this example, given that the twoitems compete for 2 of the 5 levels of the ordinal scale (1 and2 stars), their competitiveness is proportional to 2/5 = 0.4.Formally, the pairwise coverage of an ordinal feature f bytwo items i, j can be computed as follows:

V fi,j =

min(f [i], f [j])

|V f |(ordinal features) (4)

Pairwise coverage of a feature query: We now discusshow coverage can be extended to the query level. Figure 2visualizes a query q that includes two numeric features f1and f2. The figure also includes two competitive items i andj, positioned according to their values for the two features:f1[i] = 0.3, f2[i] = 0.3, f1[j] = 0.2, and f2[j] = 0.7. Weobserve that the percentage of the 2-dimensional space thateach item covers is equivalent to the area of the rectangledefined by the beginning of the two axes (0, 0) and theitem’s values for f1 and f2. For example, the covered areafor item i is 0.3×0.3 = 0.09, equal to 9% of the entire space.Similarly, the pairwise coverage provided by both items isequal to 0.2× 0.3 = 0.06 (i.e. 6% of the market).

Per our example, the pairwise coverage of a given queryq by two items i, j can be measured as the volume of the

Page 4: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

4

f1

f2

i

1

10

0.3

0.2

0.3 0.7

j

Fig. 2: Geometric interpretation of pairwise coverage

hyper-rectangle defined by the pairwise coverage providedby the two items for each feature f ∈ q. Formally:

V qi,j =

∏f∈q

V fi,j (5)

Eq. 5 allows us to compute the pairwise coverage ofany query of features, as required by the definition ofcompetitiveness in Eq. 1.

2.2 Estimating Query ProbabilitiesThe definition of competitiveness given in Eq. 1 considersthe probability p(q) that a random customer will be repre-sented by a specific query of features q, for every possiblequery q ∈ 2F . In this section, we describe how these proba-bilities can be estimated from real data. Feature queries are adirect representation of user preferences. Ideally, we wouldhave access to the query logs of the platform’s (e.g. Ama-zon’s or TripAdvisor’s) search engine. In practice, however,the sensitive and proprietary nature of such informationmakes it very hard for firms to share publicly. Therefore, wedesign an estimation process that only requires access to anabundant resource: customer reviews. Each review includesa customer’s opinions on a particular subset of features ofthe reviewed item. Extant research has repeatedly validatedthe use of reviews to estimate user preferences with respectto different features in multiple domains, such as phoneapps [14], movies [15], electronics [16], and hotels [17].

A trivial approach would be to estimate the demand foreach feature separately, and then aggregate the individualestimates at the subset level. However, this approach as-sumes feature independence, a strong assumption that wouldfirst have to be validated across domains. To avoid thisassumption and capture possible feature correlations, weconsider all the features mentioned in each review as asingle query. We then compute the frequency of each queryq in our review corpus R, and divide it by the sum of thefrequencies of all queries. This gives us an estimate of theprobability that a random user will be interested in exactlythe set of features includes in q. Formally:

p(q) =freq(q,R)∑

q∈2F

freq(q′,R)(6)

Ideally, we would have access to the set of requirementsof every possible customer in existence. The maximum like-

lihood estimate of Eq. 6 would then compute the exact oc-currence probability of any query q. While this type of globalaccess is unrealistic, Eq. 6 can still deliver accurate estimatesif the number of reviews in R is large enough to accuratelyrepresent the customer population. The usefulness of theestimator is thus determined by a simple question: how manyreviews do we need to achieve accurate estimates? We addressthis question in Section 5.7 of the experiments, where wepresent our results on datasets from different domains.

2.3 Extending our Competitiveness Definition

Feature Uniformity: Our competitiveness definition as-sumes that user requirements are uniformly distributedwithin the value space of each feature. This assumptionallows us to build a computational model for competitive-ness, but in practice it may not always be true. For instance,the number of users demanding quality in [0, 0.1] might bedifferent than those demanding a value in [0.4, 0.5]. More-over, for lack of more accurate information, it provides aconservative lower bound of our model’s true effectiveness:having access to the distribution of interest within eachfeature could only improve the quality of our results.

If such information was indeed available, then the naiveapproach would be to consider all possible interest intervalscombinations for all possible queries. Henceforth, we referto these as extended queries. Clearly, the number of possibleextended queries is exponential and renders the compu-tational cost of any evaluation algorithm prohibitive. Thislimitation can be addressed by organizing the dataset into amulti-dimensional grid, where each feature represents a dif-ferent dimension. Each cell in the grid represents a differentextended query (i.e. a set of features and an interest intervalfor each feature). We can then compute the competitivenessbetween two items by simply counting the number of datapoints that fall in the cells that they can both cover.

We can also precompute the sums of each cell offlinewith the prefix-sum array technique [18], as well as reducethe space complexity via approximations [19], [20] or multi-dimensional histograms [21], [22]. A parameter of the grid-construction process is the cell size, with larger cells sac-rificing accuracy for the sake of efficiency. In practice, thisparameter will be determined by the granularity of the inputdata, as well as the practitioner’s computational constraints.Feature Importance: A second assumption of our compet-itiveness definition is that all the features in a query q areequally important. However, a user who submits the queryq = (f1, f2) may care more about f1 than for f2. As with thecase of feature uniformity, the consideration of such weightsrequires the availability of appropriate data that is rarelyavailable in practice. Nonetheless, we can address this limi-tation by extending our definition of pairwise coverage. Forinstance, consider that the feature weights are in [0, 1] andthat the weights for f1 and f2 are w1 = 0.8 and w2 = 0.4,respectively. We are then given two items i, j such that:f1[i] = 0.5, f2[i] = 0.3, f1[j] = 0.5, f2[j] = 0.6. As per ourinitial definition, the pairwise coverage of the 2-dimensionalspace by the two items is min(0.5, 0.5) × min(0.3, 0.6) =0.5 × 0.3 = 0.15. If we consider the feature weights, thecomputation becomes: (w1 × 0.5) × (w2 × 0.3) = 0.048.Formally, this extension translates to the introduction of the

Page 5: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

5

feature weight as a multiplier for the right-hand side ofEq. 3. Note that, while this example includes only numericfeatures, the same extension for categorical and ordinalattributes trivially follows.

3 FINDING THE TOP-K COMPETITORS

Given the definition of the competitiveness in Eq. 1, westudy the natural problem of finding the top-k competitorsof a given item. Formally:Problem 1. [Top-k Competitors Problem]: We are presented

with a market with a set of n items I and a set of featuresF . Then, given a single item i ∈ I , we want to identifythe k items from I that maximize CF (i, ·).

A naive algorithm would compute the competitivenessbetween i and every possible candidate. The complexity ofthis brute force method is clearly Θ(2|F| × n2 × logK),which can be easily dominated by the powerset factor and,as we demonstrate in our experiments, is impractical forlarge datasets. One option could be to perform the naivecomputation in a distributed fashion. Even in this case,however, we would need one thread for each of the n2

pairs. This is far from trivial, if one considers that n couldmeasure in the tens of thousands. In addition, a naiveMapReduce implementation would face the bottleneck ofpassing everything through the reducer to account for theself-join included in the computation. In practice, the self-join would have to be implemented via a customized tech-nique for reduce-side joins, which is a non-trivial and highlyexpensive operation [23].

These issues motivate us to introduce CMiner, an effi-cient exact algorithm for Problem 1. Except for the creationof our indexing mechanism, every other aspect of CMinercan also be incorporated in a parallel solution.

First, we define the concept of item dominance, which willaid us in our analysis:Definition 3. [Item Dominance]: Consider a market with a

set of items I and a set of features F . Then, we say thatan item i ∈ I dominates another item j ∈ I , if f [i] ≥ f [j]for every feature f ∈ F .

Conceptually, an item dominates another if it has better orequal values across features. We observe that, per Eq. 1, anyitem i that dominates j also achieves the maximum possiblecompetitiveness with j, since it can cover the requirementsof any customer covered by j. This motivates us to utilizethe skyline of the entire set of items I . The skyline is a well-studied concept that represents the subset of points in apopulation that are not dominated by any other point [24].We refer to the skyline of a set of items I as Sky(I). Theconcept of the skyline leads to the following lemma:Lemma 1. Given the skyline Sky(I) of a set of items I and

an item i ∈ I , let Y contain the k items from Sky(I) thatare most competitive with i. Then, an item j ∈ I canonly be in the top-k competitors of i, if j ∈ Y or if j isdominated by one of the items in Y .

We present the proof of Lemma 1 in Appendix B.Lemma 1 verifies that we do not need to consider the

entire set of candidates in order to find the top-k competi-tors. This motivates us to construct the skyline pyramid, a

structure that greatly reduces the number of items that needto be considered. We refer to the algorithm used to constructthe skyline pyramid as PyramidFinder. The input toPyramidFinder is the set of items I . The output is theskyline pyramid DI . The algorithm relies on the extractionof the skyline layers of the dataset, using a modified versionof BBS [25], [26]. Each item from the ith skyline layer isthen assigned an inlink from all items of the (i − 1)th levelthat dominate it. We present the complete pseudocode andcomplexity analysis of the algorithm in Appendix C.

I9

I4 I1 0

I8

I2

I5I1

I6 I3

I7

I4 I1 0I2

I5I1

I9I8

I6

I7

I3

Fig. 3: The left side shows the dominance graph for a setof items. An edge Ii → Ij means that Ii dominates Ij . Theright side of the figure shows the skyline pyramid.

The CMiner Algorithm: Next, we present CMiner, an exactalgorithm for finding the top-k competitors of a given item.Our algorithm makes use of the skyline pyramid in orderto reduce the number of items that need to be considered.Given that we only care about the top-k competitors, wecan incrementally compute the score of each candidate andstop when it is guaranteed that the top-k have emerged. Thepseudocode is given in Algorithm 1.

Discussion of CMiner: The input includes the set of itemsI , the set of features F , the item of interest i, the number kof top competitors to retrieve, the set Q of queries and theirprobabilities, and the skyline pyramid DI . The algorithmfirst retrieves the items that dominate i, via masters(i) (line1). These items have the maximum possible competitivenesswith i. If at least k such items exist, we report those andconclude (lines 2-4). Otherwise, we add them to TopK anddecrement our budget of k accordingly (line 5). The variableLB maintains the lowest lower bound from the current top-k set (line 6) and is used to prune candidates. In line 7, weinitialize the set of candidates X as the union of items in thefirst layer of the pyramid and the set of items dominatedby those already in the TopK. This is achieved via callingGETSLAVES(TopK,DI). In every iteration of lines 8-17,CMiner feeds the set of candidates X to the UPDATETOPK()routine, which prunes items based on the LB threshold. Itthen updates the TopK set via the MERGE() function, whichidentifies the items with the highest competitiveness fromTopK ∪ X . This can be achieved in linear time, since bothX and TopK are sorted. In line 13, the pruning thresholdLB is set to the worst (lowest) score among the new TopK .Finally, GETSLAVES() is used to expand the set of candidatesby including items that are dominated by those in X .

Discussion of UPDATETOPK(): This routine processes thecandidates in X and finds at most k candidates with thehighest competitiveness with i. The routine utilizes a datastructure localTopK, implemented as an associative array:the score of each candidate serves as the key, while its idserves as the value. The array is key-sorted, to facilitate

Page 6: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

6

Algorithm 1 CMinerInput: Set of items I, Item of interest i ∈ I, feature space F ,Collection Q ∈ 2F of queries with non-zero weights, skylinepyramid DI , int kOutput: Set of top-k competitors for i

1: TopK ← masters(i)2: if ( k ≤ |TopK| ) then3: return TopK4: end if5: k ← k − |TopK|6: LB ← −17: X ←GETSLAVES(TopK,DI) ∪ DI [0]8: while ( |X | != 0 ) do9: X ← UPDATETOPK(k, LB,X )

10: if ( |X | != 0 ) then11: TopK ←MERGE(TopK,X )12: if ( |TopK| = k ) then13: LB ←WORSTIN(TopK)14: end if15: X ←GETSLAVES(X ,DI)16: end if17: end while18: return TopK

19: Routine UPDATETOPK(k, LB, X )20: localTopK ← ∅21: low(j)← 0, ∀j ∈ X .22: up(j)←

∑q∈Q

p(q)× V qj,j , ∀j ∈ X .

23: for every q ∈ Q do24: maxV ← p(q)× V q

i,i25: for every item j ∈ X do26: up(j)← up(j)−maxV + p(q)× V q

i,j27: if ( up(j) < LB ) then28: X ← X \ {j}29: else30: low(j)← low(j) + p(q)× V q

i,j31: localTopK.update(j, low(j))32: if ( |localTopK| ≥ k ) then33: LB ←WORSTIN(localTopK)34: end if35: end if36: end for37: if (|X | ≤ k ) then38: break39: end if40: end for41: for every item j ∈ X do42: for every remaining q ∈ Q do43: low(j)← low(j) + p(q)× V q

i,j44: end for45: localTopK.update(j, low(j))46: end for47: return TOPK(localTopK)

the computation of the k best items. The structure is au-tomatically truncated so that it always contains at mostk items. In lines 21-22 we initialize the lower and upperbounds. For every item j ∈ X , low(j) maintains the currentcompetitiveness score of j as new queries are considered,and serves as a lower bound to the candidate’s actualscore. Each lower bound low(j) starts from 0, and after thecompletion of UPDATETOPK(), it includes the true competi-tiveness score CF (i, j) of candidate j with the focal item i.On the other hand, up(j) is an optimistic upper bound on j’scompetitiveness score. Initially, up(j) is set to the maximumpossible score (line 22). This is equal to

∑q∈Q p(q) × V q

i,i,where V q

i,i is simply the coverage provided exclusively byi to q. It is then incrementally reduced toward the trueCF (i, j) value as follows. For every query q ∈ Q, maxV

holds the maximum possible competitiveness between itemi and any other item for that query, which is in fact thecoverage of i with respect to q. Then, for each candidatej ∈ X , we subtract maxV from up(j) and then add to itthe actual competitiveness between i and j for query q. Ifthe upper bound up(j) of a candidate j becomes lower thanthe pruning threshold LB, then j can be safely disqualified(lines 27-29). Otherwise, low(j) is updated and j remains inconsideration (lines 30-31). After each update, the value ofLB is set to the worst score in localTopK (lines 32-33), toemploy stricter pruning in future iterations.

If the number of candidates |X | becomes less or equal tok (line 37), the loop over the queries comes to a halt. Thisis an early-stopping criterion: since our goal is to retrievethe best k candidates in X , having |X | <= k means that allremaining candidates should be returned. In lines 41-46 wecomplete the competitiveness computation of the remainingcandidates and update localTopk accordingly. This takesplace after the completion of the first loop, in order to avoidunnecessary bound-checking and improve performance.Complexity: If the item of interest i is dominated by atleast k items, then these will be returned by masters(i).This step can be done in O(k), by iteratively retrieving kitems that dominate i. Otherwise, the complexity of CMineris controlled by UPDATETOPK(), which depends on thenumber of items in the candidate set X . In its simplest form,in the k-th call of the method, the candidate set containsthe entire k-th skyline layer, DI [k]. According to Bentleyet al. [27], for n uniformly-distributed d-dimensional datapoints (items), the expected size of the skyline (1st layer)is |DI [0]| = Θ( ln

d−1n(d−1)! ). UPDATETOPK() will be called at

most k times, each time fetching (at least) 1 new item,meaning that we will evaluate O(k ∗ lnd−1n

(d−1)! ) items. Foreach candidate, we need to iterate over the |Q| queries andupdate the TopK structure with the new score, which takesO(logk) time using a Red-Black tree, for a total complexityof O(|Q| ∗ k ∗ logk ∗ lnd−1n

(d−1)! ). However, as we discuss next,this is a pessimistic analysis based on the naive assumptionthat each of the k layers will be considered entirely.

In practice, with the exception of the first layer, we onlyneed to check a small fraction of the candidates in theskyline layers. For instance, in a uniform distribution withconsecutive layers of similar size, the number of points tobe considered will be in the order of k, since links will beevenly distributed among the skyline points. As we onlyexpand the top-k items in each step, approximately k newitems will be evaluated next, making the cost of UPDATE-TOPK() in subsequent calls O(|Q|∗k ∗ logk). Given that thiscost is paid for each of the (at most) k−1 iterations after thefirst one, the total cost becomes O(|Q|∗(k2+ lnd−1n

(d−1)! )∗ logk).As we show in our experiments, the actual distributionsfound in real datasets allow for much faster computations.In the following section, we describe several speed-ups thatcan achieve significant savings in practice.

In terms of space, the UPDATETOPK() method accepts|X | items as input and operates on that set alone, resultingin O(|X |) space. For each item in X , we maintain its lowerand upper bound, which is still O(|X |). As we iterate overthe queries, we update those values and discard items,reducing the required space, bringing it closer to O(k). Since

Page 7: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

7

the TopK structure always contains k entries, the space ofCMiner is determined by X , which is at its maximum whenwe retrieve the first skyline layer (line 7). Our assumptionthat the primary skyline fits in memory is reasonable andshared by prior works on skyline algorithms [24].

4 BOOSTING THE CMINER ALGORITHM

Next, we describe several improvements that we have ap-plied to CMiner in order to achieve computational savingswhile maintaining the exact nature of the algorithm.

4.1 Query OrderingOur complexity analysis is based on the premise thatCMiner evaluates all queries Q for each candidate item j.However, this assumption naively ignores the algorithm’spruning ability, which is based on using lower and upperbounds on competitiveness scores to eliminate candidatesearly. Next, we show how to greatly improve the algorithm’spruning effectiveness by strategically selecting the process-ing order of queries (line 23 of CMiner).

CMiner uses the following update rules for the lowerand upper bounds for a candidate j:

low(j)← low(j) + p(q)× V qi,j (7)

up(j)← up(j)− p(q)× V qi,i + p(q)× V q

i,j (8)

By expanding the sequences and using the initial valueslow(j) = 0 and up(j) = CF (i, i), we can re-write thebounds:

lowm(j) =m∑1

p(qm)× V qmi,j

upm(j) = CF (i, i)−m∑1

p(qm)× V qmi,i +

m∑1

p(qm)× V qmi,j ,

where lowm(j) and upm(j) are the values of the boundsafter considering the mth query qm. We can then define arecursive function T (j) = up(j)− low(j) as follows:

T (j)← T (j)− p(q)× V qi,i (9)

T (j) captures the margin of error for the competitivenessbetween the item of interest i and a candidate j. As morequeries are evaluated and the two bounds are updated, themargin decreases. Finally, it becomes equal to zero when wehave the final CF (i, j) score. We hypothesize that the abilityto minimize this margin faster can increase the number ofpruned candidates due to the existence of stricter bounds inearly iterations. Given Eq. 7 and 8, the value of T (j) afterconsidering m queries can be re-written as follows:

Tm(j) = CF (i, i)−m∑ℓ=1

p(qℓ)× V qℓi,i , (10)

where qℓ is the ℓth query processed by the algorithm. GivenEq. 10, it is clear that we can optimally minimize the marginbetween the lower and upper bounds on the competitive-ness of a candidate by processing queries in decreasingorder of their p(q) × V q

i,i values. We refer to this orderingscheme as COV. We evaluate the computational savingsachieved by COV in Section 5.4 of our experiments, wherewe also compare it with alternative approaches.

4.2 Improving UPDATETOPK() and GETSLAVES()

In this section we describe several improvements to theCMiner’s two main routines. We implement all of theseimprovements into an enhanced algorithm, which we referto as CMiner++. We include this version in our experimentalevaluation, where we compare its efficiency with that ofCMiner, as well as to that of other baselines.

Even though CMiner can effectively prune low qualitycandidates, a major bottleneck within the UPDATETOPK()function is the computation of the final competitivenessscore between each candidate and the item of interest i (lines41-46). Speeding up this computation can have a tremen-dous impact on the efficiency of our algorithm. Next, We il-lustrate this with an example. Assume that items are definedin a 4-dimensional space with features f1, f2, f3, f4. Withoutloss of generality, we assume that all features are numeric.We also consider 3 queries q1 = (f1, f2, f3), q2 = (f2, f3, f4)and q3 = (f2, f4), with probabilities w(q1), w(q2), andw(q3), respectively. In order to compute the competitivenessbetween two items i and j, we need to consider all queriesand, according to Eq. 5, compute V q1

i,j = V f1i,j × V f2

i,j × V f3i,j ,

V q2i,j = V f2

i,j ×Vf3i,j ×V

f4i,j , and V q3

i,j = V f2i,j ×V

f4i,j . Given that the

three items include common sequences of factors, we wouldlike to avoid repeating their computation, when possible.First, we sort all features according to their frequency inthe given set of queries. In our example, the order is:f2, f3, f4, f1. In this order, (f2, f3) becomes a common prefixfor q1 and q2, whereas f2 is a common prefix for all 3 queries.We then build a prefix-tree to ensure that the computation ofsuch common prefixes is only completed once. For instance,the computation of V f2

i,j × V f3i,j is done only once and used

for both q1 and q2. The tree is used in lines 41-46 of CMinerto expedite the computation of the competitiveness betweenthe item of interest and the remaining candidates in X . Thisimprovement is inspired by Huffman encoding, wherebyfrequent symbols (features in our case) are closer to theroot, so that they are encoded with fewer bits. Note thatHuffman encoding is optimal if the symbols independent ofeach other, as is the case in our own setting.

The GETSLAVES() method is used to extend the set ofcandidates by including the items that are dominated bythose in a provided set (lines 7 and 15). Henceforth, we referto this as the dominator set. A naive implementation wouldinclude all items that are dominated by at least one itemin the dominator set. However, as stated in Lemma 1, if anitem j is dominated by an item j′, then the competitivenessof j with any item of interest cannot be higher than thatof j′. This implies that items that are dominated by the k-th best item of the given set will have a competitivenessscore lower than the current k-th score and will thus notbe included in the final result. Therefore, we only needto expand the top k − 1 items and only those that havenot been expanded already during a previous iteration. Inaddition, the GETSLAVES() method can be further improvedby using the lower bound LB (the score of the k-th bestcandidate) as follows: instead of returning all the items thatare dominated by those in the dominator set, we only haveto consider a dominated item j if CF (j, j) > LB. This isdue to the fact that the competitiveness between i and jis upper-bounded by the minimum coverage achieved by

Page 8: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

8

either of the two items (over all queries), i.e., CF (i, j) ≤min(CF (i, i), CF (j, j)). Therefore, an item with a coverage≤ LB cannot replace any of the items in the current TopK .

5 EXPERIMENTAL EVALUATION

In this section we describe the experiments that we con-ducted to evaluate our methodology. All experiments werecompleted on an desktop with a Quad-Core 3.5GHz Proces-sor and 2GB RAM.

5.1 Datasets and BaselinesOur experiments include four datasets, which were col-lected for the purposes of this project. The datasets wereintentionally selected from different domains to portray thecross-domain applicability of our approach. In addition tothe full information on each item in our datasets, we alsocollected the full set of reviews that were available on thesource website. These reviews were used to (1) estimatequeries probabilities, as described in Section 2.2 and (2)extract the opinions of reviewers on specific features.Thehighly-cited method by Ding et al. [28] is used to converteach review to a vector of opinions, where each opinionis defined as a feature-polarity combination (e.g. service+,food-). The percentage of reviews on an item that expressa positive opinion on a specific feature is used as thefeature’s numeric value for that item. We refer to these asopinion features. Table 4 includes descriptive statistics foreach dataset, while a detailed description is provided below.

CAMERAS: This dataset includes 579 digital cameras fromAmazon.com. We collected the full set of reviews for eachcamera, for a total of 147192 reviews. The set of featuresincludes the resolution (in MP), shutter speed (in seconds),zoom (e.g. 4x), and price. It also includes opinion featureson manual, photos, video, design, flash, focus, menu options, lcdscreen, size, features, lens, warranty, colors, stabilization, batterylife, resolution, and cost.

HOTELS: This dataset includes 80799 reviews on 1283 hotelsfrom Booking.com. The set of features includes the facili-ties,activities, and services offered by the hotel. All three ofthese multi-categorical features are available on the website.The dataset also includes opinion features on location, ser-vices, cleanliness, staff, and comfort.

RESTAURANTS: This dataset includes 30821 reviews on 4622New York City restaurants from TripAdvisor.com. The set offeatures for this dataset includes the cuisine types and mealtypes (e.g. lunch, dinner) offered by the restaurant, as well asthe activity types (e.g. drinks, parties) that it is good for. Allthree of these multi-categorical features are available on thewebsite. The dataset also includes opinion features on food,service, value-for-money, atmosphere, and price.

RECIPES: This dataset includes 100000 recipes fromSparkrecipes.com. It also includes the full set of reviews oneach recipe, for a total of 21685 reviews. The set of featuresfor each recipe includes the number of calories, as well asthe following nutritional information, measured in grams:fat, cholesterol, sodium, potassium, carb, fiber, protein, vitaminA, vitamin B12, vitamin C, vitamin E, calcium, copper, folate,

magnesium, niasin, phosphorus, riboflavin, selenium, thiamin,zinc. All information is openly available on the website.

TABLE 4: Dataset Statistics

SkylineDataset #Items #Feats. #Subsets Layers

CAMERAS 579 21 14779 5HOTELS 1283 8 127 5

RESTAURANTS 4622 8 64 12RECIPES 100000 22 133 22

For each dataset, the 2nd, 3rd, 4th and 5th columnsinclude the number of items, the number of features, thenumber of distinct queries, and the number of layers inthe respective skyline pyramid, respectively. In order toconclude the description of our datasets, we present somestatistics on the skyline-pyramid structure constructed foreach corpus. Figure 4 shows the distribution of items in thefirst 6 skyline layers of each dataset. We observe that, for

0

20

40

60

80

100

1 2 3 4 5 6

Cu

mu

lative #

Po

ints

%

Layers of the Skyline Pyramid

RecipesHotels

CamerasRestaurants

Fig. 4: Cumulative distribution of items across the first 6layers of the skyline pyramid.

all datasets, nearly 99% of the items can be found withinthe first 4 layers, with the majority of those falling withinthe first 2 layers. This is due to the large dimensionalityof the feature space, which makes it difficult for items todominate one another. As we show in our experiments, theskyline pyramid enables CMiner to clearly outperform thebaselines with respect to computational cost. This is despitethe high concentration of items within the first layers, sinceCMiner can effectively traverse the pyramid and consideronly a small fraction of these items.Baselines: We compare CMiner with two baselines. TheNaive basline, is the brute-force approach described in Sec-tion 3. The second is a clustering-based approach that firstiterates over every query q and identifies the set of items thathave the same value assignment for the features in q andplaces them in the same group. The algorithm then iteratesover the reported groups and updates the pairwise coverageVqi,j for the item of interest i and an arbitrary item j from

each group (it can be any item, since they all have the samevalues with respect to q). The computed coverage is thenused to update the competitiveness of all the items in thegroup. The process continues until the final competitivenessscores for all items have been computed. Assuming that wehave a collection of items I , a set of queries Q, and at mostM groups per query, the complexity is O( |I| * M * |Q| ).

Page 9: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

9

Obviously, when each group is a singleton, the algorithm isequivalent to the brute-force approach. We refer to this tech-nique as GMiner. We also evaluate our enhanced CMiner++algorithm, that implements the speedups of Section 4.2.Finally, unless stated otherwise, we always experiment withk ∈ {3, 10, 50, 150, 300}.

5.2 Evaluating comparative methodsPrevious work on competitor mining has been based oncomparative evidence between two items, found in differenttypes of text data. However, these approaches are basedon the assumption that such comparative evidence can befound in abundance in the available data. In this experi-ment, we evaluate this assumption on our four datasets. Forevery pair of items in each dataset, we report (1) the numberof reviews that mention both items and (2) the number ofreviews that include a direct comparison between the twoitems. We extract such comparative evidence based on theunion of “competitive evidence” lexicons used by previouswork [8], [9], [10], [11], [12], [13]. Given two items i and j,the lexicon includes the following comparative patterns: i incontrast to j, i unlike j, i compared with j, compare i to j, i beat(s)j, i exceeds j, i outperform(s) j, prefer i to j, i than j , i same as j,i similar to j, i superior to j, i better than j, i worse than, i morethan j, i less than j, i vs j. We present the results in Table 5,in which we report the average number of findings for eachpair of items in each dataset.

TABLE 5: Evidence on Comparative Methods

Co-occurrence Comparative

Cameras 1.7 1.2Hotels 0.06 0.02Restaurants 0.09 0.04Recipes 0 0

The results verify that methods based on comparativeevidence are completely ineffective in many domains. Infact, even for CAMERAS, the dataset with the largest count,evidence was limited to a very small number of pairs.Specifically, the expected number of times that any twospecific cameras appear together in the same review is 1.7.In addition, only 1.2 of these co-occurrence were actuallycomparative, a number that is far too low to allow for aconfident evaluation of competitiveness. This demonstratesthe sparseness of comparative evidence in real data, whichgreatly limits the applicability of any approach that is basedon such evidence. These findings further motivate our work,which has no need for this type of information.

5.3 Computational TimeIn this experiment we compare the speed of CMiner withthat of the two baselines (Naive and GMiner), as well aswith that of the enhanced CMiner++ algorithm. Specifically,we use each algorithm to compute the set of top-k competi-tors for each item in our datasets. The results are shown inFigure. 5. Each plot reports the average time, in seconds, peritem (y-axis) against the various k values (x-axis).

The figures motivate some interesting observations.First, the Naive algorithm consistently reports the same

computational time regardless of k, since it naively com-putes the competitiveness of every single item in the corpuswith respect to the target item. Thus, any trivial variationsin the required time are due to the process of maintainingthe top-k set. In general, Naive is outperformed by the twoother algorithms, and is only competitive for very largevalues of k for the HOTELS dataset. The latter case canbe attributed to the small number of queries and itemsincluded in this dataset, which limit the ability of moresophisticated algorithms to significantly prune the spacewhen the number of required competitors is very large.

For the CAMERAS dataset, CMiner and GMiner, ex-hibit almost identical running times. This is due to (1)the very large number of distinct queries for this dataset(14779), which serves as a computational bottleneck forCMiner and (2) the highly clustered structure of the itempopulation, which includes 579 items. A deeper analy-sis reveals that GMiner identifies and average of 443.63item groups (i.e. groups of identical items) per query. Thismeans that the algorithm saves (on expectation) a total of(579 − 443) × 14779 = 2009944 coverage computationsper query, allowing it to be competitive to the otherwisesuperior CMiner. In fact, for the other datasets, CMinerdisplays a clear advantage. This advantage is maximizedfor the RECIPES dataset, which is the most populous of thefour in terms of included items. The experiment on thisdataset also illustrates the scalability of the approach withrespect to k. For the HOTELS and RESTAURANTS datasets,even though the computational time of CMiner appearsto rise as k increases for the other three datasets, it nevergoes above 0.035 seconds. For the CAMERAS dataset, thelarge number of considered queries has an adverse of thescalability of CMiner, since it results in larger number ofrequired computations for larger values of k. This findingmotivates us to consider pruning the set of queries byeliminating those that have a low probability. We explorethis direction in the experiment presented in Section 5.6.

Finally, we observe that the enhanced CMiner++ algo-rithm consistently outperformed all the other approaches,across datasets and values of k. The advantage of CMiner++is increased for larger values of k, which allow the algorithmto benefit from its improved pruning. This verifies the utilityof the improvements described in Section 4.2 and demon-strates that effective pruning can lead to a performance thatfar exceeds the worst-case complexity analysis of CMiner.

5.4 Ordering Efficiency

In Section 4.1 we introduced the COV ordering scheme,which determines the processing order of queries byCMiner. Next, we demonstrate COV’s superiority overthe P-INC and P-DCR oredring schemes, which processqueries in increasing and decreasing probability order, re-spectively. For each approach, we compute (1) the numberof pairwise query coverages V q

i,j that need to be computed(line 25 of Algorithm 1) and (2) the number of distinctqueries that need to be processed (line 22 of Algorithm 1)to compute the top-k competitors of each item. The resultsare shown in Figures 6 and 13 (Appendix D). The x-axisof each plot holds the value of k, while the y-axis holds theaverage number of processed queries / coverages.

Page 10: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

10

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

3 10 50 150 300

Sec

onds

Number of Competitors (k)

NaiveCMiner

GMinerCMiner++

(a) Cameras

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

3 10 50 150 300

Sec

onds

Number of Competitors (k)

NaiveCMiner

GMinerCMiner++

(b) Hotels

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

3 10 50 150 300

Sec

onds

Number of Competitors (k)

NaiveCMiner

GMinerCMiner++

(c) Restaurants

0

0.2

0.4

0.6

0.8

1

1.2

1.4

3 10 50 150 300

Sec

onds

Number of Competitors (k)

NaiveCMiner

GMinerCMiner++

(d) Recipes

Fig. 5: Average time (per item) to compute top-k competitors for each dataset.

0

1x106

2x106

3x106

4x106

5x106

6x106

7x106

3 10 50 150 300

Cov

erag

e C

ompu

tatio

ns

Number of Competitors

P-INC P-DCR COV

(a) Cameras

0

20000

40000

60000

80000

100000

120000

140000

3 10 50 150 300

Cov

erag

e C

ompu

tatio

ns

Number of Competitors

P-INC P-DCR COV

(b) Hotels

0

50000

100000

150000

200000

3 10 50 150 300

Cov

erag

e C

ompu

tatio

ns

Number of Competitors

P-INC P-DCR COV

(c) Restaurants

0

1x106

2x106

3x106

4x106

5x106

6x106

7x106

3 10 50 150 300

Cov

erag

e C

ompu

tatio

ns

Number of Competitors

P-INC P-DCR COV

(d) Recipes

Fig. 6: Average number of pairwise coverage computations under different ordering schemes.

We observe a consistent advantage for COV, acrossdatasets and values of k. This verifies that COV increasesthe efficiency of CMiner by rapidly eliminating candidatesthat fail to cover important queries. We observe that thisadvantage is more profound for RECIPES and CAMERAS.This is a reasonable outcome, as the computational bottle-neck of CMiner lies with the nested loop structure of theUPDATETOPK() routine, which iterates over all queries andover all items for each query (lines 22-39 of Algorithm 1).Therefore, datasets with a large number of items (RECIPES)or queries (CAMERAS) can benefit the most from the abilityof CMiner to quickly eliminate candidates.

5.5 Pruning Efficiency

Much of CMiner’s efficiency stems from its ability to dis-card or selectively evaluate candidates. We illustrate thisin Figure 7. The figure includes one set of bars for eachdataset, with each bar representing a different value of k(k ∈ {3, 10, 50, 150, 300}, in the order shown).

The white portion of each bar (post-pruned) representsthe average number of items pruned within UPDATETOPK()(line 28). There, an item is pruned if, as we go over the set ofqueries Q, its upper bound reaches a value that is lower thanLB (the lowest competitor in the current top-K). The blackportion of each bar (pre-pruned) represents the averagenumber of items that were never added to the candidateset X because their best-case scenario (self coverage) wasapriori worse than LB. Therefore, they can be eliminatedand we do not have to consider their competitiveness in thecontext of the queries. We explain this mechanism in detailin the final paragraph of Section 4.2. Finally, the pattern-filled portion (unpruned) at the top of each bar refers tothe average number of items that were fully evaluated intheir entirety (i.e. for all queries). We observe that the vastmajority of candidates is eliminated by one of the two typesof pruning that we consider here. The high number of pre-pruned queries is particularly encouraging, as it implies the

highest computational savings. Finally, it is important tonote that these findings are consistent across datasets.

0.1

1

10

100

1K

10K

100K

camerashotels

restaurants

recipes

Num

ber

of It

ems

Pre-Pruned Post-Pruned Unpruned

Fig. 7: Pruning Effectiveness

5.6 Reducing the Number of Considered QueriesBy default, CMiner considers all queries with a non-zeroprobability, with each query representing a different marketsegment. As larger segments are more likely to contributeto the competitiveness of two items, an intuitive way tospeed up the algorithm is to ignore low-probability queries,i.e. queries with frequency lower than a threshold T . Werepeat our experiment with T ∈ {1, 3, 5, 10, 15, 20} andrecord the time required for each combination of (k, T ). Theresults are shown in Figure 8, which includes a plot for eachdataset. The x-axis of each plot holds the different values ofk, while the y-axis holds the required computational timein seconds. The plot includes one line for each value of T .We also evaluate the quality of the results, using Kendall’s τcoefficient [29] to compare the produced top-k lists with therespective exact solution (i.e. T = 1). The results are shownin Figure 9. For the plots in this figure, the y-axis holdsthe computed Kendall τ values. For all plots, we report theaverage values computed over all the items in each dataset.

For hotels and restaurants, query elimination does notyield significant gains. On the other hand, for cameras, the

Page 11: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

11

0

0.2

0.4

0.6

0.8

1

1.2

1.4

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

T=1T=3

T=5T=10

T=15T=20

(a) Cameras

0

0.005

0.01

0.015

0.02

0.025

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

T=1T=3

T=5T=10

T=15T=20

(b) Hotels

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

0.02

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

T=1T=3

T=5T=10

T=15T=20

(c) Restaurants

0.18 0.2

0.22 0.24 0.26 0.28 0.3

0.32 0.34 0.36 0.38 0.4

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

T=1T=3

T=5T=10

T=15T=20

(d) Recipes

Fig. 8: Computational times of CMiner for different values of the k and T parameters.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

qt=1qt=3

qt=5qt=10

qt=15qt=20

(a) Cameras

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

qt=1qt=3

qt=5qt=10

qt=15qt=20

(b) Hotels

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

qt=1qt=3

qt=5qt=10

qt=15qt=20

(c) Restaurants

-0.4-0.3-0.2-0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

T=1T=3

T=5T=10

T=15T=20

(d) Recipes

Fig. 9: Kendall Tau values achieved by CMiner for different values of the k and T parameters.

runtime of CMiner drops as T increases. In fact, the algo-rithm achieves increasingly higher savings as k increases.For recipes, we observe significant savings even for smallvalues of T and k. A deeper study of the data revealsthat these discrepancies can be attributed to the number ofqueries that they include. Specifically, as shown in Table 4,the number of queries for the cameras dataset is two ordersof magnitude larger than that for the hotels and restaurantsdatasets. As a result, CMiner has a low computationalcost for restaurants and hotels, even when all queries areconsidered. However, for cameras, the large number ofqueries serves as a significant computational bottleneck,which is relieved as the value of T becomes higher andless queries need to be considered. For the recipes dataset,the improvement can be attributed to the number of items(100K): By reducing the number of queries, processed in theexternal loop of CMiner’s UPDATETOPK() routine (line 23Algorithm 1), we also significantly reduce the executions ofthe inner loop (line 25), which iterates over the large set ofitems in this dataset.

The second observation is that, for all datasets exceptrecipes, CMiner achieves near-perfect results even for largervalues of T . This is based on the observed values of theKendall τ coefficient, which was consistently above 0.9 forall evaluated combinations of the k and T parameters.This is an encouraging finding, since it reveals a highlyappealing and practical tradeoff between the computationalefficiency and quality of CMiner. In addition, it is importantto note that the practice of reducing the size of number ofconsidered queries does not require any modifications tothe algorithm itself and can thus be applied with minimumeffort. A careful examination of the recipes dataset revealsthat the low correlation values can be attributed to the factthat most queries have a low frequency and, in fact, theirfrequency distribution is nearly uniform. As a result, evena low value for the T threshold eliminates a large numberof queries and prevents CMiner from computing the exactsolution to the top-k problem. This finding reveals that, by

studying the frequency distribution of the queries in a givendataset, we can make an informed decision on whether ornot eliminating low-frequency queries is advisable, as wellas on what the value of the threshold should be.

5.7 Convergence of Query Probabilities

In Section 2.2 we described the process of estimating theprobability of each query by mining large datasets of cus-tomer reviews. The validity of this approach is based onthe assumption that the number of available reviews issufficient to allow for confident estimates. Next, we evaluatethis assumption as follows. First, we merge all the reviews ineach dataset into a single set, sort them by their submissiondate, and split the sorted sequence into fixed-size segments.We then iteratively append segments to the review corpusRconsidered by Eq. 6 and re-compute the probability of eachquery in the extended corpus. The vector of probabilitiesfrom the ith iteration is then compared with that from the(i−1)th iteration via the L1 distance: the sum of the absolutedifferences of corresponding entries (i.e. the two estimatesfor the same query in both vectors). We apply the process forsegments of 25 reviews. The results are shown in Figure 10.The x-axis of each plot includes the number of reviews,while the y-axis is the respective L1 distance.

0

0.2

0.4

0.6

0.8

1

50 100 150 200 250 300 350 400 450 500 550 600

L1 D

ista

nce

Number of Reviews

CamerasRecipes

HotelsRestaurants

Fig. 10: Convergence of query probabilities.

Page 12: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

12

Based on our results, we see that all the datasets exhib-ited near identical trends. This is an encouraging findingwith useful implications, as it informs us that any con-clusions we draw about the convergence of the computedprobabilities will be applicable across domains. Second, thefigures clearly demonstrate the convergence of the com-puted probabilities, with the reported L1 distance droppingrapidly to trivial levels below 0.2, after the consideration ofless than 500 reviews. The convergence of the probabilitiesis an especially encouraging outcome that (i) reveals a stablecategorical distribution for the preferences of the users overthe various queries, and (ii) demonstrates that only a smallseed of reviews, that is orders of magnitude smaller than thethousands of reviews available in each dataset, is sufficientto achieve an accurate estimation of the probabilities.

5.8 A User Study

In order to validate our definition of competitiveness, weconduct a user study as follows. First, we select 10 randomitem from each of our 4 datasets. We refer to these 10 itemsas the seed. For each item i in the seed, we compute its com-petitiveness with every other item in the corpus, accordingto our definition. As a baseline, we also rank all the itemsin the corpus based on their distance to i within the feature-space of each dataset. The intuition behind this approach isthat highly similar items are also likely to be competitive.The L1 distance was used for numeric and ordinal features,and the Jaccard distance was used for categorical attributes.We refer to this as the NN approach (i.e. Nearest Neighbor).We then chose the two items with the highest score, the twoitems with the lowest score, and two items from the middleof the ranked list. This was repeated for both approaches, fora total of 12 candidates per item in the seed (6 per approach).

We then created a user study on the online survey-sitekwiksurveys.com/. In total, the survey was taken by 20different annotators. Each of the 10 seed items was pairedwith each of its 12 corresponding candidates, for a total of120 different pairs. The pairs were shown to the annotatorsin a randomized order. Annotators had access to a table withthe values of each item for every feature, as well as a link tothe original source. For each pair, each annotator was askedwhether she would consider buying the candidate insteadof the seed item. The possible answers were “YES”, “NO”and “NOT SURE”. Table 8 in Appendix F shows an exampleof the pairs that were shown to the annotators.

Figure 11 contains the results of the user study. Thefigure shows 3 pairs of bars. The left bar of each paircorresponds to our approach, while the right bar to the NNapproach. The leftmost pair represents the user responses tothe top-ranked candidates for each approach. The pair in themiddle represents the responses to the candidates ranked inthe middle, and, finally, the pair on the right represents theresponses to the bottom-ranked candidates.

Each bar in Figure 11 captures the fraction of each ofthe three possible responses. The lower, middle, and upperpart of the bar represent the “YES”, “NO” and “NOT SURE”responses, respectively. For example, the first bar on the leftreveals that about 90% of the annotators would consider ourtop-ranked candidates as a replacement for the seed item.The remaining 10% was evenly divided between the “NO”and “NOT SURE” responses.

The figure motivates multiple relevant observations.First, we observe that the vast majority of the top-rankcompetitors proposed by our approach were verified aslikely replacements for the seed item. These are thus verifiedas strong competitors that could deprive the seed item frompotential customers and decrease its market share. On theother hand, the top-ranked candidates of NN were oftenrejected by the users, who did not consider these items to becompetitive. Both approaches exhibited their worst resultsfor the RECIPES dataset, even though the “YES” percent-age of the top-ranked items by our method was almosttwice that of NN. The difficulty of the recipes domain isintuitive, as users are less used to consider recipes in acompetitive setting. The middle-ranked candidates of ourapproach attracted mixed responses from the annotators,indicating that it was not trivial to determine whether theitem is indeed competitive or not. An interesting observa-tion is that, for some of our datasets, the middle-rankedcandidates of NN were more popular than its top-rankedones, which implies that this approach fails to emulate theway the users perceive the competitiveness between twoitems. The bottom-ranked candidates of our approach wereconsistently rejected, verifying their lack of competitivenessto the seed item. The bottom-ranked items by the NNapproach were also frequently rejected, indicating that it iseasier to identify items that are not competitive to the target.

Finally, to further illustrate the difference between ourcompetitiveness model and the similarity-based approach,we conducted the following quantitative experiment. Foreach item i in a dataset, we retrieve its 300 top-rankedcompetitors, as ordered by each of the two methods. Wethen compute the Kendall τ and overlap of the two lists.We report the average of these two quantities over all itemsin the dataset. The results in Table 6 demonstrate that therankings of the two techniques are significantly differentboth in their ordering and in the items that they contain.

In conclusion, the survey and our qualitative analysisvalidate our definition of the competitiveness and demon-strates that similarity is not an appropriate proxy for com-petitiveness. These results complement the discussion thatwe presented in Section 2, which includes a realistic exampleof the shortcomings of the similarity-based paradigm.

TABLE 6: Comparing Competitiveness and NN.

Kendall τ #CommonsAvg StdDev Avg StdDev

Cameras -0.0658 0.281 181.886 49.922Hotels 0.0438 0.2074 185.786 32.7872Restaurants -0.36 0.1532 82.523 32.737Recipes -0.642 0.133 14.524 30.804

6 RELATED WORK

This paper builds on and significantly extends our pre-liminary work on the evaluation of competitiveness [30].To the best of our knowledge, our work is the first toaddress the evaluation of competitiveness via the analysisof large unstructured datasets, without the need for directcomparative evidence. Nonetheless, our work has ties toprevious work from various domains.

Page 13: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

13

TOP MIDDLE BOTTOMGroups of Candidates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6R

esp

onse

Perc

enta

ge

COMP NN COMP NN COMP NN

CamerasYESNONOT SURE

(a)

TOP MIDDLE BOTTOMGroups of Candidates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Resp

onse

Perc

enta

ge

COMP NN COMP NN COMP NN

HotelsYESNONOT SURE

(b)

TOP MIDDLE BOTTOMGroups of Candidates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Resp

onse

Perc

enta

ge

COMP NN COMP NN COMP NN

RestaurantsYESNONOT SURE

(c)

TOP MIDDLE BOTTOMGroups of Candidates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Resp

onse

Perc

enta

ge

COMP NN COMP NN COMP NN

RecipesYESNONOT SURE

(d)

Fig. 11: Results of the user study comparing our competitiveness paradigm with the Nearest-Neighbor approach.

Managerial Competitor Identification: The managementliterature is rich with works that focus on how managerscan manually identify competitors. Some of these worksmodel competitor identification as a mental categorizationprocess in which managers develop mental representationsof competitors and use them to classify candidate firms [3],[6], [31]. Other manual categorization methods are basedon market- and resource-based similarities between a firmand candidate competitors [1], [5], [7]. Finally, managerialcompetitor identification has also been presented as a sense-making process in which competitors are identified basedon their potential to threaten an organizations identity [4].

Competitor Mining Algorithms: Zheng et al. [32] identifykey competitive measures (e.g. market share, share of wal-let) and showed how a firm can infer the values of thesemeasures for its competitors by mining (i) its own detailedcustomer transaction data and (ii) aggregate data for eachcompetitor. Contrary to our own methodology, this ap-proach is not appropriate for evaluating the competitivenessbetween any two items or firms in a given market. Instead,the authors assume that the set of competitors is givenand, thus, their goal is to compute the value of the chosenmeasures for each competitor. In addition, the dependencyon transactional data is a limitation we do not have.

Doan et al. explore user visitation data, such as thegeo-coded data from location-based social networks, as apotential resource for competitor mining [33]. While theyreport promising results, the dependence on visitation datalimits the set of domains that can benefit from this approach.

Pant and Sheng hypothesize and verify that competingfirms are likely to have similar web footprints, a phe-nomenon that they refer to as online isomorphism [34]. Theirstudy considers different types of isomorphism betweentwo firms, such as the overlap between the in-links and out-links of their respective websites, as well as the numberof times that they appear together online (e.g. in searchresults or new articles). Similar to our own methodology,their approach is geared toward pairwise competitiveness.However, the need for isomorphism features limits its ap-plicability to firms and make it unsuitable for items anddomains where such features are either not available orextremely sparse, as is typically the case with co-occurrencedata. In fact, the sparsity of co-occurrence data is a seriouslimitation of a significant body of work [8], [10], [11], [35]that focuses on mining competitors based on comparativeexpressions found in web results and other textual corpora.The intuition is that the frequency of expressions like “ItemA is better than Item B” “or item A Vs. Item B” is indicativeof their competitiveness. However, as we have already

discussed in the introduction, such evidence is typicallyscarce or even non-existent in many mainstream domains.As a result, the applicability of such approaches is greatlylimited. We provide empirical evidence on the sparsity ofco-occurrence information in our experimental evaluation.

Finding Competitive Products: Recent work [36], [37], [38]has explored competitiveness in the context of product de-sign. The first step in these approaches is the definitionof a dominance function that represents the value of aproduct. The goal is then to use this function to create itemsthat are not dominated by other, or maximize items withthe maximum possible dominance value. A similar line ofwork [39], [40] represents items as points in a multidimen-sional space and looks for subspaces where the appeal of theitem is maximized. While relevant, the above projects havea completely different focus from our own, and hence theproposed approaches are not applicable in our setting.

Skyline computation: Our work leverages concepts andtechniques from the extensive literature on skyline compu-tation [24], [25], [41]. These include the dominance conceptamong items, as well as the construction of the skylinepyramid used by our CMiner algorithm. Our work also hasties to the recent publications in reverse skyline queries [42],[43]. Even though the focus of our work is different, weintend to utilize the advances in this field to improve ourframework in future work.

7 CONCLUSION

We presented a formal definition of competitiveness be-tween two items, which we validated both quantitativelyand qualitatively. Our formalization is applicable acrossdomains, overcoming the shortcomings of previous ap-proaches. We consider a number of factors that have beenlargely overlooked in the past, such as the position ofthe items in the multi-dimensional feature space and thepreferences and opinions of the users. Our work introducesan end-to-end methodology for mining such informationfrom large datasets of customer reviews. Based on ourcompetitiveness definition, we addressed the computation-ally challenging problem of finding the top-k competitorsof a given item. The proposed framework is efficient andapplicable to domains with very large populations of items.The efficiency of our methodology was verified via an exper-imental evaluation on real datasets from different domains.Our experiments also revealed that only a small numberof reviews is sufficient to confidently estimate the differenttypes of users in a given market, as well the number of usersthat belong to each type.

Page 14: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

14

REFERENCES

[1] M. E. Porter, Competitive Strategy: Techniques for Analyzing Industriesand Competitors. Free Press, 1980.

[2] R. Deshpand and H. Gatingon, “Competitive analysis,” MarketingLetters, 1994.

[3] B. H. Clark and D. B. Montgomery, “Managerial Identification ofCompetitors,” Journal of Marketing, 1999.

[4] W. T. Few, “Managerial competitor identification: Integratingthe categorization, economic and organizational identity perspec-tives,” Doctoral Dissertaion, 2007.

[5] M. Bergen and M. A. Peteraf, “Competitor identification and com-petitor analysis: a broad-based managerial approach,” Managerialand Decision Economics, 2002.

[6] J. F. Porac and H. Thomas, “Taxonomic mental models in competi-tor definition,” The Academy of Management Review, 2008.

[7] M.-J. Chen, “Competitor analysis and interfirm rivalry: Toward atheoretical integration,” Academy of Management Review, 1996.

[8] R. Li, S. Bao, J. Wang, Y. Yu, and Y. Cao, “Cominer: An effectivealgorithm for mining competitors from the web,” in ICDM, 2006.

[9] Z. Ma, G. Pant, and O. R. L. Sheng, “Mining competitor relation-ships from online news: A network-based approach,” ElectronicCommerce Research and Applications, 2011.

[10] R. Li, S. Bao, J. Wang, Y. Liu, and Y. Yu, “Web scale competitordiscovery using mutual information,” in ADMA, 2006.

[11] S. Bao, R. Li, Y. Yu, and Y. Cao, “Competitor mining with the web,”IEEE Trans. Knowl. Data Eng., 2008.

[12] G. Pant and O. R. L. Sheng, “Avoiding the blind spots: Competitoridentification using web text and linkage structure,” in ICIS, 2009.

[13] D. Zelenko and O. Semin, “Automatic competitor identificationfrom public information sources,” International Journal of Computa-tional Intelligence and Applications, 2002.

[14] R. Decker and M. Trusov, “Estimating aggregate consumer pref-erences from online product reviews,” International Journal of Re-search in Marketing, vol. 27, no. 4, pp. 293–307, 2010.

[15] C. W.-K. Leung, S. C.-F. Chan, F.-L. Chung, and G. Ngai, “A prob-abilistic rating inference framework for mining user preferencesfrom reviews,” World Wide Web, vol. 14, no. 2, pp. 187–215, 2011.

[16] K. Lerman, S. Blair-Goldensohn, and R. McDonald, “Sentimentsummarization: evaluating and learning user preferences,” inACL, 2009, pp. 514–522.

[17] E. Marrese-Taylor, J. D. Velasquez, F. Bravo-Marquez, and Y. Mat-suo, “Identifying customer preferences about tourism products us-ing an aspect-based opinion mining approach,” Procedia ComputerScience, vol. 22, pp. 182–191, 2013.

[18] C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant, “Range queriesin olap data cubes,” in SIGMOD, 1997, pp. 73–88.

[19] Y.-L. Wu, D. Agrawal, and A. El Abbadi, “Using wavelet decompo-sition to support progressive and approximate range-sum queriesover data cubes,” in CIKM, ser. CIKM ’00, 2000, pp. 414–421.

[20] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi, “Ap-proximating multi-dimensional aggregate range queries over realattributes,” in SIGMOD, 2000, pp. 463–474.

[21] M. Muralikrishna and D. J. DeWitt, “Equi-depth histograms forestimating selectivity factors for multi-dimensional queries,” inSIGMOD, 1988, pp. 28–36.

[22] N. Thaper, S. Guha, P. Indyk, and N. Koudas, “Dynamic multidi-mensional histograms,” in SIGMOD, 2002, pp. 428–439.

[23] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, “Paralleldata processing with mapreduce: a survey,” AcM sIGMoD Record,vol. 40, no. 4, pp. 11–20, 2012.

[24] S. Borzsonyi, D. Kossmann, and K. Stocker, “The skyline operator,”in ICDE, 2001.

[25] D. Papadias, Y. Tao, G. Fu, and B. Seeger, “An optimal andprogressive algorithm for skyline queries,” ser. SIGMOD ’03.

[26] G. Valkanas, A. N. Papadopoulos, and D. Gunopulos, “Skylineranking a la IR,” in ExploreDB, 2014, pp. 182–187.

[27] J. L. Bentley, H. T. Kung, M. Schkolnick, and C. D. Thompson,“On the average number of maxima in a set of vectors andapplications,” J. ACM, 1978.

[28] X. Ding, B. Liu, and P. S. Yu, “A holistic lexicon-based approach toopinion mining,” ser. WSDM ’08.

[29] A. Agresti, Analysis of ordinal categorical data. John Wiley & Sons,2010, vol. 656.

[30] T. Lappas, G. Valkanas, and D. Gunopulos, “Efficient and domain-invariant competitor mining,” in SIGKDD, 2012, pp. 408–416.

[31] J. F. Porac and H. Thomas, “Taxonomic mental models in competi-tor definition,” Academy of Management Review, vol. 15, no. 2, pp.224–240, 1990.

[32] Z. Zheng, P. Fader, and B. Padmanabhan, “From business intelli-gence to competitive intelligence: Inferring competitive measuresusing augmented site-centric data,” Information Systems Research,vol. 23, no. 3-part-1, pp. 698–720, 2012.

[33] T.-N. Doan, F. C. T. Chua, and E.-P. Lim, “Mining business com-petitiveness from user visitation data,” in International Conferenceon Social Computing, Behavioral-Cultural Modeling, and Prediction.Springer, 2015, pp. 283–289.

[34] G. Pant and O. R. Sheng, “Web footprints of firms: Using onlineisomorphism for competitor identification,” Information SystemsResearch, vol. 26, no. 1, pp. 188–209, 2015.

[35] K. Xu, S. S. Liao, J. Li, and Y. Song, “Mining comparative opin-ions from customer reviews for competitive intelligence,” Decis.Support Syst., 2011.

[36] Q. Wan, R. C.-W. Wong, I. F. Ilyas, M. T. Ozsu, and Y. Peng,“Creating competitive products,” PVLDB, vol. 2, no. 1, pp. 898–909, 2009.

[37] Q. Wan, R. C.-W. Wong, and Y. Peng, “Finding top-k profitableproducts,” in ICDE, 2011.

[38] Z. Zhang, L. V. S. Lakshmanan, and A. K. H. Tung, “On domina-tion game analysis for microeconomic data mining,” ACM Trans.Knowl. Discov. Data, 2009.

[39] T. Wu, D. Xin, Q. Mei, and J. Han, “Promotion analysis in multi-dimensional space,” PVLDB, 2009.

[40] T. Wu, Y. Sun, C. Li, and J. Han, “Region-based online promotionanalysis,” in EDBT, 2010.

[41] D. Kossmann, F. Ramsak, and S. Rost, “Shooting stars in the sky:an online algorithm for skyline queries,” ser. VLDB, 2002.

[42] A. Vlachou, C. Doulkeridis, Y. Kotidis, and K. Nørvag, “Reversetop-k queries,” in ICDE, 2010.

[43] A. Vlachou, C. Doulkeridis, K. Nørvag, and Y. Kotidis, “Identify-ing the most influential data objects with reverse top-k queries,”PVLDB, 2010.

[44] K. Hose and A. Vlachou, “A survey of skyline processing in highlydistributed environments,” The VLDB Journal, vol. 21, no. 3, pp.359–384, 2012.

Georgios Valkanas George Valkanas receivedhis PhD in Computer Science from the Universityof Athens, Greece in 2014. Currently, he is aResearch Engineer at Detetica, NYC, NY. Priorto that, he was a Post-Doctoral Researcher inthe School of Business at Stevens Institute ofTechnology. His research interests include weband text mining, information retrieval and extrac-tion, with a high interest in user preferences andbehavior.

Theodoros Lappas Theodoros Lappas re-ceived his PhD in Computer Science from theUniversity of California, Riverside in 2011. Cur-rently, he is a an assistant professor in theSchool of Business at Stevens Institute of Tech-nology. He joined Stevens after completing hispost-doctoral study in the Computer ScienceDepartment of Boston University. His researchinterests include data mining and machine learn-ing, with a particular focus on scalable algo-rithms for mining text and graphs. He is a mem-

ber of the IEEE.

Dimitrios Gunopulos Dimitrios Gunopulos re-ceived his PhD degree from Princeton Univer-sity in 1995. Currently, he is a Professor in theDepartment of Informatics and Telecommunica-tions, University of Athens, Greece. He has heldpositions as a postoctoral fellow at MPII, Ger-many, research associate at IBM Almaden, andassistant, associate, and professor of computerscience and engineering at UC-Riverside. Hisresearch interests include data mining, knowl-edge discovery in databases, databases, sensor

networks, and peer-to-peer systems. He is a member of the IEEE.

Page 15: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

15

APPENDIX AEXTENSION OF PREVIOUS WORK

This manuscript builds on our previous preliminary workon the evaluation of competitiveness, published in the ACMSIGKDD 2012 conference. In addition to including a sig-nificantly improved write-up and a much more extensivediscussion of both our methods and results, the submittedmanuscript extends our previous work by introducing:

1) An updated motivation and description of the prob-lems addressed by our work, including new infor-mative examples and improved notation (Sections 1and 2).

2) An extended section on related work, including adiscussion of papers from management and market-ing science, as well as recent papers published afterthe conference version of this manuscript (Section 6)

3) A formal proof of Lemma 1, which is critical for theproposed methodology and was not included in theconference version (Section 3).

4) An extended discussion of the time and space com-plexity of the CMiner algorithm (Section 3).

5) A new algorithm, titled CMiner++. This algorithm,which is now included in our experimental eval-uation, implements a number of computationalspeedups that increase the search and pruning ca-pabilities of the previously proposed CMiner algo-rithm and increase its scalability (Section 4.2).

6) A new experiment that includes the results of theCOV ordering scheme (named IC in the conferenceversion) on all four datasets considered by ourevaluation, as well as a comparison with other al-ternatives. The experiment evaluates the scheme interms of both the number of required coverage com-putations and the number of queries that need tobe considered, for a total of 8 plots. The conferenceversion only included the results for the coveragecomputations for a single dataset (Section 5.4).

7) A new experiment that evaluates the ability of theproposed algorithm to achieve high-quality results,even when only a subset of the queries is consid-ered. The experiment includes an evaluation of boththe quality of the results and the computationalsavings that are recorded in such cases (Section 5.6).

8) A new experiment that evaluates the ability of theproposed algorithm to achieve high-quality results,even when only a subset of the candidate com-petitors is considered. The experiment includes anevaluation of both the quality of the results andthe computational savings that are recorded in suchcases (Section E).

9) A new experiment that evaluated the convergenceof the query probabilities required by our method-ology, for all four datasets (Section 5.7).

10) A comparison with previous work on mining com-petitors based on competitive evidence from text(Section 5.2).

11) An extended user study that includes results onall 4 datasets included in our evaluation: RECIPES,CAMERAS, HOTELS, and RESTAURANTS (Section 5.8).

12) A new experiment that quantitatively evaluates thepruning capability of the CMiner algorithm (Sec-tion 5.5).

13) A new experiment that illustrates the differencebetween competitiveness and similarity (added toSection 5.8).

14) An extension of our definition of competitivenessthat considers (i) the non-uniformity of the distri-bution of a user’s interest within the value space ofeach feature, and (ii) the differences in importanceamong the features included in the same user query(Section 2.3)

APPENDIX BPROOF OF LEMMA 1Proof We will prove this by contradiction. Let j be an itemthat is not included in Y and has a competitiveness CF (i, j)with i that is higher or equal to at least one of the itemsin Y . According to the lemma, we assume that j is notdominated by any of the items in Y . Observe that j cannotbe in the skyline, since its competitiveness would then havebeen enough to also include it in Y . Hence, it is guaranteedto be dominated by at least one of the items in the skyline.This means that there is an item j′ ∈ Sky(I) such thatCF (i, j

′) ≥ CF (i, j). However, since j has a greater or equalcompetitiveness to that of at least one of the items in Y , j′

will also be included in Y . This leads to a contradiction,since we assumed that none of the items in Y dominate j.

APPENDIX CPSEUDOCODE AND COMPLEXITY ANALYSIS FORPyramidFinder

Algorithm 2 PyramidFinder

Input: Set of items IOutput: Dominance Pyramid DI

1: DI [0]← Sky(I)2: Z ← I \ Skyline(I)3: level← 1.4: while Z is not empty do5: DI [level]← Sky(Z)6: for every item j ∈ DI [level] do7: for every item i ∈ DI [level − 1] do8: if i dominates j then9: Add a link i→ j

10: break11: end if12: end for13: end for14: Z ← Z \ skyline(Z)15: level← level + 116: end while

Space Complexity: Algorithm 2 describes the creation ofthe pyramid indexing structure. In principle, the algorithmsimply rearranges the original data in skyline layers. There-fore, any additional space required by the structure is dueto the edges added between data points of consecutive

Page 16: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

16

skyline layers (line 9). As is typical with skyline-relatedproblems [44], this depends greatly on the underlying datadistribution. Assuming n data points overall, if there is onlyone skyline layer, then no additional information is stored.In the worst case, we will need n2 additional space to storethe edges. This is extremely unlikely even in artificiallygenerated distributions [24]. On the other hand, if the datawere fully correlated, we would require Θ(n) space for theedges. This is also unrealistic, as it implies a monopolisticmarket.

Figure 12 provides an empirical analysis of the numberof edges stored per dataset, contrasting it to the theoreticalbounds of the worst possible and fully correlated cases,discussed earlier. The space required to store the pyramidstructure is well below the limit of the correlated dataset.This is related to both the high-dimensionality of the spaceand the fact that we are dealing with non-monopolisticsettings.

100

1K

10K

100K

1M

10M

100M

1Bn

10Bn

camerashotels restaurants

recipes

#E

dges

Dataset

WorstCorrelated

Empirical

Fig. 12: Comparative

Time Complexity: Computationally, the algorithm has twomajor parts: (i) finding the skyline layers and (ii) addingthe edges. The complexity is level times the complexity ofany state-of-the-art algorithm that the practitioner choosesfor skyline computation (we use BBS [25]), where level isthe number of skyline layers.

The standard approach to adding the edges is O(d ∗ n2),where d is the dimensionality of the data, due to thedominance checks. Lets assume that mi points exist in theskyline of the i-th layer extracted by BBS, for a total of λlayers. Adding edges between consecutive layers yields atotal runtime of

∑λ−1 mi∗mi+1. From the Cauchy-Schwarzinequality, we know that

λ−1∑(mi ∗mi+1) ≤

√(λ−1∑

m2i ) ∗ (

λ−1∑m2

i+1)

The left hand side is bounded and the upper boundis achieved when all mi are equal. Given that mi’s areintegers, the true bound may not be always attainable, yetit still holds that the maximum value is obtained whenmi = n

λ , or as best equally distributed as possible. Byplugging this result to the original cost function, the worst

runtime becomes a function of the number of layers, givenby

f(λ) =λ−1∑

(n

λ)2 = n2λ− 1

λ2

The first derivative of this function is f ′(λ) = n2 ∗( 2−λλ3 ),

which has a global maximum for λ = 2, i.e. 2 layers. Inthat case, m1 = m2 = n

2@, and the worst runtime becomes(n2 )

2 = O(n2). As each comparison takes time O(d), theoverall cost is O(d ∗ n2). We note that, even though the costcould be improved with an in-memory R-Tree, this mightdegenerate to a linear scan for very high-dimensional data.

As we show in Table 7, the pyramid construction isnot responsible for the main cost of computing the com-petitors. The runtime was only substantial (67 minutes)for our biggest dataset (100K items), for which we had touse the naive approach for edge addition, due to the highdimensionality (d = 22). It was negligible for all the othercases. However, the structure offered significant runtimesavings during search, especially for the competitors at thetop of the list. Instead, the brute force method that computescompetitiveness between all possible item pairs requiresΘ(2|F| ∗ n2 ∗ logk), which can be easily dominated by thepowerset factor.

TABLE 7: Runtime for creating the Pyramid structure.

Cameras Hotels Restaurants Recipes

Runtime (sec) 0.087 0.152 0.338 3999.322

APPENDIX DREDUCING THE NUMBER OF CONSIDERED QUERIES

We have already discussed, in Section 5.4, how CMineris able to evaluate only a fraction of the total pairwisecoverages, by using the COV ordering scheme introducedin Section 4.1. Figure 13 portrays the average number ofqueries that we must evaluate for each candidate itembefore we start computing the exact competitiveness scoreof the items in the localTopK structure (lines 23-40 ofAlgorithm 1)

Once again, the CAMERAS and RECIPES dataset benefitthe most from this ordering scheme, even for large valuesof k, with a 50% less queries being evaluated. This resultalso sheds some light to our earlier finding regarding thecomputed pairwise coverages results. Interestingly, for thosetwo datasets, the number of evaluated queries seems toremain rather constant, despite an increasing k value. Thedifference in the number of evaluated queries is also visiblefor the remaining two datasets, although not as prominent,and slightly increasing as k increases too.

Given those results, it is undeniable that the COVscheme allows us to perform far better pruning, withoutsacrificing the correctness of the result. Furthermore, all ofthose results, in addition to the ones in Figure 6, validate ourtheoretical work regarding the convergence of the approach,discussed in detail in Section 4.1.

Page 17: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

17

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

3 5 10 50 150

Num

ber

of Q

uerie

s

Number of Competitors

P-INC P-DCR COV

(a) Cameras

0

50

100

150

200

250

3 5 10 50 150

Num

ber

of Q

uerie

s

Number of Competitors

P-INC P-DCR COV

(b) Hotels

0

20

40

60

80

100

3 5 10 50 150

Num

ber

of Q

uerie

s

Number of Competitors

P-INC P-DCR COV

(c) Restaurants

0

50

100

150

200

250

3 5 10 50 150

Num

ber

of Q

uerie

s

Number of Competitors

RND P-DCR COV

(d) Recipes

Fig. 13: Average number of query considerations under different ordering schemes.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

20%40%

60%80%

100%

(a) Cameras

0

0.005

0.01

0.015

0.02

0.025

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

20%40%

60%80%

100%

(b) Hotels

0 0.002 0.004 0.006 0.008 0.01

0.012 0.014 0.016 0.018 0.02

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

20%40%

60%80%

100%

(c) Restaurants

0.04 0.06 0.08 0.1

0.12 0.14 0.16 0.18 0.2

0.22

3 10 50 150 300

Tim

e (s

ec)

Top-k Competitors

20%40%

60%80%

100%

(d) Recipes

Fig. 14: Computational cost of CMiner on just the top x% of candidates, according to coverage.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

20% 40% 60% 80%

(a) Cameras

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

20% 40% 60% 80%

(b) Hotels

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

20% 40% 60% 80%

(c) Restaurants

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

3 10 50 150 300

Ken

dall

Tau

Cor

rel

Top-k Competitors

20% 40% 60% 80%

(d) Recipes

Fig. 15: Results of using CMiner on just the top x% of candidates, according to coverage.

APPENDIX EREDUCING THE NUMBER OF CANDIDATES

Despite the scalability that CMiner demonstrated in our ex-periments, it is still a fairly complex approach that considersthe entire space of candidates. In this experiment, we verifythe need for such an algorithm by evaluating a simpler andmuch faster approximation approach.

According to the definition given in Eq. 1, the compet-itiveness between two items is based on the percentage ofcustomers that they can both cover. Intuitively it would bereasonable to expect that candidates with a high coverageare more likely to be included in the top-k competitorsof a random item. Thus, we hypothesize that, by focusingexclusively on high-coverage items, we could significantlyreduce the number of candidates that need to be considered.We evaluate this hypothesis by comparing the top-k setsretrieved by the CMiner on the full dataset with those re-trieved when focusing on just the top x% of the candidates,as ranked by coverage. The two top-k lists are comparedusing Kendall’s tau correlation coefficient. We repeat theexperiment for all four datasets, for x ∈ {20, 40, 60, 80}, andfor k ∈ {3, 10, 50, 150, 300}. We report the average value ofthe coefficient for each parameter configuration. The resultsare shown in Figure 15. We also report the computationalcost of each configuration in Figure 14.

The figures reveal a number of informative findings.First, as anticipated, higher values of x yield results that

are closer to the exact solution. However, for x ≤ 40%,the achieved coefficient values were consistently low Forlarger values of x, the algorithm delivers results closer to thecorrect output, but only for larger values of k. For instance,for hotels and restaurants, even x = 80% was not enough toachieve high correlation values for k > 10. With respect tothe computational cost, we observe that significant savingsare only achieved for large values of k (> 50), even ifsmall percentages of the entire population of candidatesare considered. This is a testament to CMiner’s ability toefficiently evaluate and prune the search space, while alsocomputing an exact solution. The results demonstrate thatthe a-priori elimination of a percentage of the search spacecan have a significant adverse effect on the quality of theresults, while only delivering computational savings in lim-ited cases. This provides further motivation for our ownapproach, which focuses on considering the entire space ofcandidates and only eliminates candidates when we haveenough information to guarantee that they are not in thetop-k.

APPENDIX FEXAMPLE PAIRS OF THE USER STUDY

As an example of the annotation setup, we provide somesample pairs in Table 8 from the restaurants dataset. Thetop-ranked (most competitive) restaurant is better than the

Page 18: Mining Competitors from Large Unstructured Datasets€¦ · approach using multiple datasets from different domains. Index Terms—Data mining, Web mining, Information Search and

1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2705101, IEEETransactions on Knowledge and Data Engineering

18

TABLE 8: Examples of top-, middle-, and bottom-ranked competitors from the user study on RESTAURANTS.

ID Food Service Value Atm. Price Meal Types Good For CuisinesFocal Item 0.83 0.79 0.74 0.78 0.96 Dinner, Lunch Families Italian

Top 0.9 0.9 0.9 0.80 0.97 Dinner, LunchFamilies, Bar scene,

Italian, PizzaBudget, Large groupsSpecial occ.

Middle 0.80 0.80 0.70 0.80 0.91 Dinner, Late NightBar scene, Large groups

Mediterranean, Middle EasternBusiness, Special occ.

Bottom 0.6 0.80 0.60 0.80 0.67 Dinner Large Groups, Bar scene American

focal (seed) item for all numeric features and covers morevalues for all categorical features. On the other hand, themiddle-ranked item has better values than the focal item forsome numeric features (e.g. service) and worse for others(e.g. price). That being said, the gaps in the values of thetwo items are consistently small across numeric features.In addition, the two items have no overlap with respect to2 of the 3 categorical features (Good For, Cuisines), a factthat demonstrates that they compete for different marketsegments and lowers the competitiveness. Finally, the bot-tom ranked item has no overlap with the focal item for anyof the categorical features. This item also has significantlylower values for many of the numeric features, a fact thatfurther contributes to its low competitiveness.