8
Extracting Aribute-Value Pairs from Product Specifications on the Web Petar Petrovski Data and Science Group, University of Mannheim B6, 26 Mannheim, Germany 68159 [email protected] Christian Bizer Data and Science Group, University of Mannheim B6, 26 Mannheim, Germany 68159 [email protected] ABSTRACT Comparison shopping portals integrate product oers from large numbers of e-shops in order to support consumers in their buy- ing decisions. Product oers often consist of a title and a free-text product description, both describing product attributes that are con- sidered relevant by the specic vendor. In addition, product oers might contain structured or semi-structured product specications in the form of HTML tables and HTML lists. As product specications often cover more product attributes than free-text descriptions, being able to extract attribute-value pairs from these specications is a critical prerequisite for achieving good results in tasks such as product matching, product categorisation, faceted product search, and product recommendation. In this paper, we present an approach for extracting attribute- value pairs from product specications on the Web. We use super- vised learning to classify the HTML tables and HTML lists within a web page as product specication or not. In order to extract attribute-value pairs from the HTML fragments identied by the specication detector, we again use supervised learning to classify columns as attribute column or value column. Compared to DEX- TER, the current state-of-the-art approach for extracting attribute- value pairs from product specications, we introduce several new features for specication detection and support the extraction of attribute-value pairs from specications having more than two columns. This allows us to improve the F-score up to 10% for ex- tracting attribute-value pairs from tables and up to 3% for lists. In addition, we report the results of using duplicate-based schema matching to align the product attribute schemata of 32 dierent e-shops. This experiment conrms the suitability of duplicate-based schema matching for product data integration. CCS CONCEPTS Information systems Extraction, transformation and load- ing; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. Web Intelligence (WI’17), Leipzig, Germany © 2017 ACM. 978-1-4503-4951-2/17/08. . . $15.00 DOI: 10.1145/3106426.3106449 (a) Example of a free-text product title (b) Example of a product specication table Figure 1: Product attributes within a free-text product title and product attributes given by a specication table KEYWORDS Product Data; Feature Extraction; Schema Matching; Web Tables ACM Reference format: Petar Petrovski and Christian Bizer. 2017. Extracting Attribute-Value Pairs from Product Specications on the Web. In Proceedings of Web Intelligence (WI’17), Leipzig, Germany, August 2017, 8 pages. DOI: 10.1145/3106426.3106449 1 INTRODUCTION The Web has made it easier for organisations to reach out to their customers, eliminating barriers of geographical location 1 , and lead- ing to a steady growth of e-commerce sales 2 . Besides e-shops run by individual vendors, comparison shopping portals which aggregate oers from multiple vendors play a central role in e-commerce. The central challenge for many tasks within the domain of e- commerce, including product matching, product categorisation, faceted product search, and product recommendation, is extract- ing attribute-value pairs with high precision from unstructured 1 Digital buyer penetration worldwide from 2011 to 2018 - http://www.emarketer.com/ public_media/docs/eMarketer_eTailWest2016_Worldwide_ECommerce_Report.pdf 2 Retail e-commerce sales worldwide from 2014 to 2019 - https://www.emarketer.com/ Article/Worldwide-Retail-Ecommerce-Sales-Will-Reach-1915-trillion-This-Year/ 1014369

Extracting Attribute-Value Pairs from Product ...webdatacommons.org/productcorpus/paper/attribute... · Extracting A˛ribute-Value Pairs from Product Specifications on the Web Web

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Extracting A�ribute-Value Pairs from Product Specifications onthe Web

Petar PetrovskiData and Science Group,University of Mannheim

B6, 26Mannheim, Germany 68159

[email protected]

Christian BizerData and Science Group,University of Mannheim

B6, 26Mannheim, Germany 68159

[email protected]

ABSTRACTComparison shopping portals integrate product o�ers from largenumbers of e-shops in order to support consumers in their buy-ing decisions. Product o�ers often consist of a title and a free-textproduct description, both describing product attributes that are con-sidered relevant by the speci�c vendor. In addition, product o�ersmight contain structured or semi-structured product speci�cations inthe form of HTML tables and HTML lists. As product speci�cationsoften cover more product attributes than free-text descriptions,being able to extract attribute-value pairs from these speci�cationsis a critical prerequisite for achieving good results in tasks such asproduct matching, product categorisation, faceted product search,and product recommendation.

In this paper, we present an approach for extracting attribute-value pairs from product speci�cations on the Web. We use super-vised learning to classify the HTML tables and HTML lists withina web page as product speci�cation or not. In order to extractattribute-value pairs from the HTML fragments identi�ed by thespeci�cation detector, we again use supervised learning to classifycolumns as attribute column or value column. Compared to DEX-TER, the current state-of-the-art approach for extracting attribute-value pairs from product speci�cations, we introduce several newfeatures for speci�cation detection and support the extraction ofattribute-value pairs from speci�cations having more than twocolumns. This allows us to improve the F-score up to 10% for ex-tracting attribute-value pairs from tables and up to 3% for lists. Inaddition, we report the results of using duplicate-based schemamatching to align the product attribute schemata of 32 di�erente-shops. This experiment con�rms the suitability of duplicate-basedschema matching for product data integration.

CCS CONCEPTS• Information systems→Extraction, transformation and load-ing;

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected] Intelligence (WI’17), Leipzig, Germany© 2017 ACM. 978-1-4503-4951-2/17/08. . . $15.00DOI: 10.1145/3106426.3106449

(a) Example of a free-text product title

(b) Example of a product speci�cation table

Figure 1: Product attributes within a free-text product titleand product attributes given by a speci�cation table

KEYWORDSProduct Data; Feature Extraction; Schema Matching; Web TablesACM Reference format:Petar Petrovski and Christian Bizer. 2017. Extracting Attribute-Value Pairsfrom Product Speci�cations on the Web. In Proceedings of Web Intelligence(WI’17), Leipzig, Germany, August 2017, 8 pages.DOI: 10.1145/3106426.3106449

1 INTRODUCTIONThe Web has made it easier for organisations to reach out to theircustomers, eliminating barriers of geographical location1, and lead-ing to a steady growth of e-commerce sales2 . Besides e-shops run byindividual vendors, comparison shopping portals which aggregateo�ers from multiple vendors play a central role in e-commerce.

The central challenge for many tasks within the domain of e-commerce, including product matching, product categorisation,faceted product search, and product recommendation, is extract-ing attribute-value pairs with high precision from unstructured1Digital buyer penetration worldwide from 2011 to 2018 - http://www.emarketer.com/public_media/docs/eMarketer_eTailWest2016_Worldwide_ECommerce_Report.pdf2Retail e-commerce sales worldwide from 2014 to 2019 - https://www.emarketer.com/Article/Worldwide-Retail-Ecommerce-Sales-Will-Reach-1915-trillion-This-Year/1014369

Web Intelligence (WI’17), August 2017, Leipzig, Germany Petar Petrovski and Christian Bizer

Figure 2: Overall process of speci�cation detection and attribute-value pair extraction

product descriptions or semi-structured product speci�cations. Theextraction of detailed product features from the HTML pages ischallenging, as a single feature may appear in various surface formsin headlines, the product name, free-text product descriptions, aswell as structured or semi-structured product speci�cations. More-over, di�erent vendors use di�erent schemata to describe productsfrom the same product category, in which case to successfully de-termine matching products schema alignment is needed. Productfeature extraction is di�cult as most e-shops publish heterogeneousproduct descriptions having di�erent levels of detail [15].

In [18], we proposed an approach for matching product o�ersfrom HTML pages that contain semantic annotations (i.e Microdatamarkup). We observed that the product titles and product descrip-tions on these pages often do not include enough attributes bywhich the speci�c products can be identi�ed. For example, Figure1a shows a product title from which only �ve attributes can beextracted. These attributes are not enough to identify the speci�cproduct: Namely, the Galaxy S4 can have a di�erent storage ca-pacity (32 GB or 64 GB) and be equipped with a di�erent chip set(Cortex-A57 or Cortex-A53). This makes Galaxy S4 phones havingdi�erent storage capacities or di�erent chip sets ultimately di�er-ent products. In contrast to product tiles and product descriptionswhich often contain only are relatively small number of attributes,product speci�cations, such as the HTML table shown in Figure1b, often contain a much larger number of very detailed productattributes, e.g. 20 attributes in the example.

In order to overcome the challenge of the low number of at-tributes covered by product descriptions, we present in this papera two step approach for extracting attribute-value pairs from tech-nical speci�cations rather than from free-text product descriptions.In addition, we demonstrate a matching method for aligning theproduct attribute schemata used by di�erent vendors. Thus thecontributions of this paper are the following:

Speci�cation Detection: As previous work, we also use su-pervised learning to classify the HTML tables and HTMLlists within a web page as product speci�cations or not.We build on the state of the art method for this task (DEX-TER [21]) and introduce a set of additionall features whichallow us to improve the F-score for the detection task byup to 5%.

Speci�cation Extraction: In order to extract attribute-valuepairs from the HTML fragments identi�ed by the speci�ca-tion detector, we again use supervised learning to classifycolumns as an attribute or value column. In contrast to thestate-of-the-art system [21], we also support the extractionof attribute-value pairs from speci�cations having morethan two columns.

Schema Alignment: We employ a duplicate-based schemamatching approach to align the product attribute schemataof 32 di�erent e-shops. This experiment con�rms the suit-ability of duplicate-based methods for matching productattribute schemata.

The rest of this paper is structured as follows. Section 2 de�nesthe tasks of product feature extraction and product schema align-ment. We proceed by giving an overview of the related work inSection 3. Section 4 describes our methods for product feature ex-traction and product schema alignment. Subsequently, Section 5presents evaluation of the methods using a Web Data Commonsgold standard for product matching and product feature extraction.We conclude with a summary and an outlook on future work.

2 PROBLEM STATEMENTDe�nition 1. Product Feature Extraction from Technical

Speci�cations: We have a set of HTML pages W extracted fromthe Web. Every record s ∈W consists of a product title and a productdescription as unstructured textual �elds, as well as a technical spec-i�cation embedded in a HTML table or list. Our objective is extractattribute-value pairs from the embedded technical speci�cations ofthe records in W. More precisely, we �rst build a model that candetect the speci�cations in the records in W. Next, we build a fea-ture extraction model able to extract attribute-value pairs from thetechnical speci�cations in W. After the feature extraction model isapplied, each product speci�cation s ∈ W is represented as a list ofattributes-values pairs Ap =

{aw1 ,a

w2 , ...,a

wn}, where the attributes

are numerical or categorical.

De�nition 2. Product Schema Alignment: We have a cen-tralised product catalog C of structured product speci�cations. Ev-ery record r ∈ C consists of a set of attribute-value pairs Rp ={ac1 ,a

c2 , ...,a

cn}, where the attributes are numeric, categorical, or

free-text. Our goal is to use the product set C as a background

Extracting A�ribute-Value Pairs from Product Specifications on the Web Web Intelligence (WI’17), August 2017, Leipzig, Germany

knowledge to infer schema alignment rules to align the schemas forthe extracted attribute-value pair setAp from W. More precisely, letAc be an attribute from the catalog schema. Let Aw be an attributefrom the schema of record fromW (i.e.,Aw appears in speci�cationsfor a given vendor in W ). We say that (Ac ,Aw ) is an attribute cor-respondence from Ac → Aw if Ac and Aw have the same meaningin their values given a similarity function s (vAc ,vAw ).

3 RELATEDWORKThis section gives an overview of the existing research on productfeature extraction from free-text product descriptions, as well asexisting work on feature extraction from product speci�cations.

Feature Extraction fromProductDescriptions: Several meth-ods for extracting attribute-value pairs form product descriptionshave been developed for the use case of product matching. Themethods either use bag-of-words approaches to extract attribute-value pairs from the descriptions [3, 4, 12, 24, 25], a dictionary-basedapproach [6], or a combination of both [9, 18, 19].

In contrast, named entity recognition based feature extractionmodels are developed in [14, 17, 23]. All approaches use a similarmodels for feature extraction. In [14] propose an approach forannotating products descriptions based on a sequence BIO taggingmodel, following an NLP text chunking process. Speci�cally, theauthors train a linear-chain conditional random �eld model on amanually annotated training dataset, to identify only eight generalclasses of terms. However, the approach is not able to extract explicitattribute-value pairs. Ristoski and Mika [23] improved upon thisshortcoming employing a CRF model using a comprehensive setof discrete features that comes from the standard distribution ofthe Stanford NER3 mode. Ortona et al. [17] propose a three foldapproach that performs the following functions: validation of theo�ers values, blocking to reduce the number of compared o�ers,and scoring of the pairwise o�ers. For the validation, an annotatoris used which performs NER extraction (places, locations, names,organizations), and ontology which contains some domain speci�cconstrains. In the blocking step, all pairs of products that violatesome of the ontology constrains are clustered in di�erent clusters.In the third step, pairwise scores are calculated for the o�ers ineach cluster.

Recently, several approaches employ word embeddings as addi-tional knowledge for extracting features from product descriptionsfor the use cases of product matching [7, 26], product recommen-dation [5, 13, 28], and product classi�cation [10]. However, theapproaches can not bypass the problem of free-text product descrip-tions often covering only small number of features.

Feature extraction fromProduct Speci�cations: While thereis a relatively large body of research for extracting product featuresfrom product descriptions, only a handful of works have studiedthe problem of feature extraction from semi-structured data withinweb pages such as HTML tables and HTML lists.

Etzioni et al. [2] relies on a approach from [8] to extract planeticket prices from HTML tables within web pages. Speci�cally, themethod involves automatically learning wrappers relaying on socalled "landmarks" (i.e., groups of consecutive tokens) that enable

3http://nlp.stanford.edu/software/CRF-NER.shtml

a wrapper to locate the start and end of the item within the page.However, it is widely known that wrappers are ine�cient wayto solve the problem of automatic feature extraction from semi-structured data, since they have to be learned for every website.

The DEXTER system presented by Qui et al. [21] is the mostprominent approach for extracting attribute-value pairs from HTMLtables and HTML lists. The system is evaluated on the task of prod-uct data extraction. In order to identify HTML tables and lists withinproduct web pages, DEXTER uses a binary classi�er. To extractattribute-value pair, the system relies on straight forward heuristics.Namely, for extraction of speci�cation tables the authors proposeto extract the left column as the attribute name and everything elseas a concatenated attribute value. For extraction of list speci�cationthe authors propose to split each list item by popular delimiters,and employ the above rule afterward.

An earlier work was presented by Wong et al. [27], where theauthors build a graphical model, which employs latent Dirichlet pro-cess mixture model for extracting and aligning product attributes.To build the latent Dirichlet process mixture model, an unsuper-vised inference algorithm based on variational method is derived.This idea, was extended in [1], where the authors only focus onextracting “positive” features from speci�cations, by employing CRFmodel similar to [23]. The list of “positive” features is constructedby employing a sentiment analysis of product reviews found in theweb page.

The authors of [16] perform product matching on a dataset ofthe Bing search engine. In their approach, the authors use histor-ical knowledge to generate the attributes and to perform schemamatching. In particular, they visit the merchant web page to com-pare the values of the products in the catalog with the values onthe web page, converting the problem to a standard table schemamatching problem. Next, similar to our approach the authors useinstance-based schema matching to align the attributes’ names inthe catalog to the ones on the merchant web page.

4 METHODOLOGYThis section introduces our two step approach for extracting attribute-value pairs from technical speci�cations. Afterwards, we describethe matching method that we employ for aligning the productattribute schemata used by di�erent vendors.

4.1 Product Feature ExtractionAutomatically extracting speci�cations from HTML pages is not atrivial task. The technical speci�cations can be contained in di�er-ent HTML structures, however they are primarily found in tablesand lists [20, 21].

Similar to [21], we employ a two step approach: First, we detectspeci�cation tables and list in HTML pages: afterwards, we extractattribute-values pairs from the detected product speci�cations. Westart by training a model for detection of speci�cation tables andlists. Next, we apply that model on the web pages. Subsequently,we use a sample of the detected speci�cations to learn a modelfor column attribute and value detection. Finally, with the learnedmodel we extract attribute value pairs from the speci�cations. Theoverall speci�cation detection and attribute-value pair extractionprocess is shown in Figure 2.

Web Intelligence (WI’17), August 2017, Leipzig, Germany Petar Petrovski and Christian Bizer

Speci�cation Detection. As stated above product speci�cationsare mostly found in tables and list. Considering that HTML tablesand lists have a di�erent base structure, we train separate classi-�ers for detecting speci�cation tables and lists. The models aretrained by learning a binary classi�er that classi�es tables/list intospeci�cation and non-speci�cation.

The classi�ers use the following features which have also beenused previously by Qui et al. [21]: Average text length per row, num-ber of rows, overall frequency of the word "speci�cation", numberof links and number of images, standard deviation of text size. Inaddition to these features, we introduce the following new featuresin order to improve the detection accuracy: Average DOM nodedepth of the items relative to the root, average number of columns,standard deviation of columns, maximum number of columns, num-ber of non table/list tags, average ratio between numerical andalphabetical characters in a cell, and maximum number of rows.In order to illustrate the correlation of these features with the tar-get attribute (speci�cation vs. non-speci�cation), Figure 3 showssummary statistics about the evaluation dataset that we use in Sec-tion 5. As it is evident from the Figure 3, there is a clear di�erencebetween speci�cations and non-speci�cations when consideringthese features. For instance, when considering “standard deviationof columns”, speci�cation tables hardly deviate from a given layout,while non-speci�cation tables deviate form the layout much more.Another interesting feature is the “number of non-table\list tags”where non-speci�cation structures contain much more tags like“<p>, <span>” etc. than the speci�cation ones. With that said, it isintuitive that the binary classi�er could learn a better model if bothfeature sets are used.

Speci�cation Extraction. The authors of [21] apply straightforward heuristics to extract attribute-value pairs from productspeci�cations. Speci�cally, for tables they use a heuristic that, foreach table row, extracts the �rst cell as attribute name and the re-maining ones are extracted as concatenated values. Main limitationof this heuristics is that it can not handle tables that have morethan two attribute name columns, like the one in Figure 1b. Namely,instead yielding a correct result like in Figure 4, this heuristic willtreat values in the third column as attribute values, which is ob-viously incorrect . To solve this issue we train a model which candetect whether a given column is an attribute or a value column.Much like the detection model, we train a binary classi�er withthe following features: average text length per cell, average textlength per column, number of non table tags per column, standarddeviation of text size per column, ratio between alphabetical andnumerical characters in a cell.

Taking into account that lists follow di�erent structure (nocolumns), we convert list items to columns by separating the itemsby a delimiter and organise the separation result into columns. Weconsider the common delimiter characters: “:” and “;”. We don’tuse delimiters like “,”, “-” , “/” and “\” since they might be a part ofa product identi�cation numbers like MPN. After the conversionprocess is done we are able to train the same model as describedabove. However, like in [21], this approach falls short when the listsdo not contain any delimiter. Since the percentage of speci�cationwith out delimiters is less than 8%, we do not pursue a solution ofthis case.

Figure 3: Speci�cation vs. Non-speci�cation table/list fea-tures

Figure 4: Example of extracted Product Speci�cation Table

After the columns have been classi�ed, we continue with pairingattribute with value columns, after which each attribute and valueitem, row wise, is considered a pair. The pairing of columns is doneleft to right, that is starting from the most left we pair the �rstattribute and value columns and we continue to the right. In thecase of consecutive attribute or value columns, we concatenatethem. Figure 4 shows an example of a table with tagged attributeand value columns, where the �rst and second column constitute

Extracting A�ribute-Value Pairs from Product Specifications on the Web Web Intelligence (WI’17), August 2017, Leipzig, Germany

Figure 5: Examples of correspondences between product attributes: (left) Speci�cation fromwalmart.com, (center) Centralisedproduct Catalog and (right) Speci�cation from ebay.com

the �rst attribute-value pairing, while the third and the fourthcolumn constitute the second attribute-value pairing.

4.2 Product Schema AlignmentGiven that vendors use di�erent schemata to describe products,there is a need to align these schemata before tasks such as prod-uct matching, faceted product search, or content-based productrecommendation can be executed.

Motivated by [16, 22], we employ a duplicate-based approach formatching the product attribute schema that are used by di�erente-shops. The prerequisite for being able to employ this approach ishaving access to a mid-sized amount of correspondences betweeno�ers for the same product by di�erent e-shops. We use the corre-spondences between products to identify correspondences betweenthe attributes that are used to describe the products. These productcorrespondences can be obtained using di�erent methods, includingexploiting shared identi�ers such as GTIN, UPC, or EAN numbersor by relying on manual labeling. We use the Web Data CommonsGold Standard for Product Matching and Product Feature Extrac-tion [20] (see Section 5.1), which contains 564 correspondencesbetween product o�ers form three product categories (headphones,phones and TVs). The product o�ers originate from 32 web sites.

We start by assuming that same attributes have the same orvery similar attribute values. This is why a core task of the schemamatching method is to determine how similar two attribute valuesare.

We do not rely on exact equality of attribute values, since dif-ferent speci�cations can have small variations for their attributevalues. Moreover, attributes values do not have explicit data types,making an e�ective determination of similarity between these val-ues considerably more di�cult. After careful analysis of the theextracted attributes, we end up with three data types: categorical(strings), numerical and units of measurements. Since attributevalues often contain only one data type, prediction of that typeis a straightforward task. For the data type detection we use theheuristic introduced in [11]. The data type of each attribute is iden-ti�ed based on its values. First, the data type of each value of theattribute is detected by using about 100 manually de�ned regularexpressions which are able to detect the data types number (with or

without unit of measurement). Additionally, there are around 200manually generated rules for converting units of measurements tothe corresponding base unit.

In the following we explain the similarity measures for each datatype:

Categorical (strings): We employ Soft-Jaccard similarity onword-level n-grams (n=(2,3)), where we consider values assimilar if the threshold θ > 0.822 (the threshold is set byrunning exhaustive search on the possible pairs).

Numbers: A simple and in our experiments e�ective ap-proach is to calculate the ratio between the absolute val-ues of the numbers. As with strings, numbers can havesmall variations because of typos or rounding (e.g. 5 vs.4.7). Therefore, we consider to number to be similar if thethreshold θ > 0.901 (the threshold is set by running ex-haustive search on the possible pairs).

Unit of measurements: Product speci�cations often use dif-ferent units of measurements (metric vs imperial). This canmake simple number comparison ine�ective. For that pur-pose, we �rst convert every unit of measurement to themetric system and then employ the numeric comparison.

Having de�ned the similarity measures, we continue with apply-ing them to each attribute pair with compatible data types. Similarto [22], we apply a straightforward approach to match the twoschemata. For each attribute in both product speci�cations, thematch with the best score is determined and the �nal mapping isadopted only if an attribute a from the �rst product speci�cationhas an attribute b as their best match and vice versa, i.e we onlyconsider a match if the best scores are bi-directional. Finally, theapproach yields a set of bi-directional schema correspondencessorted per attributes in C . Figure 5 shows examples of extractedspeci�cations together with discovered correspondences betweenproduct attributes.

5 EVALUATIONIn this section we provide the evaluation for both the feature extrac-tion from technical speci�cation and schema alignments methods.We �rst detail the datasets used, highlighting some of the most

Web Intelligence (WI’17), August 2017, Leipzig, Germany Petar Petrovski and Christian Bizer

Table 1: Number of product o�ers per category in the WDCGold Standard

Category # of O�ersHeadphones 156Phones 254TVs 154

interesting properties used within the dataset. Next, we describethe experiment setup. Finally we discuss the results.

5.1 DatasetsWe use the Web Data Commons (WDC) Gold Standard for ProductMatching and Product Feature Extraction [20]4. The gold standardconsists of product o�ers from 32 di�erent websites. For each o�er,all product features have been manually annotated which appearin: (i) the name of the product marked up using the Microdatasyntax, (ii) description of the product marked up with Microdata,(iii) speci�cation tables, and (iv) speci�cation lists.

Product Feature Extraction Dataset. The gold standard forproduct feature extraction contains a set 564 product o�ers foundon the web, with annotated attributes found in speci�cation tablesand/or list. The o�ers were chosen from three categories: Head-phones, Phones and TVs. Table 1 shows the number of o�ers by cat-egory. From this set, for the purposes of the speci�cation detectionevaluation, we used 334 speci�cation tables and 169 speci�cationlists. Additionally we annotated 304 non-speci�cation tables (layout,listing etc.) and 150 non-speci�cation lists as negative examples.

Table 2 shows the product attribute densities for both the At-tribute Density in Descriptions (ADD) (Product Title and Text) andAttribute Density in Speci�cations (ADS). Since there are more than30 attributes per category only a sample of ten attributes per cate-gory is shown. Generally, attribute density for ADD is lower thanfor ADS, i.e attributes can be found more often in product speci-�cations than in product descriptions. Moreover, the table showsthat some attributes could not be found in the product descriptionsat all, whereas they can be found in product speci�cation.

Product Schema Alignment Dataset. We use the same datasetused for the product feature extraction, with the addition of corre-spondences to a uni�ed product catalog containing 150 productsfrom the following categories: headphones (50), phones (50), andTVs (50). The attributes in the catalog were scraped from leadingshopping services, like Google Shopping, or directly from the ven-dor’s website. In total, the data set contains 564 correspondencesbetween product o�ers from the 32 e-shops and the central catalog.

5.2 Experiment SetupThe models for table and list detection as well as the column detec-tion were trained and evaluated with 5 cross-fold validation. Weused a SVM classi�er implementation provided within the libSVM5.The SVM model was built with a linear kernel to train both thedetection and extraction models.

4http://webdatacommons.org/productcorpus/5https://www.csie.ntu.edu.tw/ cjlin/libsvm/

Table 2: Densities of the attributes within the Product Fea-ture Extraction Gold Standard. A�ribute Density in Descrip-tions (in the following ADD) represents the attribute den-sity of attributes found in the product description, while At-tributeDensity in Speci�cations (in the followingADS) repre-sents the attribute density of attributes found in the productspeci�cations.

Attribute ADD % ADS %Headphones

Brand 84 97Product Name 81 87MPN N/A 81Color 39 56Sensitivity N/A 53Impedance 10 53Cup Type N/A 47Form Factor 24 43Magnet Mat. 2 27Diaphragm N/A 25

PhonesProduct Name 81 91Memory 73 87Brand 87 86Color 71 79Display Size 18 71Rear Cam. Res. 10 70OS 11 64Display Res. 6 48Processor 11 28Front Cam. Res. 1 20

TVsBrand 79 100Product Name 67 91Display Type 70 81Display Size 50 65Display Res 20 55Tot. Size 45 51Ref. Rate 20 50Img. Asp. Rat. 3 38Connectivity 2 35Resp. Time N/A 10

The schema correspondences for the schema alignment taskwere inferred from 70% of the correspondences and tested on therest of the data6.

5.3 ResultsProduct Feature Extraction. Table 3 gives results on table detec-tion. As a baseline for the experiments we use DEXTER introducedin [21]. Our approach outperforms DEXTER in F-score by 5%, as aresult of adding more subtle features to our classi�er. On the other

6The models, data and code for both feature extraction and schema alignment can befound at http://webdatacommons.org/productcorpus/

Extracting A�ribute-Value Pairs from Product Specifications on the Web Web Intelligence (WI’17), August 2017, Leipzig, Germany

Table 3: Detection Results for HTML Tables and Lists

TablesPrecision Recall F-score

DEXTER[21] 0.820 0.887 0.842Our approach 0.904 0.883 0.892

ListsPrecision Recall F-score

DEXTER[21] 0.601 0.604 0.602Our approach 0.623 0.620 0.621

Table 4: Extraction Results for HTML Tables and Lists

TablesPrecision Recall F-score

Dictionary[20] 0.560 0.609 0.583DEXTER[21] 0.691 0.778 0.724Our approach 0.800 0.826 0.817

ListsPrecision Recall F-score

Dictionary[20] 0.414 0.527 0.463DEXTER[21] 0.548 0.623 0.583Our approach 0.700 0.556 0.619

Table 5: Product Schema Alignment results per Category

Precision Recall F-scoreHeadphones schema 0.946 0.967 0.956Phones schema 0.885 0.958 0.920TVs schema 0.862 0.953 0.905

hand, as a result of not using additional features for the list classi-�er, our approach for list detection does not show any signi�cantimprovement over the baseline.

Tables 4 shows the comparative results on the attribute-valuepair extraction for tables and list against the baselines. In particular,we compare to: (1) the dictionary approach presented in [20] and(2) the approach introduced in [21] (DEXTER). Comparably, listspeci�cations proved to be more di�cult to extract than tables.However, in both cases our approach outperforms the DEXTER andthe dictionary approach baselines. In the case for table speci�cationextraction by almost 10% and in the case of lists by 3%. The signi�-cant increase in F-score for table speci�cation can be explained bythe fact that our model can extract attribute-value pairs from tableswhich have more than a single attribute column. For instance, whenextracting attribute-value pairs from speci�cation like in Figure 4with the DEXTER approach the �rst column would be extracted asattributes and all other would be concatenated values introducingfalse positives and moreover not being able to extract the third andfourth column as separate attribute-value pairs introduces falsenegatives. On the other hand, list attribute-value pair extractionstill remains di�cult task, since: (i) the detection of product speci�-cation within lists is a more di�cult task, and (ii) a small number ofthe lists in the used dataset do not contain any kind of delimiters.

Product Schema Alignment. The result of our schema align-ment experiment is shown in Table 5. Our duplicate-based matchingmethods works quite well for all three product categories (all threeF-scores are above 90%). Noteworthy, is that for the headphonescategory we get signi�cantly higher F-score. This is a result of theconsistency of the attributes with which headphones are describedby the e-shops, as well as the lower number of distinct attributes ingeneral. Contrary to the the headphones category, the phones andTVs categories both have larger number of distinct attributes andtherefore attributes in these categories are not consistently used todescribe them.

6 CONCLUSIONAttribute-value pair extraction is a crucial prerequisite for many e-commerce tasks including product matching, recommendation andcategorisation. However, product names and descriptions often onlycover a relatively small number of product attributes and it is thusbene�cial to extract product features from product speci�cationswhich often cover a much larger number of features. In this paper,we have proposed an approach for extracting attribute-value pairsfrom product speci�cations. To perform the detection, and handlethe high diversity of speci�cations in terms of content, size andformat, our approach uses supervised learning to classify HTMLtables and lists present in web pages as speci�cations or not. Toperform extraction of the attribute-value pairs from the HTMLfragments identi�ed by the speci�cation detector, we again usesupervised learning to classify columns as an attribute or valuecolumn. We show improvements in F-score over the state-of-the-art[21] of up to 10% in extracting attribute-value pairs from tables andup to 3% for lists.

There are a number of potential future research directions. Onecould certainly try to improve list speci�cation detection and extrac-tion by introducing better features for the detection and extractionmodels. Additionally, with the of rise HTML5 more and more spec-i�cations are embedded into other layout friendly elements, thusincreasing the potential for �nding features in those elements.

REFERENCES[1] Lidong Bing, Tak-Lam Wong, and Wai Lam. 2016. Unsupervised Extraction

of Popular Product Attributes from E-Commerce Web Sites by ConsideringCustomer Reviews. ACM Trans. Internet Technol. 16, 2, Article 12 (April 2016),17 pages. DOI:http://dx.doi.org/10.1145/2857054

[2] Oren Etzioni, Rattapoom Tuchinda, Craig A. Knoblock, and Alexander Yates. 2003.To Buy or Not to Buy: Mining Airfare Data to Minimize Ticket Purchase Price.In Proceedings of the Ninth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’03). ACM, New York, NY, USA, 119–128. DOI:http://dx.doi.org/10.1145/956750.956767

[3] Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, and Andrew Fano. 2006.Text mining for product attribute extraction. ACM SIGKDD Explorations Newslet-ter 8, 1 (2006), 41–48.

[4] Vishrawas Gopalakrishnan, Suresh Parthasarathy Iyengar, Amit Madaan, Ra-jeev Rastogi, and Srinivasan Sengamedu. 2012. Matching product titles usingweb-based enrichment. In 21st ACM international conference on Information andknowledge management. 605–614.

[5] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati,Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in Your Inbox:Product Recommendations at Scale. In Proceedings of the 21th ACM SIGKDD.ACM, 1809–1818. DOI:http://dx.doi.org/10.1145/2783258.2788627

[6] Anitha Kannan, Inmar E Givoni, Rakesh Agrawal, and Ariel Fuxman. 2011.Matching unstructured product o�ers to structured product speci�cations. In17th ACM SIGKDD.

[7] M. H. Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg. 2015. Where to BuyIt: Matching Street Clothing Photos in Online Shops. In 2015 IEEE International

Web Intelligence (WI’17), August 2017, Leipzig, Germany Petar Petrovski and Christian Bizer

Conference on Computer Vision (ICCV). 3343–3351. DOI:http://dx.doi.org/10.1109/ICCV.2015.382

[8] Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. 2003. Ac-curately and Reliably Extracting Data from the Web: A Machine Learning Ap-proach. Physica-Verlag HD, Heidelberg, 275–287. DOI:http://dx.doi.org/10.1007/978-3-7908-1772-0_17

[9] Hanna Köpcke, Andreas Thor, Stefan Thomas, and Erhard Rahm. 2012. Tai-loring entity resolution for matching product o�ers. In Proceedings of the 15thInternational Conference on Extending Database Technology. ACM, 545–550.

[10] Zornitsa Kozareva. 2015. Everyone Likes Shopping! Multi-class Product Catego-rization for e-Commerce. In The 2015 Annual Conference of the North AmericalChapter for the ACL. 1329–1333.

[11] Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paul-heim, and Christian Bizer. 2015. The Mannheim Search Join Engine. Web Se-mantics: Science, Services and Agents on the World Wide Web 35 (2015), 159 – 166.DOI:http://dx.doi.org/10.1016/j.websem.2015.05.001 Semantic Web Challenge2014.

[12] Nikhil Londhe, Vishrawas Gopalakrishnan, Aidong Zhang, Hung Q Ngo, andRohini Srihari. 2014. Matching titles with cross title web-search enrichmentand community detection. Proceedings of the VLDB Endowment 7, 12 (2014),1167–1178.

[13] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel.2015. Image-based recommendations on styles and substitutes. In Proceedings ofthe 38th International ACM SIGIR Conference. ACM, 43–52.

[14] Gabor Melli. 2014. Shallow Semantic Parsing of Product O�ering Titles (forbetter automatic hyperlink insertion). In Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining. ACM, 1670–1678.

[15] Robert Meusel, Petar Petrovski, and Christian Bizer. 2014. The webdatacommonsmicrodata, RDFa and microformat dataset series. In The Semantic Web–ISWC.277–292.

[16] Hoa Nguyen, Ariel Fuxman, Stelios Paparizos, Juliana Freire, and Rakesh Agrawal.2011. Synthesizing products for online catalogs. Proceedings of the VLDB Endow-ment 4, 7 (2011), 409–418.

[17] Stefano Ortona. 2014. An analysis of duplicate on web extracted objects. InProceedings of the companion publication of the 23rd international conference onWorld wide web companion. 1279–1284.

[18] Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Integrating productdata from websites o�ering microdata markup. In Proceedings of the companion

publication of the 23rd international conference on World wide web companion.1299–1304.

[19] Petar Petrovski, Volha Bryl, and Christian Bizer. 2014. Learning Regular Ex-pressions for the Extraction of Product Attributes from E-commerce Microdata.(2014).

[20] Petar Petrovski, Anna Primpeli, Robert Meusel, and Christian Bizer. 2017.The WDC Gold Standards for Product Feature Extraction and Product Matching.Springer International Publishing, Cham, 73–86. DOI:http://dx.doi.org/10.1007/978-3-319-53676-7_6

[21] Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivas-tava. 2015. Dexter: large-scale discovery and extraction of product speci�cationson the web. Proceedings of the VLDB Endowment 8, 13 (2015), 2194–2205.

[22] Daniel Rinser, Dustin Lange, and Felix Naumann. 2013. Cross-lingual EntityMatching and Infobox Alignment in Wikipedia. Inf. Syst. 38, 6 (Sept. 2013),887–907. DOI:http://dx.doi.org/10.1016/j.is.2012.10.003

[23] Petar Ristoski and Peter Mika. 2016. Enriching Product Ads with Metadata fromHTML Annotations. In Proceedings of the 13th Extended Semantic Web Conference.(To Appear).

[24] Ronald van Bezu, Sjoerd Borst, Rick Rijkse, Jim Verhagen, Damir Vandic, andFlavius Frasincar. 2015. Multi-component Similarity Method for Web ProductDuplicate Detection. (2015).

[25] Damir Vandic, Jan-Willem Van Dam, and Flavius Frasincar. 2012. Faceted productsearch powered by the Semantic Web. Decision Support Systems 53, 3 (2012),425–437.

[26] Xi Wang, Zhenfeng Sun, Wenqiang Zhang, Yu Zhou, and Yu-Gang Jiang. 2016.Matching User Photos to Online Products with Robust Deep Features. In Proceed-ings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR’16). ACM, New York, NY, USA, 7–14. DOI:http://dx.doi.org/10.1145/2911996.2912002

[27] Tak-Lam Wong, Wai Lam, and Tik-Shun Wong. 2008. An Unsupervised Frame-work for Extracting and Normalizing Product Attributes from Multiple WebSites. In Proceedings of the 31st Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR ’08). ACM, New York,NY, USA, 35–42. DOI:http://dx.doi.org/10.1145/1390334.1390343

[28] W. X. Zhao, S. Li, Y. He, E. Chang, J. R. Wen, and X. Li. 2015. Connecting SocialMedia to E-Commerce: Cold-Start Product Recommendation On Microblogs.IEEE Transactions on Knowledge and Data Engineering PP, 99 (2015), 1–1. DOI:http://dx.doi.org/10.1109/TKDE.2015.2508816