The (underestimated) role of product data for winning

IN DEGREE PROJECT INDUSTRIAL MANAGEMENT,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

The (underestimated) role of product data for winning online retail

JOHN BOLMGREN

HENRIK LINDSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INDUSTRIAL ENGINEERING AND MANAGEMENT

The (underestimated) role of product data for winning

online retail

by

John Bolmgren Henrik Lindström

Master of Science Thesis TRITA-ITM-EX 2020:365

KTH Industrial Engineering and Management

Industrial Management

SE-100 44 STOCKHOLM

Den (underskattade) rollen av produktdata för att vinna e-handeln

av

John Bolmgren Henrik Lindström

Examensarbete TRITA-ITM-EX 2020:365

KTH Industriell teknik och management

Industriell ekonomi och organisation

SE-100 44 STOCKHOLM

Master of Science Thesis TRITA-ITM-EX 2020:365

The (underestimated) role of product data for winning online retail

John Bolmgren

Henrik Lindström

Approved

2020-06-15

Examiner

Lars Uppvall

Supervisor

Pernilla Ulfvengren

Commissioner

Contact person

Abstract

As E-commerce continues to take market share from traditional brick and mortar businesses, there are few choices left for managers apart from migrating their sales online. While the topic of online adoption has been studied extensively, this thesis attempts to investigate one of the major drivers of complexity within the industry - the role of structured product data. The study was performed on a major Nordic online retailer, and identified a set of six guiding propositions on the topic of structured product data in e-commerce from interviews with industry professionals. Contemporary data science literature contributes to the body of evidence suggesting a strategically prioritized focus on creating and maintaining structured product data is the way of the future for e-commerce, aligning with much of the interview results. Furthermore, the propositions were thoroughly examined through multiple linear regression analysis on data from the same firm. The study gives empirical support for significant positive impact on most studied metrics from having structured product data available on the website as well as within the internal systems, with slight discrepancies across product categories.

Key-words E-commerce, Product data, Structured product data

Examensarbete TRITA-ITM-EX 2020:365

Den (underskattade) rollen av produktdata för att vinna e-handeln

John Bolmgren

Henrik Lindström

Godkänt

2020-06-15

Examinator

Lars Uppvall

Handledare

Pernilla Ulfvengren

Uppdragsgivare

Kontaktperson

Sammanfattning

I takt med att e-handeln fortsätter att ta marknadsandelar från traditionella fysiska butiker finns det få alternativ för ledningsgrupper förutom att migrera sin försäljning online. Online-migrering som ämne har studerats i stor utsträckning tidigare, men denna uppsats försöker utforska en av huvuddrivarna till branschens komplexitet – rollen av strukturerad produktdata. Studien gjordes på en större nordisk e-handlare, och identifierade sex ledande teman inom ämnet för strukturerade produktdata i e-handel genom intervjuer med experter på bolaget. Kontemporär litteratur inom datavetenskapen bidrar till belägg för att ett strategiskt prioriterat fokus på att skapa och managera strukturerad produktdata är vägen framåt för e-handeln, vilket ligger i linje med resultaten från intervjuerna inom studien. Vidare analyserades de identifierade temana genom multipel linjär regression genom data från bolaget. Studien ger empiriska belägg för att strukturerad produktdata på e-handlarens hemsida samt i de interna systemen ger signifikant och positiv påverkan på de flesta responsvariabler, med vissa diskrepanser mellan produktkategorier.

Nyckelord E-commerce, Product data, Structured product data

Contents

1 Introduction 1

1.1 Scope and delimitations of the paper . . . . . . . . . . . . . . 3

1.2 Setting the stage for discussing e-commerce data . . . . . . . . 5

1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Theoretical background 8

2.1 The current state of academic e-commerce literature . . . . . . 8

2.2 E-commerce from the perspective of Data Science . . . . . . . 11

2.2.1 Product data come in many shapes . . . . . . . . . . . 12

2.2.2 Towards the mighty concept of a structured product

catalogue . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 The value proposition of structured product data . . . 15

2.3 Building a successful e-commerce business . . . . . . . . . . . 17

2.3.1 Critical success factors in E-commerce . . . . . . . . . 18

2.4 The different kinds of data affecting customer experience . . . 21

3 Method 22

3.1 Proposition analysis . . . . . . . . . . . . . . . . . . . . . . . . 22

i

3.1.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Proposition validation . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Multiple linear regression . . . . . . . . . . . . . . . . . 27

3.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.3 Model specification . . . . . . . . . . . . . . . . . . . . 31

3.2.4 Validity of assumptions . . . . . . . . . . . . . . . . . . 33

3.3 Research ethics . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results 37

4.1 Proposition analysis . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Structured vs. unstructured product data . . . . . . . 38

4.1.2 Structured data in online marketing . . . . . . . . . . . 40

4.1.3 Structured data in website design . . . . . . . . . . . . 46

4.1.4 Structured data in assortment curation . . . . . . . . . 48

4.1.5 Structured data in business intelligence . . . . . . . . . 49

4.1.6 Risks of working with structured data . . . . . . . . . . 50

4.2 The propositions . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Proposition validation . . . . . . . . . . . . . . . . . . . . . . 53

4.3.1 Data transformations . . . . . . . . . . . . . . . . . . . 54

4.3.2 Coefficients of interest . . . . . . . . . . . . . . . . . . 54

5 Discussion 60

5.1 Evaluating the propositions . . . . . . . . . . . . . . . . . . . 61

ii

5.1.1 Proposition 1: Structured product data, in contrast

to its unstructured counterpart, is significantly more

valuable in terms of its potential application in all parts

of the e-commerce value chain. . . . . . . . . . . . . . . 61

5.1.2 Proposition 2: Structured product data improves nav-

igation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1.3 Proposition 3: Structured product data is crucial in

search engine optimization . . . . . . . . . . . . . . . . 64

5.1.4 Proposition 4: Optimizing product titles is important

for long-tail SEO, and structured product data makes

them seamless to create . . . . . . . . . . . . . . . . . 65

5.1.5 Proposition 5: High quality product images are impor-

tant for selling products online . . . . . . . . . . . . . . 66

5.1.6 Proposition 6: Structured data is highly valuable for

business intelligence and on-site curation . . . . . . . . 67

5.2 General implications of the results . . . . . . . . . . . . . . . . 68

5.2.1 Product catalogue creation . . . . . . . . . . . . . . . . 68

5.2.2 Toward a common product taxonomy . . . . . . . . . . 69

5.2.3 Critical success factors and their relation to product data 70

5.3 Limitations of the paper . . . . . . . . . . . . . . . . . . . . . 71

5.3.1 Proposition validation . . . . . . . . . . . . . . . . . . 72

5.3.2 Limitations of the proposition analysis . . . . . . . . . 75

5.3.3 Sustainability aspects of this paper . . . . . . . . . . . 75

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

A Appendix 82

iii

List of Figures

2.1 Illustration of structured vs. unstructured product page . . . . 13

2.2 Table of success factors from Varela et al. (2017) . . . . . . . . 19

3.1 Example QQ plot for the pageviews model of category bath . 34

4.1 Example of different types of searches . . . . . . . . . . . . . . 43

iv

List of Tables

4.1 Summary of the image count attribute regression coefficient

per category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Summary of the category specific attribute regression coeffi-

cient per category . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 Summary of the base attribute regression coefficient per category 57

4.4 Summary of the standard attribute regression coefficient per

category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Summary of the dimensions attribute regression coefficient per

category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6 Summary of the title length attribute regression coefficient per

category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.7 Summary of the description length attribute regression coeffi-

cient per category . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.1 Regression table for category full . . . . . . . . . . . . . . . . 86

A.2 Regression table for category bath . . . . . . . . . . . . . . . . 90

A.3 Regression table for category construction . . . . . . . . . . . 94

A.4 Regression table for category floor . . . . . . . . . . . . . . . . 97

v

A.5 Regression table for category int . . . . . . . . . . . . . . . . . 102

A.6 Regression table for category kitchen . . . . . . . . . . . . . . 106

A.7 Regression table for category garden . . . . . . . . . . . . . . 110

vi

Chapter 1

Introduction

E-commerce has won tremendous ground in the past thirty years and its

growth has been accelerating even further in recent years. Today, there is

no debate on whether e-commerce will account for a significant share of the

consumer retail industry going forward, the question is rather how large that

share will ultimately become. Strong structural trends such as digitization,

online-adoption, demographic shifts and most recently the consequences of

the Covid-19 pandemic all support continued growth of the e-commerce in-

dustry. Amazon has become one of the world’s most valuable companies with

a significant part of its revenues attributable to its e-commerce business.

In light of these structural trends, many (if not most) brick-and-mortar busi-

nesses have been forced to adapt to the new market conditions by taking

their business online while new online-native businesses have entered the

market. The competition for consumers’ online spending has become fierce

and the dynamics have shifted significantly as retailers, search engine com-

1

panies, online aggregators and manufacturing companies all want their piece

of the seemingly ever-growing e-commerce sector. The tough competition

and rapid changes have triggered researchers to ask the question of how a

successful e-commerce business is built, what the key success factors are and

how technological developments can be leveraged in order to win over the

hearts and minds of online shoppers.

This paper asks the question of what role data generally, and product data

specifically, plays in the realm of e-commerce. When products are not on

physical display but presented through images, descriptions and attributes

and when stores are not visible from the street but accessed through specific

entries on a keyboard or smartphone - companies must adapt all parts of its

business, from marketing to purchasing, in order to find ways to survive and

thrive. In this study, we investigate how product data is used in all parts

of an e-commerce business, what role it plays, how it should be treated and

prioritized as well as how it relates to a company’s ability to prevail in a

harshly competitive landscape.

Since the theme of data specifically applied to e-commerce has not been

widely discussed, we approach the topic with an open mind and simply ask

the question of what role it plays in creating a successful online business.

This is done in the context of a case study involving interviews with several

industry professionals working at different functions in a large Nordic e-

commerce company, henceforth referred to as ”the Company”. This part

of the paper will henceforth be referred to as the ”proposition analysis”.

Insights and conclusions from the proposition analysis are then consolidated

to form a set of propositions about the role and significance of data in e-

2

commerce that are further investigated and benchmarked against previous

research and tested using statistical methods on company data.

1.1 Scope and delimitations of the paper

For purposes of clarity we begin by giving a definition of how we define

e-commerce. A general definition was proposed by Frost et al. (2018): “E-

commerce refers to the online transactions: selling goods and services on the

internet, either in one transaction (e.g., Amazon, Zappos, Ebay, Expedia) or

through an ongoing transaction (e.g., Netflix, Match.com, Linkedin etc.)”.

Given that this paper focuses on the trade of physical goods over the internet,

we narrow the definition used in this study to: E-commerce refers to the

transaction of physical goods over the internet.

We will also clarify what we mean by data. There are many different kinds of

data in the e-commerce space. The different kinds of data are collected and

used for different purposes and while some of the data can be seen as generic

for all businesses, sales data being the obvious example, other kinds of data

exist more or less uniquely in the e-commerce sector. Akter and Wamba

(2016) divides e-commerce data into four categories:

(a) Transactional data

(b) Click-stream data

(c) Data in the form of video

(d) Voice data

3

Yet again, given our narrowed scope for this paper and the subsequent defi-

nitional difference on the term E-commerce, only the first two types in Akter

and Wamba (2016) categorization applies to our definition in a meaningful

way. We suggest a different approach to the classification of data types based

on the source of the collected data as follows:

(a) Transactional data: Refers to data collected from transactions with

the customer. This data type includes sales, profitability, pricing and

return rates to name a few.

(b) Behavioural data: Refers to data collected from the customers’ online

behaviour and interactions with the e-commerce platform. This data

type includes conversion rates, site visits, session lengths and points of

entrance among others.

(c) Logistical data: Refers to data collected from the process of shipping

products to customers. This data type includes delivery times, delivery

methods, stock levels etc.

(d) Product data: Refers to data collected from the products themselves.

This type of data includes product features, images, titles and descrip-

tions.

The focus of this paper going forward will be mainly on the impact of product

data.

4

1.2 Setting the stage for discussing e-commerce

data

For purposes of clarity, some key concepts are defined as they relate to e-

commerce websites. While they are not commonly used in the literature on

e-commerce, these concepts play an integral role in understanding the role

of product data and will be referred to throughout this paper.

• Home page: The home page is the webpage that a customer is di-

rected to if they enter the store using only the website’s domain without

additions. Commonly, the home page in the e-commerce context is the

first point of contact with the customer and can be used to browse

the website’s assortment. A real-life analogy is to the entrance of a

large mall where a visitor is guided by signs to the appropriate store

or department.

• Landing page: Landing pages display several products within the

same category or with other kinds of similarities. These pages are often

used as interstages between the home page and the product page. Here,

customers can browse through a subset of the website’s assortment,

often with the help of filters. A real-life analogy is to the entrance of a

store within the mall that sells a specific kind of product.

• Product page: The product page is where the customer can make

the actual purchase of a product. The product page is dedicated to a

specific product and contains information and images relating to that

product.

5

This paper makes use of both qualitative and quantitative methods of study

in order to answer the research questions (see section 1.3). Firstly a propo-

sition analysis was conducted as a case study of a large Nordic e-commerce

company - as proposed by Baxter and Jack (2008). Interviews were con-

ducted with the aim of deriving propositions from leading professionals at the

company relating to product data and its role in e-commerce. Furthermore,

a rigorous exploration of current literature on the subject was conducted to

give perspective to the data gathered from the proposition analysis. Given

these propositions (see below), proposition validation was dedicated towards

testing their legitimacy in the context of this specific company in the form of

a quantitative analysis on company data (see section 4.3). While we recognize

that a single company cannot be used as a generalization for the industry as

a whole, since it is bound by its specific circumstances, we consider the com-

pany a good subject for study given its presence in many different product

categories as well as its size and market share. The conclusions might not

be upheld in the general case, especially the conclusions from quantitative

analysis drawn from company data, however we will try to argue as gener-

ally as possible since it is the sense of the authors that the assumptions laid

out in the hypotheses are broadly considered to be true, even outside of the

Company and their subsidiaries.

For reference, the following propositions were extracted from the proposition

analysis, and are elaborated on in section 4.1:

• Proposition 1: Structured product data, in contrast to its unstructured

counterpart, is significantly more valuable in terms of its potential ap-

plication in all parts of the e-commerce value chain

6

• Proposition 2: Structured product data improve navigation

• Proposition 3: Structured product data is crucial in search engine op-

timization

• Proposition 4: Optimizing product titles is very important for long-tail

SEO, and structured product data makes them seamless to create

• Proposition 5: High quality product images are important for selling

products online

• Proposition 6: Structured data is highly valuable for business intelli-

gence and on-site curation

1.3 Research questions

The following research questions are proposed for the study, and are inti-

mately linked to the identified propositions:

1. What role do online retailers place on structured product data?

2. How well does the online retailers’ appreciation of structured product

data align with measurable outcomes?

7

Chapter 2

Theoretical background

The purpose of Chapter 2 is to give the reader an introduction to contempo-

rary academic literature in the field of e-commerce in general, and product

data for the former in particular. Furthermore, this chapter provides a crit-

ical academic reference for discussing the identified propositions defined in

section 4.1. Given that the interviews that were conducted within the study

were confined to a single company, this literature review is deemed necessary

in indicating whether the findings from the proposition identification have

the potential of being considered valid also in the generalized case.

2.1 The current state of academic e-commerce

literature

The role of data in e-commerce has been studied from multiple perspectives.

Little has been written in the field of management on the necessity of placing

8

data at the heart of every e-commerce business. A surprising fact given that

daily operations in these businesses has data management as a core struggle,

taking up the vast majority of all operational activities. However, a lot has

been written from a technological perspective ranging from the potential

analytical values that could be extracted from e-commerce data to the field

of data science that have extensively studied methods for mining, deploying

and enriching product data as well as the potential of that same data for

search engine- and UX applications.

From the perspective of business and management the topic has mainly been

approached by more generally studying critical success factors in e-commerce

and also the potential of Big Data Analysis (BDA) in the e-commerce setting

given its native stance as an industry with great access to many kinds of

data in tremendous volumes. An extensive positional paper on the current

stance of research on BDA in e-commerce is offered by Akter and Wamba

(2016) from which we have drawn several references for this paper. The

overall conclusion from the study of BDA-applications in e-commerce is nicely

summed up by Loebbecke and Picot (2015) as “the platform for growth of

employment, increased productivity, and increased consumer surplus”.

The data science field has approached the topic of data in e-commerce from

a more practical standpoint. The value of having high quality data is seen as

axiomatic and much of the research is centered around how data on product

specifications, reviews and prices can be mined, structured and leveraged

to fit applications such as search engine optimization, product catalogue

creation and product matching. The topic of product feature extraction

from unstructured data sources online has made significant progress in recent

9

years and the most successful methods from the area are summarized by Rao

and Sashikuma (2016). Methods for solving the not at all trivial problem

of matching identical products from different sources has been proposed by

Ristoski et al. (2018) and a method for synthesizing product catalogues from

unstructured data sources was given by Ristoski et al. (2018).

The common theme of the data science papers on e-commerce data has been

that proposed applications are rarely aimed at the e-tailers themselves, but

rather in favour of platform-type applications such as product search engines

and other recommendation engines for consumer use. This approach is taken

by Nguyen et al. (2011) who describes a method for synthesizing products for

online catalogues using novel methods in computer science with the explicit

aim of developing methods for creating generalized product catalogues that

draw data from many e-commerce websites with the aim of consolidation.

On the same general topic, Ristoski et al. (2018) lay out a method for both

categorization of products and matching of products using neural language

models and deep learning. The paper mentions Google Product Search ex-

plicitly as a target use-case for their methods, but implicitly makes the same

assumption as Nguyen et al. (2011), namely that e-commerce companies have

already solved the problem of data quality and reliability internally and that

the next natural step in the data-accessibility-value-chain is democratizing

the data through consolidation of data from all e-commerce actors.

The aim of the following sections in the literature review is to provide

an overview of recent academic efforts adjacent to the topic of data in e-

commerce. Publications in the field are dominated by data science papers

which we will try and summarize in understandable language for those not

10

versed in the field. The key point is to stress two important facts that become

evident from the literature:

1. There is a vibrant discussion in the data science community on meth-

ods for, and applications of, e-commerce data driven and financed not

primarily by the e-commerce sector but by the technology giants and

search engine companies. The value of structured product data is ax-

iomatic and much of the research rests on the assumption that high

quality data is already “out-there” and the problem to be solved thus

becomes 1) collecting the data, and 2) structuring the collected data.

2. Regardless which field of study we turn to, there is little emphasis on the

value of data for the e-commerce companies themselves. Very little is

written on topics such as management priorities, operational challenges

and marketing opportunities in e-commerce in general. Particularly,

none of that research has the same axiomatic conviction on the value

of data that permeate the data science community.

2.2 E-commerce from the perspective of Data

Science

Sticking to our categorization of e-commerce data it becomes evident that

the focus of data science research is on product data. Keep in mind that

much of this research is aimed at finding solutions for consolidated product

databases such as price comparison sites and product search engines, or to

steal an expression from Krys and Bagheri (2016): the research is set on

11

finding solutions for “online aggregators”. The interest in product data has

emerged as the growth of e-commerce has continued to accelerate (Nguyen

et al., 2011). We will focus this part of the literature review to text form

data, meaning that media is left for a later part of the discussion.

2.2.1 Product data come in many shapes

An important distinction that is often made in the data science commu-

nity (but rarely if ever made in the business community) in terms of e-

commerce data is whether a set of data is unstructured, semi-structured or

structured (Rao and Sashikuma, 2016). Unstructured data is difficult to use

in its original form for applications ranging from BDA (Kang et al., 2003) to

search engine optimization and product catalogue creation (Nguyen et al.,

2011). Nguyen et al. (2011) conclude on the topic of structured data that

”This structured data is fundamental to drive the user experience: it en-

ables faceted search, comparison of products based on their specifications,

and ranking of products based on their attributes.”. To shed some light on

the distinction between structured and unstructured data we refer the reader

to Figure 3.1. In the case of the unstructured product page the data is in

free-text format and even though the reader can get a sense of the product,

the ability to leverage this data is very limited for most applications. A basic

example relates to on-site-navigation: if there is no product level structured

data, then there is no possibility to create filtering functionality that the user

can apply to find relevant results among large assortments of products. Other

examples can be applying AI/ML-algorithms to unstructured data generally

yields inferior results compared to structured data (Shimada and Endo, 2005)

12

Figure 2.1: Illustration of structured vs. unstructured product page

and the ability to generate relevant search results is significantly improved

by searching in a structured database compared to an unstructured (Duan

et al., 2013).

The important distinction between the different kinds of product data and

consequently the necessity of structured data, preferably in the form of key-

value-pairs (i.e. a key along with a connected value, where “Color” is an

example of a key associated with the value “Blue”) has emerged as an integral

component for achieving better customer experience (Ristoski et al., 2018)

as well as improved search performance (Nguyen et al., 2011).

2.2.2 Towards the mighty concept of a structured prod-

uct catalogue

As such, the task of the data science research in the area can be thought of

as three-fold, remembering it’s desired application for “online aggregators”:

1) Collect the (unstructured or structured according to unknown structure)

13

raw data from publicly available online sources on the web. 2) Make the

unstructured data structured by a) categorizing the products along a prede-

fined “category-tree” and b) extract key-value-pairs according to a predefined

schema of keys associated with the chosen category from the unstructured

product data. 3) Aggregate the products into a product catalogue. (Rao and

Sashikuma, 2016). Several methods have been proposed for achieving these

three tasks including web-mining via crawler-scripts for collection, regular

expressions and/or machine learning for structuring data and finally other

machine learning methods and feature comparison for aggregation. Worth

noting is that all of the efforts in this area are done with the objective of

building fully automated systems for achieving all of the steps above.

In light of this paper, where emphasis lies on the e-commerce sector, repre-

senting the data source for this field of research, the same three-fold process

can be successfully applied if the data source in 1) is exchanged to the e-tailers

supplier. Effectively moving the whole process one step backward in what

can be thought of as the “data-supply-chain” or “layers of data consolida-

tion”. Here it should be recognized that many suppliers of the retail industry

in general and the e-commerce sector in particular haven’t got sophisticated

websites making the full set of raw data publicly available. However, substi-

tuting a supplier website to a supplier product database and the comparison

still holds true. Efforts has been made to use supplier websites as the source

of raw data, though to a significantly lesser extent then using the e-commerce

websites directly (Walther et al., 2010).

Given the similarities in approaches for the data-supply-chain between the

use cases it is relevant to adress some key challenges faced by the data science

14

community adjacent to these tasks. Rao and Sashikuma (2016) describe the

major hurdles in structuring data faced by researchers. These include the

volatility of the source data (i.e. the e-commerce websites), the challenge

with different data formats from different sources (i.e. structured tabular

formats vs. unstructured text formats) and the incompleteness of the source

data with regard to the target schema och keys. It is not far fetched to

assume that e-commerce companies face similar challenges in their relation

towards their suppliers.

2.2.3 The value proposition of structured product data

To conclude the review on data science progress in this field we’ll address the

topic of value-creation to try and answer why working toward complete and

structured data is important for e-commerce actors and “online-aggregators”

alike.

Considering the main objective of the research, that is, creating a structured

product catalogue, Nguyen et al. (2011) says ”The product catalog is to

online shopping what the Web index is to Web search” and elaborates by

”[...] structured data is fundamental to drive the user experience: it enables

faceted search, comparison of products based on their specifications, and

ranking of products based on their attributes.”. Thus Nguyen et al. (2011)

regards structured data as an important enabler for a wide range of further

applications. Petrovski and Bizer (2017) make a similar analysis and argues

”The central challenge for many tasks within the domain of e-commerce, in-

cluding product matching, product categorization, faceted product search,

and product recommendation, is extracting attribute-value pairs with high

15

precision from unstructured product descriptions or semi-structured prod-

uct specifications.”. Ristoski et al. (2018) takes the perspective of the e-

commerce customer and argues that as the aggregated online assortment of

products has expanded it has become increasingly difficult for customers to

find and compare products online. Investigating the cause of this experi-

enced hardship on the part of the customer, Ristoski et al. (2018) find that

the majority of products for sale online is presented only in terms of a ti-

tle and description, meaning that unstructured product data dominate the

online retail environment. Looking at e-commerce websites input feeds of

product information, where target schemas for “online aggregators” product

catalogues are clearly stated, the authors find that the data is often incom-

plete in comparison to the input schema - making the search performance of

those products orders of magnitude less effective than products fulfilling the

schema requirements. Staying in the customer perspective, Walther et al.

(2010) argues that structured product specification are the most valuable

data for the online consumer as it creates a comprehensive understanding of

the product and allows comparison with other similar products.

We have briefly addressed the underlying assumption of completeness in the

data that is prerequisite for the success of aggregation systems of product

data. To fully address the problems of the assumption we turn to Walther

et al. (2010) who’s thesis is built on using supplier websites as source for raw

data collection given the flawedness in e-commerce data. On e-commerce

data they argue that “The information in individual online shops is restricted

to only the sold products and often error prone and not comprehensive”

and drive the thesis that supplier data is in contrast “complete, correct and

16

up-to-date”. Along with Rao and Sashikuma (2016) identification of data

incompleteness as a core obstacle in the journey towards building compre-

hensive product catalogues, we conclude that e-commerce website cannot be

considered a reliable source of complete product information.

Lastly, the value of complete and structured data is evident in terms of ma-

chine learning applications. Having incomplete data generates substantially

weaker classifiers in from ML-algorithms (Shimada and Endo, 2005) and

structured data works better in creating strong ML-based systems than it’s

unstructured counterpart.

2.3 Building a successful e-commerce busi-

ness

Given the technological developments in recent decades, many businesses

have had to rethink traditional ways of conducting commerce and adopt

their business to emerging technologies. Online commerce has been one such

example where, particularly brick-and-mortar retail businesses, have been

forced to go online to stay competitive in a new market environment. Given

these developments, transitioning business online and adapting them to the

digital era has become a major research area. E-commerce in particular has

been the target for much of this research to address the challenges companies

face during this transition.

Transitioning brick-and mortar business online appears to be easy. However,

constructing a profitable online based model including everything from prod-

17

uct presentation to fulfillment of logistical promises and after-sale activities

is evidently a big challenge (Atchariyachanvanich et al., 2008). While the

online supply of products and destinations where they can be purchased has

grown tremendously in the past decade, E-commerce has not evolved at the

same rate in quality and the possibility of setting up an online store without

huge initial investments has driven many without domain knowledge to in-

vest in this area (Varela et al., 2017). The strong trend of internet adoption

on part of the consumer has forced companies online rapidly in order for

them to stay relevant, but winning online takes more than presence and as

the competition has grown stronger, the need for domain knowledge to create

competitive advantages has become painfully evident for market participants.

2.3.1 Critical success factors in E-commerce

Varela et al. (2017) summarize the research on success factors for e-commerce

companies and find that the mainstream of the studies identify five categories

that need addressing to stay competitive online: technology acceptance fac-

tors, social factors, cognitive factors, ethical factors and environmental fac-

tors. Technology acceptance factors aside, the critical success factors relate

to organizational challenges that emerge from the effort of transitioning a

business from offline to online as well as behavioural challenges in getting

the consumers to adapt to online purchasing. Breaking down the larger

themes laid out by the categories, Varela et al. (2017) suggest twelve critical

success factors that must be addressed for building a successful e-commerce

website. These are presented in Figure 2.2.

While the success factors are often discussed in general terms in the littera-

18

Figure 2.2: Table of success factors from Varela et al. (2017)

19

ture without touching the topic of key enablers for the different dimensions

of building a website some have touched upon the topic of data complete-

ness and quality. Burgess and Karanasios (2008) and Cebi (2013) identify

information quality as a main factor in building a competitive e-commerce

business and Chaudhuri et al. (2019) argue that ”In e-commerce, content

quality of the product catalog plays a key role in delivering a satisfactory

experience to the customers”. The most widely discussed factor relates to

website usability and Varela et al. (2017) discuss on-site navigation as a crit-

ical problem in terms of usability. Moreover, the aspect of trust has been

discussed at length within this research area as it relates to both social and

ethical success factors (Lee and Lin, 2005), (Machado, 2011). Trust is im-

portant in every aspect of e-commerce, from describing products objectively

and honestly to practicing solid privacy policies (Ngai, 2003).

Other examples of research on the topic that has been done on a higher level

of abstractions is provided by Choshin and Ghaffari (2017) who investigate

important factors for small- and medium-sized companies in creating online

businesses and finds statistical proof for customer satisfaction, cost, techno-

logical infrastructure and customer awareness and knowledge being integral

factors for success. Furthermore, Nisar and Prabhakar (2017) find perceived

value, customer expectations, perceived quality and loyalty to be important.

To summarize, the research done in the realm of business and management

has accurately depicted the broad strokes of the many factors that are nec-

essary to keep in mind when pursuing the e-commerce space. However, the

field has yet to discuss the connection between these general factors and the

underlying data that is needed to support many of them.

20

2.4 The different kinds of data affecting cus-

tomer experience

So far in our discussion on e-commerce in general and the aspect of data in

particular, we have focused on data in textual format. An important note

is that selling products online demands complementing data for ensuring a

good customer experience. Product pages today contain reviews, comments,

images and videos along with the textual product data which all contribute

to the customer experience. The impact of these different forms of product

data on how a product is perceived online has been discussed individually.

Chaudhuri et al. (2019) discuss the impact of product images and argue that

”Images play a key role in influencing the quality of customer experience and

the customers’ decision-making path in e-commerce transactions. Images

provide detailed product information that helps the customer build confi-

dence in the product quality and fulfillment promises.” and further argue

that bad or incorrect images can have a significant negative impact on the

customers willingness to purchase a product online.

Similar studies have been made on the impact of product reviews by for ex-

ample Singh et al. (2017) and Wan et al. (2018). We want to highlight that

complete and structured data goes beyond the realm of textual data and end

on an important point made by Chaudhuri et al. (2019): “Human errors in

compiling product information and limitations of software systems severely

hinder the ability to provide a homogeneous content experience across cate-

gories to the customer.”

21

Chapter 3

Method

This chapter aims to give the reader an understanding of the methodology

used, and methods applied, when conducting the study. On the highest

level, a qualitative method using a case study approach was used in order to

evaluate the first research question: what role does an online retailer place

on structured product data? The findings from this analysis resulted in a set

of six key propositions. These propositions were used as input for a validity

analysis in the form of a multiple linear regression model, where data from the

subject company was used in an attempt to validate each of the propositions.

3.1 Proposition analysis

The proposition analysis is structured as a single case study with embedded

units as described by Baxter and Jack (2008). In this case, the embedded

units are the subsidiaries of the Company, and the analysis will largely be

22

considered a cross-case analysis. The results of the interviews are analyzed

and consolidated to a set of propositions, in this methodological context they

can directly be related to the propositions in the case study framework put

forward by Yin (2003). The design choice of linking data to propositions has

been decided in order to create a solid foundation for the latter part of the

study. The use of pattern matching Yin (2003) is deemed appropriate in order

to determine patterns observed from individuals close to, or within, the data

management teams at the Company and its subsidiaries. This would require

interviews as the main data collection method, which will be discussed in

greater detail below (Yin, 2003).

The proposition analysis encompassed 15 exploratory interviews with em-

ployees and management at a large Nordic e-commerce company. The main

purpose of this analysis was to gain insights into the role of data in e-

commerce. This was done by identifying themes where the importance of

data is prevalent, these themes then acted as input to the proposition vali-

dation analysis. The interviews were conducted in January and February of

2020.

3.1.1 Interviews

The guiding question of the role of data in e-commerce will be analysed

through interviews using a qualitative lens as outlined by Creswell (2009).

The process can, in short, be described in the following steps:

1. Collecting raw data (transcripts, notes etc.)

2. Organizing and preparing data for analysis

23

3. Reading through the data

4. Coding the data

5. Identify themes and descriptions for themes

6. Interrelating themes/descriptions

7. Interpretation

8. Validating accuracy of information (through cross-validation)

The interviews serve the main purpose of acting as input data for the formu-

lation of the propositions. The interviews were semi-structured in the sense

that they related to the guiding theme, while allowing the interviewees the

freedom to potentially add propositions of their own, which may or may not

be included in an extended scope.

The interviews were conducted in ten separate sessions either in person or

via video-conference. Interviewees were picked from multiple organizational

levels and categorized by organizational functions are listed below:

• Management

– Chief Operating Officer

– Head of Business Development & Strategic Projects

• Merchandising

– Head of Merchandising

– Merchandiser (x2)

• Product management

24

– Senior Category manager

– Junior Category manager (x2)

• Online marketing

– Head of online marketing

– Online marketing specialist (x2)

• Business controlling

– Controller (x2)

• Content & marketing

– Content curator (x2)

3.2 Proposition validation

The following propositions (refer to section 4.2) were deemed appropriate for

a quantitative analysis given the data available: Proposition 1, Proposition

2, Proposition 3, Proposition 4, Proposition 5. These propositions crucially

relate to tangible response variables in the form of internal traffic, external

traffic and quantity of orders. These response variables are described below.

The focus of the proposition validation lies in conducting a quantitative anal-

ysis of the propositions from section 4.1 in order to evaluate their legitimacy

connected to actual sales and product data within the scope of the specific

company. Note, again, that this single company is not to be used as a direct

generalization, but is considered an adequate subject for the scope of the the

25

study as a whole.

• Quantity of orders, in the models denoted as ”quantity”, is defined as

the number of orders placed from a single product page. This response

facilitates the evaluation of propositions 1, 2 and 5, as we can evaluate

the impact that our meta-attributes and images has directly on sales.

• External traffic, in the models denoted as ”sessions”, is defined as the

number of times a user has started their session on the e-commerce

website on a specific product page. That is, a session is only counted

where the user enters the e-commerce website from an external link on

e.g. a search engine. This response variable is thus suited to quantify

the external traffic that a single product page generates. This response

facilitates the evaluation of propositions 3 and 4, as we can evaluate

the impact of our chosen meta-attributes and the product title on the

external traffic that they generate.

• Internal traffic, in the models denoted as ”pageviews”, is defined as the

number of times any user has visited a product page, but not started

their session on that specific product page. This response variable is

thus suited to quantify the internal traffic that a single product page

generates. This response facilitates the evaluation of propositions 1

and 2, as we can evaluate the impact of our chosen meta-attributes on

the internal traffic that they generate.

The only proposition left out of the quantitative analysis is thus Proposition

6. This proposition captures the value of structured data on business intel-

ligence, and the benefits of exploiting such assets are not as direct as with

26

the former propositions.

3.2.1 Multiple linear regression

In order to evaluate how rich and structured data on product features is a

driver of online sales and traffic, a multiple linear regression model is proposed

as it is widely used for this kind of problem (see e.g. Ye, Law, Gu 2009).

This method of analysis allows us to not only evaluate whether there is a

significant impact on sales, but also to control for differing product/retailer

contexts in the analysis.

The full quantitative analysis will be made on the aforementioned response

variables on the company subject to study. The analysis consists of differ-

ent product categories which will be the main analysis in investigating the

legitimacy of propositions 1-5, but will also strengthen our analysis towards

a generalized conclusion. Data for the analysis will be made available to us

by the company and will be drawn from internal ERP-systems, PIM-systems

as well as from Google Analytics.

As mentioned, there are three response variables of interest. Each of the

response variables are to be modelled individually:

1. Number of visits to the product page from external links

2. Number of visits to the product page from internal links

3. Quantity of orders on a product page

These variables were modelled using essentially the same predictors where

the predictors were different measures of the data quality of the product page

27

in question. These measures included (but were not limited to): quality of

product title, length and quality of product description, number and type of

product attributes, number of high quality images and classification of the

product. The construction of the model and the choice of predictors has

been careful and deliberate, drawing from the interviews with industry pro-

fessionals from the proposition analysis as well as the theoretical background

in Chapter 2. Furthermore, a number of control variables that are well es-

tablished to correlate with the responses were used in order to limit model

variance.

In summary, the multiple linear regression model will not try to predict sales

or traffic, since we are aware that the aspect of product data is only one theme

among many that impact these variables. Instead, we want to investigates

the aspect of product data as it relates to sales and online traffic to see 1)

whether they have a significant role in predicting how well a product sells

online and thus further validate the propositions, and 2) how big of an impact

the different aspects have individually and in relation to one another.

3.2.2 Data

This section mainly aims to describe the quantitative data collected through

the Company’s various databases, but will also give a brief discussion on the

format of the interviews conducted.

For the data compiled from the Company’s internal databases, the chosen

time span ranges over two years - from 2018-01-01 through 2019-12-31.

28

Product data

The product data set is compiled from multiple exports from the Company’s

own PIM (product information management) system. The complete data set

contains all the relevant information on the SKU (stock keeping unit) level of

the product that is presented on the website. This is crucial for the analysis,

as we can utilize the category groupings on multiple levels to infer different

rules in the analysis.

A full list of parameters used in the models of analysis will be provided in

the appropriate section.

Sales data

The sales data is collected from the Company’s ERP (Enterprise resource

planning) system. In practice, the data describes the sales on SKU level,

both in terms of total revenue and number of SKU sold.

Traffic data

The traffic data set is generated from the Google Analytics platform, and

provides us with information on page hits, the customers’ journeys through

the website and conversion rates on the level of web pages. I.e., we can utilize

this data to track where the visitor entered the site, and how the journey

towards a specific product is conducted in order to model the importance of

certain data features.

29

Product attribute metadata

The product attribute metadata data set is a consolidated set on the product

data, where we define units of analysis relevant to the propositions. Firstly,

the key objective is to find measures indicating to what extent the products

have structured product data. Our approach is to consolidate the data in

different groupings, and count have many structured data points are present

for the different products in the data set. Secondly, we want to measure other

aspects of the data in one way or the other relating to the propositions. We

try to find measures for the quality of the product titles and to what extent

the underlying structured data have been leveraged in their creation and also

seek measures for images and descriptions. The following set of metadata

attributes have been carefully selected:

• Number of populated base attributes

– These attributes include, but are not limited to: product brand,

method of delivery, country of origin and unit type

• Number of populated standard attributes

– These attributes include, but are not limited to: design series,

material and model number

• Number of populated Dimensions attributes

• Number of populated category specific attributes

• Number of high-quality images on the product page

• The length of the product description (number of words)

30

• Whether or not the following attributes are present in the title:

– Design series

– Colour

– Brand

– Material

• Number of available colour attributes

• Length of the product title when adjusted for automated title creation

3.2.3 Model specification

From the reasoning above, the following multiple linear regression equations

are proposed for each category c:

log(pageviewsc,i) = βc,0 + βTc xc,i + εi (3.1)

log(sessionsc,i) = βc,0 + βTc xc,i + εi (3.2)

log(quantityc,i) = βc,0 + βT

c xc,i + εi (3.3)

For the models specified in equations 3.1 and 3.2, the vector of regressors is

31

defined as:

xc,i =

basec,i

standardc,i

dimensionsc,i

imagecountc,i

categorySpecificc,i

shortDescriptionc,i

longDescriptionc,i

intitleSeriesc,i

intitleMaterialc,i

colourc,i

log(averagePricec,i)

adjustedT itlec,i

log(averagePricec,i) × longDescc,i

log(averagePricec,i) × imagecountc,i

log(averagePricec,i) × adjustedtitlec,i

(3.4)

where the vector βc is then simply the corresponding coefficient vector for

the regressor vector xc,i. For equation 3.3, the vector of regressors xc,i is

identical to xc,i, but appended with an interaction term with the delivery

time deliveryc,i, pageviewsc,i as well as an interaction term between delivery

and log(averagePrice).

32

3.2.4 Validity of assumptions

Homoscedasticity

One key assumption of the multiple linear regression model is the homoscedas-

ticity assumption – that the error terms of the regression have a constant vari-

ance across the sample. To ensure that the model yielded no heteroskedastic

error terms, quantile-quantile plots were evaluated for each model. Figure 3.1

illustrates an example for the bath category. To ensure homoscedasticity,

the empirical and theoretical quantiles should match as closely as possible,

as shown in the figure.

In order to achieve homoscedastic error terms, however, the response vari-

ables had to be log-transformed in all cases. This is a common transformation

technique used for this type of problem.

Multicollinearity

While the existence of multicollinearity in the model is only a violation of the

model assumptions in the case of perfect multicollinearity, high levels can still

cause some issues. A common approach to detect potential multicollinearity

in the model is to utilize the variance inflation factor (vif). Each of the models

run were checked using vif, resulting in no highly correlated regressors – with

the exception of the interaction terms, which should be expected.

Omitted variable bias

One crucial point in the estimations of the regression models is the issue of

omitted variable bias. For a model to be biased through omitted variables,

33

Figure 3.1: Example QQ plot for the pageviews model of category bath

two conditions must hold:

xi is correlated with the omitted variable xo for some i

xo is a determinant of the response variable y

In the construction of the models, significant care was taken in order to

reduce the risk of bias from omitted variables. Since the models are not

aimed to be predictive by construction, this issue is largely simplified.

34

3.3 Research ethics

The study was conducted with great regard to current research ethical con-

siderations. Specifically, the study utilized the four principles for ethical re-

search proposed by the Swedish Research Council (Vetenskapsradet, 2002).

These four principles, or criteria, are presented below and discussed in rela-

tion to the study.

The criterion of information states that the researcher shall inform the peo-

ple included in the study about its aim. Specifically, the researcher shall

inform them about their role in the study, that participation is optional and

the terms which are at play. In order to accommodate for this set of rules,

all interviewees were asked whether or not they wanted to participate in the

study, leaving full disclosure of the terms of personal anonymity. The in-

terviewees were also informed of the aim of the study either via e-mail, a

workplace instant messaging application or verbally. All interviewees com-

plied in full.

The criterion of consent states that any participant in a study has the right

to control their own contribution. That is, the researcher shall collect the

participant’s consent (and possibly the consent of a legal guardian). Fur-

thermore, the participant has the right to independently decide the terms of

their involvement and be able to abort their involvement without any neg-

ative consequences. Finally, the participant shall not be the subject of any

undue pressure. As stated previously, consent was collected from every in-

terviewee in the study, and they were informed that they should only convey

information that they deem appropriate for sharing. Furthermore, as the

35

interviews were recorded, consent was asked for (and approved) before the

start of each interview.

The criterion of confidentiality concerns information about the research par-

ticipants. Any information on the participants shall be given as much con-

fidentiality as possible, and any personal data shall be stored so that none

other than the researcher has access to them. During the interview process,

no personal data was stored in the transcripts except the first name and func-

tion of the participant. The first name was collected in order to facilitate

discussions between the authors. When presenting findings, the intervie-

wees were simply referred to by their function at the Company. While some

of the employee functions only employ a few people, leaving the Company

anonymous throughout the thesis aids in keeping confidentiality.

The criterion of good use states that any information collected on single

participants shall only be used for the purpose of research. In the study, no

data was passed on from the researchers to any function of the Company

apart from the finished thesis. This means that the interview transcripts

were only seen by the authors, and any information relating directly to a

participant was thus ensured not to be used for other purposes.

36

Chapter 4

Results

4.1 Proposition analysis

The overall aim of the proposition analysis was to explore the topic of prod-

uct data, its application and potential, in e-commerce with an open mind.

In the pursuit of achieving an understanding as complete as possible we in-

terviewed people in most parts of the organization and let them explain their

thoughts and daily struggles relating to product data. A high-level take away

that became evident from our sessions was that the value ascribed to data

differed significantly between people from different organizational functions

which we will explore further below. In terms of structure, we present our

findings under six headlines representing the most common themes discussed

in the interviews. Moreover, all of the interviewees were in agreement when

discussing the value of product images with the message that images are in-

tegral for successfully selling products online. As such, the findings below

37

refer to textual product data.

4.1.1 Structured vs. unstructured product data

It is evident that product data cannot be discussed without making the

initial distinctions between unstructured and structured data. The terms

are assigned by the authors with inspiration from data science literature but

were referred to in the interviews as “tabular data” instead of structured,

and “free text data” instead of unstructured. The consensus from all parts

of the organization was that structured data is preferable given the many

applications in the e-commerce value-chain. However, there is a significant

trade-off between working towards structured data formatting and the cost

of pursuing that structured data (in terms of time, effort and quality).

The teams working with assortment onboarding, including category man-

agers with the responsibility for supplier relations, pricing and marketing

within categories and merchandisers with responsibility for data curation,

both stressed the value of structured data, and the onboarding process has

been tailored to achieve it by the best means available. When onboarding new

assortment, the suppliers must structure their data according to a template

defined by the category manager. The template represents a “blueprint” or

a “schema” for what data is necessary depending on which product category

it belongs to. The main purpose of a pre-defined schema is that it ensures

that products in the same category are presented in a consistent way, allow-

ing the customer to compare products across suppliers. A consistent set of

structured data within a category also allows for sitelist filtering, for example

on color or width, to allow the customer better on-site navigation in large

38

assortments.

The argument against working towards achieving structured product data

is that it consumes a lot of time. Suppliers are seldom capable of quickly

packaging their data to a pre-defined format. Instead, each supplier has

their own blueprint for how they store their data in different categories. This

forces suppliers to, often through manual effort, re-structure their data to fit

the mandated format, a process that often takes significant amounts of time.

When the data arrives to merchandising, it is re-packaged and enriched to en-

sure optimal site-presentation and compliance with the existing assortments

packaging. Moreover, many suppliers lack parts of the mandated data inter-

nally which creates a difficult situation, the supplier can be pressured into

“creating” the mandated data, but more often than not the suppliers lack

the willingness to do so, forcing the onboarding team to regularly make ex-

ceptions with regards to the blueprint. While the process generally achieves

the desired result of consistency, it is painfully manual for everyone involved,

has significant lead times and is prone to errors.

Interviewees from functions not involved in the process of assortment on-

boarding were in agreement over the necessity of structured data for multi-

ple reasons. Considering an assortment of products with unstructured data,

the possibilities for automated applications decrease significantly. Optimiz-

ing on-site navigation through filtering functionality was considered to be

near impossible, and the ability to understand the in-house assortment in

terms of white-spots and weak-spots would only be possible in terms of the

structured data available (namely the product categorization). Furthermore,

the ability for search engine optimization of the assortment would be very

39

limited without significant manual effort.

The most important finding from our discussions on structured versus un-

structured data was that all organizational functions are in agreement on the

necessity of structuring product data but from many different angles. Most

interviewees mentioned the obvious application in filtering functionality, but

other perspectives and levers of structured data were only raised by specific

organizational functions indicating that even though the value is appreciated

by everyone, there is a knowledge gap between internal functions in their un-

derstanding of how product data is leveraged throughout the organization.

Going forward, we discuss our findings relating to the current and potential

applications of structured product data, that is, taking the perspective of an

e-commerce business where the data is perfectly structured and complete.

4.1.2 Structured data in online marketing

Results from this section are derived from interviews with two online mar-

keting experts within the company.

Online marketing encompasses several channels and methods but the over-

whelming majority of online traffic arriving at the e-commerce website from

marketing efforts enter either from search engines such as Google or from

social media platforms such as Facebook. Social media marketing was only

discussed briefly since it was not the interviewees’ day-to-day responsibilities,

but search engine optimization was discussed at length and particularly how

structured data can be leveraged for ranking higher on the organic search

results for the company’s target keywords and categories.

40

Those familiar with SEO (Search Engine Optimization) recognized that the

overarching target in the e-commerce context is to get one’s website listed as

high up in the search results as possible in searches using specific keywords

that are related to one’s products. How the underlying ranking algorithms

used by the search engines work is proprietary, there are however some intu-

itive basics that experts in the field agree are the most important for making

a website rise in the search engine rankings and two of the three directly

relate to product data.

The first method for achieving a good search engine ranking relates to key-

words used in search queries. Words that relate to products in different cate-

gories are referred to as keywords, and the main concept here is that content

on the e-commerce website should include the same keywords that poten-

tial customers might use when searching for products in relevant categories.

Consider the scenario where a potential customer enters a search engine with

the intention of finding a suitable sofa, that customer will likely use keywords

such as e.g. ‘sofa’, ‘couch’, ‘settee’ or ‘divan’. For the e-commerce website

selling sofas, it is important that those keywords are present in the website

content to indicate to the search engine that this is a relevant website for a

consumer searching for sofas.

The second method relates to content relevance. The idea is that a website

yielded by a specific search query or keyword should have content directly

related to that query or keyword. The more specific results the better. Con-

tinuing with the same example, a result that links directly to a landing page

containing an assortment of sofas will rank higher than a result that links

to a homepage for a website selling a variety of furniture. The relevance is

41

measured by customers’ tendency to stay on a website after entering from a

search engine and also how many clicks a customer must use to navigate to

achieve a desired result.

The last method has to do with linking to a landing page, this method is

somewhat more technical and is excluded from this result as it does not relate

to product data.

Search queries can be categorized into general, specific and long-tail depend-

ing on their level of specificity as demonstrated in Figure 4.1. General queries

have the highest competition and is as such the hardest to achieve good

rankings for. Just imagine how many websites would like to be the preferred

results for queries such as ‘nice clothing’, ‘cheap furniture’ or ‘buy laptop’.

These queries are generally used by individuals wanting to explore assort-

ments and options and as such relates to broad categories of products. Rel-

evant results for these queries are often e-commerce homepages or category

landing pages. Given the fierce competition and the fact that the number of

pages at each e-commerce website that are relevant for general queries are

generally very few, the content on these pages is curated manually by SEO

experts.

However, with increasing specificity in search queries, the number of landing

pages in need of content curation and optimization increases exponentially,

and with the increase in number of landing pages follows an ever growing

burden in manually managing the content on thousands or even millions of

landing pages. In this context, structured product data can play an integral

role for success in the online marketing space.

42

Figure 4.1: Example of different types of searches

Keeping in mind the important concepts of keywords and relevance, take the

example of an e-commerce website offering a large assortment of furniture

and the specific search query ‘green sofa’. The website in question likely has

several other product categories besides sofas, including tables, chairs, beds

and storage furniture and all of these categories likely contain products of

different colors. Furthermore, all of the mentioned categories likely have one,

two or even three levels of subcategories resulting in hundreds of cumulative

categories on a single website. To present the most relevant results relating to

the query ‘green sofa’ the website would naturally want to refer to a landing

page containing all of the website’s green sofas (and no products that are not

both green and sofas to maximize relevance) and would further want that

landing pages’ content to include the keywords ‘green’ and ‘sofa’.

Here, the first use case of how structured data is a core prerequisite for

43

online marketing becomes evident: The only way one can easily, scalably and

without manual effort create a landing page containing all of the website’s

green sofas is if all sofas have a structured attribute where the key refers to

color and the associated value is green. Effectively using a category along

with an attribute filter for that same category. While this can be done

manually, but with 15-20 different colors and hundreds of categories, the

landing pages for the set of relatively simple search queries containing a

product type and a color is counted in the thousands. To make matters worse,

color is only one key, or attribute, relevant for the assortment. Customers

could use simple queries such as ‘leather sofa’ or ‘vintage sofa’ relating to the

keys material and style respectively implicating the addition of thousands

of more necessary landing pages to maximize search engine relevance. The

manual effort in creating this volume of pages and content is overwhelming,

calling for automated solutions. With a complete set of structured product

data, these pages and the related keywords can be created automatically

by combining categories and keys using simple algorithms without need for

manual efforts.

So far, specific search queries have been considered as they relate to land-

ing pages and concluded that thousands of landing pages are necessary for

relevance optimization in the SEO-context. Intuitively, thousands could be

exchanged for several millions depending on the size of the assortment and

the level of detail as well as the number of dimensions in the structured data.

Using the same example of a website selling furniture, we instead consider

the example of a long-tail query, namely ‘green velvet chesterfield sofa’. De-

pending on the depth of the assortment, the website could have none, one

44

or several products fulfilling the requirements stipulated by the query. If the

website has no such products it has no direct incentive to pursue a good

ranking on the query and if it has several such products the website can

extend the logic described for landing pages by using combinations of cat-

egories and several attributes to achieve a relevant landing page. However,

the most common situation for long-tail queries is that the website has a

single product that matches the description, making the product page the

most relevant result for that keyword.

Here, the second and equally important application of structured product

data appears. For the structured data can be used to automatically create

product titles that are used by the search engines to find the most relevant

results. If all of the words in the queries appear in structured form on the

product pages, the title can be automatically created as a combination of the

values associated with the relevant keys. In this particular case the logic for

a title could be set as color + material + design + category, rendering the

desired result given a complete and structured set of data for that product.

In this context, the number of potential relevant landing pages is equal to,

or even greater than, the number of products in the assortment and the

only effort necessary is to find which keys are the most relevant for different

categories to define a title structure, the rest can be done automatically.

Worth noting is that this approach is deemed impossible in the context of

unstructured product data.

On a final note, staying in the context of product pages, an automated mech-

anism for creating product titles also allows significantly more flexibility to

keep up to speed with changing consumer preferences. Heavily searched key-

45

words should always be present in the titles to optimize the relevance towards

search engines, but if trends change and new keywords become relevant, the

effort of changing titles is much less demanding if they are built from struc-

tured data, if the opposite were true a manual approach would be the only

alternative.

4.1.3 Structured data in website design

Results from this section are derived from interviews with two people working

with on-site content curation at different websites and one working with front-

end development.

Firstly, all interviewees agreed with the statement that structured product

data is an important factor in designing a e-commerce website. The key point

we drew from these interviews was that the underlying product data was

prerequisite for much of the the work being done in front-end development

and content curation, meaning that many features being developed build

directly on the product data schemas and would not work, or work flawedly,

without a set of complete and structured product data. We will begin by

addressing the important topic of on-site navigation and then give examples

of design-features enabled by the product data.

In terms of on-site navigation, there are many similarities with the previous

discussion on landing pages in the SEO context. In essence, on-site navigation

refers to how a customer navigates through an e-commerce website in search

of a suitable product. A rule of thumb is that the customer should have to

put in as little effort as possible in order to reach the desired outcome (may

46

it be inspiration, comparison or purchasing). A larger assortment implies a

greater need for efficient navigation. The two main components to on-site

navigation is the category navigation and the filtering functionality. Most

e-commerce websites have a categorization of their product in some shape

or form allowing the customer to find a subset of products resonating with

the interests of the customer. This constitutes the basic navigation feature

and can be constructed in many ways depending mainly on the size of the

assortment.

Once the customer has found the right category of products, the next step

in guiding the customer towards the desired outcome is through filtering

functionality. If there are hundreds of products in each category it takes

significant effort from the side of the customer to find the right products in

the absence of filters. However, using filters can quickly and with minimal

effort help the customer exclude significant parts of the assortment that are

not relevant to that particular customer. Examples of frequently used filters

in e-commerce could be color, material, dimensions, and size. The part

played by structured product data in this context is equivalent to the case of

landing pages, meaning the possibility to create filters and to have flexibility

in choosing which filters are offered to assist the customer is solely dependent

on having complete and structured product data in the product database.

Turning to website design features and continuous development the story is

similar. Many of the desired applications are thought out with the customer

in mind to help them find inspiration, compare products and create a better

overall website experience. An example of a commonly used feature is direct

product comparisons, where products in the same category are displayed in

47

connection with each other along with their respective features - allowing

the customer to compare products along all relevant product dimensions.

This kind of feature can only be built if all included products have the same

features classified completely. Another example is suggestion engines. These

help suggest similar products based on other products that customers have

already shown an interest in. For these engines to make good suggestions

an important input is the structured product data allowing the engine to

identify similarities and differences between products.

4.1.4 Structured data in assortment curation

Results from this section are derived from interviews with two people working

with on-site content curation at different websites.

The necessity for flexibility in light of swift changes in demand is evident and

a reoccurring challenge throughout the retail industry. E-commerce websites

with a broad and deep assortment face the challenge of curating their as-

sortment in such a way that it becomes inspirational, and to market the

parts of the assortment that are currently trending among consumers. To

this end, structured product data can be leveraged to easily browse, filter

and understand the in-house assortment both within and across categories.

This method is leveraged regularly to filter out subsets of products that have

common classifications along one or several dimensions of the structured data

to quickly find a manageable number of products to leverage in addressing

trends, creating marketing content and building inspirational entries for po-

tential customers.

48

The process of assortment curation had previously been done manually but

leveraging structured product data has increased the efficiency in the process

significantly.

4.1.5 Structured data in business intelligence

The final area where structured product data is being leveraged to some

extent today, but where the perceived potential is very promising is within

business intelligence. Findings from this section are derived from interviews

with management and a business controller within the company.

The potential value comes mainly from several different aspects of assort-

ment analysis. In today’s set-up, assortment analysis is mainly performed

along the dimensions of category and price. The main objective is to achieve

completeness in the assortment meaning that defined categories should have

a satisfactory number of products and preferably products in all price ranges

in order to offer a complete assortment to the consumer. Using this method-

ology, the company has been able to continuously identify weak-spots and

white-spots within its own assortment that has then served as valuable in-

telligence for category managers when prioritizing onboarding of new assort-

ments.

An important realization is that this kind of analysis can be done in many

more dimensions to get input in the strive towards an ever more complete

assortment. For example, the company might find from the initial analysis

that there are 20 daybeds on offer and that they range between all desired

price points from low to high. But by adding new dimensions using struc-

49

tured product data, the company might realize that the assortment in terms

of e.g. colors, materials, styles and designs is homogeneous. As such, in-

putting structured product data into big data analysis applications could

lead to valuable intelligence and strategic decision making support for future

assortment expansion.

Another application for structured product data discussed in the context of

business intelligence was trend identification on all levels. Similarly as with

assortment analysis, product data could be analyzed along with sales- or

traffic-data in order to quickly be able to identify and act upon trends that

go beyond categories, brands or other current dimensions of analysis. Intel-

ligence from this sort of analysis could be leveraged by many organizational

functions including marketing, onboarding and purchasing.

Both of these use cases has grown more relevant and necessary as the avail-

ability of analytical software has exploded in recent years.

4.1.6 Risks of working with structured data

Lastly, a point of caution that was raised in several interviews was the issue

of data completeness. For, while everyone agreed working towards structured

data is core for future success and the potential for business development, if

all products aren’t classified according to the defined schemas for the struc-

tured data many applications lose much of their leverage. Take filtering for

example, say that a customer browses for a green sofa and applies the filter

‘green’ in the sofa-category but only a subset of the green sofas in the as-

sortment has the value ‘green’ connected to the key ‘color’. Then only the

50

correctly classified sofas will appear in the filtering limiting the customers

option. Many examples of undesirable outcomes that can appear through in-

complete or incorrectly classified structured data can be imagined. As such,

the strive towards structuring data must be accompanied by an equal strive

towards correctness and completeness.

4.2 The propositions

Reviewing the results of the proposition analysis, we identify several inter-

esting themes that invites further investigation. First of all, the distinction

between structured and unstructured product data is identified as a core con-

cept playing an integral role in how to view product data in the e-commerce

space. Thus, the propositions below all relate to product data in its struc-

tured form:

• Proposition 1: Structured product data, in contrast to its unstructured

counterpart, is significantly more valuable in terms of its potential ap-

plication in all parts of the e-commerce value chain

– Given that products which have values classified for the keys that

are used in filters are the only ones appearing once filters are ap-

plied, such products are likely exposed more frequently than prod-

ucts that doesn’t which could imply comparatively larger sales for

the products with structured data.

• Proposition 2: Structured product data improve navigation

– Given that products which have values classified for the keys that

51

are used in filters are the only ones appearing once filters are ap-

plied, such products are likely exposed more frequently than prod-

ucts that doesn’t which could imply comparatively larger sales for

the products with structured data.

• Proposition 3: Structured product data is crucial in search engine op-

timization

– Given that structured data seemingly play a key role in search en-

gine optimization, it appears likely that products with well struc-

tured data will have more traffic to their product pages and con-

sequently larger sales than other products.

• Proposition 4: Optimizing product titles is very important for long-tail

SEO, and structured product data makes them seamless to create

– Creating good titles for products that contain values for important

product attributes was discussed as a key part in long-tail SEO.

The implication is that products with well structured titles should

attract more traffic than products with weaker titles in terms of

included attributes and thus also more sales.

• Proposition 5: High quality product images are important for selling

products online

– While not discussed at length in the results, all agreed that prod-

uct images were of upmost importance implying that number- and

quality of images likely affect sales of products.

• Proposition 6: Structured data is highly valuable for business intelli-

52

gence and on-site curation

– Structured product data is described as highly valuable for mar-

keting and analytical purposes. While these effects wouldn’t be

visible in sales figures, it has interesting implications for the broad

necessity of working towards achieving structured product data.

The above propositions serve as input for the quantitative analysis in sec-

tion 4.3

4.3 Proposition validation

The quantitative analysis was conducted on a data set of roughly 67000

observations. In addition, the data set was grouped by the top-level product

category, resulting in 8 group regressions.

Furthermore, for each group, three regression models were fitted with the

following response variables:

• Page views

• Sessions

• Sales quantity

For the full tables of coefficients, see Appendix.

All of the models were checked for multicollinearity, significance and het-

eroskedasticity individually.

53

4.3.1 Data transformations

This part focuses on the transformations made on the data set in order to

accommodate the general linear regression assumptions.

Response variables

For all models, the response variables were log-transformed in order to reduce

the risk of heteroskedastic error terms. For each model, a quantile-quantile

plot of theoretical residual quantiles versus empirical residual quantiles were

evaluated and approved.

Regressors

The regressor for average price point was transformed across all regression

models through a log-transform in order to homogenize the variance of the

residuals. For each model, a quantile-quantile plot of theoretical residual

quantiles versus empirical residual quantiles were evaluated and approved.

Another key transformation made was to include interaction terms in the

regression. These interaction terms serve the purpose of isolating the effect

of a regressor, e.g. length of description, conditional on e.g. average price.

The full set of interaction terms is listed in the Appendix.

4.3.2 Coefficients of interest

This section provides some key findings of the coefficients of interest. For a

full list of coefficients for the different regressions, refer to the Appendix. For

definitions of the relevant regressors, refer to section 3.2.2.

54

Note further that the response variables of all regressions were log-transformed.

Image count

The number of images that are displayed for a product was significant and

positive for all categories except for interior decoration and kitchen. Table 4.1

presents a condensed view of each model. This indicates that, on average,

presenting an additional image on a product page yields significant increases

in volumes sold, page visits and session starts. This result is consistent with

Proposition 5.

Category Pageviews Sessions Quantity

All 0.37143*** 0.12771*** 0.22436***Bath 0.49106*** 0.34936*** 0.33297***Construction 0.48946*** 0.27863*** 0.15767***Floor 0.92903*** 0.26854* 0.41708***Furnishing (-0.10371) -0.20705*** -0.13658**Kitchen (0.02649) 0.31376* (-0.02588)Garden 0.41229*** 0.30494*** 0.13344***

Table 4.1: Summary of the image count attribute regression coefficient percategory

Category specific attributes

Looking at the category specific attributes, that is the number of struc-

tured product attributes that have category specific keys, we can see that

the kitchen category responds most positive across the board to increases in

these types of attributes. However, for the full regression the quantity sold is

seemingly negatively impacted on average, while the amount of traffic that

55

the page drives is positively impacted. The significance of the result across

most models was expected from Proposition 1, the negative impact in some

models however was not aligned with the propositions. Table 4.2 presents a

condensed view of each model.


All (0.00274) 0.00726** -0.14312***Bath (-0.00112) (0.00249) 0.23988***Construction -0.18675*** -0.09331*** -0.22263***Floor -0.10195*** -0.04430*** -0.11221**Furnishing 0.07192*** 0.04560*** -0.07404***Kitchen 0.05799*** 0.07523*** 0.33701***Garden (-0.01732) (0.02139) (-0.08985)

Table 4.2: Summary of the category specific attribute regression coefficientper category

Base attributes

From table 4.3, we see that on average, base attributes had a significant

positive impact in all models. However, on category level it was only positive

for kitchen, floor and bath, with negative values for interior design, outdoors

and construction. The base attributes are rarely presented on the websites

and not used in titles, thus the volatile impact is not surprising.

Standard attributes

Standard attributes had a significant impact across the board in every model

but internal traffic for the floor category. These attributes are often leveraged

for filters and titles making the result consistent with propositions 2, 3 and

56


All 0.00531* 0.01167*** 0.03372***Bath (-0.00598) 0.01005* 0.02887***Construction -0.13586*** -0.09686*** -0.04646***Floor 0.02362*** 0.01705*** -0.01264***Furnishing -0.10451*** -0.00868*** -0.02117***Kitchen 0.09435*** 0.11426*** 0.05420***Garden -0.04712*** (-0.01328) (0.00375)

Table 4.3: Summary of the base attribute regression coefficient per category

4. Table 4.4 presents a condensed view of each model.


All -0.14395*** -0.17414*** -0.06618***Bath (-0.02480) -0.05476*** -0.03285***Construction -0.43771 -0.40815*** -0.10417***Floor 0.08104*** (-0.03233) -0.05021**Furnishing -0.17463*** -0.20492*** -0.11726***Kitchen (-0.05944) -0.13902*** -0.09711***Garden (-0.03076) (-0.01023) (-0.05902)

Table 4.4: Summary of the standard attribute regression coefficient per cat-egory

Dimensions

The dimensions regressor, measuring the number of structured dimension

attributes for a product, was in general positive and significant across the

categories in terms of pageviews and sessions, but not in the quantity sold.

On average we saw that adding a dimension attribute roughly increases the

page hits by 11%, and the external traffic by 15% as seen in table 4.5. This

is consistent with propositions 2 and 3.

57


All 0.11029*** 0.14539*** (-0.00628)Bath (0.02414) 0.12040*** (-0.01054)Construction 0.33961*** 0.334432*** 0.05380***Floor 0.10506*** -0.09643*** -0.13634***Furnishing (0.00533) 0.03017* (0.01389)Kitchen 0.11295*** 0.04578* (-0.01316)Garden 0.17431*** 0.20669*** 0.06819***

Table 4.5: Summary of the dimensions attribute regression coefficient percategory

Information in title

Regarding information in the title, several different coefficients were used to

evaluate the effect on traffic and sales. Table 4.6 presents a condensed view

of the coefficient of information in the title not attributable to Series, Brand,

Colour or Material. For the quantity sold, this regressor was not significant

in any category. In general, however, the traffic driven from within the site

was positively correlated with the amount of extra information in the title

with the exception of bath and interior design.

Regarding structured information in the title, colour, series and material

had significant positive correlation with the traffic, while the brand was not

significant for internal traffic and negative for external traffic.

Length of description

For all categories where the description length was significant, the coefficient

was also positive with the exception of the bath category. Table 4.7 presents

a condensed view of each model.

58


All 0.25592*** 0.08663*** (-0.02138)Bath -0.21350*** -0.31870*** (-0.07646)Construction 0.09874* (0.07178) (0.05279)Floor 0.63347*** 0.60028*** (0.08490)Furnishing (0.08128) -0.20377*** (-0.05623)Kitchen (0.20987) (0.10165) (-0.00713)Garden 0.48688*** 0.17667* (0.10517)

Table 4.6: Summary of the title length attribute regression coefficient percategory


All 0.01095*** 0.00633** (0.00073)Bath (-0.00112) (-0.00287) -0.00762***Construction 0.01142*** 0.00943*** 0.00411***Floor (0.00328) (0.00471) (-0.00253)Furnishing (0.00058) (-0.00320) 0.01740***Kitchen 0.02082*** 0.02841*** (0.00490)Garden 0.02936*** 0.02717 0.01449***

Table 4.7: Summary of the description length attribute regression coefficientper category

Interactions

The interaction terms were constructed with the average price as a basis.

In brief, the importance of a quick delivery increases with the price of the

product on a significant level. Furthermore, most of the coefficients had a

negative conditional effect with the average price, indicating that cheaper

products on average rely more heavily on structured product information.

59

Chapter 5

Discussion

The intention of this section is to merge the findings from chapters two

through four in order to discuss what conclusions can be drawn as well as

potential implications of the findings. Firstly, we discuss our findings in

terms of the propositions put forward in section 4.2. We will continue on to

discuss more general implications of the results put forward and lastly, we

discuss the limitations of the paper.

60

5.1 Evaluating the propositions

5.1.1 Proposition 1: Structured product data, in con-

trast to its unstructured counterpart, is signif-

icantly more valuable in terms of its potential

application in all parts of the e-commerce value

chain.

This proposition is considered the key distinction. As such, it is the propo-

sition on which the majority of other results are evaluated on.

In terms of the current literature on the topic of e-commerce, we find that this

proposition holds under scrutiny. While rarely discussed explicitly, Rao and

Sashikuma (2016), Kang et al. (2003) and Nguyen et al. (2011) all directly

argue for the value of structured data. Moreover, the applications being

researched in the data science community all include methods for structuring

unstructured data or re-structuring already structured data before it can be

leveraged in different applications (Nguyen et al., 2011), (Krys and Bagheri,

2016).

From the quantitative part of the analysis we find further support that the

proposition holds. The results of the regression state that while some at-

tribute types seem more important than others, the total number of struc-

tured data points for a product has a significant and positive impact on both

sales and online traffic for that same product. Thus, the more of the prod-

uct data that can be presented in a structured fashion, the more likely the

61

product is to drive traffic and, ultimately, sell.

The direct importance of having structured data for concrete functionality

such as filtering, product comparisons and creation of large numbers of land-

ing pages is evident from the proposition analysis, and is coherent with the

intuitive hypothesis. These are the applications that the Company struggles

with in daily operations to optimize the performance of their websites. The

consensus was that the effort with structuring product data, while being te-

dious and difficult to create and maintain, can be directly related to positive

developments in terms of traffic and sales. Thus, the efforts are considered

worthwhile for basic applications but from the literature review we find that

the potential of extracting value from a well maintained structured product

database are quite vast. Ranging from BDA, SEO optimization and better

customer experiences the potential is significant and we conclude that not

only does the proposition hold, but the effort of creating these product data

sets should be a core activity for all e-commerce companies if they want to

stay competitive in online retail.

5.1.2 Proposition 2: Structured product data improves

navigation

While this proposition is intuitively true from the very construction of database

filters, we find some proof that the implication of the proposition is that it

can generate more sales and traffic. Nguyen et al. (2011) argue explicitly

for the positive impact on user experience from filtering functionality and

how structured data is its enabler. Petrovski and Bizer (2017) and Ristoski

62

et al. (2018) argue in similar fashion and we conclude that the proposition

has significant support in academic literature.

While the quantitative method does not allow investigation of this proposi-

tion directly, it gives some insight into the implications of improved naviga-

tion. Given that filtering is only possible once structured data is in place, we

earlier argued that products with structured data should get more exposure

than similar products that lack in this property. The regression yielded re-

sults implying that both page views and sessions increased with the number

of structured attributes present, in line with our expectation, but also that

the same structured data had a significant positive impact on the quantity

sold over the two year period investigated in the study. They key response

variable in this case is the page views, as it models the traffic to a product

page from internal sources. Having established that improved data structure

for a product does have a significant positive correlation with internal traffic,

it remains to show directly that these products are also more likely to sell

as a result. While the modelling of direct sales for a product gives an indi-

cation that this is the case (given the number of page visits) supports this

propositions, section 5.3.1 discusses potential issues with this approach and

potential remedies to consider in further studies.

Thus we conclude that proposition two has both support in the literature

and that the regression results were aligned with the expected implication of

the proposition.

63

5.1.3 Proposition 3: Structured product data is crucial

in search engine optimization

While reviewing the literature, we were surprised to find that very little has

been written on the topic of search engine optimization as it relates to e-

commerce. The interviews conducted within the proposition analysis found

that SEO was a highly prioritized subject within the organization and that

it is considered key in staying competitive over time. However, we did find

evidence of the important role of structured data as it relates to generalized

database searches. Petrovski and Bizer (2017) and Nguyen et al. (2011)

both argue that searching product databases, be that through actual search

engines or with database queries, is significantly more effective if the product

data is in structured format. We suspect that while search engine algorithms

are generally proprietary, these insights do in fact give some support for the

proposition.

More important, however, are the results of the regressions in this matter.

For the quantitative analysis yielded support for structured data in terms of

page views, sessions and quantity sold as discussed for earlier propositions,

implying its evident value in the context of SEO. Noteworthy is that the out-

come of the proposition analysis suggested that the SEO-value of structured

data was mainly implicit, meaning that its existence was more of an enabler

for further activities (the creation of new landing pages) rather than valuable

in and of itself.

Although, in general, the above holds, there are some considerations to be

taken when interpreting the data. For example, there is a seemingly negative

64

impact (or at least correlation) of standard attributes on the internal and ex-

ternal traffic driven to the products. This could be interpreted as an error in

the model, since the trivial hypothesis would be that extra attributes would

not decrease the traffic or quantity sold. On the other hand, it is possible

that these attributes, especially if they are considered as equally weighted as

e.g. dimensions, brand and category specific attributes in the search engine

algorithm, would serve to dilute the critical information. This could poten-

tially rank the products lower in the search engine perspective compared to

products which display only what are considered critical attributes. Testing

this would require entirely new hypotheses and potential interaction terms in

a regression, and is left for further research or a continuation of the results of

this paper. With the search engine algorithms being proprietary, we further

consider this a difficult issue to solve in any case.

We conclude that the literature implies that it also has an explicit value,

strengthening the support for the proposition further.

5.1.4 Proposition 4: Optimizing product titles is im-

portant for long-tail SEO, and structured prod-

uct data makes them seamless to create

Given the scarcity of academic literature on SEO in the context of e-commerce,

we could not find sufficient support for this proposition in the research. In-

tuitively, the latter part of the proposition relating to the automatic creation

of titles seems to hold just given the trivial logic that the process is based

on. And as was suggested by professionals on the topic we see no reason to

65

doubt its validity – at least in the context of the case subject. The former

part of the proposition on the other hand is more interesting as it should

have a direct effect on traffic and sales.

From Chapter 4, we can conclude that the length and content of the product

titles has a significant and positive impact on most measured dimensions.

Most importantly, one would expect a positive impact on the number of

sessions since long-tail SEO implies traffic directed straight to the product

page from external sources. This effect is confirmed by our analysis and

we conclude that the proposition is supported at least in part. Regarding

the quantity sold directly, no coefficient was significant on the 5% level of

confidence for any category or in the full data set. This would imply that

there is, in general, a positive correlation with external and internal traffic

that does not coincide with increased sales when controlling for the number

of page visits. In fact, the crass interpretation would be that while products

with more information in the title drive more traffic, there is no support

for an argument that these products sell in larger quantities. This latter

argument does not, however, contradict the proposition as such, but is an

interesting observation nonetheless.

5.1.5 Proposition 5: High quality product images are

important for selling products online

Once again, the proposition can be argued to make strong intuitive sense in

the context of e-commerce. Moreover, Chaudhuri et al. (2019) give support

that image quality is key for increasing online sales. While the quantitative

66

analysis could not capture the relative quality aspect of product images, it

does give proof that the number of product images had a significant positive

impact on both sales and traffic for the Company.

In fact, in the general case, the number of images on a product page yielded

the highest significant regression coefficients of all considered meta-attributes.

This indicating that the number of images is a key factor to consider for on-

line retailers when onboarding new assortment. However, it is also likely

a tedious process to engage in if images are not available in the suppliers’

databases since this would require an in-house or outsourced unit with the

responsibility to take new high-quality images of products. The magnitude

of this issue increases if the online retailer is employing drop-shipping, and

hence would not keep the product units in stock themselves.

Finally, the coefficients of the number of images should be interpreted with

caution. It is not likely that increasing the number of images ad infinitum

would generate constant marginal returns to traffic and sales. A model with

decreasing marginal returns could likely be constructed to deal with the in-

terpretation of the coefficient in a predictive model.

5.1.6 Proposition 6: Structured data is highly valuable

for business intelligence and on-site curation

This proposition was not considered to be possible to investigate with means

of quantitative analysis with the data set available. In terms of current aca-

demic literature, we could not find support for the explicit use of structured

product data for analytical purposes. On the other hand, Akter and Wamba

67

(2016) discuss the benefits of structured data for BDA applications in more

general applications. We can thus conclude that proposition 6 needs further

investigation to be able to be considered fully supported, but with confidence

in the intuitive hypothesis that the proposition holds even in the general case.

5.2 General implications of the results

5.2.1 Product catalogue creation

Upon reviewing the data science research on methods for consolidation of

products, it became clear that the incompleteness and lack of structure in

the e-commerce data was a major hurdle for achieving better results (Rao and

Sashikuma, 2016), (Ristoski et al., 2018). This is consistent with the find-

ings from the proposition analysis that indicated that many suppliers simply

cannot provide all of the requested data and that in some cases the manual

workload of structuring the data for large assortments is too overwhelming to

pursue, and thus products with less information than desired are allowed to

appear on e-commerce websites due to lack of alternatives. This mechanism

limits the value that can be created in all parts of the product-data value

chain and consequently the user experience for the consumer.

Moreover, the pursuit of automation in the data science community in solving

these issues is evident, and a flora of methods for structuring data using novel

technologies such as machine learning are proposed and successfully tested.

In contrast, the efforts of doing those exact same tasks in the company inves-

tigated (and possibly other e-commerce companies as well) are highly manual

68

and thus costly in terms of time and effort. This might very well present an

interesting opportunity for the e-commerce sector. The three-step process

for automatically collecting, structuring and aggregating product data could

potentially be adopted by online retailers themselves with the purpose of

reducing cost and potentially increasing the quality of data.

5.2.2 Toward a common product taxonomy

There seems to be great inefficiencies generally in transferring and leveraging

data between different parts of the product-data value chain. The proposi-

tion analysis identifies barriers between the suppliers and the e-commerce

companies, and the literature review identifies similar struggles in collecting

and structuring the e-commerce websites data. While novel technologies can

play a role in making these inefficiencies less prevalent, one way of elimi-

nating these struggles more efficiently could be creating common product

taxonomies.

There are such taxonomies relating to product categorization that are lever-

aged by several online aggregators in their classification of products. The

next natural step would be to elaborate on those categories and enrich each

category node with a schema for structured data points that relate to prod-

ucts in that category. With such an approach, it would be clear for suppliers

and e-commerce companies alike how the data should be structured and

eliminate much of the tension in transferring data between systems. We

recognize, however, that creating common standards is difficult and requires

participation from many stakeholders, and might also be vitiated by other

problems. Yet, the approach is intriguing in light of the results in this paper

69

and would be of interest for further research.

5.2.3 Critical success factors and their relation to prod-

uct data

As is evident from the proposition analysis, the topic of product data is core

for many parts of running a successful e-commerce company. Collecting,

structuring and managing the data is costly in terms of time and effort but

makes a significant impact on the success of the business. While we recog-

nize the risk of our assessment being somewhat biased by the fact that our

perspective was data-centered to begin with, we find it surprising that the

topic has not been discussed more frequently in the literature outside the

data science community. Some insights on the direct relation between prod-

uct data and success factors are given by Burgess and Karanasios (2008),

Cebi (2013) and Chaudhuri et al. (2019) while most others discuss the im-

portance of applications that leverage structured product data for trust and

user-experience (navigation for example) without touching the topic of the

underlying data.

Our deduction is that structured product data lies at the core of many of the

critical success factors discussed in the literature. That both the creation,

growth and potential scalability of an e-commerce business requires a data-

centered mindset and that while the current research does a good job on

enlightening the importance of considering all parts of the e-commerce value

chain, it does not do justice to the role of product data in achieving the

desired outcomes. A final note on the topic relates to the way e-commerce

70

is discussed in the academic community. As we point out in the review

of the literature, much research is written with the objective of supporting

traditional retailers’ transition to the online marketplace, but given the fact

that e-commerce has evolved into an industry in its own right with many

participants being online-native, we suggest a more e-commerce centered

focus going forward that can better account for the intricacies of conducting

e-commerce that is not necessarily related to the dynamics of traditional

retail.

5.3 Limitations of the paper

This section aims to discuss both limitations in the study and potential

weaknesses of the different chapters. We consider it appropriate to separate

the proposition analysis and the quantitative analysis. Starting with the

latter, as it is more straightforward to introduce the apparent weaknesses.

We do, however, want to re-iterate that this study was conducted on a single

case (although multiple subsidiaries in the proposition analysis). This means

that conclusions drawn in the paper might not always hold in the general case,

as interviewees are undisputedly shaped by their organizational context, and

the data only represents a, albeit relatively large, fraction of total online

sales in the Nordics. We hope to have created a solid foundation for future

research with different case subjects, where our results can be evaluated in

contexts differing in product space, company size and geographic location.

71

5.3.1 Proposition validation

Firstly, the issue of causality versus correlation and reverse-causality needs

to be addressed. When performing a regression analysis, while significant

coefficients indicates correlation, it does not necessarily provide a basis for

causality. In the context of this study, reverse-causality is a valid concern.

If it were the case that products that either sold better or drove more traffic

were to be retroactively amended with more structured data, reverse causal-

ity would indeed be an issue. However, interviews with employees at the

Company did not provide any evidence that products are amended on the

basis of sales or traffic, which gives validity to the results. Furthermore, the

model was not constructed ad hoc, but was deliberately specified together

with professionals within the company. This yields additional validity in the

interpretation of the model, as not only does the careful specification lower

the risk of omitted variable bias, it also provides some confidence in the pro-

posed causal relationship between the regressors and the response variables.

Secondly, the data sample from the Company only consisted of data from

2018 and 2019, yielding 2 years of data. While the time aspect is not a direct

issue, the analysis is solely built upon data from a single e-commerce website,

and thus the results are not guaranteed to hold in the general case. Further-

more, a larger data set from different sources would likely have facilitated

a more thorough analysis of the long-tail products. Almost half of the orig-

inal data set provided from the Company had fewer than 10 orders placed

during the 2-year period, which made data unsuitable for analysis. These

data points were thus excluded, and effect from the long-tail would then not

72

be captured. From the proposition analysis, it is clear that these kinds of

products are of significant value, and is the reason why many e-tailers aim for

full assortments within the categories. In the limitations on the data, there

was also an issue where all relevant categories was not eligible for regression

due to a lack of observations. This means that insights into these categories

were lost, and would have benefited from a data set either from a longer sales

period or from more online retailers.

Furthermore, one desired response variable in the study was the conversion

rate. With the type of data provided from the Company, there was no way to

properly model the conversion rate with a standard multiple linear regression

model - even with a logit transform. Thus, we attempted to use the quantity

of orders as a proxy, controlling for page views. We recognize that this is not

a perfect substitute, but the analysis is still deemed valid for the purposes

of understanding the relative impact of different types of structured product

data. This leaves an opening for further research, where a potential data set

tracking customer journeys could be utilized to model the conversion rate in

a more direct way.

To the topic of regressors, there are some considerations that need to be

addressed. Firstly, the description regressors were constructed as the number

of words in the short and long descriptions on the product page. We recognize

that this might not be a perfect method of controlling for the information

in the product description. An ideal scenario would have been to use as a

regressor a modified description which excludes information that is (or would

have been) present in the structured data of the product. This would then

have served as a better indicator of how the description text impacts sales

73

and website traffic. This approach was, however, deemed unfeasible with the

limitations of the data set that was available to us at this time.

One potential problem that was evaluated was the common occurrence of

sales on the Company’s website, which severely lowered prices and likely

affected sales at different times. While we have motivated that there is a

low risk of generating an omitted variable bias from these occurrences, since

there is likely no correlation between sales and the amount of product data,

we recognize that there is a potential for additional uncaptured variation in

the sales and traffic driven. The former assumption on no correlation between

sales and data might not, however, hold true if the Company would have used

an automated system to drive campaigns. At this time we did not receive

any indication that this was the case, but also not a firm confirmation of the

alternative. On the same line, there could be some correlation with our meta-

attributes and other factors which we are still oblivious to, as automation of

sales and advertisement becomes more prominent - especially in a company

with the resources of the subject of study.

A final note is that of seasonal differences in sales, and how that could po-

tentially dramatically affect sales. While this point is intimately connected

to sales, especially since the emergence of more and more annual ”sales holi-

days”, there is a distinct probability that these time periods exhibit different

behaviours on sales and traffic than the more mundane weekdays. The fact

that the data set covers two whole years does however mean that every season

is indeed captured. A suggestion for further research would be to control for

the biggest sales holidays (e.g. Black Friday weekend and the post-christmas

sale). This could be done by excluding it or finding a set of suitable control

74

variables. For the purpose of generating new intelligence, it would also be an

interesting case to conduct an analysis specifically for these types of events,

as there is a significant potential for retailers to generate unusual amounts

of revenue.

5.3.2 Limitations of the proposition analysis

The proposition analysis is of course limited by the fact that it only encom-

passes interviews from employees from a single company group. As such, the

nature of the proposition analysis findings can in general only be considered

valid in the context of that group of companies. Moreover, the selection of

interviewees did not include representation from all of the companies orga-

nizational functions. Thus, potential insights or questionings of our propo-

sitions might not be included in this paper. We further acknowledge that

our focus on product data going in to the interviews might have affected the

interviewees notions of the relative importance of product data relative to

other important aspects of the e-commerce business.

5.3.3 Sustainability aspects of this paper

As we are closing in on the end of this paper we will discuss sustainability

aspects as they relate to this paper. A framework of three sustainability

aspects are used to guide the discussion: ecological, social and economic

sustainability.

Ecological sustainability is difficult to relate directly to this work. There

is, however, second layer considerations in terms of ecological factors worth

75

considering. In terms of consumer retail, the main ecological factors to con-

sider are what products people are buying and how they are being delivered.

In terms of the products themselves, the aspect of sustainability metrics for

judging the quality of products from a sustainability standpoint relates di-

rectly to the data available to the customer. Including structured product

data product sustainability is thus integral to empower the consumer to in-

formed decision making. Such data can also be leveraged by e-commerce

companies by leveraging such data in filters, landing pages and marketing

materials. As such, sustainability metrics should be included in data schemas

all across the data value chain.

Since this paper is concerned with data on existing products, the social as-

pects of this work is negligible. In terms of the method, the interviews were

made with a diverse group of people in terms of gender and age but the se-

lection was of course limited by the fact that the study was made at a single

company.

Lastly, in terms of economic sustainability, we consider approaches for au-

tomation of manual efforts in the realm of product data to be the only

reasonable approach to create a dynamic and scalable business model for

e-commerce considering the huge amount of manual effort dedicated to these

activities today.

5.4 Conclusion

This study has served the purpose of exploring the impact of structured

product data in the e-commerce space through means of a limited case study

76

on a large Nordic online retailer, a rigorous review of contemporary academic

literature as well as a quantitative study on data provided by the Company.

Through the proposition analysis, six guiding propositions were deduced,

and used as guiding propositions for the rest of the study. All six of the

propositions were addressed in relation to the literature review conducted,

and were later evaluated through a multiple regression model.

On an overarching level, it is clear that there is a significant positive corre-

lation between most of the meta-attributes that were defined in the scope

of the study and the three response variables internal traffic, external traffic

and quantity sold. An exception included the coefficient for the standard

attributes in relation to SEO optimization, which could be attributed to a

diluting effect of such attributes on the response of a search engine algo-

rithm. These correlations are coherent with the current academic literature

on the subject of product data, although literature in the specific context of

e-commerce is surprisingly limited.

In conclusion, the paper gives strong support for propositions 1-5, indicating

that online retailers are currently well-aware of potential positive implica-

tions of structured product data on their business. However, there is are

significant knowledge gaps within the firm, as well as between the firm and

the state-of-the-art research on BDA. We propose that further research needs

to apply a context-specific lens on e-commerce as a whole in order to reduce

this knowledge gap and ultimately make the solutions accessible for online

retailers with less resources than e.g. Amazon.

77

Bibliography

Akter, S. and Wamba, S. F. (2016), ‘Big data analytics in E-commerce:

a systematic review and agenda for future research’, Electronic Markets

26(2), 173–194.

Atchariyachanvanich, K., Okada, H. and Sonehara, N. (2008), Critical success

factors of Internet shopping: The case of Japan, in ‘Communications in

Computer and Information Science’, Vol. 23 CCIS, pp. 98–109.

Baxter, P. and Jack, S. (2008), ‘Qualittive Case Study Methodology’, The

Qualitative Report 13(4), 544–559.

Burgess, S. and Karanasios, S. (2008), ‘Electronic commerce and business-

to-consumer (B2C) relations’, Journal of Electronic Commerce in Organi-

zations 6(4), 1–7.

Cebi, S. (2013), ‘Determining importance degrees of website design parame-

ters based on interactions and types of websites’, Decision Support Systems

54(2), 1030–1043.

Chaudhuri, A., Messina, P., Kokkula, S., Subramanian, A., Krishnan, A.,

Gandhi, S., Magnani, A. and Kandaswamy, V. (2019), A Smart System

78

for Selection of Optimal Product Images in E-Commerce, in ‘Proceed-

ings - 2018 IEEE International Conference on Big Data, Big Data 2018’,

pp. 1728–1736.

Choshin, M. and Ghaffari, A. (2017), ‘An investigation of the impact of

effective factors on the success of e-commerce in small- and medium-sized

companies’, Computers in Human Behavior 66, 67–74.

Creswell, J. W. (2009), Research Design: Qualitative, Quantitative and Mixed

Approaches (3rd Edition).

Duan, H., Zhai, C. X., Cheng, J. and Gattani, A. (2013), ‘Supporting key-

word search in product database: A probabilistic approach’, Proceedings

of the VLDB Endowment 6(14), 1786–1797.

Frost, R., Fox, A. K. and Strauss, J. (2018), E-marketing, eighth edition.

Kang, K.-D., Son, S. and Stankovic, J. (2003), ‘Differentiated Real-Time

Data Services for E-Commerce Applications’, Electronic Commerce Re-

search 3(1/2), 113–142.

Krys, G. and Bagheri, E. (2016), Semi-Supervised Product Specification Ex-

traction From The Web.

Lee, G. G. and Lin, H. F. (2005), ‘Customer perceptions of e-service quality

in online shopping’.

Loebbecke, C. and Picot, A. (2015), ‘Reflections on societal and business

model transformation arising from digitization and big data analytics: A

research agenda’, Journal of Strategic Information Systems 24(3), 149–157.

79

Machado, A. (2011), ‘Usability : impact on e-commerce’.

Ngai, E. W. (2003), ‘Selection of web sites for online advertising using the

AHP’, Information and Management 40(4), 233–242.

Nguyen, H., Fuxman, A., Paparizos, S., Freire, J. and Agrawal, R. (2011),

‘Synthesizing products for online catalogs’, Proceedings of the VLDB En-

dowment 4(7), 409–418.

Nisar, T. M. and Prabhakar, G. (2017), ‘What factors determine e-

satisfaction and consumer spending in e-commerce retailing?’, Journal of

Retailing and Consumer Services 39, 135–144.

Petrovski, P. and Bizer, C. (2017), Extracting Attribute-Value Pairs

from Product Specifications on theWeb, in ‘Proceedings - 2017

IEEE/WIC/ACM International Conference on Web Intelligence, WI 2017’,

pp. 558–565.

Rao, H. and Sashikuma, M. (2016), ‘A Survey on Automated Web Data

Extraction Techniques for Product Specification from E-commerce Web

Sites’, International Journal of Advanced Research in Computer Science

and Software Engineering 6(8).

Ristoski, P., Petrovski, P., Mika, P. and Paulheim, H. (2018), ‘A machine

learning approach for product matching and categorization’, Semantic Web

9(5), 707–728.

Shimada, K. and Endo, T. (2005), Acquisition of new training data from un-

labeled data for product specification extraction, in ‘Pacling 2005’, p. 284.

Singh, J. P., Irani, S., Rana, N. P., Dwivedi, Y. K., Saumya, S. and Kumar

80

Roy, P. (2017), ‘Predicting the “helpfulness” of online consumer reviews’,

Journal of Business Research 70, 346–355.

Varela, M. L. R., Araujo, A. F., Vieira, G. G., Manupati, V. K. and Manoj,

K. (2017), ‘Integrated Framework based on Critical Success Factors for E-

Commerce’, Journal of Information Systems Engineering & Management

2(1).

Vetenskapsradet (2002), ‘Forskningsetiska principer inom humanistisk-

samhallsvetenskaplig forskning’, Stockholm.

Walther, M., Hahne, L., Schuster, D. and Schill, A. (2010), Locating and

extracting product specifications from producer websites, in ‘ICEIS 2010

- Proceedings of the 12th International Conference on Enterprise Informa-

tion Systems’, Vol. 4 SAIC, pp. 13–22.

Wan, Y., Ma, B. and Pan, Y. (2018), ‘Opinion evolution of online consumer

reviews in the e-commerce environment’, Electronic Commerce Research

18(2), 291–311.

Yin, R. (2003), Case study methodology R.K. Yin (2003, 3rd edition). Case

Study Research design and methods. Sage, Thousand Oaks (CA)..pdf, in

‘Case Study Research: design and methods’, pp. 19–39; 96–106.

81

Appendix A

Appendix

82

logPageviews logSessions Quantity

(Intercept) 2.18945∗∗∗ 2.50456∗∗∗ 0.93710∗∗

(0.10786) (0.10913) (0.28954)

base 0.00531∗ 0.01167∗∗∗ 0.03372∗∗∗

(0.00216) (0.00218) (0.00165)

Standard −0.14395∗∗∗ −0.17414∗∗∗ −0.06618∗∗∗

(0.00759) (0.00767) (0.00568)

Dimensions 0.11029∗∗∗ 0.14539∗∗∗ −0.00628

(0.00562) (0.00564) (0.00426)

count 0.37143∗∗∗ 0.12771∗∗∗ 0.22436∗∗∗

(0.02440) (0.02470) (0.01861)

CategorySpecific 0.00274 0.00726∗∗ −0.14312∗∗∗

(0.00259) (0.00262) (0.00826)

short desc words 0.00079∗ 0.00423∗∗∗ −0.00019

(0.00032) (0.00032) (0.00024)

long desc words 0.01095∗∗∗ 0.00633∗∗∗ 0.00073

(0.00058) (0.00059) (0.00044)

intitle seriesTrue 0.27084∗∗∗ 0.10665∗∗∗ 0.13524∗∗∗

(0.02035) (0.02049) (0.01524)

83


intitle colourTrue 0.46694∗∗∗ 0.24787∗∗∗ 0.32234∗∗∗

(0.02476) (0.02510) (0.01867)

intitle brandTrue 0.00084 −0.25386∗∗∗ −0.07360∗∗

(0.03092) (0.03127) (0.02334)

intitle materialTrue 0.09293∗ 0.11967∗∗ −0.22940∗∗∗

(0.03761) (0.03826) (0.02843)

Colour −0.14712∗∗∗ −0.15470∗∗∗ −0.01139

(0.00980) (0.00992) (0.00739)

log(Average.Price) 0.17729∗∗∗ 0.12220∗∗∗ −0.13795∗∗∗

(0.01057) (0.01073) (0.03681)

adjusted title words 0.25592∗∗∗ 0.08663∗∗∗ −0.02138

(0.02513) (0.02551) (0.01901)

long desc words:log(Average.Price) −0.00045∗∗∗ −0.00018∗ −0.00009

(0.00008) (0.00008) (0.00006)

count:log(Average.Price) −0.02521∗∗∗ −0.00227 −0.02756∗∗∗

(0.00332) (0.00336) (0.00254)

log(Average.Price):adjusted title words −0.02738∗∗∗ −0.01306∗∗∗ −0.00204

(0.00374) (0.00380) (0.00282)

84


Pageviews 0.00028∗∗∗

(0.00000)

deliverydelivery within 40 days 0.16264

(0.28691)

deliverydelivery within five days −0.09209

(0.27976)

deliverydelivery within ten days 1.21731∗∗∗

(0.28201)

CategorySpecific:log(Average.Price) 0.01636∗∗∗

(0.00115)

log(Average.Price):deliverydelivery within 40 days −0.02200

(0.03725)

log(Average.Price):deliverydelivery within five days 0.08572∗

(0.03636)

log(Average.Price):deliverydelivery within ten days −0.11784∗∗

(0.03669)

R2 0.16897 0.09554 0.23189

Adj. R2 0.16866 0.09523 0.23148

85


Num. obs. 45493 49367 46322

RMSE 1.68631 1.77295 1.26958

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.1: Regression table for category full

86


(Intercept) 3.03324∗∗∗ 1.81521∗∗∗ −0.00689

(0.30925) (0.31647) (2.02519)

base −0.00598 0.01005∗ 0.02887∗∗∗

(0.00444) (0.00454) (0.00359)

Standard −0.02480 −0.05476∗∗∗ −0.03285∗∗

(0.01371) (0.01403) (0.01068)

Dimensions 0.02414 0.12040∗∗∗ −0.01054

(0.01451) (0.01480) (0.01147)

count 0.49106∗∗∗ 0.34936∗∗∗ 0.33297∗∗∗

(0.06749) (0.06943) (0.05308)

CategorySpecific −0.00112 0.00249 0.23988∗∗∗

(0.00583) (0.00598) (0.02771)

short desc words 0.00272∗∗ 0.00759∗∗∗ 0.00118

(0.00083) (0.00085) (0.00066)

long desc words −0.00112 −0.00287 −0.00762∗∗∗

(0.00177) (0.00181) (0.00137)

intitle seriesTrue 0.08009 0.08191 −0.02379

(0.04782) (0.04891) (0.03743)

87



(0.04712) (0.04815) (0.03661)

intitle brandTrue 1.88661∗∗∗ 1.56287∗∗∗ 0.63929∗∗∗

(0.18017) (0.18459) (0.13868)

intitle materialTrue 0.07012 0.00864 −0.16597

(0.13106) (0.13345) (0.10136)

Colour −0.15287∗∗∗ −0.17552∗∗∗ −0.10379∗∗∗

(0.02571) (0.02636) (0.01997)

log(Average.Price) −0.09129∗∗∗ −0.06943∗∗ −0.11778

(0.02598) (0.02652) (0.23302)

adjusted title words −0.31450∗∗∗ −0.31870∗∗∗ −0.07646

(0.07647) (0.07790) (0.05928)

long desc words:log(Average.Price) 0.00070∗∗ 0.00066∗∗ 0.00092∗∗∗

(0.00023) (0.00024) (0.00018)

count:log(Average.Price) −0.03814∗∗∗ −0.02635∗∗ −0.03687∗∗∗

(0.00873) (0.00899) (0.00685)

log(Average.Price):adjusted title words 0.04080∗∗∗ 0.03550∗∗∗ 0.01036

(0.01011) (0.01031) (0.00785)

88



(0.00000)


(2.02546)

deliverydelivery within five days 1.39758

(2.01291)

deliverydelivery within ten days 1.89899

(2.01356)

CategorySpecific:log(Average.Price) −0.02858∗∗∗

(0.00327)


(0.23444)

log(Average.Price):deliverydelivery within five days −0.02737

(0.23290)

log(Average.Price):deliverydelivery within ten days −0.17054

(0.23293)

R2 0.10562 0.06399 0.33129

Adj. R2 0.10418 0.06258 0.32971

89


Num. obs. 10532 11307 10623

RMSE 1.68172 1.78129 1.30663

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.2: Regression table for category bath

90


(Intercept) 7.54899∗∗∗ 6.10407∗∗∗ 3.07121∗∗∗

(0.32559) (0.33810) (0.46891)

base −0.13586∗∗∗ −0.09686∗∗∗ −0.04646∗∗∗

(0.00744) (0.00771) (0.00502)

Standard −0.43771∗∗∗ −0.40815∗∗∗ −0.10417∗∗∗

(0.02167) (0.02265) (0.01466)

Dimensions 0.33961∗∗∗ 0.33432∗∗∗ 0.05380∗∗∗

(0.01183) (0.01233) (0.00856)

count 0.48956∗∗∗ 0.27863∗∗∗ 0.15767∗∗∗

(0.04985) (0.05175) (0.03373)

CategorySpecific −0.18675∗∗∗ −0.09331∗∗∗ −0.22263∗∗∗

(0.00919) (0.00940) (0.02510)

short desc words 0.00196∗ 0.00454∗∗∗ 0.00184∗∗∗

(0.00078) (0.00081) (0.00051)

long desc words 0.01142∗∗∗ 0.00943∗∗∗ 0.00411∗∗∗

(0.00087) (0.00092) (0.00058)

intitle seriesTrue 1.27074∗∗∗ 0.93653∗∗∗ 0.31278∗∗∗

(0.06022) (0.06247) (0.04049)

91



(0.08970) (0.09444) (0.06030)

intitle brandTrue −0.13142∗ −0.24649∗∗∗ −0.11654∗∗

(0.05161) (0.05415) (0.03553)

intitle materialTrue 0.30226∗∗∗ 0.39449∗∗∗ −0.20829∗∗∗

(0.07142) (0.07557) (0.04970)

Colour −0.20747∗∗∗ −0.13351∗∗∗ −0.06542∗∗∗

(0.02010) (0.02105) (0.01393)

log(Average.Price) 0.14459∗∗∗ 0.15192∗∗∗ −0.06710

(0.02244) (0.02343) (0.05675)

adjusted title words 0.09874∗ 0.07178 0.05279

(0.04464) (0.04730) (0.03049)

long desc words:log(Average.Price) −0.00027∗ −0.00042∗∗∗ −0.00051∗∗∗

(0.00012) (0.00012) (0.00008)

count:log(Average.Price) −0.03108∗∗∗ −0.01762∗∗ −0.01684∗∗∗

(0.00654) (0.00682) (0.00445)

log(Average.Price):adjusted title words 0.00676 0.00075 −0.01534∗∗

(0.00708) (0.00749) (0.00485)

92



(0.00001)


(0.40389)

deliverydelivery within five days 0.86126∗

(0.40475)


(0.40362)


(0.00381)


(0.05349)


(0.05426)

log(Average.Price):deliverydelivery within ten days 0.00343

(0.05372)

R2 0.43917 0.29011 0.25348

Adj. R2 0.43801 0.28879 0.25132

93


Num. obs. 8208 9132 8692

RMSE 1.55202 1.69530 1.03558

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.3: Regression table for category construction

94


(Intercept) −0.57652 0.49963 2.09927∗

(0.39538) (0.42785) (0.88381)

base 0.02362∗∗∗ 0.01705∗∗ −0.01264∗∗

(0.00600) (0.00645) (0.00446)

Standard 0.08104∗∗∗ −0.03233 −0.05021∗∗

(0.02228) (0.02383) (0.01700)

Dimensions 0.10506∗∗∗ −0.09643∗∗∗ −0.13634∗∗∗

(0.02073) (0.02219) (0.01515)

count 0.92903∗∗∗ 0.26854∗ 0.41708∗∗∗

(0.10445) (0.11298) (0.07815)

CategorySpecific −0.10195∗∗∗ −0.04430∗∗∗ −0.11221∗∗

(0.01166) (0.01254) (0.04352)

short desc words 0.00069 0.00086 −0.00113∗∗

(0.00056) (0.00058) (0.00041)

long desc words 0.00328 0.00471 −0.00253

(0.00278) (0.00302) (0.00205)

intitle seriesTrue −0.11627∗ 0.07472 −0.26285∗∗∗

(0.04812) (0.05169) (0.03555)

95


intitle colourTrue 0.03351 0.04248 0.18538∗∗∗

(0.05764) (0.06225) (0.04238)

intitle brandTrue −0.26515∗∗ 0.13896 −0.35473∗∗∗

(0.08951) (0.09560) (0.06504)

intitle materialTrue 0.30114∗∗∗ 0.40084∗∗∗ 0.18510∗∗∗

(0.06135) (0.06603) (0.04530)

Colour 0.06250∗ −0.14860∗∗∗ −0.21754∗∗∗

(0.02615) (0.02822) (0.01915)


(0.05266) (0.05718) (0.13974)

adjusted title words 0.63347∗∗∗ 0.60028∗∗∗ 0.08490

(0.09793) (0.10620) (0.07790)

long desc words:log(Average.Price) 0.00035 −0.00029 0.00057

(0.00048) (0.00052) (0.00036)

count:log(Average.Price) −0.12407∗∗∗ −0.02167 −0.05148∗∗∗

(0.01813) (0.01962) (0.01357)

log(Average.Price):adjusted title words −0.09846∗∗∗ −0.08962∗∗∗ −0.01634

(0.01698) (0.01843) (0.01350)

96



(0.00001)


(0.84262)


(0.88001)

CategorySpecific:log(Average.Price) 0.01878∗

(0.00738)


(0.13527)


(0.14197)

R2 0.22140 0.07617 0.34110

Adj. R2 0.21933 0.07391 0.33874

Num. obs. 6416 6972 6447

RMSE 1.48819 1.66986 1.08607

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.4: Regression table for category floor

97

98


(Intercept) 8.63638∗∗∗ 5.15022∗∗∗ 4.18897∗

(0.30885) (0.32293) (1.96461)

base −0.10451∗∗∗ −0.00868 −0.02117∗∗∗

(0.00658) (0.00689) (0.00549)

Standard −0.17463∗∗∗ −0.20492∗∗∗ −0.11726∗∗∗

(0.02016) (0.02109) (0.01650)

Dimensions 0.00533 0.03017∗ 0.01389

(0.01263) (0.01319) (0.01024)

count −0.10371 −0.20705∗∗∗ −0.13658∗∗

(0.05796) (0.06107) (0.04777)

CategorySpecific 0.07192∗∗∗ 0.04560∗∗∗ −0.07404∗∗∗

(0.00432) (0.00451) (0.01897)

short desc words −0.00418∗∗∗ 0.00170 −0.00202∗∗

(0.00091) (0.00096) (0.00073)

long desc words 0.00058 −0.00320 0.01740∗∗∗

(0.00356) (0.00373) (0.00287)

intitle seriesTrue −0.02734 −0.03125 −0.08924∗

(0.04400) (0.04596) (0.03516)

99



(0.05161) (0.05382) (0.04152)

intitle brandTrue −0.61969∗∗∗ −0.87628∗∗∗ −0.33349∗∗∗

(0.05410) (0.05665) (0.04438)

intitle materialTrue −0.17839∗ −0.07101 0.26948∗∗∗

(0.07849) (0.08277) (0.06316)

Colour −0.07070∗∗ 0.01366 −0.01043

(0.02488) (0.02599) (0.02014)

log(Average.Price) 0.01286 −0.02831 −0.18602

(0.02165) (0.02274) (0.21589)

adjusted title words 0.08128 −0.20366∗∗∗ −0.05623

(0.05598) (0.05862) (0.04522)

long desc words:log(Average.Price) 0.00246∗∗∗ 0.00227∗∗∗ −0.00215∗∗∗

(0.00057) (0.00059) (0.00046)

count:log(Average.Price) 0.01904∗ 0.02919∗∗∗ 0.00880

(0.00817) (0.00860) (0.00680)

log(Average.Price):adjusted title words −0.01695 0.01736 0.00796

(0.00874) (0.00915) (0.00706)

100



(0.00001)


(1.95152)


(1.94176)


(1.94721)


(0.00307)


(0.21699)


(0.21559)

log(Average.Price):deliverydelivery within ten days −0.55743∗

(0.21675)

R2 0.12823 0.06955 0.32994

Adj. R2 0.12661 0.06795 0.32814

101


Num. obs. 9201 9960 9325

RMSE 1.49467 1.62712 1.19979

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.5: Regression table for category int

102


(Intercept) −1.12798 −3.44343∗∗∗ 1.32183

(0.61139) (0.63267) (1.23178)

base 0.09435∗∗∗ 0.11426∗∗∗ 0.05420∗∗∗

(0.01015) (0.01056) (0.00819)

Standard −0.05944 −0.14902∗∗∗ −0.09711∗∗∗

(0.03394) (0.03541) (0.02580)

Dimensions 0.11295∗∗∗ 0.04578∗ −0.01316

(0.02224) (0.02307) (0.01707)

count 0.02649 0.31376∗ −0.02588

(0.13345) (0.13916) (0.10399)

CategorySpecific 0.05788∗∗∗ 0.07523∗∗∗ 0.33701∗∗∗

(0.00872) (0.00902) (0.05775)

short desc words −0.00023 0.00387∗∗∗ 0.00026

(0.00072) (0.00077) (0.00059)

long desc words 0.02082∗∗∗ 0.02841∗∗∗ 0.00490

(0.00548) (0.00566) (0.00422)

intitle seriesTrue −0.26639∗∗ −0.35160∗∗∗ −0.08468

(0.09125) (0.09485) (0.06955)

103



(0.09642) (0.10167) (0.07464)

intitle brandTrue 0.84117∗∗∗ 0.63934∗∗ 0.31881

(0.23383) (0.24417) (0.17474)

intitle materialTrue 0.72365∗∗ 0.30164 −0.43707∗

(0.22612) (0.23892) (0.17234)

Colour −0.33027∗∗∗ −0.18376∗∗ −0.22576∗∗∗

(0.05333) (0.05584) (0.04061)

log(Average.Price) 0.21710∗∗∗ 0.41485∗∗∗ −0.18596∗∗

(0.05972) (0.06160) (0.06884)

adjusted title words 0.20987 0.10165 −0.00713

(0.12334) (0.12692) (0.09311)

long desc words:log(Average.Price) −0.00183∗∗ −0.00319∗∗∗ −0.00031

(0.00065) (0.00067) (0.00050)

count:log(Average.Price) −0.00584 −0.04268∗ −0.00158

(0.01672) (0.01739) (0.01305)

log(Average.Price):adjusted title words −0.03499∗ −0.02036 −0.00338

(0.01570) (0.01616) (0.01184)

104



(0.00001)

deliverydelivery within 40 days −2.74612

(1.51541)


(1.14438)

deliverydelivery within ten days −0.30310

(1.04480)

CategorySpecific:log(Average.Price) −0.03991∗∗∗

(0.00694)

log(Average.Price):deliverydelivery within 40 days 0.35621∗∗

(0.13562)

log(Average.Price):deliverydelivery within five days 0.08568

(0.05999)

R2 0.23114 0.27673 0.44188

Adj. R2 0.22501 0.27145 0.43561

Num. obs. 2152 2346 2161

RMSE 1.37762 1.49698 1.03938

105


∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.6: Regression table for category kitchen

106


(Intercept) 2.29032∗∗∗ 1.98878∗∗∗ 1.23828

(0.41470) (0.42813) (4.69087)

base −0.04712∗∗∗ −0.01328 0.00375

(0.00785) (0.00812) (0.00665)

Standard −0.03076 −0.01023 −0.05902

(0.03888) (0.04001) (0.03204)

Dimensions 0.17431∗∗∗ 0.20669∗∗∗ 0.06819∗∗∗

(0.01801) (0.01841) (0.01500)

count 0.41229∗∗∗ 0.30494∗∗∗ 0.13344∗

(0.07178) (0.07418) (0.06208)

CategorySpecific −0.01732 0.02139 −0.08985

(0.01473) (0.01529) (0.06947)

short desc words 0.00423∗∗∗ 0.00419∗∗ 0.00341∗∗

(0.00126) (0.00130) (0.00104)

long desc words 0.02836∗∗∗ 0.02717∗∗∗ 0.01449∗∗∗

(0.00258) (0.00271) (0.00223)

intitle seriesTrue 0.15154 −0.14573 0.21382∗∗

(0.08681) (0.08964) (0.07251)

107


intitle colourTrue −0.13275 −0.27339∗∗ −0.07883

(0.09258) (0.09661) (0.07773)

intitle brandTrue 0.05399 −0.16113 0.07269

(0.12105) (0.12527) (0.10323)

intitle materialTrue −0.32477∗ 0.02558 −0.15691

(0.16528) (0.17414) (0.13660)

Colour −0.25447∗∗∗ −0.20923∗∗∗ −0.10157∗∗

(0.03785) (0.03919) (0.03166)


(0.03756) (0.03910) (0.88918)

adjusted title words 0.48688∗∗∗ 0.17667∗ 0.10571

(0.08312) (0.08674) (0.06915)

long desc words:log(Average.Price) −0.00256∗∗∗ −0.00253∗∗∗ −0.00151∗∗∗

(0.00029) (0.00030) (0.00025)

count:log(Average.Price) −0.04203∗∗∗ −0.03577∗∗∗ −0.01756∗

(0.00905) (0.00931) (0.00785)

log(Average.Price):adjusted title words −0.07127∗∗∗ −0.03166∗ −0.01696

(0.01234) (0.01287) (0.01026)

108



(0.00001)


(4.68870)


(4.67073)


(4.67310)

CategorySpecific:log(Average.Price) 0.00653

(0.00878)

log(Average.Price):deliverydelivery within 40 days 0.03036

(0.89117)


(0.88845)


(0.88868)

R2 0.32325 0.23606 0.25935

Adj. R2 0.32031 0.23299 0.25464

109


Num. obs. 3933 4247 3955

RMSE 1.54132 1.65990 1.26853

∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05

Table A.7: Regression table for category garden

110

Documents

The (underestimated) role of product data for winning