12
Web Scraping for Consumer Price Statistics Robert Breton

Web Scraping for Consumer Price Statistics Robert Breton

Embed Size (px)

Citation preview

Page 1: Web Scraping for Consumer Price Statistics Robert Breton

Web Scraping forConsumer Price Statistics

Robert Breton

Page 2: Web Scraping for Consumer Price Statistics Robert Breton

Uses of inflation measures

• Inflation targeting• Index-linked government bonds• Indexing pensions and benefits• Indirect taxes (e.g. fuel & alcohol duty)• Income tax thresholds• Regulated charges (e.g. rail fares)• Wage bargaining• GDP deflation

Page 3: Web Scraping for Consumer Price Statistics Robert Breton

Rationale for web scraping pilot

• Prices collection manually based• Web scraping offers more detailed, more

frequent data at lower cost• Long-term aim is to obtain retail scanner data

(as it includes quantity information) but access is difficult.

• Web scraped data provides an opportunity to gain experience in processing high volume price data

Page 4: Web Scraping for Consumer Price Statistics Robert Breton

Prototype web scrapers

• 3 supermarkets• 35 CPI/RPI item categories• Written using Open Source Python (scrapy)• Daily collection (around 6500 price quotes)• Item counts monitored daily

Page 5: Web Scraping for Consumer Price Statistics Robert Breton

Web scraping

Rendered webpage:

XML code:......</div><div class="productLists" id="endFacets-1"><ul class="cf products line"><li id="p-254942348-3" class=" first"><div

class="desc"><h3 class="inBasketInfoContainer"><a id="h-254942348" href="/groceries/Product/Details/?id=254942348" class="si_pl_254942348-title"><span class="image"><img src="http://img.tesco.com/Groceries/pi/121\5010044000121\IDShot_90x90.jpg" alt="" /><!----></span>Warburtons Toastie Sliced White Bread 800G</a></h3><p class="limitedLife"><a href="http://www.tesco.com/groceries/zones/default.aspx?name=quality-and-freshness">Delivering the freshest food to your door- Find out more &gt;</a></p><div class="descContent"><!----><div class="promo"><a href="/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31234788" title="All products available for this offer" id="flyout-254942348-promo-A31234788--pos" class="promoFlyout"><span class="promoImgBox"><img src="/Groceries/UIAssets/I/Sites/Retail/Superstore/Online/Product/pos/2for.png" class="promoFlyout promo" alt="Special Offer" id="flyout-254942348-promo-A31234788--posimg" /></span><em>Any 2 for £2.00</em></a><span> valid from 21/1/2014 until 10/2/2014</span></div><div class="tools"><div class="moreInfo"><a href="/groceries/Product/Details/?id=254942348" class="midiFlyout" id="flyout-254942348-midi-0-"><img class="midiFlyout hd" src="http://ui.tescoassets.com/groceries/UIAssets/I/../Compressed/I_635209615845382232/Sites/Retail/Superstore/Online/Product/infoBlue.gif" alt="" title="View product information" id="flyout-254942348-midi-1-" /></a></div><!----><div class="links"><ul><li><a href="http://www.tesco.com/groceries/product/browse/default.aspx?notepad=white%20sliced%20loaf%20800g&amp;N=4294793217" class="shelfFlyout active plaintooltip" id="s-tt-254942348" title="Premium White Bread"> Rest of <span class="hide">Premium White Bread <!----></span>shelf </a></li></ul></div></div></div></div><div class="quantity"><div class="content addToBasket"><p class="price"><span class="linePrice">£1.45<!----></span><span class="linePriceAbbr"> (£0.18/100g)</span></p><h4 class="hide">Add to basket</h4><form method="post" id="fMultisearch-254942348"

.....

Page 6: Web Scraping for Consumer Price Statistics Robert Breton

Dealing with messy data

Page 7: Web Scraping for Consumer Price Statistics Robert Breton

Classification Challenge

ONS Item Category

Item Description Search Term

Correct Match

Apples, dessert, per kg

WAITROSE PINK LADY APPLES 4S

'APPLE*' Yes

Apples, dessert, per kg

SAINSBURY'S APPLE, KIWI & STRAWBERRY 160G

'APPLE*' No

a) Hardcoding:

'APPLE*‘ =

Page 8: Web Scraping for Consumer Price Statistics Robert Breton

Classification Challenge

b) Supervised Machine Learning:“This is adessert apple”

“This is fruit juice (not orange)”

Training Set

“This is fruit juice (not orange)” and not a dessert apple!

Page 9: Web Scraping for Consumer Price Statistics Robert Breton

Price quote distributions

Whisky:

Onions:

Page 10: Web Scraping for Consumer Price Statistics Robert Breton

Experimental Indices

Page 11: Web Scraping for Consumer Price Statistics Robert Breton

Next Steps

• Analysis of mySupermarket data• Expanded list of items (all groceries)• Machine learning for product categorisation• Further development of experimental indices

Page 12: Web Scraping for Consumer Price Statistics Robert Breton

Key Lessons for data collection

• Be legal - check terms and conditions• Expertise needed for set-up and maintenance • Data sourced from the web may need to be

restructured to make it useful