Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Why Googlebot & The URL Scheduler Should Be Amongst Your Key Personas And How To Train Them

TALK TO THE SPIDER

Dawn Anderson @ dawnieando

9 types of Googlebot

THE KEY PERSONAS 02

SUPPORTING ROLESIndexer /

Ranking EngineThe URL Scheduler

History Logs

Link Logs

Anchor Logs

‘Ranks nothing at all’Takes a list of URLs to crawl from URL SchedulerJob varies based on ‘bot’ typeRuns errands & makes deliveries for the URL server, indexer / ranking engine and logsMakes notes of outbound linked pages and additional links for future crawlingTakes notes of ‘hints’ from URL scheduler when crawlingTells tales of URL accessibility status, server response codes, notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs

03GOOGLEBOT’S JOBS

04ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER

Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system

Schedules Googlebot visits to URLsDecides which URLs to ‘feed’ to GooglebotUses data from the history logs about past visitsAssigns visit regularity of Googlebot to URLsDrops ‘hints’ to Googlebot to guide on types of content NOT to crawl and excludes some URLs from schedulesAnalyses past ‘change’ periods and predicts future ‘change’ periods for URLs for the purposes of scheduling Googlebot visitsChecks ‘page importance’ in scheduling visitsAssigns URLs to ‘layers / tiers’ for crawling schedules

Indexed Web contains at least 4.73 billion pages (13/11/2015)

05TOO MUCH CONTENTTotal number of websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

SINCE 2013 THE WEB IS THOUGHT TO HAVE INCREASED IN SIZE BY 1/3

Capacity limits on Google’s

crawling system

By prioritising URLs for crawling

By assigning crawl period

intervals to URLs

How have search engines responded?

By creating work ‘schedules’ for Googlebots

06TOO MUCH CONTENT

‘Managing items in a crawl schedule’

Include

07GOOGLE CRAWL SCHEDULER PATENTS

‘Scheduling a recrawl’

‘Web crawler scheduler that utilizes sitemaps from websites’

‘

‘Document reuse in a search engine crawler’

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’

‘Scheduler for search engine’

Crawled multiple times daily

Crawled daily Or bi-‐daily

Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawledSplit into segments

on random rotation

08MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT)

Real TimeCrawl

Daily Crawl

Base Layer Crawl

3 layers / tiers URLs are moved in and out of layers based on past visits data

Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy, ‘probability of modification’

GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET

09

The URL Scheduler controls the meal planner

Carefully controls the list of URLs Googlebot vits

‘Budgets’ are allocated

£

CRAWL BUDGET 10

Roughly proportionate to Page Importance (LinkEquity) & speed

Pages with a lot of healthy links get crawled more (Can include internal links??)

Apportioned by the URL scheduler to Googlebots

WHAT IS A CRAWL BUDGET? -‐ An allocation of ‘crawl visit frequency’ apportioned to URLs on a site

But there are other factors affecting frequency of Googlebot visits aside from importance / speed

The vast majority of URLs on the web don’t get a lot of budget allocated to them

CRITICAL MATERIAL CONTENT CHANGE

11

HINTS &

C = ∑ i = 0 n -‐ 1 � weight i * feature

Current capacity of the web crawling system is highYour URL is ‘important’Your URL is in the real time, daily crawl or ‘active’ base layer segmentYour URL changes a lot with critical material content changeProbability and predictability of critical material content change is high for your URLYour website speed is fast and Googlebot gets the time to visit your URLYour URL has been ‘upgraded’ to a daily or real time crawl layer

12POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Current capacity of web crawling system is lowYour URL has been detected as a ‘spam’ URLYour URL is in an ‘inactive’ base layer segmentYour URLs are ‘tripping hints’ built into the system to detect non-‐critical change dynamic contentProbability and predictability of critical material content change is low for your URLYour website speed is slow and Googlebot doesn’t get the time to visit your URLYour URL has been ‘downgraded’ to an ‘inactive’ base layer segmentYour URL has returned an ‘unreachable’ server response code recently

13NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

IT’S NOT JUST ABOUT ‘FRESHNESS’ 14

It’s about the probability & predictability of future ‘freshness’

BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE INFLUENCE THEM TO ESCAPE THE BASE LAYER?

Going ‘where the action is’ in sites

The ‘need for speed’

Logical structure

Correct ‘response’ codes

XML sitemaps

‘Successful crawl visits

‘Seeing everything’ on a page

Taking ‘hints’

Clear unique single ‘URL fingerprints’ (no duplicates)

Predicting likelihood of ‘future change’

Slow sites

Too many redirects

Being bored (Meh) (‘Hints’ are built in by the search engine systems – Takes ‘hints’)

Being lied to (e.g. On XML sitemap priorities)

Crawl traps and dead ends

Going round in circles (Infinite loops)

Spam URLs

Crawl wasting minor content change URLs

‘Hidden’ and blocked content

Uncrawlable URLs

Not just any change

Critical material change

Predicting future change

Dropping ‘hints’ to Googlebot

Sending GooglebotWhere ‘the action is’

CRAWL OPTIMISATION – STAGE 1 -UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES

15

LIKES DISLIKES CHANGE IS KEY

FIND GOOGLEBOT 16

AUTOMATE SERVER LOG RETRIEVAL VIA CRON JOB

grep Googlebot access_log>googlebot_access.txt

LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT

17

PREPARE TO BE HORRIFIEDIncorrect URL header response codes (e.g. 302s)301 redirect chainsOld files or XML sitemaps left on server from years agoInfinite/ endless loops (circular dependency)On parameter driven sites URLs crawled which produce same outputURLs generated by spammersDead image files being visitedOld css files still being crawled

Identify your ‘real time’, ‘daily’ and ‘base layer’ URLsARE THEY THE ONES YOU WANT THERE?

18FIX GOOGLEBOT’S JOURNEYSPEED UP YOUR SITE TO ‘FEED’ GOOGLEGOT MORE

TECHNICAL ‘FIXES’ Speed up your site

Implement compression, minification, caching‘Fix incorrect header response codes

Fix nonsensical ‘infinite loops’ generated by database driven parameters or ‘looping’ relative URLs

Use absolute versus relative internal links

Ensure no parts of content is blocked from crawlers (e.g. in carousels, concertinas and tabbed content

Ensure no css or javascript files are blocked from crawlers

Unpick 301 redirect chains

Minimise 301 redirects

Minimise canonicalisation

Use ‘if modified’ headers on low importance ‘hygiene’ pages

Use ‘expires after’ headers on content with short shelf live (e.g. auctions, job sites, event sites)

Noindex low search volume or near duplicate URLs (use noindex directive on robots.txt)

Use 410 ‘gone’ headers on dead URLs liberally

Revisit .htaccess file and review legacy pattern matched 301 redirects

Combine CSS and javascript files

FIX GOOGLEBOT’S JOURNEY 19

SAVE BUDGET

£

Revisit ‘Votes for self ’ via internal links in GSC

Clear ‘unique’ URL fingerprints

Use XML sitemaps for your important URLs (don’t put everything on it)

Use ‘mega menus’ (very selectively) to key pages

Use ‘breadcrumbs’ (for hierarchical structure)

Build ‘bridges’ and ‘shortcuts’ via html sitemaps and supplementary content for ‘cross modular’ ‘related’ internal linking to key pages

Consolidate (merge) important but similar content (e.g. merge FAQs)

Consider flattening your site structure so ‘importance’ flows further

Reduce internal linking to low priority URLs

BE CLEAR TO GOOGLEBOT WHICH ARE YOUR MOST IMPORTANT PAGES

Not just any change – Critical material change

Keep the ‘action’ in the key areas -‐ NOT JUST THE BLOG

Use ‘relevant ‘supplementary content to keep key pages ‘fresh’

Remember the negative impact of ‘crawl hints’

Regularly update key content

Consider ‘updating’ rather than replacing seasonal content URLs

Build ‘dynamism’ into your web development (sites that ‘move’ win)

GOOGLEBOT GOES WHERE THE ACTION IS AND IS LIKELY TO BE IN THE FUTURE

TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)

20

EMPHASISE PAGE IMPORTANCE TRAIN ON CHANGE

YSlowPingdomGoogle Page Speed TestsMinificiation – JS Compress and CSS MinifierImage Compression – Compressjpeg.com, tinypng.com

21TOOLS YOU CAN USE

GSC Crawl StatsDeepcrawlScreaming FrogServer LogsSEMRush (auditing tools)Webconfs (header responses / similarity checker)Powermapper (birds eye view of site)

GSC Internal links Report (URL importance)Link Research Tools (Strongest sub pages reports)GSC Internal links (add site categories and sections as additional profiles)Powermapper

GSC Index levels (over indexation checks)GSC Crawl statsLast Accessed Tools (versus competitors)Server logs

SPEED

SPIDER EYES

URL IMPORTANCE

SAVINGS & CHANGE

Webmaster Hangout Office Hours

IS THIS YOUR BLOG?? HOPE NOT

22WARNING SIGNS – TOO MANY VOTES BY SELF FOR WRONG PAGES

Most Important Page 1



23WARNING SIGNS – OVER INDEXATION

FIX IT FOR A BETTER CRAWL

Tags: I, must, tag, this, blog, post, with, every, possible, word, that, pops, into, my, head, when, I, look, at, it, and, dilute, all, relevance, from, it, to, a, pile, of, mush, cow, shoes, sheep, the, and, me, of, it

Image Credit: Buzzfeed

Creating ‘thin’ content and Even more URLs to crawl

24WARNING SIGNS – TAG MAN

25GOOGLE THINKS SO

”Googlebot’s On A Strict Diet”

“Make sure the right URLs get on the menu”

Dawn Anderson @ dawnieando

REMEMBER

Marketing

Sasconbeta 2015 Dawn Anderson - Talk To The Spider