Upload
dawn-anderson-pg-dip-digm
View
2.918
Download
2
Embed Size (px)
Citation preview
Why Googlebot & The URL Scheduler Should Be Amongst Your Key Personas And How To Train Them
TALK TO THE SPIDER
Dawn Anderson @ dawnieando
9 types of Googlebot
THE KEY PERSONAS 02
SUPPORTING ROLESIndexer /
Ranking EngineThe URL Scheduler
History Logs
Link Logs
Anchor Logs
‘Ranks nothing at all’Takes a list of URLs to crawl from URL SchedulerJob varies based on ‘bot’ typeRuns errands & makes deliveries for the URL server, indexer / ranking engine and logsMakes notes of outbound linked pages and additional links for future crawlingTakes notes of ‘hints’ from URL scheduler when crawlingTells tales of URL accessibility status, server response codes, notes relationships between links and collects content checksums (binary data equivalent of web content) for comparison with past visits by history and link logs
03GOOGLEBOT’S JOBS
04ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER
Think of it as Google’s line manager or ‘air traffic controller’ for Googlebots in the web crawling system
Schedules Googlebot visits to URLsDecides which URLs to ‘feed’ to GooglebotUses data from the history logs about past visitsAssigns visit regularity of Googlebot to URLsDrops ‘hints’ to Googlebot to guide on types of content NOT to crawl and excludes some URLs from schedulesAnalyses past ‘change’ periods and predicts future ‘change’ periods for URLs for the purposes of scheduling Googlebot visitsChecks ‘page importance’ in scheduling visitsAssigns URLs to ‘layers / tiers’ for crawling schedules
Indexed Web contains at least 4.73 billion pages (13/11/2015)
05TOO MUCH CONTENTTotal number of websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
SINCE 2013 THE WEB IS THOUGHT TO HAVE INCREASED IN SIZE BY 1/3
Capacity limits on Google’s
crawling system
By prioritising URLs for crawling
By assigning crawl period
intervals to URLs
How have search engines responded?
By creating work ‘schedules’ for Googlebots
06TOO MUCH CONTENT
‘Managing items in a crawl schedule’
Include
07GOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler scheduler that utilizes sitemaps from websites’
‘
‘Document reuse in a search engine crawler’
‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’
‘Scheduler for search engine’
Crawled multiple times daily
Crawled daily Or bi-‐daily
Crawled least on a ‘round robin’ basis – only ‘active’ segment is crawledSplit into segments
on random rotation
08MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT)
Real TimeCrawl
Daily Crawl
Base Layer Crawl
3 layers / tiers URLs are moved in and out of layers based on past visits data
Scheduler checks URLs for ‘importance’, ‘boost factor’ candidacy, ‘probability of modification’
GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET
09
The URL Scheduler controls the meal planner
Carefully controls the list of URLs Googlebot vits
‘Budgets’ are allocated
£
CRAWL BUDGET 10
Roughly proportionate to Page Importance (LinkEquity) & speed
Pages with a lot of healthy links get crawled more (Can include internal links??)
Apportioned by the URL scheduler to Googlebots
WHAT IS A CRAWL BUDGET? -‐ An allocation of ‘crawl visit frequency’ apportioned to URLs on a site
But there are other factors affecting frequency of Googlebot visits aside from importance / speed
The vast majority of URLs on the web don’t get a lot of budget allocated to them
CRITICAL MATERIAL CONTENT CHANGE
11
HINTS &
C = ∑ i = 0 n -‐ 1 � weight i * feature
Current capacity of the web crawling system is highYour URL is ‘important’Your URL is in the real time, daily crawl or ‘active’ base layer segmentYour URL changes a lot with critical material content changeProbability and predictability of critical material content change is high for your URLYour website speed is fast and Googlebot gets the time to visit your URLYour URL has been ‘upgraded’ to a daily or real time crawl layer
12POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY
Current capacity of web crawling system is lowYour URL has been detected as a ‘spam’ URLYour URL is in an ‘inactive’ base layer segmentYour URLs are ‘tripping hints’ built into the system to detect non-‐critical change dynamic contentProbability and predictability of critical material content change is low for your URLYour website speed is slow and Googlebot doesn’t get the time to visit your URLYour URL has been ‘downgraded’ to an ‘inactive’ base layer segmentYour URL has returned an ‘unreachable’ server response code recently
13NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY
IT’S NOT JUST ABOUT ‘FRESHNESS’ 14
It’s about the probability & predictability of future ‘freshness’
BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE INFLUENCE THEM TO ESCAPE THE BASE LAYER?
Going ‘where the action is’ in sites
The ‘need for speed’
Logical structure
Correct ‘response’ codes
XML sitemaps
‘Successful crawl visits
‘Seeing everything’ on a page
Taking ‘hints’
Clear unique single ‘URL fingerprints’ (no duplicates)
Predicting likelihood of ‘future change’
Slow sites
Too many redirects
Being bored (Meh) (‘Hints’ are built in by the search engine systems – Takes ‘hints’)
Being lied to (e.g. On XML sitemap priorities)
Crawl traps and dead ends
Going round in circles (Infinite loops)
Spam URLs
Crawl wasting minor content change URLs
‘Hidden’ and blocked content
Uncrawlable URLs
Not just any change
Critical material change
Predicting future change
Dropping ‘hints’ to Googlebot
Sending GooglebotWhere ‘the action is’
CRAWL OPTIMISATION – STAGE 1 -UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES
15
LIKES DISLIKES CHANGE IS KEY
FIND GOOGLEBOT 16
AUTOMATE SERVER LOG RETRIEVAL VIA CRON JOB
grep Googlebot access_log>googlebot_access.txt
LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT
17
PREPARE TO BE HORRIFIEDIncorrect URL header response codes (e.g. 302s)301 redirect chainsOld files or XML sitemaps left on server from years agoInfinite/ endless loops (circular dependency)On parameter driven sites URLs crawled which produce same outputURLs generated by spammersDead image files being visitedOld css files still being crawled
Identify your ‘real time’, ‘daily’ and ‘base layer’ URLsARE THEY THE ONES YOU WANT THERE?
18FIX GOOGLEBOT’S JOURNEYSPEED UP YOUR SITE TO ‘FEED’ GOOGLEGOT MORE
TECHNICAL ‘FIXES’ Speed up your site
Implement compression, minification, caching‘Fix incorrect header response codes
Fix nonsensical ‘infinite loops’ generated by database driven parameters or ‘looping’ relative URLs
Use absolute versus relative internal links
Ensure no parts of content is blocked from crawlers (e.g. in carousels, concertinas and tabbed content
Ensure no css or javascript files are blocked from crawlers
Unpick 301 redirect chains
Minimise 301 redirects
Minimise canonicalisation
Use ‘if modified’ headers on low importance ‘hygiene’ pages
Use ‘expires after’ headers on content with short shelf live (e.g. auctions, job sites, event sites)
Noindex low search volume or near duplicate URLs (use noindex directive on robots.txt)
Use 410 ‘gone’ headers on dead URLs liberally
Revisit .htaccess file and review legacy pattern matched 301 redirects
Combine CSS and javascript files
FIX GOOGLEBOT’S JOURNEY 19
SAVE BUDGET
£
Revisit ‘Votes for self ’ via internal links in GSC
Clear ‘unique’ URL fingerprints
Use XML sitemaps for your important URLs (don’t put everything on it)
Use ‘mega menus’ (very selectively) to key pages
Use ‘breadcrumbs’ (for hierarchical structure)
Build ‘bridges’ and ‘shortcuts’ via html sitemaps and supplementary content for ‘cross modular’ ‘related’ internal linking to key pages
Consolidate (merge) important but similar content (e.g. merge FAQs)
Consider flattening your site structure so ‘importance’ flows further
Reduce internal linking to low priority URLs
BE CLEAR TO GOOGLEBOT WHICH ARE YOUR MOST IMPORTANT PAGES
Not just any change – Critical material change
Keep the ‘action’ in the key areas -‐ NOT JUST THE BLOG
Use ‘relevant ‘supplementary content to keep key pages ‘fresh’
Remember the negative impact of ‘crawl hints’
Regularly update key content
Consider ‘updating’ rather than replacing seasonal content URLs
Build ‘dynamism’ into your web development (sites that ‘move’ win)
GOOGLEBOT GOES WHERE THE ACTION IS AND IS LIKELY TO BE IN THE FUTURE
TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)
20
EMPHASISE PAGE IMPORTANCE TRAIN ON CHANGE
YSlowPingdomGoogle Page Speed TestsMinificiation – JS Compress and CSS MinifierImage Compression – Compressjpeg.com, tinypng.com
21TOOLS YOU CAN USE
GSC Crawl StatsDeepcrawlScreaming FrogServer LogsSEMRush (auditing tools)Webconfs (header responses / similarity checker)Powermapper (birds eye view of site)
GSC Internal links Report (URL importance)Link Research Tools (Strongest sub pages reports)GSC Internal links (add site categories and sections as additional profiles)Powermapper
GSC Index levels (over indexation checks)GSC Crawl statsLast Accessed Tools (versus competitors)Server logs
SPEED
SPIDER EYES
URL IMPORTANCE
SAVINGS & CHANGE
Webmaster Hangout Office Hours
IS THIS YOUR BLOG?? HOPE NOT
22WARNING SIGNS – TOO MANY VOTES BY SELF FOR WRONG PAGES
Most Important Page 1
Most Important Page 2
Most Important Page 3
23WARNING SIGNS – OVER INDEXATION
FIX IT FOR A BETTER CRAWL
Tags: I, must, tag, this, blog, post, with, every, possible, word, that, pops, into, my, head, when, I, look, at, it, and, dilute, all, relevance, from, it, to, a, pile, of, mush, cow, shoes, sheep, the, and, me, of, it
Image Credit: Buzzfeed
Creating ‘thin’ content and Even more URLs to crawl
24WARNING SIGNS – TAG MAN
25GOOGLE THINKS SO
”Googlebot’s On A Strict Diet”
“Make sure the right URLs get on the menu”
Dawn Anderson @ dawnieando
REMEMBER