26
Why Googlebot & The URL Scheduler Should Be Amongst Your Key Personas And How To Train Them TALK TO THE SPIDER Dawn Anderson @ dawnieando

Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Embed Size (px)

Citation preview

Page 1: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Why  Googlebot &  The  URL  Scheduler  Should   Be  Amongst   Your  Key  Personas  And  How  To  Train  Them

TALK  TO  THE  SPIDER

Dawn  Anderson  @  dawnieando

Page 2: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

9  types  of  Googlebot

THE KEY PERSONAS 02

SUPPORTING  ROLESIndexer  /  

Ranking  EngineThe  URL  Scheduler

History  Logs

Link  Logs

Anchor  Logs

Page 3: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

‘Ranks  nothing  at  all’Takes  a  list  of  URLs  to  crawl  from  URL  SchedulerJob  varies  based  on  ‘bot’   typeRuns  errands  &  makes  deliveries   for  the  URL  server,  indexer  /  ranking  engine  and  logsMakes  notes  of  outbound   linked  pages  and  additional  links   for  future  crawlingTakes  notes  of  ‘hints’   from  URL  scheduler  when  crawlingTells  tales  of  URL  accessibility   status,  server  response  codes,   notes  relationships   between  links   and  collects  content  checksums   (binary   data  equivalent  of  web  content)  for  comparison  with  past  visits   by  history  and  link  logs

03GOOGLEBOT’S JOBS

Page 4: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

04ROLES – MAJOR PLAYERS – A ‘BOSS’- URL SCHEDULER

Think  of  it  as  Google’s  line  manager  or  ‘air  traffic  controller’  for  Googlebots in  the  web  crawling  system

Schedules  Googlebot visits   to  URLsDecides  which  URLs  to  ‘feed’  to  GooglebotUses  data  from  the  history   logs  about  past  visitsAssigns   visit  regularity  of  Googlebot to  URLsDrops  ‘hints’   to  Googlebot to  guide  on  types  of  content  NOT  to  crawl  and  excludes  some  URLs  from  schedulesAnalyses   past  ‘change’   periods  and  predicts  future  ‘change’  periods   for  URLs  for  the  purposes   of  scheduling  Googlebot visitsChecks   ‘page  importance’  in  scheduling   visitsAssigns  URLs  to  ‘layers  /  tiers’  for  crawling  schedules

Page 5: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Indexed  Web  contains at  least  4.73  billion   pages (13/11/2015)

05TOO MUCH CONTENTTotal  number  of  websites

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

1,000,000,000

750,000,000

500,000,000

250,000,000

SINCE  2013  THE  WEB  IS  THOUGHT  TO  HAVE  INCREASED  IN  SIZE  BY  1/3

Page 6: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Capacity  limits  on  Google’s  

crawling  system

By  prioritising  URLs  for  crawling

By  assigning  crawl  period  

intervals  to  URLs

How  have  search  engines  responded?

By  creating  work  ‘schedules’  for  Googlebots

06TOO MUCH CONTENT

Page 7: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

‘Managing items in a crawl schedule’

Include

07GOOGLE CRAWL SCHEDULER PATENTS

‘Scheduling a recrawl’

‘Web crawler scheduler that utilizes sitemaps from websites’

‘Document reuse in a search engine crawler’

‘Minimizing visibility of stale content in web searching including revising web crawl intervals of documents’

‘Scheduler for search engine’

Page 8: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Crawled  multiple  times  daily

Crawled  daily  Or  bi-­‐daily

Crawled  least  on  a  ‘round  robin’  basis  – only  ‘active’  segment  is  crawledSplit  into  segments  

on  random  rotation

08MANAGING ITEMS IN A CRAWL SCHEDULE (GOOGLE PATENT)

Real  TimeCrawl

Daily Crawl

Base  Layer    Crawl

3  layers  /  tiers URLs  are  moved  in  and  out  of  layers  based  on  past  visits  data

Page 9: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Scheduler   checks  URLs  for  ‘importance’,   ‘boost  factor’  candidacy,  ‘probability   of  modification’

GOOGLEBOT’S BEEN PUT ON A URL CONTROLLED DIET

09

The  URL  Scheduler  controls  the  meal  planner

Carefully  controls  the  list  of  URLs  Googlebot vits

‘Budgets’  are  allocated

£

Page 10: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

CRAWL BUDGET 10

Roughly  proportionate  to  Page  Importance  (LinkEquity)   &  speed

Pages  with  a  lot  of  healthy  links   get  crawled  more  (Can  include  internal  links??)

Apportioned  by  the  URL  scheduler   to  Googlebots

WHAT  IS  A  CRAWL  BUDGET?  -­‐ An  allocation  of  ‘crawl  visit   frequency’   apportioned   to  URLs  on  a  site

But  there  are  other  factors  affecting  frequency   of  Googlebot visits   aside  from  importance  /  speed

The  vast  majority  of  URLs  on  the  web  don’t  get  a  lot  of  budget  allocated  to  them

Page 11: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

CRITICAL MATERIAL CONTENT CHANGE

11

HINTS  &

C  =  ∑  i =  0  n  -­‐ 1  �  weight  i *  feature

Page 12: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Current  capacity  of  the  web  crawling  system  is  highYour  URL  is  ‘important’Your  URL  is  in  the  real  time,  daily  crawl  or  ‘active’  base  layer  segmentYour  URL  changes  a  lot  with  critical  material  content  changeProbability   and  predictability   of  critical  material  content  change  is  high  for  your  URLYour  website  speed   is  fast  and  Googlebot gets  the  time  to  visit  your  URLYour  URL  has  been  ‘upgraded’  to  a  daily   or  real  time  crawl  layer

12POSITIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Page 13: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Current  capacity  of  web  crawling  system  is  lowYour  URL  has  been  detected  as  a  ‘spam’  URLYour  URL  is  in  an  ‘inactive’   base  layer  segmentYour  URLs  are  ‘tripping   hints’   built  into  the  system  to  detect  non-­‐critical  change  dynamic  contentProbability   and  predictability   of  critical  material  content  change  is  low  for  your  URLYour  website  speed   is  slow  and  Googlebot doesn’t   get  the  time  to  visit   your  URLYour  URL  has  been  ‘downgraded’   to  an  ‘inactive’  base  layer  segmentYour  URL  has  returned  an  ‘unreachable’   server  response  code  recently

13NEGATIVE FACTORS AFFECTING GOOGLEBOT VISIT FREQUENCY

Page 14: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

IT’S NOT JUST ABOUT ‘FRESHNESS’ 14

It’s  about  the  probability  &  predictability  of  future  ‘freshness’

BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE INFLUENCE THEM TO ESCAPE THE BASE LAYER?

Page 15: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Going  ‘where  the  action  is’  in  sites

The  ‘need  for  speed’

Logical  structure

Correct   ‘response’  codes

XML  sitemaps

‘Successful  crawl  visits

‘Seeing  everything’  on  a  page

Taking  ‘hints’

Clear  unique  single  ‘URL  fingerprints’  (no  duplicates)

Predicting  likelihood  of  ‘future  change’

Slow  sites

Too  many  redirects

Being  bored  (Meh)  (‘Hints’  are  built  in  by   the  search  engine  systems  – Takes  ‘hints’)

Being  lied  to  (e.g.  On  XML  sitemap  priorities)

Crawl  traps  and  dead  ends

Going  round   in  circles  (Infinite  loops)

Spam  URLs

Crawl  wasting  minor  content  change  URLs

‘Hidden’  and  blocked  content

Uncrawlable URLs

Not  just  any  change

Critical  material  change

Predicting  future  change

Dropping   ‘hints’  to  Googlebot

Sending  GooglebotWhere  ‘the  action  is’

CRAWL OPTIMISATION – STAGE 1 -UNDERSTAND GOOGLEBOT & URL SCHEDULER - LIKES & DISLIKES

15

LIKES DISLIKES CHANGE  IS  KEY

Page 16: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

FIND GOOGLEBOT 16

AUTOMATE  SERVER  LOG  RETRIEVAL  VIA  CRON  JOB

grep Googlebot access_log>googlebot_access.txt

Page 17: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

LOOK THROUGH ‘SPIDER EYES’ VIA LOG ANALYSIS – ANALYSE GOOGLEBOT

17

PREPARE TO BE HORRIFIEDIncorrect  URL  header  response   codes   (e.g.  302s)301  redirect  chainsOld  files  or  XML  sitemaps  left  on  server  from  years  agoInfinite/  endless   loops   (circular  dependency)On  parameter  driven  sites  URLs  crawled  which  produce  same  outputURLs  generated  by  spammersDead  image  files  being  visitedOld  css files  still  being  crawled

Identify  your  ‘real  time’,  ‘daily’  and  ‘base  layer’  URLsARE  THEY  THE  ONES  YOU  WANT  THERE?

Page 18: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

18FIX GOOGLEBOT’S JOURNEYSPEED UP YOUR SITE TO ‘FEED’ GOOGLEGOT MORE

TECHNICAL  ‘FIXES’      Speed  up  your  site

Implement  compression,  minification,  caching‘Fix  incorrect  header  response  codes

Fix  nonsensical  ‘infinite  loops’  generated  by  database  driven  parameters  or  ‘looping’   relative  URLs

Use  absolute  versus  relative  internal  links

Ensure  no  parts  of  content  is  blocked   from  crawlers  (e.g.  in  carousels,  concertinas  and  tabbed  content

Ensure  no  css or  javascript files  are  blocked   from  crawlers

Unpick  301   redirect  chains

Page 19: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Minimise  301  redirects

Minimise  canonicalisation

Use  ‘if  modified’  headers  on   low  importance  ‘hygiene’  pages

Use  ‘expires  after’  headers  on  content  with  short  shelf  live  (e.g.  auctions,  job  sites,  event  sites)

Noindex low  search  volume  or  near  duplicate  URLs  (use  noindex directive  on  robots.txt)

Use  410  ‘gone’  headers  on  dead  URLs  liberally

Revisit  .htaccess file  and  review  legacy  pattern  matched  301   redirects

Combine  CSS  and  javascript files

FIX GOOGLEBOT’S JOURNEY 19

SAVE  BUDGET

£

Page 20: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Revisit  ‘Votes  for  self ’  via  internal  links  in  GSC

Clear  ‘unique’  URL  fingerprints

Use  XML  sitemaps  for  your  important  URLs  (don’t  put  everything  on   it)

Use  ‘mega  menus’  (very  selectively)  to  key  pages

Use  ‘breadcrumbs’  (for  hierarchical  structure)

Build  ‘bridges’  and  ‘shortcuts’  via  html  sitemaps  and  supplementary  content  for  ‘cross  modular’  ‘related’  internal  linking  to  key  pages

Consolidate  (merge)  important  but  similar  content  (e.g.  merge  FAQs)

Consider   flattening  your  site  structure  so  ‘importance’  flows  further

Reduce  internal  linking  to  low  priority  URLs

BE  CLEAR  TO  GOOGLEBOT  WHICH  ARE  YOUR  MOST  IMPORTANT  PAGES

Not  just  any  change  – Critical  material  change

Keep  the  ‘action’  in  the  key  areas -­‐ NOT  JUST  THE  BLOG

Use  ‘relevant  ‘supplementary  content   to  keep  key  pages  ‘fresh’

Remember  the  negative  impact  of    ‘crawl  hints’

Regularly  update  key  content

Consider   ‘updating’  rather  than  replacing  seasonal  content  URLs

Build  ‘dynamism’  into  your  web  development  (sites  that  ‘move’  win)

GOOGLEBOT  GOES  WHERE  THE  ACTION  IS  AND  IS  LIKELY  TO  BE  IN  THE  FUTURE

TRAIN GOOGLEBOT – ‘TALK TO THE SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)

20

EMPHASISE  PAGE  IMPORTANCE       TRAIN  ON  CHANGE

Page 21: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

YSlowPingdomGoogle  Page  Speed  TestsMinificiation – JS  Compress   and  CSS  MinifierImage  Compression   – Compressjpeg.com,  tinypng.com

21TOOLS YOU CAN USE

GSC  Crawl  StatsDeepcrawlScreaming  FrogServer  LogsSEMRush (auditing  tools)Webconfs (header  responses   /  similarity  checker)Powermapper (birds  eye  view  of  site)

GSC  Internal  links  Report  (URL  importance)Link  Research  Tools  (Strongest  sub  pages  reports)GSC  Internal  links  (add  site  categories  and  sections   as  additional   profiles)Powermapper

GSC  Index  levels  (over  indexation  checks)GSC  Crawl  statsLast  Accessed  Tools  (versus   competitors)Server  logs

SPEED

SPIDER  EYES

URL  IMPORTANCE

SAVINGS  &  CHANGE

Webmaster Hangout Office Hours

Page 22: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

IS THIS YOUR BLOG?? HOPE NOT

22WARNING SIGNS – TOO MANY VOTES BY SELF FOR WRONG PAGES

Most Important Page 1

Most  Important  Page  2

Most  Important  Page  3

Page 23: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

23WARNING SIGNS – OVER INDEXATION

FIX IT FOR A BETTER CRAWL

Page 24: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Tags:  I,  must,  tag,    this,  blog,   post,  with,  every,  possible,   word,  that,  pops,   into,  my,  head,  when,   I,  look,   at,  it,  and,  dilute,   all,  relevance,  from,  it,  to,  a,  pile,   of,  mush,  cow,  shoes,   sheep,   the,  and,  me,  of,   it

Image  Credit:  Buzzfeed

Creating  ‘thin’  content  and  Even  more  URLs  to  crawl

24WARNING SIGNS – TAG MAN

Page 25: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

25GOOGLE THINKS SO

Page 26: Sasconbeta 2015 Dawn Anderson - Talk To The Spider

”Googlebot’s On  A  Strict  Diet”

“Make  sure  the  right  URLs  get  on  the  menu”

Dawn  Anderson  @  dawnieando

REMEMBER