15
Your First Sitemap.xml & Robots.txt Implementation Jérôme Verstrynge For Ligatures.net December, 2014 License: CC BY-ND 4.0 Click for information

Your first sitemap.xml and robots.txt implementation

Embed Size (px)

Citation preview

Page 2: Your first sitemap.xml and robots.txt implementation

Table Of Contents

● Introduction● Sitemap: XML vs HTTP● Location:

– Sitemap.xml– Robots.txt

● Sitemap– Content I & II– Generators– Recommendations

● Robots.txt– Content I & II– Basic example– Recommendation &

Warnings● Additional

References– Further readings

Page 3: Your first sitemap.xml and robots.txt implementation

Introduction

● Web Crawler– A search engine

computer searching for content on the Internet for later indexation

– They read robots.txt and sitemap.xml files found on websites

● Robots.txt– A text file containing

instructions for web crawlers

● Sitemap.xml– Text files listing pages

URLs to help web crawlers find content on a website

Page 4: Your first sitemap.xml and robots.txt implementation

Sitemaps: XML vs HTML (confusion)

● HTML Sitemap:– A web page containing

links facilitating user navigation on a website

● XML Sitemap:– A structured text file

containing the URLs of pages of a website for web crawlers

Displayed in web browsers

Visited by users and web crawlers

Never displayed to users

Read by web crawlers only

That's what we are interested in !!!

Page 5: Your first sitemap.xml and robots.txt implementation

Sitemap.xml Locations

By default, most web crawlers search for a sitemap.xml file in

the root

But sitemaps can be located anywhere...

...although the recommended

practice is to put them all in the root!

sitemap.xml

...

/mydir

/

...

sitemap2.xml

website 'root'

A website can have more than one

sitemap!

Page 6: Your first sitemap.xml and robots.txt implementation

Robots.txt Location

By default, all web crawlers search for a

robots.txt file in the root

sitemap.xml

...

/mydir

/

robots.txt

website 'root'

A website may not have a robots.txt

file..

...

...but it is recommended to always have a

robots.txt file(even if minimal)

Page 7: Your first sitemap.xml and robots.txt implementation

Sitemap.xml Content - I

● A structured document defining a <urlset>● One <url>...</url> section per web page URL● In bold required elements, others are →

optional

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://mysite.com/page.html</loc> <lastmod>2014-10-04T13:27:58+03:00<lastmod> <changefreq>daily</changefreq> <priority>0.7</priority> </url> ...</urlset>

Page 8: Your first sitemap.xml and robots.txt implementation

Sitemap.xml Content - II

● <loc>: the URL of a page on the website● <lastmod>: when it has been last modified● <changefreq>: how often it is modified● <priority>: your opportunity to tell web crawlers

on which page you think they should spend their time first (it has no impact on rankings)

<loc>http://mysite.com/page.html</loc> <lastmod>2014-10-04T13:27:58+03:00<lastmod> <changefreq>daily</changefreq> <priority>0.7</priority>

Page 9: Your first sitemap.xml and robots.txt implementation

Sitemap.xml Generators

● Creating a sitemap.xml manually can be very time consuming

● Many can generate it automatically for their websites

● ...but not everyone is a technical!● Solution?

– Use free online sitemap generators– Some plugins are available for blog platforms

Page 10: Your first sitemap.xml and robots.txt implementation

Sitemap.xml Recommendations

● Create at least one sitemap.xml in the root● Be as exhaustive as possible● Leave out <lastmod> and <changefreq> if you

can't set reliable values● Don't try to fool search engines with <lastmod>,

<changefreq> and <priority>, it does not work and can bite back at you

● You may submit your sitemaps to search engines (but it is not mandatory)

Page 11: Your first sitemap.xml and robots.txt implementation

Robots.txt Content - I

● Rules apply top-down, last prevails on top● User-agent: tells to which web crawler (a.k.a. robot it

applies), * means all● Disallow = forbid access, but if empty, this means

forbid access to nothing (in other words, allow all)● Allow = authorize access

User-agent: *Disallow:

User-agent: Googlebot Disallow: /mydir/ Allow: /myfile/myfile.html

Page 12: Your first sitemap.xml and robots.txt implementation

Robots.txt Content - II

● The above robots.txt says:● All web crawlers (but Google's) can access

everything on the website● Google's web crawler cannot access the content

of the /mydir directory, except myfile.html in this directory

User-agent: *Disallow:

User-agent: Googlebot Disallow: /mydir/ Allow: /myfile/myfile.html

Page 13: Your first sitemap.xml and robots.txt implementation

Robots.txt – Basic Example

● Use the above example for a start● Allow access to all your website content to all web crawlers● Register all your sitemaps in robots.txt, otherwise web

crawlers likely won't find them● Locations are case-sensitive● Directory locations should end with a '/'

user-agent: *disallow:

sitemap: http://www.mysite.com/sitemap.xmlsitemap: http://www.mysite.com/sitemap2.xml...

Page 14: Your first sitemap.xml and robots.txt implementation

Robots.txt Recommendations & Warnings

● Always create (at least) a minimal robots.txt where all sitemaps are declared

● Never block access to CSS and Javascript content

● Disallow instructions can be bypassed by malicious web crawlers, they are no means to protect access to content

● Debug your robots.txt with online checkers