Upload
mark-thomas
View
3.230
Download
0
Embed Size (px)
Citation preview
LOG FILE ANALYSIS
5 CRITICAL TECH SEO QUESTIONS YOUR LOGS CAN ANSWER
#BrightonSEO | @SearchMATH
#BrightonSEO
As used by…
About Botify
Here’s the problem…
> Google doesn’t crawl every page of your website>> If a page isn’t crawled it won’t be indexed>>> If a page isn’t indexed, it won’t make you money
Identify Desired Outcomes and
Objectives
Information Gathering
Action Planning
Implementation and Review
New Initiative Planning Process
This presentation will focus on the “Information Gathering” stage of the
process.
Identify Desired Outcomes and
Objectives
Information Gathering
Action Planning
Implementation and Review
New Initiative Planning ProcessDawn Anderson’s slide-deck “BRINGING IN THE
FAMILY DURING CRAWLING” is an insightful guide to help you identify crawl budget
opportunities. Dawn also suggests powerful actions you should explore.
Log File 101
Hypertext Transfer Protocol (HTTP)
ClientServer
HTTP RequestGET /index.html HTTP/1.1 Host: www.exampleshop.comUser-Agent: Mozilla 5.0
HTTP ResponseHTTP/1.1 200 OKDate: Mon, 11 Jul 2016 08:06:45 GMTServer: Apache/1.3.27 (Unix) (Red-Hat/Linux)Last-Modified: Wed, 04 Feb 2016 23:11:55 GMTEtag: “3f84f-1b9-3elcd16b”Accept-Ranges: bytes Content-Length: 458Connection: closeContent-Type: text/html; charset=UTF-8
Fig 1: HTTP Client/Server Communication
This is a standard HTTP/1.1 exchange between Client (e.g. Browser or Googlebot) & your Server.
Server Log Files
Server
188.65.114.122 - - [19/Jul/2016:08:07:05 -0400] "GET /women/shoes/ converse14579/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"charset=UTF-8
Server IPTimestamp (date & time)Method (GET / POST)Request URI HTTP status codeUser-agent
Fig 2: Example Server OutputWHO’S REQUESTING? |
WHEN? | HOW?
WHAT FILE?
SERVER RESPONSE Server Logs are the SINGLE SOURCE OF TRUTH when it comes to seeing how search engines, such as Googlebot,
assess your website.
Your webserver keeps a file of every hit the server receives during the exchange on the previous slide. Your very own data treasure chest.
“[Cleanup your architecture because] we get lost crawling unnecessary URLs and we might not be able to crawl and index your new and updated content as quickly as we would otherwise… There are a number of crawlers you can use to crawl your website on your own, to run across your website.” Google Webmaster Central office hours hangout, 16 Oct 2015
@JohnMuCrawl your
website with a THIRD-PARTY
CRAWLER
@JohnMuConduct
LOG FILE ANALYSIS
How does Log Files Analysis differ to Web Crawl Analysis?
Home
Category
Subcategory
Detail
Web CrawlSystematically fetch, retrieve, and validate the HTML on every page of your website to simulate
Googlebot’s/Bingbot’s analysis of your pages
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
_______________
__________________________________________
______________
Let’s consider how the information is collected…
This is great for optimising your HTML code and helps you try and produce a best
in class website.
But that’s not how search engines operate and crawling alone lacks the evidence to back up your strategy.
For example, Googlebot might enter through a popular category and crawl the same pages time after time. Search Console won’t tell you this and
neither does simulating a crawl from your homepage.
So, you need to crawl your architecture and compare the data to Google’s activity (via your log files) to gain an insight into how you’ll get more of your
money making pages crawled and indexed.
What barriers do people face when trying to study this vital information?
• Access to Server Logs• File Sizes• Misplacing trust in Search Console• Time required to process the data
But I don’t think you should be deterred and here’s why…
Accessing your logs is simpler than you think. Your organisation is probably already using them.
Common Log Analysis use cases for eCommerce organisations include:>> Application Management>> Access Management>> Network Forensics>> Compliance
Popular products used by Applications and Security teams at major Enterprise companies include: LogRhythm, Loggly, and Splunk.
Splunk (a log file storage and processing company): Market Cap $8.6bn, 11,000 Customers
http://www.slideshare.net/Splunk/splunklive-london-john-lewis
This is a picture from a presentation I watched at SplunkLive in London 2016. John Lewis visualise their operational intelligence from log files. You can get your logs!!
It’s true that the volume of data involved can make working with the files prohibitive.
For example, if a site receives 50,000 visitors a day browsing an average of 5 pages per session, that’s 250,000 log entries per day for the HTML
7.5M entries per monthNow add 10 assets requested from the server for
each page:75,000,000 lines in your Log Files per month
SEOs regularly monitor and trend site architecture data (HTTP codes, etc.) in third-party apps
but it’s not possible to scrutinise Search Console’s crawling and indexing charts, but you really should.
So, how is engineering helping us overcome these barriers and expand our knowledge?
>>> Secure File Transfer Protocol (SFTP)>>> Storing and trending Log Data thanks to cloud services>>> Processing Automation (saving TIME)>>> Diffing Log Data with Simulated Crawl
Data for greater insights
Let’s move onto the questions I think you should be looking to answer.
What are the typical questions SEOs try and answer with Log Analysis?
• Where do I have accessibility errors?• Which pages are being spidered most frequently?• Is spammer activity proving detrimental to performance?• Which pages haven’t been crawled by search engines?
And these are all very valid and helpful but I suggest looking at the next list too…
# 5 Critical Questions / KPIs Score
1 What is my ‘Crawl Ratio’?
2 What percentage of my compliant pages (2xx & unique) will Google crawl each month?
3 How deep will Google crawl into my site architecture?
4 What does Google consider to be my Top, Middle and Long Tail pages?
5 What is my ‘Crawl Window’ score?
HOW MANY MORE PAGES NOW HAVE THE POTENTIAL TO MAKE US MONEY? (THANKS TO MY EFFORTS OVER THE PAST 30 DAYS?)
INDEX
Crawl Definition Score
Crawl Rate requests per second Googlebot makes to your site when it is crawling it
Crawl Budget the maximum number of pages that Google crawls on a website
Crawl Frequency program determining which sites to crawl, how often, and how many pages to fetch from each site
Crawl Rank the frequency a page is crawled compared with the ranking position of that page
Crawl Space the totality of possible URLs for a website
Crawl Ratio the percentage of my website structure Google is crawling every 30 days
Crawl Window the percentage of the compliant (unique & 200) pages on my website Google usually crawls in a 14 day period
I’ve mentioned a few terms you might not be familiar with so here’s a list of old friends with a couple of new additions.
Critical Question 1 – What is my ‘Crawl Ratio’?
Crawl Ratio: the percentage of my website structure Google is crawling every 30 days
Total Pages in the website structure crawled by Google in 30 days
Total Pages in the website structurex100
Organic Growth Opportunities
LifestylePublisher
Business Equipment Retail Real EstateClassified
The Venn diagram clearly illustrates the mis-match between the URLs you hope Google is looking at with the accurate picture from your server logs.
Critical Question 2 What percentage of my compliant pages
(200 & unique) will Google crawl each month?
% of key pages crawled
Total Compliant Pages Crawled By Google in 30 daysTotal Compliant Pages in the website structure x100
%
Potential
LifestylePublisher
76.4%
Business Equipment Retail92.2%
Real EstateClassified
42%
These examples reflect just how varied Google’s crawling of compliant pages can be.
Organic Growth Measure
Critical Question 3 – how deep will Google crawl into my website
architecture?
What depths will Google plunge?
LifestylePublisher
Business Equipment Retail Real EstateClassified
This chart indicates the correlation between the depth of your content and Google’s crawling activity
LifestylePublisher
Business Equipment Retail Real EstateClassified
This chart indicates Google's crawl rate (URL crawled or not by any bot) by Internal Pagerank
How can I more effectively use Pagerank to increase visibility?
Critical Question 4 – what does Google consider to be my Top, Middle and Long
Tail pages?
This graph details visits frequency from Google search result pages for all URLs analysed by the crawler: how often URLs get
organic visits from Google
LifestylePublisher
Business Equipment Retail Real EstateClassified
Then compare Organic Traffic with a measure of how often URLs are crawled by any Google bot.
Increase your Middle Tail
LifestylePublisher
Business Equipment Retail Real EstateClassified
Critical Question 5 – what is my ‘Crawl Window’?
Crawl Window: the percentage of my compliant URLs Google usually crawls in a 14 day period*
When a change appears on the website, either voluntary or involuntary, understanding your Crawl Window value will help you know precisely how long it will take to identify a positive/negative impact.
*This is a simplified calculation of Botify’s Crawl Window metric.
Real EstateClassified
25.5%
Business Equipment Retail80.8%
LifestylePublisher
66.3%
# 5 Critical Questions / KPIs Score
1 What is my ‘Crawl Ratio’?
2 What percentage of my compliant pages (2xx & unique) will Google crawl each month?
3 How deep will Google crawl into my site architecture?
4 What does Google consider to be my Top, Middle and Long Tail pages?
5 What is my ‘Crawl Window’ score?
HOW MANY MORE PAGES NOW HAVE THE POTENTIAL TO MAKE US MONEY? (THANKS TO MY EFFORTS OVER THE PAST 30 DAYS?)
INDEX
You might find this checklist helpful.
THANK YOU!
Take a Free Trial via www.botify.com
#BrightonSEO | @SearchMATH