BrightonSEO 5 Critical Questions Your Log Files Can Answer September 2016

LOG FILE ANALYSIS

5 CRITICAL TECH SEO QUESTIONS YOUR LOGS CAN ANSWER

#BrightonSEO | @SearchMATH

#BrightonSEO

As used by…

About Botify

Here’s the problem…

> Google doesn’t crawl every page of your website>> If a page isn’t crawled it won’t be indexed>>> If a page isn’t indexed, it won’t make you money

Identify Desired Outcomes and

Objectives

Information Gathering

Action Planning

Implementation and Review

New Initiative Planning Process

This presentation will focus on the “Information Gathering” stage of the

process.

Identify Desired Outcomes and

Objectives

Information Gathering

Action Planning

Implementation and Review

New Initiative Planning ProcessDawn Anderson’s slide-deck “BRINGING IN THE

FAMILY DURING CRAWLING” is an insightful guide to help you identify crawl budget

opportunities. Dawn also suggests powerful actions you should explore.

http://www.slideshare.net/DawnFitton/bringing-in-the-family-to-emphasise-importance-and-win-during-crawling

Log File 101

Hypertext Transfer Protocol (HTTP)

ClientServer

HTTP RequestGET /index.html HTTP/1.1 Host: www.exampleshop.comUser-Agent: Mozilla 5.0

HTTP ResponseHTTP/1.1 200 OKDate: Mon, 11 Jul 2016 08:06:45 GMTServer: Apache/1.3.27 (Unix) (Red-Hat/Linux)Last-Modified: Wed, 04 Feb 2016 23:11:55 GMTEtag: “3f84f-1b9-3elcd16b”Accept-Ranges: bytes Content-Length: 458Connection: closeContent-Type: text/html; charset=UTF-8

Fig 1: HTTP Client/Server Communication

This is a standard HTTP/1.1 exchange between Client (e.g. Browser or Googlebot) & your Server.

Server Log Files

Server

188.65.114.122 - - [19/Jul/2016:08:07:05 -0400] "GET /women/shoes/ converse14579/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"charset=UTF-8

Server IPTimestamp (date & time)Method (GET / POST)Request URI HTTP status codeUser-agent

Fig 2: Example Server OutputWHO’S REQUESTING? |

WHEN? | HOW?

WHAT FILE?

SERVER RESPONSE Server Logs are the SINGLE SOURCE OF TRUTH when it comes to seeing how search engines, such as Googlebot,

assess your website.

Your webserver keeps a file of every hit the server receives during the exchange on the previous slide. Your very own data treasure chest.

“[Cleanup your architecture because] we get lost crawling unnecessary URLs and we might not be able to crawl and index your new and updated content as quickly as we would otherwise… There are a number of crawlers you can use to crawl your website on your own, to run across your website.” Google Webmaster Central office hours hangout, 16 Oct 2015

@JohnMuCrawl your

website with a THIRD-PARTY

CRAWLER

@JohnMuConduct

LOG FILE ANALYSIS

How does Log Files Analysis differ to Web Crawl Analysis?

Home

Category

Subcategory

Detail

Web CrawlSystematically fetch, retrieve, and validate the HTML on every page of your website to simulate

Googlebot’s/Bingbot’s analysis of your pages

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

_______________

__________________________________________

______________

Let’s consider how the information is collected…

This is great for optimising your HTML code and helps you try and produce a best

in class website.

But that’s not how search engines operate and crawling alone lacks the evidence to back up your strategy.

For example, Googlebot might enter through a popular category and crawl the same pages time after time. Search Console won’t tell you this and

neither does simulating a crawl from your homepage.

So, you need to crawl your architecture and compare the data to Google’s activity (via your log files) to gain an insight into how you’ll get more of your

money making pages crawled and indexed.

What barriers do people face when trying to study this vital information?

• Access to Server Logs• File Sizes• Misplacing trust in Search Console• Time required to process the data

But I don’t think you should be deterred and here’s why…

Accessing your logs is simpler than you think. Your organisation is probably already using them.

Common Log Analysis use cases for eCommerce organisations include:>> Application Management>> Access Management>> Network Forensics>> Compliance

Popular products used by Applications and Security teams at major Enterprise companies include: LogRhythm, Loggly, and Splunk.

Splunk (a log file storage and processing company): Market Cap $8.6bn, 11,000 Customers

http://www.slideshare.net/Splunk/splunklive-london-john-lewis

This is a picture from a presentation I watched at SplunkLive in London 2016. John Lewis visualise their operational intelligence from log files. You can get your logs!!

It’s true that the volume of data involved can make working with the files prohibitive.

For example, if a site receives 50,000 visitors a day browsing an average of 5 pages per session, that’s 250,000 log entries per day for the HTML

7.5M entries per monthNow add 10 assets requested from the server for

each page:75,000,000 lines in your Log Files per month

SEOs regularly monitor and trend site architecture data (HTTP codes, etc.) in third-party apps

but it’s not possible to scrutinise Search Console’s crawling and indexing charts, but you really should.

So, how is engineering helping us overcome these barriers and expand our knowledge?

>>> Secure File Transfer Protocol (SFTP)>>> Storing and trending Log Data thanks to cloud services>>> Processing Automation (saving TIME)>>> Diffing Log Data with Simulated Crawl

Data for greater insights

Let’s move onto the questions I think you should be looking to answer.

What are the typical questions SEOs try and answer with Log Analysis?

• Where do I have accessibility errors?• Which pages are being spidered most frequently?• Is spammer activity proving detrimental to performance?• Which pages haven’t been crawled by search engines?

And these are all very valid and helpful but I suggest looking at the next list too…

# 5 Critical Questions / KPIs Score

1 What is my ‘Crawl Ratio’?

2 What percentage of my compliant pages (2xx & unique) will Google crawl each month?

3 How deep will Google crawl into my site architecture?

4 What does Google consider to be my Top, Middle and Long Tail pages?

5 What is my ‘Crawl Window’ score?

HOW MANY MORE PAGES NOW HAVE THE POTENTIAL TO MAKE US MONEY? (THANKS TO MY EFFORTS OVER THE PAST 30 DAYS?)

INDEX

Crawl Definition Score

Crawl Rate requests per second Googlebot makes to your site when it is crawling it

Crawl Budget the maximum number of pages that Google crawls on a website

Crawl Frequency program determining which sites to crawl, how often, and how many pages to fetch from each site

Crawl Rank the frequency a page is crawled compared with the ranking position of that page

Crawl Space the totality of possible URLs for a website

Crawl Ratio the percentage of my website structure Google is crawling every 30 days

Crawl Window the percentage of the compliant (unique & 200) pages on my website Google usually crawls in a 14 day period

I’ve mentioned a few terms you might not be familiar with so here’s a list of old friends with a couple of new additions.

Critical Question 1 – What is my ‘Crawl Ratio’?

Crawl Ratio: the percentage of my website structure Google is crawling every 30 days

Total Pages in the website structure crawled by Google in 30 days

Total Pages in the website structurex100

Organic Growth Opportunities

LifestylePublisher

Business Equipment Retail Real EstateClassified

The Venn diagram clearly illustrates the mis-match between the URLs you hope Google is looking at with the accurate picture from your server logs.

Critical Question 2 What percentage of my compliant pages

(200 & unique) will Google crawl each month?

% of key pages crawled

Total Compliant Pages Crawled By Google in 30 daysTotal Compliant Pages in the website structure x100

%

Potential

LifestylePublisher

76.4%

Business Equipment Retail92.2%

Real EstateClassified

42%

These examples reflect just how varied Google’s crawling of compliant pages can be.

Organic Growth Measure

Critical Question 3 – how deep will Google crawl into my website

architecture?

What depths will Google plunge?

LifestylePublisher


This chart indicates the correlation between the depth of your content and Google’s crawling activity

LifestylePublisher


This chart indicates Google's crawl rate (URL crawled or not by any bot) by Internal Pagerank

How can I more effectively use Pagerank to increase visibility?

Critical Question 4 – what does Google consider to be my Top, Middle and Long

Tail pages?

This graph details visits frequency from Google search result pages for all URLs analysed by the crawler: how often URLs get

organic visits from Google

LifestylePublisher


Then compare Organic Traffic with a measure of how often URLs are crawled by any Google bot.

Increase your Middle Tail

LifestylePublisher


Critical Question 5 – what is my ‘Crawl Window’?

Crawl Window: the percentage of my compliant URLs Google usually crawls in a 14 day period*

When a change appears on the website, either voluntary or involuntary, understanding your Crawl Window value will help you know precisely how long it will take to identify a positive/negative impact.

*This is a simplified calculation of Botify’s Crawl Window metric.

Real EstateClassified

25.5%

Business Equipment Retail80.8%

LifestylePublisher

66.3%

# 5 Critical Questions / KPIs Score

1 What is my ‘Crawl Ratio’?

2 What percentage of my compliant pages (2xx & unique) will Google crawl each month?

3 How deep will Google crawl into my site architecture?

4 What does Google consider to be my Top, Middle and Long Tail pages?

5 What is my ‘Crawl Window’ score?

HOW MANY MORE PAGES NOW HAVE THE POTENTIAL TO MAKE US MONEY? (THANKS TO MY EFFORTS OVER THE PAST 30 DAYS?)

INDEX

You might find this checklist helpful.

THANK YOU!

Take a Free Trial via www.botify.com

#BrightonSEO | @SearchMATH

Internet

BrightonSEO 5 Critical Questions Your Log Files Can Answer September 2016