Slide deck from Tom Bennet's presentation at Brighton SEO, September 2014. Accompanying guide can be found here: http://builtvisible.com/log-file-analysis/
Image Credits:
https://www.flickr.com/photos/nullvalue/4188517246
https://www.flickr.com/photos/small_realm/11189803763/
https://www.flickr.com/photos/florianric/7263382550
http://fotojenix.wordpress.com/2011/07/08/weekly-photo-challenge-old-fashioned/
4. What is a log file?
A record of all hits that a server has received – humans and robots.
http://www.brightonseo.com/about/
1. Protocol
2. Host name
3. File name
Host name -> IP Address via DNS -> Connection to Server ->
HTTP Get Request via Protocol for File -> HTML to Browser
11. Preparing Your Data
Extraction: Varies by server. See accompanying guide.
Filter: By Googlebot user-agent, validate the IP range. https://support.google.com/webmasters/answer/80553?hl=en
Tools: Gamut and Splunk are great, but you can’t beat Excel.
12. Working in Excel
1. Convert .log to .csv
(cool tip: just change the file extension)
13. Working in Excel
2. Sample size
(60-120k Googlebot requests / rows is a good size)
14. Working in Excel
3. Text-to-columns
(a space will usually be a suitable delimiter)
15. Working in Excel
4. Create a table
(Label your columns, sort by timestamp)
17. Most vs Least Crawled
Formula: Use COUNTIF on Request URL.
Tip: Extract top-level category for crawl distribution by site-section.
http://www.brightonseo.com/speakers/person-name/
18. Crawl Frequency Over Time
Formula: Pivot date against count of requests.
Tip: Segment by site section or by user-agent (G-bot Mobile, Images, Video, etc).
19. HTTP Response Codes
Formula: Total up HTTP Response Codes.
Tip: Find most common 302s or 404s, filter by code and sort by URL occurrence.
20.
21. Level Up
Robots.txt – Crawl all URLs with Screaming Frog to determine if they are blocked in robots.txt. Investigate most frequently crawled.
Faceted Nav Issues – Dedupe a list of unique resources, sort by times requested.
Sitemap – Add your sitemap URLs into an Excel table, VLOOKUP against your logs. Which mapped URLs are crawl deficient?
CSS / JS – These resources should be crawlable, but are files unnecessary for render absorbing an inordinate amount of crawl budget?
22. Top Level Crawl Waste
Formula: Use IF statements to check for every cause of waste.