Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Pubcon florida 2018 logs dont lie dawn anderson


Published on

Here we take a look at server log file analysis for SEO and explore not only the benefits but also the process of finding, gathering, shipping and analysing user agent logs

Published in: Marketing

Pubcon florida 2018 logs dont lie dawn anderson

  1. 1. #pubcon Logs Don’t Lie – SEO Wins & Server Log File Analysis Presented by: Dawn Anderson @dawnieando
  2. 2. #pubcon Dawn Anderson • Move It Marketing • University Lecturer – Digital Marketing • From Manchester, UK (rains a lot) • International SEO Consultant – 11+ yrs in SEO • Pomeranian pooch lover – Bert & Tedward • Fascinated by crawling (practice & academia) • Doesn’t fare well in YouTube screen grabs ;P • Party trick: Remembering UK postcode areas (US Zip code equivalent) • Search Awards Judge • Twitter chatterer @dawnieando
  3. 3. #pubcon What Server Log File Analysis is NOT
  4. 4. #pubcon It’s not just about ‘crawl budget’
  5. 5. #pubcon Crawl Budget 1 It’s not a real term, but 2 notions together 2 Host Load + URL Scheduling combined 3 Crawling ’Politeness’ IS A BIG THING 4 How much can we crawl & how often? 5 What is important to crawl & how often? 6 We need to not ‘obsess’ over this if we have small sites 7 WE NEED TO TAKE A LOOK THROUGH ‘SPIDER EYES’
  6. 6. #pubcon When could it be worth it? You have a huge site with many parameters You’ve spotted anomalies in Google Analytics traffic URLs You’re trying to consolidate site URLs You’ve got big legacy
  7. 7. #pubcon Why Server Log File Analysis?
  8. 8. #pubcon Many reasons… But… Here’s one… An infinite loop can kill a site over time
  9. 9. #pubcon Exponentially Multiplicative URLs From Faceted Navigation… 100 DRESSES 5 COLOURS 10 SIZES 2 LENGTHS 4 SUPPLIERS 100 x 5 x 10 x 2 x 4 = 40,000 URLs
  10. 10. #pubcon And that’s without HTTPS, WWW/non or internationalization 100 DRESSES 5 COLOURS 10 SIZES 2 LENGTHS 4 SUPPLIERS 100 x 5 x 10 x 2 x 4 = 40,000 URLs X 2 BECAUSE… HTTPS VERSION 80,000 URLs X 2… BECAUSE… WWW / NON WWW VERSION 160,000 URLs X 5… BECAUSE… EN / FR / ES / DE / IT (e.g.) 800,000 URLs
  11. 11. #pubcon Why is Server Log File Analysis Important? Detecting orphaned URLs Understanding URL crawl frequency Detecting server errors Understanding the % of ‘healthy’ crawling IDENTIFYING WEAK AREAS IN A SITE
  12. 12. #pubcon A consolidation of signals to preferred URLs can win with SEO CONSISTENCY IS KING
  13. 13. #pubcon Detective Meets Detective
  14. 14. #pubcon We are stalking Googlebots (as detectives) and trying to walk their paths to understand their experiences as they traverse a site WHAT ‘CLUES’ ARE WE PROVIDING?
  15. 15. #pubcon Is Google (the detective) picking up on our clues? Canonical tags XML sitemaps Href Lang Internal links Pagination URL parameters
  16. 16. #pubcon Is Google getting your ’hints’? ONLY HINTS ‘NOT’ DIRECTIVES??
  17. 17. #pubcon Directive or Hint?… Either way, we need to ensure our clues are working Likely pretty strong hints… or maybe ’nearly’ directives ;P ;P ;P… It depends ;P
  18. 18. #pubcon Every site will have its own crawling rules DUSTBUSTER CRAWLING RULES BUILDS ‘HINTS’ ON WHAT NOT TO CRAWL DO NOT CRAWL IN THE DUST
  20. 20. #pubcon Why ‘Sampling’ in crawling for efficiency? Is it worth it? Should we crawl more? Is there lots of important URLs here? Do the URL’s ‘genuinely’ change frequently? Are the changes ‘important’ (weighted) or is it just ‘DUST’?
  21. 21. #pubcon There is likely also ‘quilting’ detection Detecting Quilted Web Pages at Scale (Najork. M, 2012) Finds pages ‘stitched’ together to make other pages Image credit: Najork, Mark
  22. 22. #pubcon Breadth First Crawling or Other Crawling Strategies (OR SOMETHING MUCH BETTER THAN THIS SINCE CAFFEINE??)
  23. 23. #pubcon Do NOT get me started on Javascript & dependent files THEY ARE ALL NEEDED IN INDEXING & GATHERED IN CRAWLING
  24. 24. #pubcon If you use a ’Builder’ theme in Wordpress this will be very evident THE CACHES CREATED GET CRAWLED… a LOT
  25. 25. #pubcon Yoast and Googlebot access… • Yoast has unblocked access to Googlebot in its plugin pretty much everywhere • You might find Googlebot is trying to access wp-admin even • Googlebot needs all dependent files to render the page (including javascript and css files)
  26. 26. #pubcon Hunting for Googlebots? Where can we find them?
  27. 27. #pubcon In your quest…You may face some challenges along the way
  28. 28. #pubcon Google Search Console is Where It’s At… Right?
  29. 29. #pubcon We Can See Some Symptoms Here Though •SPEED ISSUES •‘BOREDOM’ ISSUES •ROGUE CODE
  30. 30. #pubcon We Can See Some Symptoms Here Too AFTER REMOVAL OF CANONICAL AT SCALE
  31. 31. #pubcon Signs in Google Search Console Crawl Stats • Possible signs of ‘quality-impacted’ content • Near duplicates & duplicates • Not just speed… ‘boredom’ too • Major site changes or switches to https protocol • Signs of ‘Sampling’ visits for quality • Getting the best yield for crawling • Transitive nature • Slow sites
  32. 32. #pubcon GOOGLE SEARCH CONSOLE IS NOT JUST REALLY ‘WEB PAGES’ • Includes ALL CSS, JS, Zip, XML, PDF, AMP, HTML files crawled • Pages are NOT just single webpages 5253 Not just ‘web pages
  33. 33. #pubcon VISITS BY ALL THE TYPES Of GOOGLEBOTS ARE RECORDED TOGETHER IN GSC Web Image News Video Feature Phone Smartphone Mobile Adsens e Adsense Adsbot App Crawle ALL The Googlebot Family
  35. 35. #pubcon Don’t Be Fooled By Those ‘Big Success’ Screen Grabs on Twitter CAN BE BOTH HTTP AND HTTPS OR SIMPLY A MAJOR IN-SITE REDIRECTION EXERCISE
  36. 36. #pubcon Google Search Console is Like a Visit To The GP Before Referral VAGUE AT BEST
  37. 37. #pubcon So…We Need To Dig Deeper… Be More Curious
  38. 38. #pubcon Finding Awstats on cPanel
  39. 39. #pubcon What URLs Though?
  40. 40. #pubcon REALITY – Server Logs & Log Analysis Is Where It’s At AUTOMATE SERVER LOG RETRIEVAL VIA CRON JOB grep Googlebot access_log >googlebot_access.txt
  41. 41. #pubcon But… what are logs? Log files? Log file analysis… really?
  42. 42. #pubcon Not just this… J But kind of this J
  43. 43. #pubcon Not just this… J But kind of this JLogs are everywhere
  44. 44. #pubcon So… What’s a log file? A document containing one or more logs Usually exported as a .txt file in ‘Common Log Format’ (W3C) from the server Common log format contains specific fields May be bundled in a tar or .gz file Can be exported from ‘raw access’ in the server
  45. 45. #pubcon Elements of Server Log File (Common Log Format) • IP Address of bot • Date Accessed (and time) • Request type (GET, HEAD, POST) • URL Requested • Server Response Code Returned • HTTP Code • Bytes Served (file size) • User Agent
  46. 46. #pubcon Purposes & Types of logs on Servers Error logs (database errors + server response codes, bad code warnings 01 Suspicious activity (Security implications / monitoring) – DDOS, spammers, hackers 02 Visits to a file / page / Page URL requests (both human & bots) 03
  47. 47. #pubcon Monitor Other Types of Logs Too Error logs provide great insight into where there may be issues with overloading the server etc If errors (e.g. 500 codes are sent frequently Google will pull back and crawl less
  48. 48. #pubcon A Better Explanation of Log File Analysis • Interrogation of data • Looking for patterns • Looking for anomalies • Looking for split messaging about URL importance to Google and finding ways to consolidate consistent signaling to single content fingerprints - STRENGTH
  49. 49. #pubcon Recap - Simple Explanation • Logs – simply a notation or record of something • Log files – simply the document where the log is stored • Log file analysis – simply analysing and exploring the log files to identify areas where optimisation is possible or wastage occurs
  50. 50. #pubcon Hunting for Googlebots? How will we know we’ve found them?
  51. 51. #pubcon The Good, The Bad & The Ugly Good bots (polite, un- malicious, usually from respectable organisations (e.g. Search Engines & good tool providers)) Bad bots (unpolite, may be malicious, scraper bots, spammers)
  52. 52. #pubcon ‘Politeness’ Crawling Rules Do NOT damage the server 01 Do NOT damage the server 02 Do NOT damage the server 03
  53. 53. #pubcon There are ALSO many, many ‘POLITE’ bots • Yahoo Slurp • Bing Bog • Other Search Engine Bots • 10 Types of Googlebot • SEO Tool Bots (On Page, Sistrix, Deepcrawl, etc)
  54. 54. #pubcon Verifying it’s really Googlebot • Spoofing • Google don’t publish a list of IPs any longer (they change too frequently) • Need to verify by reverse DNS lookup on server using HOST to ensure visits are really from Googlebot and not spoof bots (Screaming frog)
  55. 55. #pubcon Examples of Googlebot Organic Search Calling Cards (user agents) • Desktop - Mozilla/5.0 (compatible; Googlebot/2.1; + • Smartphone - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; + • Googlebot-News • Googlebot-Image/1.0
  56. 56. #pubcon So what’s the server log file analysis process? Ship AnalyseGather
  57. 57. #pubcon Gathering server logs
  58. 58. #pubcon Gathering logs straight from the server FIND ‘RAW ACCESS’
  59. 59. #pubcon Gathering logs straight from the server These only represent a few hours worth of data / activity GOOD FOR A QUICK LOOK IF YOU SEE SOME PROBLEMS OCCURRING
  60. 60. #pubcon Archived raw logs… is what you want • A log of everything on the server • All of the separate subdomains • All of the separate protocols • Zipped Up
  61. 61. #pubcon Shipping server logs
  62. 62. #pubcon Shipping logs to analytical tools • Docker • Naked Eye • Excel • Text File • Cloud based log analysis software • GREP • Command Line • Downloaded log analysis software Carrying ‘ALL’ the logs
  63. 63. #pubcon Opening log files manually YOU’LL NEED A TEXT EDITOR OR EVEN BETTER AN IDE (INTEGRATED DEVELOPMENT ENVIRONMENT) e.g. NOTEPAD++ BRACKETS (on MAC) • Notepad++ • Notepad • Komodo • Aptana • Netbeans • Eclipse • Brackets (Mac)
  64. 64. #pubcon They look something like this… SERVER LOGS EXAMPLE IN ‘BRACKETS’ IDE TEXT EDITOR
  65. 65. #pubcon Move them all to excel CTRL A and Paste all into an Excel spreadsheet
  66. 66. #pubcon Convert text to data in Excel Choose ‘DATA’ and convert text to columns. Delimit with ‘space’ (usually)
  67. 67. #pubcon Filter by verified user agent FILTER USER AGENT on verified DNS lookup HOST tml
  68. 68. #pubcon Many ‘scale’ tools for log file analysis
  70. 70. #pubcon You cannot ‘emulate’ a Googlebot crawl It is not a ‘from start to finish’ crawl through a site
  71. 71. #pubcon You may see strange URLs • Old .htaccess rules run in order of their placement in the .htaccess file • Old .htaccess rules on ‘migrated-from’ sites • Old XML sitemaps left on server • Old plugins removed but folders left • Un-optimized MySQL or other database (cluttered with legacy) • Spammers hitting your search queries & randomly spinning new links
  72. 72. #pubcon Understanding the problems
  74. 74. #pubcon If you migrate or switch protocol… import & monitor all logs… they will chain
  76. 76. #pubcon Filter on bot and 301 to identify bad crawl chains & problematic parameters SPOT THE ISSUES
  77. 77. #pubcon Filter on bot and 304 code served 304 – ‘IF MODIFIED’ (NOTICE NOTHING HAS BEEN DOWNLOADED) HEAD CHECKED ONLY. NO ‘GET’ REQUEST
  78. 78. #pubcon When consolidating… check 410 progress Filter on 410 and User Agent 410 ‘GONE’
  79. 79. #pubcon Build a monitoring dashboard
  80. 80. #pubcon Check The Split of Smartphone v Desktop Googlebot
  81. 81. #pubcon Check The Split of Response Codes
  82. 82. #pubcon Splunk & Deepcrawl log file filter query
  83. 83. #pubcon Explore & drill down into issues
  84. 84. #pubcon Discover anomalies / gaps between analytics & crawls from user-agent logs
  85. 85. #pubcon Find & reconnect orphaned pages
  86. 86. #pubcon Identify the discrepancies & weak areas to prioritise
  87. 87. #pubcon Do Regular Log Analysis & Crawling • Weekly or monthly crawls • Download logs or run them into the cloud automatically (RECOMMENDED) • Compare log file activity against crawls of the site • Compare crawls and log file activity against Google Analytics & GSC ‘active’ URLs
  88. 88. #pubcon Closing words – Pressing ‘Recrawl now’ (April Fools) will not fix your content
  89. 89. #pubcon But… Fixing your content might positively impact crawling
  90. 90. #pubcon Happy Sleuthing Thank you
  91. 91. #pubcon Appendix, References & Further Resources
  92. 92. #pubcon Pros and cons of Excel CONS • Fiddly • Mostly Manual process • Limited capacity • Need to keep updating with more data PROS • Easy to set up • Suitable for small analysis • Simple to understand • Easy to filter & sort
  93. 93. #pubcon Loggly CONS • Free version very limited • Not initially intuitive • Integrates with server • Integrates with log-shipper intermediaries like ‘Docker’ PROS • Option to upload files • Good graphical analysis • Can build nice reports • Cloud based software • Great dashboard
  94. 94. #pubcon Splunk Light CONS • Free version limited • Based on usage • Can soon add up • Not easily configured in the cloud • UI not intuitive PROS • Downloadable version • Good for medium projects • You have control (on own machine) • Easy to pick out essentials • Can integrate with Deepcrawl
  95. 95. #pubcon Screaming Frog Log Analyzer CONS • Need to increase RAM allowance almost always • Very limited free version • Again, has limits as it’s not cloud based PROS • Very easy to configure • Can compare log URIs with crawl data • Very similar to excel • Some graphics to build reports • Overlay GA & GSC
  96. 96. #pubcon Places to find bot footprints • On server analytics & visitor analytics screens – e.g. Awstats • Google search console • Server logs (raw access logs)