These slides were presented at the SEMrush webinar "How to Leverage Insights from Your Site’s Server Logs | 5 Hours of Technical SEO". Video replay and transcript are available at https://www.semrush.com/webinars/how-to-leverage-insights-from-your-site-s-server-logs-or-5-hours-of-technical-seo/
2. To be crawled, indexed,
and ranked.
All SEOs share a common goal:
3. How can we answer all these questions?
● Which pages is Googlebot crawling?
● What user-agent is it using?
● Is Googlebot crawl mirroring our understanding of site
structure and assets?
● How’s the sites tech health?
4. Logs are a record
of every request
a server receives.
9. Check your CDN on data on edge node
(cached) vs server (uncached) hits
10. Internal Log Requests
Ask: Is there already a log management platform in place?
Be Clear: We do not want Personal Identification
Information (PII) and request it be removed
Be specific: Exported as .csv, please!
18. Manually validate Googlebot IPs
Run a reverse DNS lookup on the accessing IP
address from your logs, using the host command.
jammer@Hypatia ~ % host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com
19. Bulk validate Googlebot IPs with Scripts
Source: Shell Script to Detect if the IP Address Is Googlebot, Dzone
22. 216.150.168.131 [07/Mar/2018:16:11:58 -0800]
66.249.66.1 GET
/twiki/bin/view/TWiki/WikiSyntax HTTP/1.1
www.arrow.com 200 7352 616 -
Mozilla/5.0+(Linux;+Android+6.0.1;+Nexus+5X+Bu
ild/MMB29P)+AppleWebKit/537.36+(KHTML,+like+Ge
cko)+Chrome/41.0.2272.96+Mobile+Safari/537.36+
(compatible;+Googlebot/2.1;++http://www.google
.com/bot.html) https://www.arrow.com/en/
indiegogo
The values captured in logs is unique to
each site.
Make a new engineering friend to learn
exactly what they mean.
25. Use Case (Basic Query)
Legacy code being brought kicking
and screaming into mobile-only index
26. Query: Are we migrating to mobile-only index?
1. Data Source: Your aggregated logs
2. Condition: where the requester
is (verified) Googlebot
3. Group by: User-agent
4. Count: Number of hits (desc)
5. Limit: Start with ~10 results.
28. Query: Are non-canonical hostnames being
crawled?
1. Data Source: Aggregated logs
2. Condition: where Googlebot
3. Group by: Hostname
4. Count: Number of hits (desc)
5. Limit: 10
31. Query: Which languages are being crawled?
1. Data Source: Your aggregated logs
2. Condition: where Googlebot
3. Group by: Language
4. Count: Number of hits (desc)
5. Limit: 10
6. Limit: Start with ~10 results.
33. Build on the fly segments by parsing URL structure
/en/products/blam-o/log-12345
}Language
App
}
Manufacturer
}
SKU
}
34. Query: Which subfolders are being crawled?
1. Data Source: Your aggregated logs
2. Condition: where Googlebot
3. Parse: subfolder
4. Aggregate: by Subfolder
5. Count: Number of hits (desc)
6. Limit: Start with ~10 results.
36. Even search engines need to CYA
Googlebot is designed to be a good citizen of the web...
For Googlebot a speedy site is a sign of healthy servers...
If the site slows down or responds with server errors, the
[crawl rate] limit goes down and Googlebot crawls less.
Official Google Webmaster Central Blog: What Crawl Budget Means for Googlebot
37. Starting query: What HTTP status codes are we returning?
1. Data Source: Your aggregated logs
2. Condition: where Googlebot
3. Aggregate: by HTTP Status
4. Count: Number of hits (desc)
5. Limit: Start with ~10 results.
38. Iterative query: What resources are returning 5XX?
1. Data Source: Your aggregated logs
2. Condition: where Googlebot
AND
3. Condition: where 5XX
4. Parse: subfolder
5. Count: Number of hits (desc)
6. Limit: Start with ~10 results.
46. I'm a mentor @ United Search
Want to take stage as an SEO speaker?
Want to stay in the audience but see more diversity in SEO events?
United Search is an SEO speaker accelerator designed to specifically aid
underrepresented groups, at no cost to students.
● Application - unitedsearch.org/apply
● Mentors - unitedsearch.org/mentors
● Mission - unitedsearch.org/about-us
For more info check out unitedsearch.org or @search_united on Twitter.