Pubcon florida 2018 logs dont lie dawn anderson

#pubcon
Logs Don’t Lie – SEO Wins &
Server Log File Analysis
Presented by:
Dawn Anderson @dawnieando

#pubcon
Dawn Anderson
• Move It Marketing
• University Lecturer – Digital Marketing
• From Manchester, UK (rains a lot)
• International SEO Consultant – 11+ yrs in SEO
• Pomeranian pooch lover – Bert & Tedward
• Fascinated by crawling (practice & academia)
• Doesn’t fare well in YouTube screen grabs ;P
• Party trick: Remembering UK postcode areas
(US Zip code equivalent)
• Search Awards Judge
• Twitter chatterer @dawnieando

#pubcon
What Server Log File Analysis is NOT

#pubcon
It’s not just about
‘crawl budget’

#pubcon
Crawl Budget
1
It’s not a
real term,
but 2 notions
together
2
Host Load +
URL
Scheduling
combined
3
Crawling
’Politeness’
IS A BIG
THING
4
How much
can we crawl
& how often?
5
What is
important to
crawl & how
often?
6
We need to
not ‘obsess’
over this if
we have
small sites
7
WE NEED TO
TAKE A LOOK
THROUGH
‘SPIDER
EYES’

#pubcon
When
could it
be
worth
it?
You have a huge
site with many
parameters
You’ve spotted
anomalies in
Google Analytics
traffic URLs
You’re trying to
consolidate site
URLs
You’ve got big
legacy

#pubcon
Why Server Log File Analysis?

#pubcon
Many reasons… But… Here’s one… An
infinite loop can kill a site over time

#pubcon
Exponentially Multiplicative URLs From Faceted Navigation…
100 DRESSES
5 COLOURS
10 SIZES
2 LENGTHS
4 SUPPLIERS
100 x 5 x 10
x 2 x 4 =
40,000
URLs

#pubcon
And that’s without HTTPS, WWW/non or internationalization
100 DRESSES
5 COLOURS
10 SIZES
2 LENGTHS
4 SUPPLIERS
100 x 5 x 10
x 2 x 4 =
40,000
URLs
X 2 BECAUSE…
HTTPS VERSION
80,000
URLs
X 2…
BECAUSE…
WWW / NON
WWW VERSION
160,000
URLs
X 5…
BECAUSE…
EN / FR / ES /
DE / IT (e.g.)
800,000
URLs

#pubcon
Why is Server Log File Analysis Important?
Detecting orphaned URLs
Understanding URL crawl frequency
Detecting server errors
Understanding the % of ‘healthy’ crawling
IDENTIFYING WEAK AREAS IN A SITE

#pubcon
A consolidation of signals
to preferred URLs can win
with SEO
CONSISTENCY
IS KING

#pubcon
Detective
Meets
Detective

#pubcon
We are stalking Googlebots (as
detectives) and trying to walk
their paths to understand their
experiences as they traverse a
site
WHAT ‘CLUES’ ARE WE
PROVIDING?

#pubcon
Is Google (the detective) picking
up on our clues?
Canonical tags
XML sitemaps
Href Lang
Internal links
Pagination
URL parameters

#pubcon
Is Google getting your ’hints’?
ONLY
HINTS
‘NOT’
DIRECTIVES??

#pubcon
Directive or Hint?… Either way, we need
to ensure our clues are working
Likely pretty strong
hints… or maybe
’nearly’ directives ;P
;P ;P…
It depends ;P

#pubcon
Every site will have its own crawling rules
DUSTBUSTER
CRAWLING
RULES
BUILDS ‘HINTS’
ON WHAT NOT
TO CRAWL
DO NOT CRAWL
IN THE DUST

#pubcon
Popular CMS’ might help with ‘rule-building’
ALL WILL HAVE SOME COMMON
CANONICALIZATION PATTERNS
WHICH CAN BE LEARNED FOR
EFFICIENCY

#pubcon
Why ‘Sampling’ in crawling for efficiency?
Is it worth it?
Should we crawl more?
Is there lots of important URLs here?
Do the URL’s ‘genuinely’ change frequently?
Are the changes ‘important’ (weighted) or is it just
‘DUST’?

#pubcon
There is likely also ‘quilting’ detection
Detecting Quilted
Web Pages at Scale
(Najork. M, 2012)
Finds pages
‘stitched’ together
to make other
pages
Image credit: Najork, Mark

#pubcon
Breadth First Crawling
or Other Crawling
Strategies (OR
SOMETHING MUCH
BETTER THAN THIS
SINCE CAFFEINE??)

#pubcon
Do NOT get me started on Javascript &
dependent files
THEY ARE ALL
NEEDED IN
INDEXING &
GATHERED IN
CRAWLING

#pubcon
If you use a ’Builder’ theme in
Wordpress this will be very evident
THE CACHES
CREATED GET
CRAWLED… a
LOT

#pubcon
Yoast and Googlebot
access…
• Yoast has unblocked access to Googlebot
in its plugin pretty much everywhere
• You might find Googlebot is trying to
access wp-admin even
• Googlebot needs all dependent files to
render the page (including javascript and
css files)

#pubcon
Hunting for
Googlebots? Where
can we find them?

#pubcon
In your
quest…You
may face some
challenges
along the way

#pubcon
Google Search Console is Where It’s At… Right?

#pubcon
We Can See Some Symptoms Here
Though
•SPEED ISSUES
•‘BOREDOM’
ISSUES
•ROGUE CODE

#pubcon
We Can See
Some
Symptoms
Here Too
AFTER REMOVAL OF
CANONICAL AT SCALE

#pubcon
Signs in Google Search Console Crawl Stats
• Possible signs of ‘quality-impacted’ content
• Near duplicates & duplicates
• Not just speed… ‘boredom’ too
• Major site changes or switches to https protocol
• Signs of ‘Sampling’ visits for quality
• Getting the best yield for crawling
• Transitive nature
• Slow sites

#pubcon
GOOGLE SEARCH CONSOLE IS NOT JUST
REALLY ‘WEB PAGES’
• Includes ALL CSS, JS, Zip,
XML, PDF, AMP, HTML files
crawled
• Pages are NOT just single
webpages
https://support.google.com/webmasters/answer/3
5253
Not just ‘web
pages

#pubcon
VISITS BY ALL THE TYPES Of GOOGLEBOTS
ARE RECORDED TOGETHER IN GSC
Web Image News
Video Feature Phone Smartphone
Mobile
Adsens
e
Adsense Adsbot
App
Crawle
ALL The Googlebot
Family

#pubcon
ONLY CONTAINS STATUS CODES BETWEEN
200s & 30X (ALL PROTOCOLS THOUGH)

#pubcon
Don’t Be Fooled By Those ‘Big Success’
Screen Grabs on Twitter
CAN BE BOTH
HTTP AND HTTPS
OR SIMPLY A
MAJOR IN-SITE
REDIRECTION
EXERCISE

#pubcon
Google Search Console is Like a Visit To
The GP Before Referral
VAGUE AT BEST

#pubcon
So…We Need To Dig Deeper… Be More Curious

#pubcon
Finding
Awstats on
cPanel

#pubcon
REALITY – Server Logs & Log Analysis Is
Where It’s At
AUTOMATE SERVER LOG
RETRIEVAL VIA CRON JOB
grep Googlebot access_log
>googlebot_access.txt

#pubcon
But… what are logs? Log files? Log file
analysis… really?

#pubcon
Not just this… J But kind of this J

#pubcon
Not just this… J But kind of this JLogs are everywhere

#pubcon
So… What’s a log file?
A document
containing one or
more logs
Usually exported as a
.txt file in ‘Common
Log Format’ (W3C)
from the server
Common log format
contains specific
fields
May be bundled in a
tar or .gz file
Can be exported
from ‘raw access’ in
the server

#pubcon
Elements of Server Log File (Common
Log Format)
• IP Address of bot
• Date Accessed (and time)
• Request type (GET, HEAD, POST)
• URL Requested
• Server Response Code Returned
• HTTP Code
• Bytes Served (file size)
• User Agent

#pubcon
Purposes & Types of logs on Servers
Error logs (database
errors + server
response codes, bad
code warnings
01
Suspicious activity
(Security
implications /
monitoring) – DDOS,
spammers, hackers
02
Visits to a file /
page / Page URL
requests (both
human & bots)
03

#pubcon
Monitor
Other
Types of
Logs Too
Error logs provide great
insight into where there
may be issues with
overloading the server etc
If errors (e.g. 500 codes
are sent frequently
Google will pull back and
crawl less

#pubcon
A Better
Explanation of
Log File Analysis
• Interrogation of data
• Looking for patterns
• Looking for anomalies
• Looking for split messaging
about URL importance to
Google and finding ways to
consolidate consistent
signaling to single content
fingerprints - STRENGTH

#pubcon
Recap - Simple
Explanation
• Logs – simply a notation or record of
something
• Log files – simply the document where the
log is stored
• Log file analysis – simply analysing and
exploring the log files to identify areas
where optimisation is possible or wastage
occurs

#pubcon
Hunting for
Googlebots?
How will we
know we’ve
found them?

#pubcon
The
Good,
The Bad
& The
Ugly
Good bots (polite, un-
malicious, usually from
respectable organisations
(e.g. Search Engines &
good tool providers))
Bad bots (unpolite, may
be malicious, scraper
bots, spammers)

#pubcon
‘Politeness’ Crawling Rules
Do NOT
damage the
server
01
Do NOT
damage the
server
02
Do NOT
damage the
server
03

#pubcon
There are ALSO many,
many ‘POLITE’ bots
• Yahoo Slurp
• Bing Bog
• Other Search Engine Bots
• 10 Types of Googlebot
• SEO Tool Bots (On Page, Sistrix,
Deepcrawl, etc)

#pubcon
Verifying it’s really Googlebot
• Spoofing
• Google don’t publish a list
of IPs any longer (they
change too frequently)
• Need to verify by reverse
DNS lookup on server using
HOST to ensure visits are
really from Googlebot and
not spoof bots (Screaming
frog)

#pubcon
Examples of Googlebot
Organic Search Calling
Cards (user agents)
• Desktop - Mozilla/5.0 (compatible;
Googlebot/2.1;
+http://www.google.com/bot.html)
• Smartphone - Mozilla/5.0 (Linux; Android
6.0.1; Nexus 5X Build/MMB29P)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/41.0.2272.96 Mobile
Safari/537.36 (compatible;
Googlebot/2.1;
+http://www.google.com/bot.html)
• Googlebot-News
• Googlebot-Image/1.0

#pubcon
So what’s the server log file analysis
process?
Ship AnalyseGather

#pubcon
Gathering logs straight from the server
FIND ‘RAW
ACCESS’

#pubcon
Gathering logs straight from the server
These only represent a few
hours worth of data /
activity
GOOD FOR A QUICK
LOOK IF YOU SEE
SOME PROBLEMS
OCCURRING

#pubcon
Archived raw logs… is what you want
• A log of everything on
the server
• All of the separate
subdomains
• All of the separate
protocols
• Zipped Up

#pubcon
Shipping logs to analytical tools
• Docker
• Naked Eye
• Excel
• Text File
• Cloud based log analysis software
• GREP
• Command Line
• Downloaded log analysis software
Carrying ‘ALL’ the
logs

#pubcon
Opening log files manually
YOU’LL NEED A TEXT
EDITOR OR EVEN BETTER
AN IDE (INTEGRATED
DEVELOPMENT
ENVIRONMENT)
e.g.
NOTEPAD++
BRACKETS (on MAC)
• Notepad++
• Notepad
• Komodo
• Aptana
• Netbeans
• Eclipse
• Brackets (Mac)

#pubcon
They look something like
this… SERVER LOGS
EXAMPLE IN
‘BRACKETS’ IDE
TEXT EDITOR

#pubcon
Move them all to excel
CTRL A and
Paste all into an
Excel
spreadsheet

#pubcon
Convert text to data in Excel
Choose ‘DATA’
and convert text
to columns.
Delimit with
‘space’ (usually)

#pubcon
Filter by verified user agent
FILTER USER
AGENT on verified
DNS lookup HOST
http://google.com/bot.h
tml

#pubcon
Many ‘scale’ tools for log file analysis

#pubcon
SCREAMING FROG LOG ANALYZER

#pubcon
You cannot ‘emulate’ a Googlebot
crawl
It is not a ‘from start to finish’ crawl
through a site

#pubcon
You may see strange URLs
• Old .htaccess rules run in order of their placement in
the .htaccess file
• Old .htaccess rules on ‘migrated-from’ sites
• Old XML sitemaps left on server
• Old plugins removed but folders left
• Un-optimized MySQL or other database (cluttered
with legacy)
• Spammers hitting your search queries & randomly
spinning new links

#pubcon
Understanding the problems

#pubcon
Crawl prioritization & queuing is evident
MANY OF THESE FILES
WERE PREVIOUSLY
CALLING THESE NOW
REMOVED PLUGINS AND
WERE QUEUED TO CRAWL
TOGETHER

#pubcon
If you migrate or switch protocol… import &
monitor all logs… they will chain

#pubcon
What followed the 301? FILTER ON BOT &
RESPONSE CODE
INCLUDING ONLY 301
AND 200 RESPONSE
CODES TO FIND THE
NEXT PART OF THE BOT
JOURNEY
(CONSIDER MULTIPLE
CONNECTIONS)

#pubcon
Filter on bot and 301 to identify bad crawl chains &
problematic parameters
SPOT THE
ISSUES

#pubcon
Filter on bot and 304 code served
304 – ‘IF MODIFIED’
(NOTICE NOTHING HAS
BEEN DOWNLOADED)
HEAD CHECKED
ONLY. NO ‘GET’
REQUEST

#pubcon
When consolidating… check 410 progress
Filter on 410 and User
Agent
410 ‘GONE’

#pubcon
Build a monitoring dashboard

#pubcon
Check The Split of Smartphone v Desktop Googlebot

#pubcon
Check The
Split of
Response
Codes

#pubcon
Splunk &
Deepcrawl log
file filter query

#pubcon
Explore & drill down into issues

#pubcon
Discover anomalies / gaps between
analytics & crawls from user-agent logs

#pubcon
Find & reconnect orphaned pages

#pubcon
Identify the discrepancies & weak
areas to prioritise

#pubcon
Do Regular Log Analysis & Crawling
• Weekly or monthly crawls
• Download logs or run them into the cloud
automatically (RECOMMENDED)
• Compare log file activity against crawls of the
site
• Compare crawls and log file activity against
Google Analytics & GSC ‘active’ URLs

#pubcon
Closing words –
Pressing
‘Recrawl now’
(April Fools)
will not fix your
content

#pubcon
But… Fixing your
content might
positively impact
crawling

#pubcon
Happy
Sleuthing
Thank you

#pubcon
Appendix, References & Further
Resources

#pubcon
Pros and cons of Excel
CONS
• Fiddly
• Mostly Manual process
• Limited capacity
• Need to keep updating with
more data
PROS
• Easy to set up
• Suitable for small
analysis
• Simple to understand
• Easy to filter & sort

#pubcon
Loggly
CONS
• Free version very limited
• Not initially intuitive
• Integrates with server
• Integrates with log-shipper
intermediaries like
‘Docker’
PROS
• Option to upload files
• Good graphical analysis
• Can build nice reports
• Cloud based software
• Great dashboard

#pubcon
Splunk Light
CONS
• Free version limited
• Based on usage
• Can soon add up
• Not easily configured in
the cloud
• UI not intuitive
PROS
• Downloadable version
• Good for medium projects
• You have control (on own
machine)
• Easy to pick out essentials
• Can integrate with Deepcrawl

#pubcon
Screaming Frog Log Analyzer
CONS
• Need to increase RAM
allowance almost always
• Very limited free version
• Again, has limits as it’s not
cloud based
PROS
• Very easy to configure
• Can compare log URIs
with crawl data
• Very similar to excel
• Some graphics to build
reports
• Overlay GA & GSC

#pubcon
Places to find bot
footprints
• On server analytics & visitor analytics
screens – e.g. Awstats
• Google search console
• Server logs (raw access logs)

Pubcon florida 2018 logs dont lie dawn anderson

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Pubcon florida 2018 logs dont lie dawn anderson

Similar to Pubcon florida 2018 logs dont lie dawn anderson (20)

More from Dawn Anderson MSc DigM

More from Dawn Anderson MSc DigM (20)

Recently uploaded

Recently uploaded (20)

Pubcon florida 2018 logs dont lie dawn anderson