SMX Advanced Europe, June 2021 - With the advent of new technologies and the massive use of Javascript on the internet, search engines have started using Web Rendering Services to better understand the content of pages on the internet. What are the difficulties in building a WRS? Are tools we use every day replicating what search engines do? In this session, Giacomo will drive you on a discovery journey digging in some techy implementation details of a search engine like web rendering service building process, covering edge cases such as infinite scrolling, iframe, web component, and shadow DOM and how to approach them.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Challenges of building a search engine like web rendering service
1. Challenges of building a search
engine like web rendering service
Giacomo Zecchini | Verve Search
@giacomozecchini
2. Hi, I’m Giacomo
Technical Director at
Technical background and
previous experiences in
development
Love: understanding how things work and Web
Performance
@giacomozecchini
9. We implemented a short term
solution* static rendering the site
and using the user-agent to serve
the right content.
*A sort of customized Rendertron script.
@giacomozecchini
10. After the migration we helped the
client to move to a medium-long
term solution (Prerender.io).
@giacomozecchini
12. My curiosity made me start to
research web rendering services.
@giacomozecchini
13. * Icons made by Freepik from www.flaticon.com
In the past the html was the most important thing to
download in order to access the content of a page
@giacomozecchini
14. Today, JavaScript is a big part of the web and
makes everything more complex
* Icons made by Freepik from www.flaticon.com @giacomozecchini
15. In the past we had a Crawling-Indexing process
Crawler Processing Index
URLs
Crawl
Queue
URL HTML
@giacomozecchini
16. Now, we’ve moved to a
crawling-rendering-indexing process
https://developers.google.com/search/docs/guides/javascript-seo-basics
Crawler Processing Index
Renderer
URLs
Crawl
Queue
URL HTML
Render Queue
@giacomozecchini
17. Google calls the rendering element WRS
Crawler Processing Index
Renderer
URLs
Crawl
Queue
URL HTML
Render Queue
WRS
@giacomozecchini
18. Martin Splitt’s TechSEO Boost 2019 talk
https://www.youtube.com/watch?v=Qxd_d9m9vzo
In his presentation, Martin
covered a lot of interesting
implementation details.
If you’re interested in
Google’s WRS, this is the
presentation to watch.
@giacomozecchini
19. These are three of the most important thing you
can get from a Web Rendering Service
DOM Tree
Render Tree +
Layout
Rendered
HTML
@giacomozecchini
20. DOM Tree & Render Tree
https://developers.google.com/web/fundamentals/performance/critical-rendering-path/render-tree-construction @giacomozecchini
22. The layout information is useful when it comes to:
- Understand the semantics of a page
- Check if a page is mobile friendly
- Find intrusive interstitials
- Understand above the fold content
https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini
38. If you need JavaScript console message data you
can use the Mobile Friendly test or Search Console
Live Test but be careful!
@giacomozecchini
39. Mobile-Friendly Test, Search
Console Live Test, AMP Test, and
Rich Results Test are using the
WRS infrastructure, but bypassing
cache, using shorter timeouts,
and few other differences.
@giacomozecchini
40.
41. Hic sunt dracones / Here be Dragons
What follows is based on my own tests and
assumptions, results may be false positives.
Google can change implementation details at any
time without notice, explanation, or justification.
@giacomozecchini
42. How we’ll approach the edge cases
1. Define the edge case
2. Understand Google's WRS support and
behaviour (personal assumption)
3. Check for tools support and behaviour
4. Propose a solution
@giacomozecchini
43. Some of the tested tools
I did multiple tests for each edge case. @giacomozecchini
44. This is not an evaluation of those
tools, but just a comparison
between their results and those of
Google’s WRS.
@giacomozecchini
46. Mixed content occurs when initial
HTML is loaded over a secure
HTTPS connection, but other
resources are loaded over an
insecure HTTP connection.
https://youtu.be/WjMSfTK1_SY?t=239 @giacomozecchini
48. Chrome will automatically upgrade mixed content
from HTTP to HTTPS. If the fetch fails that asset
won’t be loaded.
@giacomozecchini
49. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
* this was not the case until recently
@giacomozecchini
Google’s WRS seems to behave like Chrome
51. Solution
When visiting an HTTPS website, upgrade the
URLs of assets from HTTP to HTTPS.
Using Chromium-based browsers you should
already have the right solution in place.
@giacomozecchini
54. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
VIEWPORT
Google starts the rendering
using a fixed viewport:
Mobile: 412 X 732
Desktop: 1024 x 1024
@giacomozecchini
55. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
VIEWPORT
PAGE
HEIGHT
Then calculate the new viewport:
Viewport = Page Height + pixels
The amount of additional pixels
depends on the page, it could be
thousands of pixels
@giacomozecchini
56. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
VIEWPORT
PAGE
HEIGHT
A bigger viewport
triggers Infinite
loading or lazy loading
events.
@giacomozecchini
57. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
10,000,000 px
This seems to be the maximum viewport height
@giacomozecchini
58. Tools support
95% of tests showed a
different result
* for very tall pages @giacomozecchini
60. Solution
Wait for an event:
onload
DOMContentLoaded
If you’re using puppeteer:
networkidle0
networkidle2
@giacomozecchini
61. Solution
VIEWPORT
PAGE
HEIGHT
Check for the page height
and compare it to the initial
viewport.
https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini
63. Solution
The simplest solution is then
to wait for X seconds and
stop rendering or check
viewport and Page Height
again.
VIEWPORT
PAGE
HEIGHT
* for more complex solutions you can look at ongoing requests or an
event-based approach. @giacomozecchini
67. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
VIEWPORT
Google starts the rendering
using a fixed viewport:
Mobile: 412 X 732
Desktop: 1024 x 1024
@giacomozecchini
68. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
VIEWPORT
PAGE
HEIGHT
Viewport = Page Height + pixels
When the browser starts the
rendering the Page Height is
calculated using the
contain-intrinsic-size
@giacomozecchini
69. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
VIEWPORT
PAGE
HEIGHT
A bigger viewport
makes the browser
rendering the element
affected by size
containment.
@giacomozecchini
70. Tools support
97% of tests showed a
different result
* for very tall pages @giacomozecchini
72. Solution
Wait for an event:
onload
DOMContentLoaded
If you’re using puppeteer:
networkidle0
networkidle2
@giacomozecchini
73. Solution
VIEWPORT
PAGE
HEIGHT
Check for the page height
and compare that to the
initial viewport.
https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-getLayoutMetrics @giacomozecchini
75. Solution
The simplest solution is then
to wait for X seconds and
stop rendering or check
viewport and Page Height
again.
VIEWPORT
PAGE
HEIGHT
@giacomozecchini
78. Google is able to render
and use Shadow DOM content.
N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow. @giacomozecchini
81. Solution
The solution is to get
the DOM tree, traverse it
and serialize it into HTML.
https://www.w3schools.com/js/js_htmldom_navigation.asp @giacomozecchini
86. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
Google is able to render
an Iframe inlining the <body>
content in a <div>.
@giacomozecchini
87. N.B. These are just personal assumptions based on tests.
Tests could be wrong and implementation details may change tomorrow.
If the page included through the
Iframe has a noindex, the content
is not included in the page.
@giacomozecchini
88. Iframe - Tools support
?
95% of tests showed a
different result
@giacomozecchini
89. Solution
Get the DOM tree, traverse it, and serialize it into
HTML.
N.B. When traversing the DOM you only need the
content of <body>, remove other HTML elements
and tags such as the <head>. Remember to check
for the noindex.
@giacomozecchini
94. Sometimes you should reinvent
the wheel.
It’s fun and you can learn a lot
from that!
@giacomozecchini
95. When you change the way you look
at things, the things you look at
change.
Understanding these limitations
should change, in those edge cases,
the advice that you provide.
@giacomozecchini
96. Don’t use tools blindly!
Tools are great and save us a huge
amount of time in all our tasks. The
majority of pages on the web are
not affected by those edge cases.
@giacomozecchini
97. If your website uses or is affected
by one of the mentioned edge
cases, you can open a support
ticket to check with your tool
provider if they are already covering
that.
@giacomozecchini