This is a great approach, but detecting the user-agent is the wrong way to decide if you should pre-render the page. If you include the following meta tag in the header:
<meta content="!" name="fragment">
then Google will request the page with the "_escaped_fragment_" query param. That's when you should serve the pre-rendered version of the page.
Waiting for google to request the page with _escaped_fragment_ should also prevent you from getting penalized for slow load times or showing googlebot different content.
Considering the intention behind the Google document is to enable support for existing Ajax applications, and not the cornerstone of crawlability of newly built apps, probably not.
"If you're starting from scratch, one good approach is to build your site's structure and navigation using only HTML. Then, once you have the site's pages, links, and content in place, you can spice up the appearance and interface with AJAX. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your AJAX bonuses."
No, it is not. While this will certainly help your client side app get indexed, it is not 'great'. Other commenters on this thread bring up a number of valid concerns, but in my mind it comes down to two very simply things.
One is that when you are fighting for the top spot in organic traffic, this won't cut it. Off-page SEO is more important than on-page optimizations, but it on-page optimizations still have value.
The other issue is that this approach assumes that the client side rendered view at a particular hash is exactly what should be initially rendered on the server side. While this could work in some cases, it is my experience that it either creates a weird user experience and/or you end up doing hacks on the client side in order to ensure PhantomJS captures the right html.
This is a fine solution for some use cases, but I really hope that the community doesn't think this is the future. This is a temporary hack until we get a good server/client rendering framework in place OR all search engines evolve to capture pure client side apps without any of this.
There definitely are valid concerns with this approach...but until the search engines pick up the slack, this is the solution that we're left with. It's definitely not ideal, but I prefer it over having non-DRY code just to serve incomplete HTML to crawlers.
The alternative solution is pushstate with an acceptance that you will need to either do some duplication of effort on the server side rendering or creating something along the lines of the AirBnb Rendr framework. I am in the process of doing the former with the plan to do the latter with a future iteration.
This is a great point. It might seem extreme, but I would advocate never using the User-Agent string to make decisions about what to serve a client. There is too much hackery and history that clouds up the User-Agent (such as every browser identifying itself as Mozilla), and it's almost always a proxy for something else that you actually want to test for.
In some rare situations, it's unavoidable, but even then I'd urge trying to rearchitect the solution to avoid it.
That makes sense, though in an ideal world there would be an abstracted header that said "hey, I'm not gonna render JS the way a regular browser will, so send me something prerendered". Then you could write something that would actually be future-proof and work with other search engines.
The way Google suggests there actually seems a little bit nefarious, as it makes it hard-coded to Google instead of working for any search engine.
You do realize that you gave a link which is from 2006? And more recent recommendations does not include that.
[EDIT] OK as I was downvoted I will clarify my point: https://developers.google.com/webmasters/ajax-crawling/docs/... This is recommended practice for crawling javascript generated pages, no need to lookup for spiders IP address as someone mentioned.
escaped fragment is used by major search engines google, bing, yandex (if you are interested in russian markets), not sure about Yahoo! as it was ages since I used yahoo for anything.
Rendering different content based on user agent is tempting the webspam gods. Rendering nothing but a big gob of javascript to non-googlebot user agents is a recipe to get the banhammer dropped on your head.
You're either gambling that Google is smart enough to know that your particular big gob of javascript isn't cloaking keyword spam (in which case you should just depend on their JS evaluation, since you already are, implicitly), or you're gambling that they won't bust you even though your site looks like a classic keyword stuffer.
"JavaScript: Place the same content from the JavaScript in a <noscript> tag. If you use this method, ensure the contents are exactly the same as what’s contained in the JavaScript, and that this content is shown to visitors who do not have JavaScript enabled in their browser."
This was done for years with Flash sites and I never saw Google black list anyone doing it legitimately.
You can also provide different content if you want the content to be behind a pay wall, although personally I find this is a little annoying.
Placing the text content on the same page in a <noscript> tag is entirely different than rendering different content based on user agent. That's what the noscript tag is meant to do -- Google is telling you to follow best practices for fallback to non-JS browsers.
Google does have a section within their guidelines on creating "HTML Snapshots". "If a lot of your content is created in JavaScript, you may want to consider using a technology such as a headless browser to create an HTML snapshot." https://developers.google.com/webmasters/ajax-crawling/docs/...
Did you read that page, or did you just skim it? They're telling you to use a "headless browser" as one possible (clunky) way of responding to _escaped_fragment_ requests, which is a workaround wherein you put a special tag in your original page to tell the googlebot to make another request to get a static version of the page.
Using _escaped_fragment_ is not the same thing as rendering different content based on user agent.
Hiding keyword spam behind JS doesn't make any sense in this situation - the whole point is that the JS isn't being served to Google. That's who keyword spammers are trying to fool, not actual humans.
It works either way. If I'm allowed to serve a big blob of javascript to my users and not to the googlebot, I load up my googlebot page with keyword content, and use the JS blob to render my "secret" page to real users.
This will get you penalized for having a website that takes forever to load. This is what happens:
Googlebot requests page -> your webapp detects googlebot -> you call remote service and request that they crawl your website -> they request the page from you -> you return the regular page, with js that modifies it's look and feel -> the remote service returns the final html and css to your webapp -> your webapp returns the final html and css to Googlebot. That's gonna be just murder on your loadtimes.
If this must be done, for static pages, it should be done by grunt during build time, not by a remote service. For dynamic content, it's best to do the phantomjs rendering locally, and on an hourly (or so) schedule, since it doesn't really matter if googlebot has the latest version of your content.
Or perhaps I'm mistaken and the node-module actually calls the service hourly or so and caches results on app so it doesn't actually call the service during googlebot crawls. If that's the case, I take back my objections, but I'd recommend updating the website to say as much.
If it doesn't cache, then besides latency, someone could send fake googlebot requests and overload the prerender service, which is unlikely to be able to handle a lot of traffic.
Best case scenario you still have network trips going out to the service, so it's still not a great solution UNLESS the caching was done by your webapp - which is what I spoke about at the end of my comment above.
Unless this works w/o adding network roundtrips on each request, it's not a great idea.
I think the Unix philosophy of "do one thing and do it well" applies here. There are already off-the-shelf caching solutions that do what you describe: for example, with Varnish you can serve cached pages immediately and update the cache contents in the background.
It would probably be better to use those than reimplement them in an uber-webapp.
An entire project written to simulate progressive enhancement (badly). One that only works for specified whitelisted User-Agents, instead of being based on capability.
I'm also not understanding the use-case for this project. Everytime the topic of "Web Apps", "JavaScript Apps", "Single page web apps" comes up, evangelists point out that they are applications (or skyscrapers), not just fancy decorators for website content.
So exactly what is this project delivering as fallback content? A server-generated website?
This project just seems pointlessly backwards. Simulating a feature that the JavaScript framework has already deliberately broken. One that introduces a server-side dependency on a project deliberately chosen not to have a server-side framework.
This just looks like a waste of effort, when building the JavaScript application properly the first time, with progressive enhancement, covers this exact use-case, and far, far more use-cases.
The time would have been better spent fixing these evidently broken JavaScript frameworks - Angular, ember, Backbone. Or at least to fix the tutorial documentation to explain how to build Web things properly. (This stuff isn't difficult, it just requires discipline)
I call hokum on people saying there's a difference between Websites and Web apps (or the plethora of terms used to obfuscate that: Single-page apps, JavaScript apps). This project proves that these are just Websites, built improperly, and this is the fudge that tries to repair that for Googlebot.
Why some developers are so against progressive enhancement mystifies me. It is an elegant solution that actually works in all cases rather than an ugly hack that should probably work in the majority of cases. How can there even be a dispute about it? It's insane!
For the vast majority of websites, this is true and I agree.
However, there are now websites that are more akin to applications than to the traditional website of yore and that are meaningless without Javascript. This is not a bad thing; it is just the opening up of the web platform to new opportunities.
Some of these emerging applications may want some of their content to be searchable and so I argue that the posted solution is solving a real problem and is of value to users.
What would you do if you required SEO enhancement AND dynamic loading of content? Are you supposed to just let that portion of the site go without indexing? Surely there are sites that have both requirements.
Progressive enhancement. It is a web development best practice.
You will find that "dynamic loading of content" doesn't automatically mean "no content served by HTML under any circumstances". This is an error perpetuated by these JavaScript-only frameworks.
For example, bustle.com - there is absolutely no customer experience reason for the website not to have the content loaded at the HTML layer and then progressively enhanced with the customer experience additions. The content here isn't tied exclusively to the behaviour layer.
"The content here isn't tied exclusively to the behaviour layer."
Can you elaborate on a situation where that the content and the behaviour are tied together, and what you would do in that case?
From my understanding, Facebook's BigPipe loads content in modules to reduce user perceived latency. If I'm building X site and wanted that same behavior (since I've heard on several occasions that there is a direct correlation between page load times and user engagement), is my only option to sacrifice SEO?
Facebook's big pipe is nothing more than a client-side hack to work around a limitation of their server-side architecture.
Both Yahoo and Amazon - that I personally know of - have an infrastructure where components on the page are rendered separately and in parallel, and are stitched together on the HTML layer. The render time is then down to the rendering time of the slowest component, or the slowest dependency chain of components.
Loading content in with JavaScript after the HTML page load is always going to be slower, and perceivably so - look at both Twitter and AirBnb, both have written about how much faster they get content to the user using progressive enhancement.
If you decide that the HTML layer isn't the right layer for content, you are working against the strengths of the Web. And of course, that leads down a path where you are sacrificing SEO, sacrificing robustness.
Your time is better spent figuring out why it takes your server too long to generate content, and put in steps to reduce the server side labour.
The JavaScript include approach isn't quicker. Bustle.com for example, takes 10 seconds to show the first page - that's horrific.
Okay, I'm sold on HTML being where the content should be on page load. My next question is are there any frameworks that can assist with this stitching together of content? I'm afraid my ignorance is showing here, but I can't off the top of my head list any.
The web paradigm that I've grown up with is the single-threaded dynamic content generation one, most recently using MVC, but with any server-side logic. The concept of parallel rendering of content and a "stitching" together of HTML is new to me.
I'm also curious as to the best practices surrounding page linking when the behavior specifies something like no screen flash. It seems like all that content (like, the whole page) would have to be loaded with AJAX.. then you're right back where you started with loading content with JS. Maybe it's forgivable as long as the initial page load returns a complete set of content?
Perhaps there is a place where I can learn more about the logistics of progressive enhancement.
Thanks for being willing to answer my questions about this. It's something I've always wanted hashed out in my head from an opponent of these frameworks.
"are there any frameworks that can assist with this stitching together of content"
Mainstream, no. These are not typical use-cases for sites until they reach a gigantic scale.
Plus, even before you get to that level, there's heaps you can do on caching at different levels, pre-calculating, pre-generating. So many optimisations at various levels of your stack, then there's scaling across hardware. Wordpress and Wikipedia don't need paralellisation of HTML components yet.
"The concept of parallel rendering of content and a "stitching" together of HTML is new to me."
It was new to me until I joined Yahoo. The approach is breaking down a page into modular/independent sections. And then running some sort of parelellisation process, and when all the responses are received, then render the page with those generated components.
There's probably a variety of hacks in each major platform that will allow things to be parelellised. If you want to parallelise on the HTTP level, then curl multiget is an option: http://php.net/manual/en/function.curl-multi-init.php
It's probably possible to cobble together something with node.js too. Node.js receives the HTTP request, turns that into a series of service calls that return HTML, makes those calls asynchronously, waits for them all to return (this is where Node.js excels), then renders an HTML page skeleton, replacing placeholders with the responses from the various services. With a decent promise library that waits for a number of calls to finish, this is quite a compact approach, I guess.
Almost anything that allows asynchronous operations that uses resources outside of the current request handler can be fashioned into a component parallelisation stack.
"I'm also curious as to the best practices surrounding page linking when the behavior specifies something like no screen flash."
No screen flash is impossible, due to the nature of the Web. The browser has to receive the HTML first before it can know what dependent resources are needed. The problem you are trying to solve here is to minimise the perceived time between the HTML arriving, and enough of the CSS to load in for an incremental render to paint a close-enough-to-look-complete rendering.
Loading the content after the CSS is one way of doing that. Which replaces the screen-flash delay with a blank screen. That's the JavaScript-app approach.
I don't like that, because it delays the appearance of content.
The perception of screenflash can be minimised, mostly by decreasing the amount of traffic crossing the wire until a good enough repaint can happen. There are various tricks and hacks for minimising this, but due to the nature of the web they cannot be completely eliminated using Web technologies. They can be replaced with other issues.
Tricks I'd consider is reducing the amount of CSS needed to render the page, break the CSS up into a primary rendering and a secondary, more detailed rendering. The primary rendering is just a basic layout plus main elements styling. Perhaps a careful inline style or two, judicious display:nones and overflow: hiddens to minimise page assets moving around as incremental CSS rendering happens. Also, if you want to get serious, techniques for deferring CSS, JavaScript and images of content outside the current viewport is an option. Yahoo loved deferring the loading of avatar images in an article comments area till after onload. I see that technique used in tech publication websites, can't remember off the top of my head a site that did this.
"Perhaps there is a place where I can learn more about the logistics of progressive enhancement."
The process is about thinking about a site one layer at a time. Get it functional at each level: HTML with links and form posts, CSS presentational level, JavaScript enhancements and usability improvements. Like building a skyscraper, you get the foundation right first.
But before that, it takes understanding as to what are the core use-cases for the site. This is about tasks a visitor can complete. Something that's tied into key product indicators and metrics. I doubt bustle.com use page loading performance as their primary business success factor. It is more likely to be about customer activity - how long did they visit, how many articles, any social activity.
It's figuring out the primary services and functionality of the site, and building that to not rely on JavaScript, or CSS. Primary services are those that, if you didn't provide them, you'd have no business.
Secondary functionality and use-cases -- those that complement or are related to primary functionality, that can be argued on a case by case basis whether a quick ajax solution is sufficient. But most of the time when you get progressive enhancement right, it becomes just a natural development technique.
If you are able to "pre-render" a JavaScript app like this, then you should be serving users the pre-rendered version and then enhancing it with JavaScript after onload.
JavaScript-only apps are a blight on the web. All it takes is a bad SSL cert, or your CDN going down, and your pages become useless to the end-user.
Apologies for being vague. Regarding the SSL certificate, I was referring to modern browsers refusing to load "unsafe" assets.
When the JS can't load, JS-heavy apps tend to either be raw templates (i.e. full of {{ statements }}) or completely blank (if the templates were going to be loaded in a separate request). As Isofarro said, non-JS pages don't suffer from this because the content is there in plain HTML.
Google doesn't like it when they are shown different content than a browsing user. This is roughly the equivalent of pointing Google Agent to a copy of the page requested that happens to be in Memcached instead of spinning up the full app stack to do the render.
Not a technically different page, specifically different content. Serving different pages to Google is fine, as long as they contain the same primary content that the real pages do. That's the whole point - so you can serve prerendered pages to Google but still have a JS-based frontend for the actual users.
AJAX sites often lazy load in content later. My point is the page delivered initially is not the same as the static version content wise or technically.
>Static rendering of dynamic content? I don't think this does make sense.
Bro do you even Web 1.0? That's what CGI scripts in Perl did! Pull the data from the database, generate HTML (no JavaScript back then!) on the fly, and send to the browser.
I was under the impression that Googlebot already executes javascript on pages.
A more interesting idea would be if you do this for every user - prerender the page and send them the result, so they don't have to do the first, heavy js execution themselves. I know it sounds a bit retarded at first - you're basically using javascript as a server-side page renderer, but think about this: You can choose to prerender or not to prerender based on user agent string -- do it for people on mobile phones, but not for desktop users. You can write your entire site with just client-side page generation with javascript and let it run client-side at first, then switch to server-side prerendering once you have better hardware.
Something similar to that, albeit slightly more elegant, is the work that AirBnB has done with their rendr [0] project, which serves prerendered content that's then rerendered with JS if it needs to be changed. You can do similar things with non-Backbone stacks, of course.
It's a multipage app, that uses ajax to function as a singlepage app. From the user's point of view it's a singlepage app, but it's accessible from any of the URLs that it pushStates to, so it's like the best of both worlds. It's fully crawlable because it functions as a multipage app, but it's got the speed of a singlepage app (if your browser supports ___pushState)
I tried using phantomjs in the past to serverside render a complex backbone application for SEO, and it was taking over 15 seconds to return a response (which is bad for SEO).
Looking at the prerender's source I did't see any caching mechanism.
What kind of load times have you see rendering your apps?
Have there been recent significant improvements in phantomjs's performance?
You can get it faster than 15 seconds, but you can't really get it fast enough. We precache everything. I would strongly recommend against trying to process the pages in realtime.
I have been looking for something like this for a long time. Seems very straight forward.
I have not tested it yet, but I wonder if the speed of render will penalize you in the google results. Seems like a separate machine with a good CPU might be worthwhile if you are going to run this.
They execute Javascript in limited fashion. So you should consider using what is suggested by google itself https://developers.google.com/webmasters/ajax-crawling/docs/... . If you are using angular, then you will get your template displayed instead of fully rendered page. with all {{sitename}} displayed.
It's a SaaS which is much more elaborated than this project (there is one year of development into it). We serve and crawl thousands of pages every day without any issues.
If you're using Rails have a look at https://github.com/seojs/seojs-ruby, it's a gem similar to prerender but it's using our managed service at http://getseojs.com/ to get the snapshots. There are also ready to use integrations for Apache and Nginx.
Some benefits of SEO.js to other approaches are:
- it's effortless, you don't need to setup and operate your own phantomjs server
- snapshots are created and cached in advance so the search engine crawler won't be put off by slow page loads
I recently needed to do this for google, but i wanted the rendering time, and delivery of the page to be under 500MS, so i hacked up something that works with expressjs
It uses phantomjs but removes all the styles initially so the rendering time is much faster. (my ember app was averaging 70MS to render, but i prefetch the page data)
Looks like that's exactly what Meteor's spiderable package does since 08/2012[0]: look at user-agent, run phantomjs for 10s and return a rendered page once google/facebook crawler detected.
With each user-interaction that updates a page fragment it modifies the address in the browser's address bar to correspond to the current state. If somebody were to copy and paste that URL into a new tab, your site would load the complete interface if you've structured your back-end code correctly.
You do this by building in logic to the part of the code that outputs your view to see whether the request is coming as a PJAX request, or not. If it is, you output the page-fragment, which is then added to your existing DOM. If it's not a PJAX request, your back-end outputs the entire code for your site.
There's a limitation to PJAX where you can only update one fragment at a time, though PJAXR seems to address that limitation by providing support for updating multiple-fragments simultaneously. Either way, you get the huge advantage of having a fully-crawlable site without needing to integrate pre-rendering work-arounds for search-engine compatibility.
I'm confused, search indexing isn't a realtime exercise... Why would performance be an issue? Running a headless browser vs running "whatever it is they run that can execute JS" doesn't seem like a huge leap...
At Google's scale any performance drop can have massive implications. If Google's crawl rate is 100 million[1] pages a day then a 1% drop in crawl rate means 1 million less pages crawled per day (which has many implications, for example having to use more compute power to regain crawl rate which raises costs etc.)
You are right that it is not a real-time exercise but they do have crawl targets.
You cannot be flippant about "Why don't they just do X" when scale is that big.
[1] Picked out of the air but probably in the right magnitude (or even a little small)
The point is that Google probably doesn't have a lot of cycles to spare - anything else wouldn't be good business sense.
Anything that significantly adds to the load will lose them money - whether or not the operation needs to be realtime is secondary to that.
I apologise for giving offense: I wrote the comment the same way I would have made it face-to-face, which is always a bit risky in a purely textual medium.
I don't know if you are trying to be serious at this point or not. Google has millions (literally) of machines with dozens of cores each. Search is their business that makes all the money.
Google executes JavaScript and renders the full DOM for every page internally. They generate full length screenshots of every page and have pointers to where text appears on the page so they can do highlighting of phrases within the screenshot.
It isn't even a debatable question if Google reuses the Chrome engine to do this.
This is a really great idea !. I mean now data in apps made on js can be searched. My question is that can we add "Search with google" to out javascript app then ?
I surf with JavaScript turned off and I see just a blank page. If it's "crawlable" I certainly expect it to be visible to me without turning JavaScript on.
Why would you do this? Genuinely interested. Do you browse the web with JavaScript turned off the majority of the time or just in this particular example?
I keep JavaScript turned off by default. Then I turn it on for only a few sites of critical importance for me which would not function otherwise. And I don't feel I miss anything, most of the content I care about is still HTML and it should remain so. JavaScript is not needed to show me the text.
That way the chances for cross site sripting attacks are greatly reduced and the content appears much faster.
Pushstate 4 the win. Done with hash fragments not going back to that mess. Pushstate is quick and easy to implement,I don't see a reason to over complicate.
>Javascript apps can be fully crawlable
yes, and i think it's cool that you try to provide a solution as a service for this.
but as with every technology, there are some tradeoffs
a) serving google a different response bases on the user-agent is the definition of cloaking (it's not misleading or malicious cloaking, it's cloaking non the less)
b) you hardcode a dependency to a third party server - you have no control over - into your app (and from the sample code on the page, there is no fallback available if this server is down)
c) there are latency/web-performance issue i.e.: for a first time request by a search engine the roundtrip would look like so:
[googlebot GET for page -> googlebot detected -> app GET to prerender.io -> prerender.io GET to page -> app delivers page -> prerender.io returns page to app -> app returns page to googlebot]
this will always be slower than
[googlebot GET for page -> app returns page to googlebot]
so basically the prerender.io approach creates some issues. said that. we don't have - yet - another "no tread-off" solution
this is a no go if you have a big site with hundred of thousands to millions of pages.
and there is another much, much bigger issue:
* showing JS clients
* and "other only-partially-JS clients" (google parses some JS) different responses
just does not work in the long turn.
why? if there is no direct feedback, then there is no direct feedback!
non-responsive mobile site currently offer overall poor user experience, why? because all the guys working on the site sit in front of their fat office desktops. no feedback equals crap in the long run.
and it's worse for "for robots only" views, because people just don't have to live with the crap they server spits out, as they always just see the fancy JS versions. since the hashbang ajax crawl-able spec came out it consulted some clients on this question, everyone who choose the _escaped_fragment_ road anyway did regret it later on. even if the the first iteration works, 1000 roll out later, it doesn't - if there is no direct feedback, then there is no direct feedback.
conclusion: if you have a bit site and want to do big-scale (lots of pages) SEO you are stuck with landingpages and delivering HTML + content via the server + progressive enhancement for functionality, until the day google get's its act together.
and for first-view webperformance i recommend the progressive enhancement approach anyway, too.
Google has documentation on this here: https://developers.google.com/webmasters/ajax-crawling/docs/... and we've been using this method at https://circleci.com for the past year.
Waiting for google to request the page with _escaped_fragment_ should also prevent you from getting penalized for slow load times or showing googlebot different content.