Javascript apps can be fully crawlable

dwwoelfel · on Oct 7, 2013

This is a great approach, but detecting the user-agent is the wrong way to decide if you should pre-render the page. If you include the following meta tag in the header:

   <meta content="!" name="fragment">

then Google will request the page with the "_escaped_fragment_" query param. That's when you should serve the pre-rendered version of the page.

Google has documentation on this here: https://developers.google.com/webmasters/ajax-crawling/docs/... and we've been using this method at https://circleci.com for the past year.

Waiting for google to request the page with _escaped_fragment_ should also prevent you from getting penalized for slow load times or showing googlebot different content.

Isofarro · on Oct 7, 2013

That Google Ajax crawler spec is no magic bullet.

Nick Denton: "Dip in uniques largely because of drop in Google refers. Pageviews (which are driven more by core audience) less affected." -- http://twitter.com/nicknotned/status/61152134929981440

Nick Denton: "Google does not fully support "hashbang" URLs. So we're eliminating them rather than waiting for Mountain View." -- http://twitter.com/nicknotned/status/61465859079671808

Nick Denton: "Yeah, I'd advise against hashbang urls. Will kill search traffic -- even if you abide by Google protocol." -- http://twitter.com/nicknotned/status/62595141927583745

alanlewis · on Oct 7, 2013

These tweets are from 2.5 years ago. Has the google bot improved since then? (Honest question)

Isofarro · on Oct 7, 2013

Considering the intention behind the Google document is to enable support for existing Ajax applications, and not the cornerstone of crawlability of newly built apps, probably not.

Also, the same document that's quoted in defence of these Web (unfriendly) Apps is https://developers.google.com/webmasters/ajax-crawling/

Where in the first section of that document: https://developers.google.com/webmasters/ajax-crawling/docs/... There is this:

"If you're starting from scratch, one good approach is to build your site's structure and navigation using only HTML. Then, once you have the site's pages, links, and content in place, you can spice up the appearance and interface with AJAX. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your AJAX bonuses."

abraham · on Oct 7, 2013

You can use the meta tags with normal URLs and completely ignore hashbang urls.

bceagle · on Oct 7, 2013

> This is a great approach

No, it is not. While this will certainly help your client side app get indexed, it is not 'great'. Other commenters on this thread bring up a number of valid concerns, but in my mind it comes down to two very simply things.

One is that when you are fighting for the top spot in organic traffic, this won't cut it. Off-page SEO is more important than on-page optimizations, but it on-page optimizations still have value.

The other issue is that this approach assumes that the client side rendered view at a particular hash is exactly what should be initially rendered on the server side. While this could work in some cases, it is my experience that it either creates a weird user experience and/or you end up doing hacks on the client side in order to ensure PhantomJS captures the right html.

This is a fine solution for some use cases, but I really hope that the community doesn't think this is the future. This is a temporary hack until we get a good server/client rendering framework in place OR all search engines evolve to capture pure client side apps without any of this.

thoop · on Oct 7, 2013

There definitely are valid concerns with this approach...but until the search engines pick up the slack, this is the solution that we're left with. It's definitely not ideal, but I prefer it over having non-DRY code just to serve incomplete HTML to crawlers.

bceagle · on Oct 7, 2013

The alternative solution is pushstate with an acceptance that you will need to either do some duplication of effort on the server side rendering or creating something along the lines of the AirBnb Rendr framework. I am in the process of doing the former with the plan to do the latter with a future iteration.

benaiah · on Oct 7, 2013

This is a great point. It might seem extreme, but I would advocate never using the User-Agent string to make decisions about what to serve a client. There is too much hackery and history that clouds up the User-Agent (such as every browser identifying itself as Mozilla), and it's almost always a proxy for something else that you actually want to test for.

In some rare situations, it's unavoidable, but even then I'd urge trying to rearchitect the solution to avoid it.

robmcm · on Oct 7, 2013

Actually Google recomend using the IP address of the bot vai a DNS lookup:

http://googlewebmastercentral.blogspot.co.uk/2006/09/how-to-...

benaiah · on Oct 7, 2013

That makes sense, though in an ideal world there would be an abstracted header that said "hey, I'm not gonna render JS the way a regular browser will, so send me something prerendered". Then you could write something that would actually be future-proof and work with other search engines.

The way Google suggests there actually seems a little bit nefarious, as it makes it hard-coded to Google instead of working for any search engine.

mk3 · on Oct 7, 2013

You do realize that you gave a link which is from 2006? And more recent recommendations does not include that. [EDIT] OK as I was downvoted I will clarify my point: https://developers.google.com/webmasters/ajax-crawling/docs/... This is recommended practice for crawling javascript generated pages, no need to lookup for spiders IP address as someone mentioned.

robmcm · on Oct 7, 2013

thoop · on Oct 7, 2013

Thanks, I'll work on adding that. I'd still like to use a useragent fall-back for the crawlers that might not use the _escaped_fragment_ protocol.

mk3 · on Oct 10, 2013

escaped fragment is used by major search engines google, bing, yandex (if you are interested in russian markets), not sure about Yahoo! as it was ages since I used yahoo for anything.

thomasfromcdnjs · on Oct 7, 2013

I wrote a similar open source library that uses this approach last year

http://github.com/apiengine/seoserver

and the blog post related to it

http://backbonetutorials.com/seo-for-single-page-apps/

gcb1 · on Oct 7, 2013

no it is not a great approach for 99% of the cases.

the issue with getting content from scripted sites is not the initial part... you could use noscript and be done much easier.

the real issue is that most sites require user interaction to get to most content. this does nothing besides providing a convenient DoS entry point.

nice hack though.

timr · on Oct 7, 2013

Don't do this.

Rendering different content based on user agent is tempting the webspam gods. Rendering nothing but a big gob of javascript to non-googlebot user agents is a recipe to get the banhammer dropped on your head.

You're either gambling that Google is smart enough to know that your particular big gob of javascript isn't cloaking keyword spam (in which case you should just depend on their JS evaluation, since you already are, implicitly), or you're gambling that they won't bust you even though your site looks like a classic keyword stuffer.

robmcm · on Oct 7, 2013

Google actually recommend you do this, provided it's the same content that is shown with JS enabled:

https://support.google.com/webmasters/answer/66353

"JavaScript: Place the same content from the JavaScript in a <noscript> tag. If you use this method, ensure the contents are exactly the same as what’s contained in the JavaScript, and that this content is shown to visitors who do not have JavaScript enabled in their browser."

This was done for years with Flash sites and I never saw Google black list anyone doing it legitimately.

You can also provide different content if you want the content to be behind a pay wall, although personally I find this is a little annoying.

timr · on Oct 7, 2013

Placing the text content on the same page in a <noscript> tag is entirely different than rendering different content based on user agent. That's what the noscript tag is meant to do -- Google is telling you to follow best practices for fallback to non-JS browsers.

gcb1 · on Oct 7, 2013

and much easier btw

stephenheron · on Oct 7, 2013

Google does have a section within their guidelines on creating "HTML Snapshots". "If a lot of your content is created in JavaScript, you may want to consider using a technology such as a headless browser to create an HTML snapshot." https://developers.google.com/webmasters/ajax-crawling/docs/...

timr · on Oct 7, 2013

Did you read that page, or did you just skim it? They're telling you to use a "headless browser" as one possible (clunky) way of responding to _escaped_fragment_ requests, which is a workaround wherein you put a special tag in your original page to tell the googlebot to make another request to get a static version of the page.

Using _escaped_fragment_ is not the same thing as rendering different content based on user agent.

Wintamute · on Oct 7, 2013

Well said. Really not sure why your parent is recommending against this, when Google are recommending in favour of it in their webmaster docs ...

benaiah · on Oct 7, 2013

Hiding keyword spam behind JS doesn't make any sense in this situation - the whole point is that the JS isn't being served to Google. That's who keyword spammers are trying to fool, not actual humans.

timr · on Oct 7, 2013

It works either way. If I'm allowed to serve a big blob of javascript to my users and not to the googlebot, I load up my googlebot page with keyword content, and use the JS blob to render my "secret" page to real users.

_lex · on Oct 7, 2013

This will get you penalized for having a website that takes forever to load. This is what happens:

Googlebot requests page -> your webapp detects googlebot -> you call remote service and request that they crawl your website -> they request the page from you -> you return the regular page, with js that modifies it's look and feel -> the remote service returns the final html and css to your webapp -> your webapp returns the final html and css to Googlebot. That's gonna be just murder on your loadtimes.

If this must be done, for static pages, it should be done by grunt during build time, not by a remote service. For dynamic content, it's best to do the phantomjs rendering locally, and on an hourly (or so) schedule, since it doesn't really matter if googlebot has the latest version of your content.

Or perhaps I'm mistaken and the node-module actually calls the service hourly or so and caches results on app so it doesn't actually call the service during googlebot crawls. If that's the case, I take back my objections, but I'd recommend updating the website to say as much.

benatkin · on Oct 7, 2013

If it doesn't cache, then besides latency, someone could send fake googlebot requests and overload the prerender service, which is unlikely to be able to handle a lot of traffic.

10098 · on Oct 7, 2013

Pretty sure the load time problem can be mitigated by caching.

_lex · on Oct 7, 2013

Best case scenario you still have network trips going out to the service, so it's still not a great solution UNLESS the caching was done by your webapp - which is what I spoke about at the end of my comment above.

Unless this works w/o adding network roundtrips on each request, it's not a great idea.

reissbaker · on Oct 7, 2013

I think the Unix philosophy of "do one thing and do it well" applies here. There are already off-the-shelf caching solutions that do what you describe: for example, with Varnish you can serve cached pages immediately and update the cache contents in the background.

It would probably be better to use those than reimplement them in an uber-webapp.

ilaksh · on Oct 8, 2013

Its not a remote service. Its PhantomJS which is webkit rendering on your own server. Where did they say it was going to call a remote service?

Isofarro · on Oct 7, 2013

An entire project written to simulate progressive enhancement (badly). One that only works for specified whitelisted User-Agents, instead of being based on capability.

I'm also not understanding the use-case for this project. Everytime the topic of "Web Apps", "JavaScript Apps", "Single page web apps" comes up, evangelists point out that they are applications (or skyscrapers), not just fancy decorators for website content.

So exactly what is this project delivering as fallback content? A server-generated website?

This project just seems pointlessly backwards. Simulating a feature that the JavaScript framework has already deliberately broken. One that introduces a server-side dependency on a project deliberately chosen not to have a server-side framework.

This just looks like a waste of effort, when building the JavaScript application properly the first time, with progressive enhancement, covers this exact use-case, and far, far more use-cases.

The time would have been better spent fixing these evidently broken JavaScript frameworks - Angular, ember, Backbone. Or at least to fix the tutorial documentation to explain how to build Web things properly. (This stuff isn't difficult, it just requires discipline)

I call hokum on people saying there's a difference between Websites and Web apps (or the plethora of terms used to obfuscate that: Single-page apps, JavaScript apps). This project proves that these are just Websites, built improperly, and this is the fudge that tries to repair that for Googlebot.

philbo · on Oct 7, 2013

+100

Why some developers are so against progressive enhancement mystifies me. It is an elegant solution that actually works in all cases rather than an ugly hack that should probably work in the majority of cases. How can there even be a dispute about it? It's insane!

gbadman · on Oct 7, 2013

For the vast majority of websites, this is true and I agree.

However, there are now websites that are more akin to applications than to the traditional website of yore and that are meaningless without Javascript. This is not a bad thing; it is just the opening up of the web platform to new opportunities.

Some of these emerging applications may want some of their content to be searchable and so I argue that the posted solution is solving a real problem and is of value to users.

nailer · on Oct 7, 2013

> Why some developers are so against progressive enhancement mystifies me.

There's another common adage: HTML is content. CSS is presentation. JS is behaviour.

Some public web apps simply don't work without behaviour.

Isofarro · on Oct 7, 2013

> Some public web apps simply don't work without behaviour.

Every app that uses a solution like this to generate static views of a website is an app that simply works without behaviour.

raynjamin · on Oct 7, 2013

What would you do if you required SEO enhancement AND dynamic loading of content? Are you supposed to just let that portion of the site go without indexing? Surely there are sites that have both requirements.

What's the alternative?

Isofarro · on Oct 7, 2013

Progressive enhancement. It is a web development best practice.

You will find that "dynamic loading of content" doesn't automatically mean "no content served by HTML under any circumstances". This is an error perpetuated by these JavaScript-only frameworks.

For example, bustle.com - there is absolutely no customer experience reason for the website not to have the content loaded at the HTML layer and then progressively enhanced with the customer experience additions. The content here isn't tied exclusively to the behaviour layer.

raynjamin · on Oct 7, 2013

"The content here isn't tied exclusively to the behaviour layer."

Can you elaborate on a situation where that the content and the behaviour are tied together, and what you would do in that case?

From my understanding, Facebook's BigPipe loads content in modules to reduce user perceived latency. If I'm building X site and wanted that same behavior (since I've heard on several occasions that there is a direct correlation between page load times and user engagement), is my only option to sacrifice SEO?

Isofarro · on Oct 7, 2013

Facebook's big pipe is nothing more than a client-side hack to work around a limitation of their server-side architecture.

Both Yahoo and Amazon - that I personally know of - have an infrastructure where components on the page are rendered separately and in parallel, and are stitched together on the HTML layer. The render time is then down to the rendering time of the slowest component, or the slowest dependency chain of components.

Loading content in with JavaScript after the HTML page load is always going to be slower, and perceivably so - look at both Twitter and AirBnb, both have written about how much faster they get content to the user using progressive enhancement.

If you decide that the HTML layer isn't the right layer for content, you are working against the strengths of the Web. And of course, that leads down a path where you are sacrificing SEO, sacrificing robustness.

Your time is better spent figuring out why it takes your server too long to generate content, and put in steps to reduce the server side labour.

The JavaScript include approach isn't quicker. Bustle.com for example, takes 10 seconds to show the first page - that's horrific.

raynjamin · on Oct 7, 2013

Okay, I'm sold on HTML being where the content should be on page load. My next question is are there any frameworks that can assist with this stitching together of content? I'm afraid my ignorance is showing here, but I can't off the top of my head list any.

The web paradigm that I've grown up with is the single-threaded dynamic content generation one, most recently using MVC, but with any server-side logic. The concept of parallel rendering of content and a "stitching" together of HTML is new to me.

I'm also curious as to the best practices surrounding page linking when the behavior specifies something like no screen flash. It seems like all that content (like, the whole page) would have to be loaded with AJAX.. then you're right back where you started with loading content with JS. Maybe it's forgivable as long as the initial page load returns a complete set of content?

Perhaps there is a place where I can learn more about the logistics of progressive enhancement.

Thanks for being willing to answer my questions about this. It's something I've always wanted hashed out in my head from an opponent of these frameworks.

Isofarro · on Oct 7, 2013

"are there any frameworks that can assist with this stitching together of content"

Mainstream, no. These are not typical use-cases for sites until they reach a gigantic scale.

Plus, even before you get to that level, there's heaps you can do on caching at different levels, pre-calculating, pre-generating. So many optimisations at various levels of your stack, then there's scaling across hardware. Wordpress and Wikipedia don't need paralellisation of HTML components yet.

"The concept of parallel rendering of content and a "stitching" together of HTML is new to me."

It was new to me until I joined Yahoo. The approach is breaking down a page into modular/independent sections. And then running some sort of parelellisation process, and when all the responses are received, then render the page with those generated components.

There's probably a variety of hacks in each major platform that will allow things to be parelellised. If you want to parallelise on the HTTP level, then curl multiget is an option: http://php.net/manual/en/function.curl-multi-init.php

It's probably possible to cobble together something with node.js too. Node.js receives the HTTP request, turns that into a series of service calls that return HTML, makes those calls asynchronously, waits for them all to return (this is where Node.js excels), then renders an HTML page skeleton, replacing placeholders with the responses from the various services. With a decent promise library that waits for a number of calls to finish, this is quite a compact approach, I guess.

Almost anything that allows asynchronous operations that uses resources outside of the current request handler can be fashioned into a component parallelisation stack.

"I'm also curious as to the best practices surrounding page linking when the behavior specifies something like no screen flash."

No screen flash is impossible, due to the nature of the Web. The browser has to receive the HTML first before it can know what dependent resources are needed. The problem you are trying to solve here is to minimise the perceived time between the HTML arriving, and enough of the CSS to load in for an incremental render to paint a close-enough-to-look-complete rendering.

Loading the content after the CSS is one way of doing that. Which replaces the screen-flash delay with a blank screen. That's the JavaScript-app approach.

I don't like that, because it delays the appearance of content.

The perception of screenflash can be minimised, mostly by decreasing the amount of traffic crossing the wire until a good enough repaint can happen. There are various tricks and hacks for minimising this, but due to the nature of the web they cannot be completely eliminated using Web technologies. They can be replaced with other issues.

Tricks I'd consider is reducing the amount of CSS needed to render the page, break the CSS up into a primary rendering and a secondary, more detailed rendering. The primary rendering is just a basic layout plus main elements styling. Perhaps a careful inline style or two, judicious display:nones and overflow: hiddens to minimise page assets moving around as incremental CSS rendering happens. Also, if you want to get serious, techniques for deferring CSS, JavaScript and images of content outside the current viewport is an option. Yahoo loved deferring the loading of avatar images in an article comments area till after onload. I see that technique used in tech publication websites, can't remember off the top of my head a site that did this.

"Perhaps there is a place where I can learn more about the logistics of progressive enhancement."

The process is about thinking about a site one layer at a time. Get it functional at each level: HTML with links and form posts, CSS presentational level, JavaScript enhancements and usability improvements. Like building a skyscraper, you get the foundation right first.

But before that, it takes understanding as to what are the core use-cases for the site. This is about tasks a visitor can complete. Something that's tied into key product indicators and metrics. I doubt bustle.com use page loading performance as their primary business success factor. It is more likely to be about customer activity - how long did they visit, how many articles, any social activity.

It's figuring out the primary services and functionality of the site, and building that to not rely on JavaScript, or CSS. Primary services are those that, if you didn't provide them, you'd have no business.

Secondary functionality and use-cases -- those that complement or are related to primary functionality, that can be argued on a case by case basis whether a quick ajax solution is sufficient. But most of the time when you get progressive enhancement right, it becomes just a natural development technique.

Gov.uk have a good explanation of progressive enhancement here: https://www.gov.uk/service-manual/making-software/progressiv...

wldlyinaccurate · on Oct 7, 2013

If you are able to "pre-render" a JavaScript app like this, then you should be serving users the pre-rendered version and then enhancing it with JavaScript after onload.

JavaScript-only apps are a blight on the web. All it takes is a bad SSL cert, or your CDN going down, and your pages become useless to the end-user.

dchest · on Oct 7, 2013

All it takes is a bad SSL cert, or your CDN going down, and your pages become useless to the end-user.

How are non-JavaScript pages protected from this?

wldlyinaccurate · on Oct 7, 2013

Apologies for being vague. Regarding the SSL certificate, I was referring to modern browsers refusing to load "unsafe" assets.

When the JS can't load, JS-heavy apps tend to either be raw templates (i.e. full of {{ statements }}) or completely blank (if the templates were going to be loaded in a separate request). As Isofarro said, non-JS pages don't suffer from this because the content is there in plain HTML.

Isofarro · on Oct 7, 2013

Less dependencies, a reduced risk vector. Just the HTML page containing the content needs to load.

ewillbefull · on Oct 7, 2013

Wouldn't the pre-render based on useragent be penalized because Google doesn't like being shown pages differently than non-Googlebot useragents?

michaelbuckbee · on Oct 7, 2013

Google doesn't like it when they are shown different content than a browsing user. This is roughly the equivalent of pointing Google Agent to a copy of the page requested that happens to be in Memcached instead of spinning up the full app stack to do the render.

dsl · on Oct 7, 2013

> Google doesn't like it when they are shown different content than a browsing user.

This is exactly correct. Regardless of your motivations.

benaiah · on Oct 7, 2013

Not a technically different page, specifically different content. Serving different pages to Google is fine, as long as they contain the same primary content that the real pages do. That's the whole point - so you can serve prerendered pages to Google but still have a JS-based frontend for the actual users.

dsl · on Oct 7, 2013

AJAX sites often lazy load in content later. My point is the page delivered initially is not the same as the static version content wise or technically.

eonil · on Oct 7, 2013

Static rendering of dynamic content? I don't think this does make sense.

If it's pre-rednered, it's missing something. If it has all the data at first, then it's not dynamic.

Pre-rendered(static) javascript app(dynamic)...? Hmm... I don't see anything more than something like JWT in JS instead of Java?

FedRegister · on Oct 7, 2013

>Static rendering of dynamic content? I don't think this does make sense.

Bro do you even Web 1.0? That's what CGI scripts in Perl did! Pull the data from the database, generate HTML (no JavaScript back then!) on the fly, and send to the browser.

eonil · on Oct 8, 2013

JS is definitely client-side dynamic technology. At least from AJAX era.

Well... I don't understand how you and many people (including the author) can read JS as server-side dynamic in this HTML5 era...!!!

dchest · on Oct 7, 2013

> Static rendering of dynamic content?

Yes.

> I don't think this does make sense.

It does, if you use one of the JS frameworks listed on the linked page.

eonil · on Oct 8, 2013

Well I see the definitions of dynamic are completely different between you and me.

anonymous · on Oct 7, 2013

I was under the impression that Googlebot already executes javascript on pages.

A more interesting idea would be if you do this for every user - prerender the page and send them the result, so they don't have to do the first, heavy js execution themselves. I know it sounds a bit retarded at first - you're basically using javascript as a server-side page renderer, but think about this: You can choose to prerender or not to prerender based on user agent string -- do it for people on mobile phones, but not for desktop users. You can write your entire site with just client-side page generation with javascript and let it run client-side at first, then switch to server-side prerendering once you have better hardware.

benaiah · on Oct 7, 2013

Something similar to that, albeit slightly more elegant, is the work that AirBnB has done with their rendr [0] project, which serves prerendered content that's then rerendered with JS if it needs to be changed. You can do similar things with non-Backbone stacks, of course.

[0]: https://github.com/airbnb/rendr

pzxc · on Oct 7, 2013

A better way is to do a hybrid single/multipage app as described here:

https://news.ycombinator.com/item?id=6507135

It's a multipage app, that uses ajax to function as a singlepage app. From the user's point of view it's a singlepage app, but it's accessible from any of the URLs that it pushStates to, so it's like the best of both worlds. It's fully crawlable because it functions as a multipage app, but it's got the speed of a singlepage app (if your browser supports ___pushState)

bfirsh · on Oct 7, 2013

This is a similar thing, but is far faster because it uses Zombie instead of Phantom: https://github.com/bfirsh/otter

tjmehta · on Oct 7, 2013

I tried using phantomjs in the past to serverside render a complex backbone application for SEO, and it was taking over 15 seconds to return a response (which is bad for SEO).

Looking at the prerender's source I did't see any caching mechanism.

What kind of load times have you see rendering your apps?

Have there been recent significant improvements in phantomjs's performance?

chaddeshon · on Oct 7, 2013

I run http://www.brombone.com. We provide prerendered snapshots as a service.

You can get it faster than 15 seconds, but you can't really get it fast enough. We precache everything. I would strongly recommend against trying to process the pages in realtime.

ivanhoe · on Oct 7, 2013

Still the main problems is not solved: you risk getting penalized for serving a different content to the googlebot

beernutz · on Oct 7, 2013

I have been looking for something like this for a long time. Seems very straight forward.

I have not tested it yet, but I wonder if the speed of render will penalize you in the google results. Seems like a separate machine with a good CPU might be worthwhile if you are going to run this.

gkoberger · on Oct 7, 2013

I can see a lot of issues with this (slow, displaying different content to Google can get you penalized, etc)... but this is a really clever hack.

Google is less important (they already execute JS), however it's good for sites like Facebook (which doesn't when you share a link).

mk3 · on Oct 7, 2013

They execute Javascript in limited fashion. So you should consider using what is suggested by google itself https://developers.google.com/webmasters/ajax-crawling/docs/... . If you are using angular, then you will get your template displayed instead of fully rendered page. with all {{sitename}} displayed.

gildas · on Oct 7, 2013

Shameless plug: http://seo4ajax.com

It's a SaaS which is much more elaborated than this project (there is one year of development into it). We serve and crawl thousands of pages every day without any issues.

se_ · on Oct 7, 2013

If you're using Rails have a look at https://github.com/seojs/seojs-ruby, it's a gem similar to prerender but it's using our managed service at http://getseojs.com/ to get the snapshots. There are also ready to use integrations for Apache and Nginx.

Some benefits of SEO.js to other approaches are:

- it's effortless, you don't need to setup and operate your own phantomjs server

- snapshots are created and cached in advance so the search engine crawler won't be put off by slow page loads

- snapshots are updated regularly

chadscira · on Oct 7, 2013

I recently needed to do this for google, but i wanted the rendering time, and delivery of the page to be under 500MS, so i hacked up something that works with expressjs

https://github.com/icodeforlove/node-express-renderer

It uses phantomjs but removes all the styles initially so the rendering time is much faster. (my ember app was averaging 70MS to render, but i prefetch the page data)

paulocal · on Oct 7, 2013

came across this recently and it's super easy to implement

RoboTeddy · on Oct 7, 2013

This looks similar to Meteor's "spiderable" package

http://docs.meteor.com/#spiderable

imslavko · on Oct 7, 2013

Looks like that's exactly what Meteor's spiderable package does since 08/2012[0]: look at user-agent, run phantomjs for 10s and return a rendered page once google/facebook crawler detected.

[0]: http://www.meteor.com/blog/2012/08/08/search-engine-optimiza...

davedx · on Oct 7, 2013

Considering using this for my Meteor app. Do you have any experience of it?

commanderj · on Oct 7, 2013

Making JS heavy sites crawlable is also possible with libraries like https://github.com/minddust/jquery-pjaxr and https://github.com/defunkt/jquery-pjax . Plus the push state has the advantage of "real" urls.

dchest · on Oct 7, 2013

uptown · on Oct 7, 2013

With each user-interaction that updates a page fragment it modifies the address in the browser's address bar to correspond to the current state. If somebody were to copy and paste that URL into a new tab, your site would load the complete interface if you've structured your back-end code correctly.

You do this by building in logic to the part of the code that outputs your view to see whether the request is coming as a PJAX request, or not. If it is, you output the page-fragment, which is then added to your existing DOM. If it's not a PJAX request, your back-end outputs the entire code for your site.

There's a limitation to PJAX where you can only update one fragment at a time, though PJAXR seems to address that limitation by providing support for updating multiple-fragments simultaneously. Either way, you get the huge advantage of having a fully-crawlable site without needing to integrate pre-rendering work-arounds for search-engine compatibility.

gorm · on Oct 7, 2013

Very cool, but something I don't get:

- Try to go to prerender.io, press "Install It -> Ruby on rails". Now it loads the ruby on rails example.

- Then go all the way down and change to "Prerendereed content". Pressing "Install It -> Ruby on rails" doesn't do anything now.

Shouldn't it render the same content? "Add the middleware gem to your Gemfile..." and so on.

thoop · on Oct 7, 2013

prerender.io uses js(bootstrap) for the tab switching. So the prerendered page doesn't do anything because it doesn't load that javascript.

fuddle · on Oct 7, 2013

PhantomJS can be a pain to setup, I think the approach taken by Discourse.org is the best option: http://eviltrout.com/2013/06/19/adding-support-for-search-en...

Isofarro · on Oct 8, 2013

Progressive enhancement is still better than this. The noscript element is a fallback when JavaScript is not available or turned off.

It doesn't handle situations where JavaScript is enabled, but your application failed to get the JavaScript completely to the browser.

With modern JavaScript and feature detection, the use of no script elements is a code smell.

t0 · on Oct 7, 2013

Why hasn't Google implemented this yet? Their current solution isn't good enough (https://developers.google.com/webmasters/ajax-crawling/).

est · on Oct 7, 2013

web apps today are so much more than ajax. You have to actually a full blown DOM tree to get what real user-agents renders

dsl · on Oct 7, 2013

If only Google had access to a full blown browser they could use in the crawl engine...

rurounijones · on Oct 7, 2013

* At scale, without massive performance drops

Volpe · on Oct 7, 2013

I'm confused, search indexing isn't a realtime exercise... Why would performance be an issue? Running a headless browser vs running "whatever it is they run that can execute JS" doesn't seem like a huge leap...

rurounijones · on Oct 7, 2013

At Google's scale any performance drop can have massive implications. If Google's crawl rate is 100 million[1] pages a day then a 1% drop in crawl rate means 1 million less pages crawled per day (which has many implications, for example having to use more compute power to regain crawl rate which raises costs etc.)

You are right that it is not a real-time exercise but they do have crawl targets.

You cannot be flippant about "Why don't they just do X" when scale is that big.

[1] Picked out of the air but probably in the right magnitude (or even a little small)

est · on Oct 7, 2013

Have you ever experienced web apps that laggs like crap? Yeah think about that x 10000 million web pages.

Volpe · on Oct 7, 2013

... right but a bot doesn't get impatient. So I don't see your point.

cygx · on Oct 7, 2013

They should just shut down all their data centers and crawl the whole web from a single box located in someone's basement.

After all, the bot doesn't get impatient.

Volpe · on Oct 7, 2013

... Comments have really gone to shit here haven't they.

Some how we all end up antagonistic over bullshit like whether google have a big enough computer.

But alas, you're right, google could never crawl with an actual browser - what a ridiculous suggestion. I apoligise for such a dumb-witted comment.

As an aside: For my part in contributing such bad quality comments, I apoligise.

cygx · on Oct 7, 2013

The point is that Google probably doesn't have a lot of cycles to spare - anything else wouldn't be good business sense.

Anything that significantly adds to the load will lose them money - whether or not the operation needs to be realtime is secondary to that.

I apologise for giving offense: I wrote the comment the same way I would have made it face-to-face, which is always a bit risky in a purely textual medium.

dsl · on Oct 7, 2013

I don't know if you are trying to be serious at this point or not. Google has millions (literally) of machines with dozens of cores each. Search is their business that makes all the money.

Google executes JavaScript and renders the full DOM for every page internally. They generate full length screenshots of every page and have pointers to where text appears on the page so they can do highlighting of phrases within the screenshot.

It isn't even a debatable question if Google reuses the Chrome engine to do this.

richardwhiuk · on Oct 7, 2013

The compute costs would be extraordinarily expensive.

selvakn · on Oct 7, 2013

Shameless plug: https://github.com/selvakn/rack_phantom

Similar idea, but implemented without a server for rendering, with a phantomjs process. And only for rails/rack app.

radq · on Oct 7, 2013

I believe Bustle.com does something similar to this. There was a talk about it in the Ember NYC August meetup.

http://www.youtube.com/watch?v=8MYcjaar7Vw

steeve · on Oct 7, 2013

I've made a plugin to automate this for AngularJS: https://github.com/steeve/angular-seo

Works with PhantomJS (of course).

ateevchopra · on Oct 7, 2013

This is a really great idea !. I mean now data in apps made on js can be searched. My question is that can we add "Search with google" to out javascript app then ?

acqq · on Oct 7, 2013

I surf with JavaScript turned off and I see just a blank page. If it's "crawlable" I certainly expect it to be visible to me without turning JavaScript on.

welly · on Oct 7, 2013

> I surf with JavaScript turned off

Why would you do this? Genuinely interested. Do you browse the web with JavaScript turned off the majority of the time or just in this particular example?

acqq · on Oct 7, 2013

I keep JavaScript turned off by default. Then I turn it on for only a few sites of critical importance for me which would not function otherwise. And I don't feel I miss anything, most of the content I care about is still HTML and it should remain so. JavaScript is not needed to show me the text.

That way the chances for cross site sripting attacks are greatly reduced and the content appears much faster.

stuartmemo · on Oct 7, 2013

Why? You're not a web crawler.

tomekmarchi · on Oct 7, 2013

Pushstate 4 the win. Done with hash fragments not going back to that mess. Pushstate is quick and easy to implement,I don't see a reason to over complicate.

franze · on Oct 7, 2013

hi, my 2 cents

>Javascript apps can be fully crawlable yes, and i think it's cool that you try to provide a solution as a service for this.

but as with every technology, there are some tradeoffs

a) serving google a different response bases on the user-agent is the definition of cloaking (it's not misleading or malicious cloaking, it's cloaking non the less)

b) you hardcode a dependency to a third party server - you have no control over - into your app (and from the sample code on the page, there is no fallback available if this server is down)

c) there are latency/web-performance issue i.e.: for a first time request by a search engine the roundtrip would look like so:

[googlebot GET for page -> googlebot detected -> app GET to prerender.io -> prerender.io GET to page -> app delivers page -> prerender.io returns page to app -> app returns page to googlebot]

this will always be slower than

[googlebot GET for page -> app returns page to googlebot]

so basically the prerender.io approach creates some issues. said that. we don't have - yet - another "no tread-off" solution

the "make ajax crawlable" approach basically allows - non malicious, non misleading - cloaking https://developers.google.com/webmasters/ajax-crawling/docs/...

(sorry google, but ?_escaped_fragment_= was really one of your must stupidest specs ever, even worse then "nofollow")

so if you target "?_escaped_fragment_=" in the GET request, and not the user-agent cloaking a.k.a. sending different responses is ok

but: it creates a double googlebot crawl issue i.e.:

[googlebot GET http://www.exmaple.com/test -> googlebot parses HTML and finds <meta name="fragment" content="!"> in the HTML -> googlebot pushes http://www.exmaple.com/test?_escaped_fragment_= into its "stuff to crawl-queue" (a.k.a. discovery-queue) -> googlebot crawls http://www.exmaple.com/test?_escaped_fragemten_= -> gets server side get request (or if you would use a prerender.io service the whole roundtrip to the prerender.io site would start) ]

this is a no go if you have a big site with hundred of thousands to millions of pages.

and there is another much, much bigger issue:

  * showing JS clients 
  * and "other only-partially-JS clients" (google parses some JS) different responses

just does not work in the long turn.

why? if there is no direct feedback, then there is no direct feedback!

non-responsive mobile site currently offer overall poor user experience, why? because all the guys working on the site sit in front of their fat office desktops. no feedback equals crap in the long run.

and it's worse for "for robots only" views, because people just don't have to live with the crap they server spits out, as they always just see the fancy JS versions. since the hashbang ajax crawl-able spec came out it consulted some clients on this question, everyone who choose the _escaped_fragment_ road anyway did regret it later on. even if the the first iteration works, 1000 roll out later, it doesn't - if there is no direct feedback, then there is no direct feedback.

conclusion: if you have a bit site and want to do big-scale (lots of pages) SEO you are stuck with landingpages and delivering HTML + content via the server + progressive enhancement for functionality, until the day google get's its act together.

and for first-view webperformance i recommend the progressive enhancement approach anyway, too.