You will have to create an attribute for your json, where you'll store the json utf-8 encoded. If you want to index on parts of that json blob you'll have to pull them out into their own separate attributes and the recombine them into a single json object on read.
I really wish GA tracked all the metrics exposed by the timing spec, instead of combining them into one overall value. It's great that this reports numbers from actual users' machines instead of a headless render process on a monitoring server somewhere though. Here is hoping FF implements support for this soon.
That is significantly harder. Consider a graph (as in, graph data structure) of IM users. Communicating between an AIM and an XMPP user is a single line, and even though the two nodes are on different services it's not that hard to make work. Almost by definition, both services have very similar semantics for what an IM is, and stuff like presence converged a while ago enough for interop.
Now, consider a conference. Let's say AIM_A and AIM_B wish to conference with XMPP_X and XMPP_Y. Now there's six entities; the four users I just mentioned, and the conference rooms that are supposed to be reflections of each other. Conference rooms actually have complicated protocols for creating and managing them, fundamentally. Hooking them up to each other is very non-trivial. What does it even mean at the protocol level for XMPP_X to try to kick AIM_A out of the conference? You need a lot of infrastructure to make this work, almost none of which exists because it requires both conferences to cooperate, or a lot of implementation by one side or the other to be able to directly connect to an entirely foreign network on a foreign protocol. None of which the proprietary networks are likely to do.
Without cooperation on both sides this is basically impossible.
Right, that makes sense, thank you for the explanation. I guess interopertability between the two services implied to me they could speak eachothers' protocols in entirety (at least ideally), so i pictured gtalk talking to the aim chat room via its protocol. Oh well.
BTW this has implications for google apps users since there are no usable persistent chat rooms in gtalk. Well i guess any users, but for businesses chat rooms are more useful than for casual users I think.
Interesting to see json schema used. I was just using it for something similar in my own api framework (although mostly for validation and serialization, not just discovery) and the public interest in the spec seemed mild at best. Good to see the idea catch on a little, though it would have been even better if they released a full python implementation of it instead of just hard coding around the few pieces they actually use. There actually aren't any full implementations at all right now.
I used json-schema like crazy in my last pet project. used it to validate javascript I would proxy for well known interface endpoints. pretty damn nice.
> Also, Moore's law is still relevant in the mobile market meaning that people replace their phones fairly frequently to get better hardware support. A mobile phone from 5 years ago is clearly inferior to most people whereas a computer from 5 years ago is mostly adequate for most users.
Agreed, and also the dominant software platform has not emerged yet, like it has with the PCs. The market is still highly fragmented and applications are routinely being written for many platforms. There is still time for someone other than Apple and Google to build something there.
In the meantime, MS has quite a bit of time I think. Those mobile devices will remain "niche" devices for a long time, as these platforms mature. Right now using my phone while i'm outside the house or using a tablet on my couch is great, but that doesn't mean i can dispense with my desktop, where MS is king. That will remain for probably a long time.
The holy grail in my opinion is a single device that i can use as a phone, and also hook up to monitors + peripherals and get a full unabridged desktop experience. Why have two devices when you can have one. Obviously the Atrix is already a step towards that, so i'm not pointing out anything particularly new. Of all the existing players, I think MS is potentially even better positioned than Apple to get in on that game since they already have a lock on the desktop platform.
Google instant is where google shows you search results as you type, going beyond auto completing your query. Some people like myself find that annoying, and it's doubly annoying that google actually resets your preferences back to what IT thinks is better for YOU.
Tying yourself to LINQ is one thing, tying yourself to MS SQL Server is completely fine if you're ok with the licensing fees. SQL Server is probably one of the best enterprise databases out there right now, I imagine right behind oracle.
They aren't really tying themselves to LINQ, though. It seems like they're using LINQ to get the dev speed improvement when they can, and dropping down to writing their own SQL in the subset of cases where they know that LINQ isn't fast enough.
Sure, I'm just trying to speak up for some of the less popular technologies in today's dev community :). SQL server has the downside of costing money, but i think it should be in the equation when making technology decisions today, it's a vastly superior piece of tech compared to mysql and postgres :).
If you are building anything more complex than a blog site and expect to take a decent amount of traffic, to the point that you may in fact care about optimizing at all, going with an ORM that writes sql for you is a really really bad idea. I really don't understand the fascination with ORMs today. Some sort of sql-to-object translation layer is no doubt a great thing, but any time you write "sql" in a non-sql language like python or ruby you are letting go of any ability to optimize your queries. For reasonably complicated and trafficked websites that's a disaster simply waiting to happen. This isn't just blind speculation on my part, I've heard a great many stories where very significant resources had to be dedicated to removing ORM from the architecture, and the twitter example should familiar to most.
I would go so far as to say that sql writing ORMs are a deeply misguided engineering idea in and of itself, not just badly implemented in its current incarnations. You can't possibly write data access logic entirely in your front end and expect some system to magically create and query a data store for you in the best or even close to the best way.
I think the real reason people use ORMs is because they don't have someone at the company that can actually competently operate a sql database, and at any company of a decent size traffic-wise that's simply a fatal mistake. Unless you are going 100% nosql, at which point this discussion is irrelevant.
ORM's aren't a problem at all as long as you have the ability to override problematic queries with named queries, etc.
ORM's can provide very real advantages when it comes to caching, development time, etc., as long as you review what the ORM is doing and notice when it's doing it wrong.
I've just spent 2 years architecting a high transaction global video game system using an ORM, and it worked well. In our case, the ORM provided acceptable SQL for about 85% of the queries, and we overrode the rest.
The ability to quickly and easily allow the developers to write their own SQL, to be reviewed later by a DBA, was a life saver. Combine that with our stress and load testing, it was easy to see where the hot spots were and deal with them effectively.
The problem comes from people who rely on the ORM to do everything for them without truly understanding how it works.
ORM's, like anything, are a tool, and there is a time and place for them.
Wrapping both caching logic and database access in an ORM like system is no doubt the right thing to do. Letting front end developers write queries to be converted by an orm and reviewed by a DBA later - in my opinion that's not the most efficient method of development. I probably would have invested in an extra DB person or two to help write the data access logic. But hey, I can't argue with results - if it worked for you that's great. But as a general statement I think that sort development methodology is highly conducive to errors and systematic problems that would not become evident until later, and at that point take a great deal of effort to fix.
The two big systems I architected where I made the decision to go with ORM's were the online EA Sports system (all EA Sports games on all platforms, currently running in a 7 node Oracle cluster), and most recently, the Need For Speed World Online system. We launched the EA Sports system with Madden, and went from 50 to 11 million users hitting the DB in less than an hour. Then we rolled out the other EA Sports games. Needless to say, both systems were slightly bigger than a simple blogging site.
In both cases, we had a large number of smart developers who we empowered with the use of an ORM; they understood the ___domain model, and they didn't have to worry about waiting for a "DB type" to write stored procedures, or develop a data model, etc. As a matter of fact, in both cases, I was the only DBA on the project, and it was a predominately part-time role. We'd meet, ensure we were all on the same page with the object/data model, and then they'd go and build it. The developers were able to immediately build and run and test and integrate something that was functional and operational, when they needed it. This was HUGE, and something that most people don't properly appreciate. Timelines were already insane enough as it was, the last thing we needed to do was artificially constrain ourselves by waiting for other (db) devs before work could go on. Especially when requirements had the potential to change from one day to the next.
In both situations, we took advantage of very, very sophisticated testing procedures that would happen nightly, both functional and stress/load, and it pointed us at the bottlenecks of each nightly build that would require tuning and investigation. We intentionally set up our testing to be able to monitor and test the effectiveness of the ORM, and to point it out when it didn't work efficiently. The devs would do the majority of the heavy lifting with the initial data model, and the results would be tested, reviewed, and then modified if required. The performance modifications were not a lot of effort to fix, either. Usually it was a very slight data model change, or using a named query to take advantage of a database-specific features. And CLOBS. Every database seems to handle them differently, so we had to hack some solutions.
Having done large scale database development for almost 25 years, using the classic stored procedure approach and the ORM approach, I'll say again that ORM's are a great solution for certain projects with the right staff, and aren't a crutch or some lazy choice if used properly.
As another 'Oracle guy' this is an interesting post. I have said before on here, if you pay for Oracle, and also pay for decent storage arrays, Oracle can shift a serious amount of data before it reaches its limit. In my opinion, it really seems to be an order of magnitude better than its closest open source competitor.
I think when people say "relational doesn't scale," what they mean is "MySQL often requires application-level changes to scale out."
I assume (among this crowd, anyway) that scaling out is more desirable vs. scaling up because the majority of the hardware costs are variable, whereas scaling up requires a step function of large cash investments that startups often can't afford.
Do those massive systems on Oracle etc scale out, or simply scale up with expensive hardware?
You can certainly scale out with Oracle RAC. At some point the bottleneck will likely be disk however, so buying a high end storage system would probably become priority.
My experience is from writing a bunch of middle tier code at MySpace in the 06-07 time frame, the myspace hey days when they were pushing more traffic than google (true story). Anyway, the user facing product might have sucked, but we did scale (that's why friendster was friendster and we were myspace :). In an environment with 450+ million users, we had extensive caching systems and still had to use every sql trick in the book to get our systems to scale well. I know because my job was working with the DBAs to bridge the sql and front end worlds together. I can say with great certainty that front end developers who did not know sql and were simply following a logical object model would not have produced code that scaled in our environment, there were way too many things that were done that were extremely non-obvious. Since myspace i've been working at a python/postgres start up where we've been applying the same principles pretty successfully, at a much different scale of course. If nothing else, i think the no orm approach will at least give you more bang for your buck.
Separating your data access code out of the application logic also allows you to change it much more easily as data conditions change, including on the fly, without an application deployment. That's often extremely useful.
MySpace scale may be at an extreme end of the spectrum, but we had formidable hardware to throw at it too (although x86, so nothing TOO crazy). So I think the ratio of hardware to scale at other sites is comparable, and so I think the same lessons apply. I have no experience working with oracle, but would you say that a 7 node oracle cluster is some pretty serious hardware? I myself really don't know, but it is a question I have :).
EDIT: I'm not discounting your experience, i just want to point out that i've experienced conditions where I think the orm approach would have broken down. If others have had different experiences, the more data points the better, but i think the scale/complexity/cost(hw) ratios play into the debate as well.
EDIT #2: Oh and I forgot to mention that the automated test suite you had is an incredible asset, and no doubt made it easier to discover problems early and deal with them effectively. But you do have to invest resources in creating one, and something like that is no small cost at a start up.
The point of my post was to say that if you take a serious look at the ORM you want to use, fully understand the issues you may have with it, design/adapt your development process to help mitigate the issues you may run into, there are huge advantages to using it.
I was just pointing out that ORM's are indeed quite effective in online systems that are more complex than a blogging site.
If you're going to say "no, don't use it", based on a development situation that is very much an outlier (MySpace), and use that experience to discount it for any but trivial use, then I'm not sure what to say.
They can and do offer real-world advantages with minimal downside if you treat them like any other tool, and not use them blindly, in reasonably complex and large systems, as I've tried to demonstrate.
As to your environment, the data requirements were quite different than ours. Our systems were more like online banking systems; very much an even split of fast writes and reads, transactionally bound to third party systems (in-game payment, in-game "real time" use of consumables, etc), real-time analytics for fraud detection, etc. We were very much high IO, and our caching opportunitites were few and far between.
And in our environment, we HAD to have sophisticated testing. I ensured that the stress and load testing was done so that we could directly simulate the load of our expected user base, with realistic profiles, in order to better engineer our databases and disk IO. It also allowed us to measure the impacts of feature additions, etc. If it failed in Production, it made the news, and we had millions of gamer-freaks bitching everywhere.
In my case, the middle-tier was not an issue... we enabled minimal caching on a per-box basis, and other than that, they were stateless, and we could add/remove them at will; the application WAS the database.
And you can still abstract various parts of the database while using an ORM. We did write a few special stored procedures, and used some forced query plans, views, etc., to tweak the performance.
And yes, Oracle can scale out quite well. Cache Fusion, high speed and low latency interconnects, and shared block access provides incredible scaling without having to do anything special in the middle tier.
It's interesting to hear that has worked well, obviously this wasn't a small project. Your point about knowing how to use your tool definitely rings true. Also interesting that you had a use case where data loss and integrity actually mattered and in real time, unlike a social network or most start ups operating today. Going with a heavy oracle system instead of trying to roll your own creative distributed architecture definitely seems to make sense in that scenario. Just out curiosity, was this Java/Hibernate?
On one system we used Java/Oracle/Hibernate and went with the big single cluster. The other system was a .NET stack, using NHibernate and a large number of SQLServer instances. We also worked with Microsoft on integrating their latest (at the time beta) caching servers. We did indeed have to roll our own distributed architecture in that case, but it's not like we had to drop ORM to do it.
If anyone has any questions about how I've used ORM, etc., feel free to email me at [email protected] if you like. I don't usually keep tabs on old threads, and have no problems sharing some of my experiences in this.
> any time you write "sql" in a non-sql language like python or ruby you are letting go of any ability to optimize your queries.
No you're not. Look at ActiveRecord, it lets you drop to any level of SQL optimization you need. In ActiveRecord 3 with ARel queries are composable, allowing lazy loading and the breaking of queries into appropriate locations according to your code architecture.
I can't speak to other ORMs, maybe they really are as bad as your opinion would indicate, but I suspect what you're really complaining about people who don't know how SQL works being enabled to write horrible data persistence code by ORMs with a pretty facade. That's a legitimate problem, but the fact that a tool can be abused is not an argument against the tool itself. We'll never build anything great if we are driven primarily by what the ignorant will do with it, after all, every single person on the planet is ignorant of most things, our tool development should be driven by what they enable experts to do.
Things like lazy loading is a red flag to me that you are doing something wrong, so if your framework allows you to do that that's not necessarily something to brag about :). Random IO that is triggered by merely accessing a property without knowledge of the programmer is not the best approach if you want to scale, you are better off doing deliberate fetches as a result of previously fetched data. If you are breaking and composing queries, how are they broken and compose by the orm, as joins or as sub queries? If as joins does your orm know the best columns to join on? You could replace everything with named sql functions (dropping to the lowest level of optimization as you mention above), but at that point what is your orm really doing for you. Anyway, sorry, I'm not sold :). Maybe if you effectively replicated the database engine in your front end framework I would come closer to being sold, but even then you don't have the same rapid in memory access to statistics about tables to make the right optimization decisions, etc..
You are thinking at too abstract a level. First of all, IO is not "randomly" triggered, it's entirely predictable how it works. Composability allows you to define a scope that globally applicable like (eg. "published articles") and then add additional constraints in a controller (eg. "tagged X") using the logic of relational algebra to construct a sane query. Could this query be slow? Sure, it's still your responsibility to make sure the schema supports that query. Writing SQL manually does not absolve you of that responsibility, it just means organizing a lot of SQL strings somehow.
Here's what you're missing, and why ActiveRecord works: if you halfway know what you're doing with a RDBMS, 95-99% of the time the query generation just saves you writing a lot of boilerplate SQL. It's true that sometimes you have to drop to a lower level to hand-craft a query, but ActiveRecord in no way prevents that. Again, I don't know what kind of ORM hell you have been put through, but I assure you that an ORM does not need to be this horrendous performance killing black box that you think it is.
There is one thing that I think can help with the sql overhead you mention - if you have a rock star dedicated sql person that can take all this work off your hands (that's not me btw, i've just worked/am working with such people). I think it affords you easier long term growth if you have expectations of making it to the medium to large company world, while not slowing you down when you are small, so I think it's a better strategy for both small and large companies. Are you signing on for a potential bottle neck? Yea it is a trade off and it is paramount you hire well in that area, but that's the sort of problems and decisions you have make all the time at a company.
I understand where you're coming from and what you describe may be workable in a smaller company with 7-14 devs where everyone knows what they are doing and understands well what happens under the hood. I think it's less likely to work at a company with 50+ devs though where you inevitably start trusting people less, or just at a company where you don't trust everyone. I've worked at both types. There is also the question of the complexity of your data and the way you need to query it. Right now we do essentially a ton of graph queries that we optimize highly in sql (ends up working much faster than any graph database since the schema and the queries are optimized for the exact data we are working on). Some of the functions that I write for this would not be implementable in an orm. I suppose that could be the case where you drop down into raw sql, but that happens to be a fair chunk of our code.
Maybe you can make it work better than I'm expecting, but if you were starting from scratch would you really want to go down that path anyway, all things considered? My original argument was that you are better off choosing a different way. I suppose that point of view will be difficult to change for me :).
Have you actually ever used an ORM? All of your posts strike me as being of an "I imagine it would be bad" nature, without actually speaking from experience.
I get the impression that you don't really understand the complexity and capabilities of a modern, robust ORM, or how it can be used.
And you're working on a data warehouse, which isn't usually a viable candidate for ORM in the first place.
I've used ORMs before i worked at myspace. NHibernate specifically. I've also used sql alchemy on the python side. NHibernate was in a professional environment, sql alchemy was a bunch of stuff I did for evaluation purposes, so you can discount that if you like.
And i'm not working at a data warehouse.. why do you think that?
I completely agree. I didn't even realize the "N+1 Selects Problem" was a problem because it should be referred to as "My Training Wheels Fell Off and Now My Bike Falls Over When I Sit On It".
Replace ORM in your argument with C and SQL with assembler. Just like high level languages it's an abstraction layer and it can really help you put stuff like caching, escaping (think SQL injections) into once place. Also it's much easier to change a function name in your abstraction layer, then to change your database schema.
The advantage of using LINQ database querying in C#, and it's a big one in my experience, is that your queries are actually typechecked by the compiler like any other code, making it a lot easier to refactor. (In the context of Python/Ruby which don't even have typecheckers I have no idea what the draw is).
The disadvantage is that due to some organizational dysfunction at MSFT there's still no really satisfactory ORM infrastructure surrounding the query engines.
(as for your "misguided engineering idea in itself" claim, I don't really see how it's fundamentally different from writing SQL in the first place to be translated by the database into query execution plans, vs. writing the query execution plans directly).
The difference at a high level is that sql has a syntax and set of capabilities that is quite unique, and every single database vendor has its own extensions or differences driven by their particular approach. To really replicate all of this in code you would have to go beyond the basic data structures and syntax of that programming language. And at that point might as well just have sql. It's a paradigm and an approach expressed through its own syntax, you can't easily copy all of it in a totally different programming language..
As for checking for type safety, I think frameworks that do sql-to-object mapping (with type safety), and also handle cache for you, are a very useful thing. Making raw calls on database connections is definitely too far "in the other direction" :).
On the syntactic level SQL is just a poorly designed language. LINQ query expressions actually do a better job of expressing the semantics of the SQL-like set/collection operations, in a compositional manner. It's definitely true though that SQL databases currently have a lot of capabilities (like, errmm, DML) that at least the Microsoft ORMs don't support other than by dropping down to SQL. I don't think this is a problem with the LINQ IQueryable paradigm, though, but just a problem with the Microsoft ORMs being incomplete.
I don't have much experience with ORMs or mapping frameworks other than LINQ-based ones, but it seems like it would be pretty difficult to typecheck queries expressed as SQL strings, at least dynamic ones, at compile time. Do the frameworks you mention typecheck the actual query itself at compile time, or do they just check at runtime that the data returned from the query matches what you want?
Query results you will normally adapt to specific object properties, and at that point the only thing that can bite you is if your query starts returning columns of a different sql type, which of course you can't catch compile time anyway. If you wrap your query results with objects and maintain an interface into your update statements via method calls (which obviously have type checking for arguments), I don't see how you can run into serious trouble. In the dynamic language world you of course don't have compile time anything, but you can use pretty much the exact same techniques to ensure you don't pass something bad to your query. I guess this isn't very dynamic, but that's the idea - your data access logic lives in the database, you execute methods and get back objects. That's the no orm way.
> If you are building anything more complex than a blog site and expect to take a decent amount of traffic, to the point that you may in fact care about optimizing at all, going with an ORM that writes sql for you is a really really bad idea.
Then explain the massive success of Rails. Quite simply, you are wrong.
I don't think they're necessarily misguided. DataMapper made efforts to circumvent the N+1 problem, in most cases probably pretty effectively.
Partial Updates are also pretty easy. Slamming every field into every INSERT/UPDATE is obviously a bad idea.
I think the missing sauce for ORMs is funding. Getting the basics together takes time and money, and it's hard to pull off in your free time.
On the other hand, having written many an ORM, I think there's still plenty of room to advance the state of the art. One of the biggest untapped (AFAIK) opportunities is using Statistics for query tuning. It's the life-blood of databases, but statistics are noticeably lacking in ORMs. Even simple counters could allow you to tune lazy-loads, JOINs, pre-fetches, etc on-the-fly.
If you went back a few decades then the exact same argument would have been made by replacing 'ORM' with high-level languages and 'SQL' for assembly. The trade-offs are similar.
Aren't metrics such as "seconds on an ec2 instance" not particularly meaningful because you get highly variable performance per instance based on who else is using the actual hardware? Am I correct to assume that m2.2xlarge instances are shared like other instance types?