Remember 10 years ago when this was a solved problem? There were only two rules: "always use stored procedures" and "never build dynamic SQL". Doing stuff on relational databases was fast.
Granted, you often had to actually write Stored Procedures by hand back then. But then if you look at the LINQ in the article, you'll notice that it's pretty much exactly the SQL you'd stick into your stored procedure, just backwards. Had they not bothered introducing LINQ in the first place, they'd have had their 100x performance boost from the get go.
Naturally, SP's are not some magic 100x'ing cure-all. They're just a non-generated version of your SQL, which means that you're guaranteed never to have your ORM go nuts and build some monstrosity like the one outlined in the article.
You still need to tune your SQL by hand, but at least you can tune the SQL rather than some not-particularly-helpful abstraction on top of it.
This is the old Microsoft Misinformation. Stored Procedures are not faster.
SQLBooksOnline actually has this information clear as day, but MS Evangelists like Rob Howard (went on to create Teligent) would go around spouting the "sprocs are faster!" line and no one would ever think to point out the discrepancy. It just became religion for the masses of MS platform Developers.
ADO.NET (and whatever it's called these days) executes all Parameterized queries through (see if I can get the casing right) the sp_execsql procedure. This means that the Query Plan for every query run on ADO.NET is cached. IIRC the only additional step Stored Procedures avoided was Parsing. Execution for either would come up with the same, cached plans, but sprocs wouldn't have to parse each time.
Who's willing to bet that the SQL Parser, with decades of optimization work, can parse 100% of your queries in a few hundred nano-seconds?
In practice then, that savings doesn't matter. In side-by-side tests I did a few years ago (that anyone with a few minutes can replicate), there was effectively no performance difference between the two. So the parsing savings were swallowed up as noise in the actual network-latency/query-execution formula.
At the end of the day then, sprocs are not the solution. The solution is to avoid dynamic SQL, and use Parameterized queries. Generated queries that aren't Parameterized do not cache their query plans (last I knew, but it's been years, so maybe that's changed) and won't perform as well as sprocs.
The problem isn't LINQ then. If you know how the gears underneath it all fit together, you can easily avoid sprocs with no negative performance repercussions.
Ah.. but stored procs are faster if you are executing business logic in them, because you can avoid round trips.
I think it's a terrible idea, but I've seen multiple places where there is a very blurred boundary between stored-proc-for-data-access and stored-proc-for-data-access-after-a-bit-of-business-processing
If you can actually measure a significant difference between sp_executesql and a stored-procedure call that's news to me. Definitely not my experience, and not what the actual documentation indicates. I'd go so far as to bet that unless you came up with some kind of pathological example you're mis-remembering or misunderstanding what I'm trying to say.
To keep it simple: If you use the .AddParameter interface you should see no appreciable performance advantage to Stored Procedures. If you simply interpolate your parameters (or your O/RM does it for you), then you will pay the price for generating the execution plan each time. Probably not a significant different for 100ms queries, but it could be quite a substantial cost for a <10ms query.
I remember 10 years ago, and I'm sure as hell not going back.
Versioning of SPs was hard. Code control for SPs was hard. Automated testing of SPs was hard. Incremental deployment (e.g. rolling out the new code to half of the servers in the web farm) was hard. Parallel development against a shared dev/staging server was hard. Trivial schema updates were hard.
And even then, it was never guaranteed to be fast. If the underlying tables weren't indexed, or your stored procedures did stupid shit like use cursors or temporary tables, it was still slow.
Why is any of that hard? Stored procedures and database schema changes are all written in SQL, which goes into source control nicely. Schema changes are easy in an environment like that. Parallel development on multiple dev databases rolling changes into a single staging server is easy in an environment like that. Just check in your code and let the automated build do it's thing.
Nothing has changed in the last 10 years to make it harder. Granted, many shops weren't doing it well back then, thus the misconception that versioning SQL databases is hard, and that schema changes are hard.
That stuff was only ever hard because people declared it to be so, and built crazy bureaucratic organizational limitations to stop developers from changing the database (because that's hard and therefore not something that developers should be doing.) Once you start thinking of it as easy and build your tools around that fact, you'll notice that it actually is easy.
I'm not laboring under a misconception. I know how to do all of those things, I was doing them, it's just easier not to.
The example that comes to mind is making a change to a stored procedure's signature. In order to change it, you have to change all of the clients that talk to it at the exact same time. As a result, incremental rollouts become impossible, so you wind up doing high-risk huge-scale deployments or doing things like creating procedures named EntityInsertV2 and EntityInsertV3 or adding optional parameters with defaults or ignoring parameters that the procedure no longer needs but older clients still want to provide.
And then there's cleanup. As you really should remove EntityInsertV1 once all of the clients are using EntityInsertV2, but you can't have that in the same Update SQL, so there's really two deployments for every deployment.
If there's an easy fix for this problem (other than keeping procedural logic out of the persistence layer) I never found it.
> making a change to a stored procedure's signature. In order to change it, you have to change all of the clients that talk to it at the exact same time.
Assuming all of your clients are all server-side (ie, servlets) and are also in source control, this should not be an issue.
If you have thick clients accessing SQL functions directly, you have a bigger problem (client-server... ugh) that browsers have solved.
If you have multiple servers talking to each other, you still have a version of the "thick client" problem, namely enterprise integration... for which there are intermediary APIs.
Assuming all of your clients are all server-side (ie, servlets) and are also in source control, this should not be an issue.
Source control isn't the issue (of course everything is also in source control, this isn't the dark ages).
I'm talking about a pretty common scenario where you have one SQL server and 4 web servers acting as clients. Let's say you want to do a zero-downtime rolling deployment and upgrade the web servers one at a time while the other 3 continue to run. This is a real issue that I encountered on multiple projects that were adhering to the "everything must be in sprocs" dogma. Moving to ORMs and parameterized queries saves a lot of headaches.
This doesn't seem to have anything to do with Stored Procedures. Do a quick search & replace above:
s/stored procedure/function
... and notice that you're simply describing a common issue with building anything. If you change the interface to something, you need to change the stuff that calls it. That's just life, and it's the same story whether you have databases in the mix or not.
The distinction is that with functions, I can have my automated refactoring tools help me change them safely, I can have my compiler tell me if I get them wrong, I can have my automated tests tell me if I get them wrong.
By keeping the procedural stuff in the procedural layer and not in the persistence layer, it's easier (for me, anyway, your experience may be different) to keep things encapsulated and make contained changes in contained places.
Yes, bad table designs and poor practice will make your code slow regardless.
Still though, while I don't dispute that SPs (frankly, databases in general) have issues with version control and deployment management, there are still options for managing that sort of issue that don't require the blunt instrument of the ORM tool generating its own set of dependencies and inappropriate SQL for the job, leading to the sort of performance issues that the article talks about....
>> "...Had they not bothered introducing LINQ in the first place, they'd have had their 100x performance boost from the get go..."
Except that when StackOverflow started it was a small startup with just three developers and a small user base. They probably made the decision early on to tradeoff any premature performance optimization with the ability to quickly iterate and ship the product by using Linq-2-Sql instead of hand-writing every single CRUD operation.
Now that their business (i.e. web traffic) has grown exponentially and have a larger group of developers they can now go back and optimize and performance tune the code including doing complete re-writes as necessary.
This is an important takeaway for anyone building a startup.
They cover the "why Linq?" question in one of their podcasts... and it basically boils down to if you're drinking the Microsoft cool-aid anyway, might as well drink the whole cup.
I think there was an implicit assumption that everything Microsoft does is great, and Microsoft would never lie to you. Oh, and the next version will be even betterer! Honest!
Basically the fix boils down to replacing a slow ORM with a faster ORM, and using indexes. Well of course indexes are faster - Duh!
HOWEVER this is where if I was them, I would start to get worried, because every SQL Server horror story I've been told basically starts with "And then the indexes became randomly corrupted" and goes rapidly downhill from there.
But.... every single CRUD is normally not a huge amount of code and relatively simple to generate from the information schema if you know a bit of SQL Server. Do it that way and you've got your sprocs just sitting there for later use, primarily OK to be used straight out of the box but available for modification as soon as you need something different.
Your generated procs's are your home-brew ORM, what exactly do you think an ORM does other than dynamically generate CRUD and map data into objects to speed development? The ORM is better, generates it's code at runtime so it doesn't leave a mess behind when things change and removes any need for you to regenerate.
That ORM can generate update statements that only update changed fields... can your proc do that or does it blindly update all fields? That ORM effortlessly maps data from the db into objects so you don't have to pass record-sets around; I don't want record-sets and data readers littered all over my programs, I want business objects.
Programs that keep all SQL in procs are horrible to work on and a bear to maintain, I'll take an ORM any day over a mess like that.
All code, auto-generated or not, is a sort of liability in that you have to maintain it. You have to re-generate it each time you make changes, and there's a possibility that someone can mess with it and cause it to break.
My take on auto-generated code is this: If code is so simple/deterministic/boilerplate that it can be auto-generated, then the generation of that code should be completely automated (at runtime, if possible) and the generated code should be invisible and untouchable.
I've never had a pleasant experience tweaking auto-generated code.
But, if the code is being generated in that way, you're also doing a lot of identical string concatenation at runtime, when identical results with higher performance can be achieved by doing it at an earlier point - compile, design, whatever. You're burning CPU cycles to avoid the existence of these objects and separetly make it harder to optimise your application later when you find you need it in places.
Depending on your application architecture, you're also increasing the chances of your needing multiple round-trips to the database to query various bits of information for building your query, increasing the load on the DBMS (and possibly the network, depending on your architecture) and increasing the query latency.
Properly auto-generated code should be the cleanest, best laid out, most consistent code you work on, quite possibly well commented as well. It's obviously not a panacaea that'll work for everything and I'm not saying ORM tools never have their benefits, but certainly I'm not convinced that third-party ORM tools are right for what I'm doing at present.
Sure, ORM tools can be a tradeoff between what's fast at dev time and what's fast at runtime. What has been working for me is relying on the ORM and then hand-tuning the small subset of things that need to be hand-tuned.
And, of course, I don't know anything about what you're doing at present, so I won't even try to speak to that. But I can't imagine a weaker argument against ORM than the first one you made, though. When you're doing data-intensive work, are you really worried about the CPU overhead of string concatenation on your application layer?
For a decent subset of the procedures I've worked on, yes, I am, from experience. I've had app performance killed by excessive string concatenation before, so my inclination to repeat identical jobs at runtime which I know from experience cause performance issues is limited. Precompile.
I should stress in this though, I'm not a complete 'it must be a compiled sproc stored in the DBMS' guy; I have worked with systems before which effectively provided their own stored procedure implementation within the application source code because the underlying DMBS didn't support anything else; it's not ideal for a number of reasons but it does work, is an order of magnitude superior to ad-hoc SQL scattered throughout the app, still permits some of the benefits of genuine sprocs but also still provides the dev with proper control over what they're doing.
I haven't used LINQ, but I do use Rails and its ORM. Using Rails' ORM to generate SQL is a case of optimizing for development time rather than execution time; assuming I would write faster SQL by hand, which is questionable, it would still be premature optimization and would be harder to maintain.
My take would be: use a great ORM to develop quickly and maintain easily, but be aware of common pitfalls, set up sensible indices, and have some performance monitoring in place. Tune by hand if absolutely necessary.
Actually, I do remember 10 years ago, that was when I got my first job. I remember that already then I was rather surprised to learn that what they said in the books about database applications was exactly backwards: the often repeated wisdom was that the database engines were fast and highly optimized, and you should do as much processing as possible on the database engine and as little as possible in your PHP script which was bound to be less efficient. The reality was that the database engine was always the bottleneck. If the PHP script was slow, you could replicate your Apache to as many machines as needed. If the MySQL was crapping out, you could buy a bigger machine or optimize your queries but generally that was a bigger pain.
MySQL could barely do joins ten years ago. Today, it's still behind where Oracle was ten years ago. You can throw extremely complicated queries at Oracle and it generally optimizes them well. Don't think of MySQL as "fast and highly optimized". It's only fast for relatively simple queries.
When I say complicated queries, I'm talking about 50+ line queries with many joins, subqueries, aggregations, etc. You can run these against tables with millions of rows and still get an answer in seconds, often regardless of indexes.
With one exception: joining over database links is terrible. I admit that it's a hard problem to solve, but still.
Just yesterday I rewrote a view that used a database link to act as a materialized view (which makes me feel icky). It ran approximately 10,000 times faster. For a 100-row result that only joins 4 modest tables, there shouldn't be that kind of a difference IMO.
If your database engine is the bottleneck, you don't have an enterprise architecture.
What's the busiest website that you have experience with? Do you have any experience with scaling to, say, serve more than a million dynamic pages per hour to consumers?
I ask because your opinion seems extremely uninformed when it comes to high performance websites. If you've designed your application right in a traditional CRUD application, everything except the database layer scales fairly easily. The game then is about finding ways to help that remaining bottleneck. Standard techniques include optimizing queries, caching data intelligently, directing queries to read-only slaves, and in extreme cases sharding data across databases. Several of these result in moving more logic into the app, not less.
I think the fundamental mismatch is not between OO and relational, or the application layer and the relational layer, but between reads and writes.
In the database world this conflict has been studied for decades under the headings of OLAP and OLTP. All you get taught in a university course or a beginner's book is OLTP-focused. Great if you're storing thousands of credit card transactions per second. Less useful if you need to read thousands of records per second.
Lots of NoSQL looks like it's actually dimensional modelling minus some of the really clever bits.
Well, with the ORMs the objects generate the relational stuff. I enjoy modeling my entities on paper as an ERD (from the entity relationship perspective) and then writing the object code.
However, I have found it is sometimes necessary to denormalize my schema and store entities based on how they are presented, not necessarily the idealistic OO ORM'ed way.
At the end of the day, the user doesn't care if your code is abstracted properly (who's definition?) or enough. It just needs to work speedily and correctly and be easily maintainable.
I think alot of people know SQL, but may not understand how to model entities correctly, and if they can't model the entities correctly, what ever they denormalize is still coming from a 'less-good' model.
Personally, when it comes to denormalisation, my preference is that it be done in a semi-structured way such as dimensional modelling.
And that's exactly what I was talking about: a key purpose for DM is to make the database more accessible for fast bulk queries at the cost of duplicate data.
Whereas normalisation drives out duplication in order to maximise transactional performance and space efficiency.
In your case, you start with the OLTP/normalised model, then devolve to what is actually a kind of OLAP/denormalised model.
Write-oriented vs read-oriented.
Maybe the missing link for relational databases in a web world is actually some kind of real-time ETL?
This is simply not true. Stored procedures and dynamic SQL are completely orthogonal to the problems of N+1 select and index seek vs table scan.
In fact, many ORMs will happily use CRUD sprocs to save and retrieve complex entities, which compounds the N+1 select problem by passing all columns, every time. Compared to that, raw-SQL mappers like LINQ to SQL are a positive step forward, and leave the door open for more intelligent mappers like Dapper (as mentioned in the article: http://code.google.com/p/dapper-dot-net/)
They chose to write their query in C# instead of SQL. They could have kept LINQ-to-SQL and manually written the query (even made it a SP, for the slight gain that'd provide) and then used LINQ's mapping features to get objects out of it. (I believe the DataContext has a Translate feature that'll take any data reader and pop out objects.) I do this from time to time when the LINQ-to-SQL query has a bug or isn't doing exactly what I want, or when there's a bulk operation I can do much more efficiently. It's not exactly advanced LINQ-to-SQL...
Actually, if you look at other posts from Sam mentioning dapper you'll find that they profiled the site and found some bottlenecks also on the translator from the datareader to objects on linq2sql.
Since they're going to hand-write the sql then they're going to take advantage of that really fast microorm instead of sticking with l2q.
Checking the SQL that your ORM engine produce if you deploy your app in an environment where performance matters has always been a good practice. It's true now as it was 6-7 year ago when modern ORM solution like hibernate (or similar things on .Net) started to appear.
If you are building anything more complex than a blog site and expect to take a decent amount of traffic, to the point that you may in fact care about optimizing at all, going with an ORM that writes sql for you is a really really bad idea. I really don't understand the fascination with ORMs today. Some sort of sql-to-object translation layer is no doubt a great thing, but any time you write "sql" in a non-sql language like python or ruby you are letting go of any ability to optimize your queries. For reasonably complicated and trafficked websites that's a disaster simply waiting to happen. This isn't just blind speculation on my part, I've heard a great many stories where very significant resources had to be dedicated to removing ORM from the architecture, and the twitter example should familiar to most.
I would go so far as to say that sql writing ORMs are a deeply misguided engineering idea in and of itself, not just badly implemented in its current incarnations. You can't possibly write data access logic entirely in your front end and expect some system to magically create and query a data store for you in the best or even close to the best way.
I think the real reason people use ORMs is because they don't have someone at the company that can actually competently operate a sql database, and at any company of a decent size traffic-wise that's simply a fatal mistake. Unless you are going 100% nosql, at which point this discussion is irrelevant.
ORM's aren't a problem at all as long as you have the ability to override problematic queries with named queries, etc.
ORM's can provide very real advantages when it comes to caching, development time, etc., as long as you review what the ORM is doing and notice when it's doing it wrong.
I've just spent 2 years architecting a high transaction global video game system using an ORM, and it worked well. In our case, the ORM provided acceptable SQL for about 85% of the queries, and we overrode the rest.
The ability to quickly and easily allow the developers to write their own SQL, to be reviewed later by a DBA, was a life saver. Combine that with our stress and load testing, it was easy to see where the hot spots were and deal with them effectively.
The problem comes from people who rely on the ORM to do everything for them without truly understanding how it works.
ORM's, like anything, are a tool, and there is a time and place for them.
Wrapping both caching logic and database access in an ORM like system is no doubt the right thing to do. Letting front end developers write queries to be converted by an orm and reviewed by a DBA later - in my opinion that's not the most efficient method of development. I probably would have invested in an extra DB person or two to help write the data access logic. But hey, I can't argue with results - if it worked for you that's great. But as a general statement I think that sort development methodology is highly conducive to errors and systematic problems that would not become evident until later, and at that point take a great deal of effort to fix.
The two big systems I architected where I made the decision to go with ORM's were the online EA Sports system (all EA Sports games on all platforms, currently running in a 7 node Oracle cluster), and most recently, the Need For Speed World Online system. We launched the EA Sports system with Madden, and went from 50 to 11 million users hitting the DB in less than an hour. Then we rolled out the other EA Sports games. Needless to say, both systems were slightly bigger than a simple blogging site.
In both cases, we had a large number of smart developers who we empowered with the use of an ORM; they understood the domain model, and they didn't have to worry about waiting for a "DB type" to write stored procedures, or develop a data model, etc. As a matter of fact, in both cases, I was the only DBA on the project, and it was a predominately part-time role. We'd meet, ensure we were all on the same page with the object/data model, and then they'd go and build it. The developers were able to immediately build and run and test and integrate something that was functional and operational, when they needed it. This was HUGE, and something that most people don't properly appreciate. Timelines were already insane enough as it was, the last thing we needed to do was artificially constrain ourselves by waiting for other (db) devs before work could go on. Especially when requirements had the potential to change from one day to the next.
In both situations, we took advantage of very, very sophisticated testing procedures that would happen nightly, both functional and stress/load, and it pointed us at the bottlenecks of each nightly build that would require tuning and investigation. We intentionally set up our testing to be able to monitor and test the effectiveness of the ORM, and to point it out when it didn't work efficiently. The devs would do the majority of the heavy lifting with the initial data model, and the results would be tested, reviewed, and then modified if required. The performance modifications were not a lot of effort to fix, either. Usually it was a very slight data model change, or using a named query to take advantage of a database-specific features. And CLOBS. Every database seems to handle them differently, so we had to hack some solutions.
Having done large scale database development for almost 25 years, using the classic stored procedure approach and the ORM approach, I'll say again that ORM's are a great solution for certain projects with the right staff, and aren't a crutch or some lazy choice if used properly.
As another 'Oracle guy' this is an interesting post. I have said before on here, if you pay for Oracle, and also pay for decent storage arrays, Oracle can shift a serious amount of data before it reaches its limit. In my opinion, it really seems to be an order of magnitude better than its closest open source competitor.
I think when people say "relational doesn't scale," what they mean is "MySQL often requires application-level changes to scale out."
I assume (among this crowd, anyway) that scaling out is more desirable vs. scaling up because the majority of the hardware costs are variable, whereas scaling up requires a step function of large cash investments that startups often can't afford.
Do those massive systems on Oracle etc scale out, or simply scale up with expensive hardware?
You can certainly scale out with Oracle RAC. At some point the bottleneck will likely be disk however, so buying a high end storage system would probably become priority.
My experience is from writing a bunch of middle tier code at MySpace in the 06-07 time frame, the myspace hey days when they were pushing more traffic than google (true story). Anyway, the user facing product might have sucked, but we did scale (that's why friendster was friendster and we were myspace :). In an environment with 450+ million users, we had extensive caching systems and still had to use every sql trick in the book to get our systems to scale well. I know because my job was working with the DBAs to bridge the sql and front end worlds together. I can say with great certainty that front end developers who did not know sql and were simply following a logical object model would not have produced code that scaled in our environment, there were way too many things that were done that were extremely non-obvious. Since myspace i've been working at a python/postgres start up where we've been applying the same principles pretty successfully, at a much different scale of course. If nothing else, i think the no orm approach will at least give you more bang for your buck.
Separating your data access code out of the application logic also allows you to change it much more easily as data conditions change, including on the fly, without an application deployment. That's often extremely useful.
MySpace scale may be at an extreme end of the spectrum, but we had formidable hardware to throw at it too (although x86, so nothing TOO crazy). So I think the ratio of hardware to scale at other sites is comparable, and so I think the same lessons apply. I have no experience working with oracle, but would you say that a 7 node oracle cluster is some pretty serious hardware? I myself really don't know, but it is a question I have :).
EDIT: I'm not discounting your experience, i just want to point out that i've experienced conditions where I think the orm approach would have broken down. If others have had different experiences, the more data points the better, but i think the scale/complexity/cost(hw) ratios play into the debate as well.
EDIT #2: Oh and I forgot to mention that the automated test suite you had is an incredible asset, and no doubt made it easier to discover problems early and deal with them effectively. But you do have to invest resources in creating one, and something like that is no small cost at a start up.
The point of my post was to say that if you take a serious look at the ORM you want to use, fully understand the issues you may have with it, design/adapt your development process to help mitigate the issues you may run into, there are huge advantages to using it.
I was just pointing out that ORM's are indeed quite effective in online systems that are more complex than a blogging site.
If you're going to say "no, don't use it", based on a development situation that is very much an outlier (MySpace), and use that experience to discount it for any but trivial use, then I'm not sure what to say.
They can and do offer real-world advantages with minimal downside if you treat them like any other tool, and not use them blindly, in reasonably complex and large systems, as I've tried to demonstrate.
As to your environment, the data requirements were quite different than ours. Our systems were more like online banking systems; very much an even split of fast writes and reads, transactionally bound to third party systems (in-game payment, in-game "real time" use of consumables, etc), real-time analytics for fraud detection, etc. We were very much high IO, and our caching opportunitites were few and far between.
And in our environment, we HAD to have sophisticated testing. I ensured that the stress and load testing was done so that we could directly simulate the load of our expected user base, with realistic profiles, in order to better engineer our databases and disk IO. It also allowed us to measure the impacts of feature additions, etc. If it failed in Production, it made the news, and we had millions of gamer-freaks bitching everywhere.
In my case, the middle-tier was not an issue... we enabled minimal caching on a per-box basis, and other than that, they were stateless, and we could add/remove them at will; the application WAS the database.
And you can still abstract various parts of the database while using an ORM. We did write a few special stored procedures, and used some forced query plans, views, etc., to tweak the performance.
And yes, Oracle can scale out quite well. Cache Fusion, high speed and low latency interconnects, and shared block access provides incredible scaling without having to do anything special in the middle tier.
It's interesting to hear that has worked well, obviously this wasn't a small project. Your point about knowing how to use your tool definitely rings true. Also interesting that you had a use case where data loss and integrity actually mattered and in real time, unlike a social network or most start ups operating today. Going with a heavy oracle system instead of trying to roll your own creative distributed architecture definitely seems to make sense in that scenario. Just out curiosity, was this Java/Hibernate?
On one system we used Java/Oracle/Hibernate and went with the big single cluster. The other system was a .NET stack, using NHibernate and a large number of SQLServer instances. We also worked with Microsoft on integrating their latest (at the time beta) caching servers. We did indeed have to roll our own distributed architecture in that case, but it's not like we had to drop ORM to do it.
If anyone has any questions about how I've used ORM, etc., feel free to email me at nettdata@gmail.com if you like. I don't usually keep tabs on old threads, and have no problems sharing some of my experiences in this.
> any time you write "sql" in a non-sql language like python or ruby you are letting go of any ability to optimize your queries.
No you're not. Look at ActiveRecord, it lets you drop to any level of SQL optimization you need. In ActiveRecord 3 with ARel queries are composable, allowing lazy loading and the breaking of queries into appropriate locations according to your code architecture.
I can't speak to other ORMs, maybe they really are as bad as your opinion would indicate, but I suspect what you're really complaining about people who don't know how SQL works being enabled to write horrible data persistence code by ORMs with a pretty facade. That's a legitimate problem, but the fact that a tool can be abused is not an argument against the tool itself. We'll never build anything great if we are driven primarily by what the ignorant will do with it, after all, every single person on the planet is ignorant of most things, our tool development should be driven by what they enable experts to do.
Things like lazy loading is a red flag to me that you are doing something wrong, so if your framework allows you to do that that's not necessarily something to brag about :). Random IO that is triggered by merely accessing a property without knowledge of the programmer is not the best approach if you want to scale, you are better off doing deliberate fetches as a result of previously fetched data. If you are breaking and composing queries, how are they broken and compose by the orm, as joins or as sub queries? If as joins does your orm know the best columns to join on? You could replace everything with named sql functions (dropping to the lowest level of optimization as you mention above), but at that point what is your orm really doing for you. Anyway, sorry, I'm not sold :). Maybe if you effectively replicated the database engine in your front end framework I would come closer to being sold, but even then you don't have the same rapid in memory access to statistics about tables to make the right optimization decisions, etc..
You are thinking at too abstract a level. First of all, IO is not "randomly" triggered, it's entirely predictable how it works. Composability allows you to define a scope that globally applicable like (eg. "published articles") and then add additional constraints in a controller (eg. "tagged X") using the logic of relational algebra to construct a sane query. Could this query be slow? Sure, it's still your responsibility to make sure the schema supports that query. Writing SQL manually does not absolve you of that responsibility, it just means organizing a lot of SQL strings somehow.
Here's what you're missing, and why ActiveRecord works: if you halfway know what you're doing with a RDBMS, 95-99% of the time the query generation just saves you writing a lot of boilerplate SQL. It's true that sometimes you have to drop to a lower level to hand-craft a query, but ActiveRecord in no way prevents that. Again, I don't know what kind of ORM hell you have been put through, but I assure you that an ORM does not need to be this horrendous performance killing black box that you think it is.
There is one thing that I think can help with the sql overhead you mention - if you have a rock star dedicated sql person that can take all this work off your hands (that's not me btw, i've just worked/am working with such people). I think it affords you easier long term growth if you have expectations of making it to the medium to large company world, while not slowing you down when you are small, so I think it's a better strategy for both small and large companies. Are you signing on for a potential bottle neck? Yea it is a trade off and it is paramount you hire well in that area, but that's the sort of problems and decisions you have make all the time at a company.
I understand where you're coming from and what you describe may be workable in a smaller company with 7-14 devs where everyone knows what they are doing and understands well what happens under the hood. I think it's less likely to work at a company with 50+ devs though where you inevitably start trusting people less, or just at a company where you don't trust everyone. I've worked at both types. There is also the question of the complexity of your data and the way you need to query it. Right now we do essentially a ton of graph queries that we optimize highly in sql (ends up working much faster than any graph database since the schema and the queries are optimized for the exact data we are working on). Some of the functions that I write for this would not be implementable in an orm. I suppose that could be the case where you drop down into raw sql, but that happens to be a fair chunk of our code.
Maybe you can make it work better than I'm expecting, but if you were starting from scratch would you really want to go down that path anyway, all things considered? My original argument was that you are better off choosing a different way. I suppose that point of view will be difficult to change for me :).
Have you actually ever used an ORM? All of your posts strike me as being of an "I imagine it would be bad" nature, without actually speaking from experience.
I get the impression that you don't really understand the complexity and capabilities of a modern, robust ORM, or how it can be used.
And you're working on a data warehouse, which isn't usually a viable candidate for ORM in the first place.
I've used ORMs before i worked at myspace. NHibernate specifically. I've also used sql alchemy on the python side. NHibernate was in a professional environment, sql alchemy was a bunch of stuff I did for evaluation purposes, so you can discount that if you like.
And i'm not working at a data warehouse.. why do you think that?
I completely agree. I didn't even realize the "N+1 Selects Problem" was a problem because it should be referred to as "My Training Wheels Fell Off and Now My Bike Falls Over When I Sit On It".
Replace ORM in your argument with C and SQL with assembler. Just like high level languages it's an abstraction layer and it can really help you put stuff like caching, escaping (think SQL injections) into once place. Also it's much easier to change a function name in your abstraction layer, then to change your database schema.
The advantage of using LINQ database querying in C#, and it's a big one in my experience, is that your queries are actually typechecked by the compiler like any other code, making it a lot easier to refactor. (In the context of Python/Ruby which don't even have typecheckers I have no idea what the draw is).
The disadvantage is that due to some organizational dysfunction at MSFT there's still no really satisfactory ORM infrastructure surrounding the query engines.
(as for your "misguided engineering idea in itself" claim, I don't really see how it's fundamentally different from writing SQL in the first place to be translated by the database into query execution plans, vs. writing the query execution plans directly).
The difference at a high level is that sql has a syntax and set of capabilities that is quite unique, and every single database vendor has its own extensions or differences driven by their particular approach. To really replicate all of this in code you would have to go beyond the basic data structures and syntax of that programming language. And at that point might as well just have sql. It's a paradigm and an approach expressed through its own syntax, you can't easily copy all of it in a totally different programming language..
As for checking for type safety, I think frameworks that do sql-to-object mapping (with type safety), and also handle cache for you, are a very useful thing. Making raw calls on database connections is definitely too far "in the other direction" :).
On the syntactic level SQL is just a poorly designed language. LINQ query expressions actually do a better job of expressing the semantics of the SQL-like set/collection operations, in a compositional manner. It's definitely true though that SQL databases currently have a lot of capabilities (like, errmm, DML) that at least the Microsoft ORMs don't support other than by dropping down to SQL. I don't think this is a problem with the LINQ IQueryable paradigm, though, but just a problem with the Microsoft ORMs being incomplete.
I don't have much experience with ORMs or mapping frameworks other than LINQ-based ones, but it seems like it would be pretty difficult to typecheck queries expressed as SQL strings, at least dynamic ones, at compile time. Do the frameworks you mention typecheck the actual query itself at compile time, or do they just check at runtime that the data returned from the query matches what you want?
Query results you will normally adapt to specific object properties, and at that point the only thing that can bite you is if your query starts returning columns of a different sql type, which of course you can't catch compile time anyway. If you wrap your query results with objects and maintain an interface into your update statements via method calls (which obviously have type checking for arguments), I don't see how you can run into serious trouble. In the dynamic language world you of course don't have compile time anything, but you can use pretty much the exact same techniques to ensure you don't pass something bad to your query. I guess this isn't very dynamic, but that's the idea - your data access logic lives in the database, you execute methods and get back objects. That's the no orm way.
> If you are building anything more complex than a blog site and expect to take a decent amount of traffic, to the point that you may in fact care about optimizing at all, going with an ORM that writes sql for you is a really really bad idea.
Then explain the massive success of Rails. Quite simply, you are wrong.
I don't think they're necessarily misguided. DataMapper made efforts to circumvent the N+1 problem, in most cases probably pretty effectively.
Partial Updates are also pretty easy. Slamming every field into every INSERT/UPDATE is obviously a bad idea.
I think the missing sauce for ORMs is funding. Getting the basics together takes time and money, and it's hard to pull off in your free time.
On the other hand, having written many an ORM, I think there's still plenty of room to advance the state of the art. One of the biggest untapped (AFAIK) opportunities is using Statistics for query tuning. It's the life-blood of databases, but statistics are noticeably lacking in ORMs. Even simple counters could allow you to tune lazy-loads, JOINs, pre-fetches, etc on-the-fly.
If you went back a few decades then the exact same argument would have been made by replacing 'ORM' with high-level languages and 'SQL' for assembly. The trade-offs are similar.
The term "NoSQLite" shouldn't be used, or at the very least it should be hyphenated "NoSQL-ite." The existence of a little database engine called SQLite makes this term more than a little confusing.
This seems like a fairly common issue with frameworks that convert things to SQL behind the scenes. If you aren't paying attention, it will run a query for each object, instead of one for the whole page.
SQL is quick and easy to write. I never understood why people decided it would be a good idea to add another layer of abstraction on top in various web frameworks.
I don't think it saves dev time at all. It just makes it more difficult to know what queries you're running by looking at the code.
In my experience, it saves a lot of dev time, using my current ORM (Linq with Entity Framework 4.0) I can get probably-fast-enough CRUD against a new entity in just a few minutes without writing any SQL at all.
When it's not fast enough (high volume, specific complex queries, whatever) I can write specifically tuned SQL.
Performance tuning on anything that isn't an actual bottleneck is waste.
Realistically, you should never be writing your own CRUD, whether you use an ORM or not.
CRUD stored procedures can be generated directly from the database schema, wrapped in C# objects using the same code generator, and compiled into your project whenever you make a schema change. That gives you all the advantages of an ORM, without any magic runtime SQL generation.
Serious question: Why is having stored procedures generated at design time better than having dynamic SQL generated at run time? It's not like there has been a meaningful performance benefit for the last several years.
No, performance isn't really a reason to go with Stored Procedures.
It's all about the readability, maintainability and debugability of your code. If you have everything generated in simple classes backed by simple stored procedures, you can look at a stack trace and immediately see what's going on without having to mentally unravel what you think your ORM might have been doing at that point. That is, you'll be looking at a single line of code that does a single thing, rather than a null reference in a hashtable of hashtables that was loaded at runtime from an XML file.
It's also a lot less duplication of effort. Realistically, the SQL calls that your app makes for CRUD actions will only ever change when you change your schema. With that in mind, it seems a bit wasteful to re-generate that same SQL over and over again with every page request. In my mind I'd rather build it all once after a schema change, and drop it into source control where it's part of the record of what your code is doing.
I'd also like to hear a reasonable answer to that question.
For me, the only advantages of stored Procs over straight SQL (generate or hand coded) are:
* Security - you can revoke rights to directly access / change the tables from the app layer, making all access be through procs. However, in general you can probably create a similar security setup using views.
* Many SQL calls that need to be run as a set to give a single answer - this avoids the round trip latency of firing many SQL calls from the application.
I'd am genuinely interested to hear of more advantages.
>CRUD stored procedures can be generated directly from the database schema, wrapped in C# objects using the same code generator, and compiled into your project whenever you make a schema change.
You can't write it faster than an ORM. anObject.save, your SQL can't beat that in tersness and by the time you've written CRUD SQL for one table the guy with the ORM is far ahead of you in getting actual work done.
> I don't think it saves dev time at all.
It saves massive amounts of time, if you don't think so, then I doubt you've used one long enough to be able to have a valid opinion on the matter.
Further to other replies, I also find that the mental shift between languages is always a gear change that slows down development. Especially shifting to your SQL brain.
Think of it the other way around, their engineers didn't optimize any queries and they don't have a dba watching for table scans or nested loop joins. To the amateur crowd it looks like "wow optimization" but if you've been developing seriously for your career this should look like "you have a lot to learn".
Sometimes table scans happen even when you think you know what you're doing. If someone adds a column and doesn't think anyone will be filtered on that column then it can easily happen. I usually ask people:
- how big is this table getting?
- how are you querying on it?
That usually fixes most problems before they happen. Not sure it's my job to specify a join type though. A nested loop join isn't a bad thing or even a sign of a bad thing happening necessarily.
I don't think Stack Overflow's case is a case of "see, you can scale SQL databases". I think it's more of a case of "they've forced themselves to HAVE to scale SQL databases by tying themselves to .Net and MS SQL Server".
There's nothing wrong with relational databases, it's just that many of the NoSQL databases are newer, encourage denormalization, and implement things like intelligent sharding and mapreduce out of the box.
Which one you prefer seems to depend on whether you're more systems-oriented or code-oriented (if you have someone dedicated to tuning your database, why not just let them worry about it?), or whether you're just forced to use one by virtue of the platform you're using.
Having a de-normalized, document-based / column-based / whatever-based NoSQL database doesn't mean that every query or map-reduce function you write in it is going to be super-fast.
I think we all could benefit from articles similar to this one but applied to NoSQL databases: Take a real-world example of a slow-query in a NoSQL database, show how you identified it and profiled it, show how you solved it.
I've personally been burned with performance issues in CouchDB when it came time to perf-test my queries, and I really would have benefitted from articles that detailed a real-world example like this.
I feel like there's a lot of tutorials on how to create toy applications using NoSQL, and a lot of claims on how things like sharding and mapreduce will let you scale to the moon, but comparatively little information on how to do real-world things like what this article covers.
I think it's more a case of "boring but tried and tested" or "new but potentially unreliable". Would you risk your business just to say you use latest tech buzzword?
Things like NoSQL are neat to play with in your spare time, but I'd think long and hard before I'd use them in actual production code. I seem to remember hearing about Cassandra causing woes in the past, for example.
If you're putting your database on your feature page, you probably have other problems than scaling.
It's not about "latest tech buzzword", it's about using the tool that is most appropriate to the job. Cassandra worked up to a point, but it ultimately was not the right tool for the job at Twitter and Facebook scale. Is that cause to be dismissive of all NoSQL databases?
I agree that you should give careful consideration to your technology stack before putting it into production, but to me that means considering all of the options available, not just the old stuff.
This is true, but it should be pointed out that Facebook scales very poorly. Messages disappear/reappear, friends are suddenly missing and suddenly back, your own posts suddenly vanish.
I think Facebook does a pretty decent job considering that they have hundreds of millions of users, with many of them hitting the site many times a day. They've got an absolutely tremendous amount of traffic.
For rapid development, there are a wealth of tools that will just work on SQL databases. No special query language to learn. Jeez, none at all if you use one of the nice ORMs and magic query writing they provide (which was the point of this article). You can even drag and drop your way to a semi-decent app if you know that toolset.
On top of that, you get well-known consistency, transactional updates, and so on. Stuff most developers are used to. So, there's plenty of reasons for picking a traditional SQL database, especially when starting.
Even more so, you don't need to throw SQL away (which is a nice declarative query language) in search of performance. For instance, look at VoltDB/HStore for an example of a SQL-based, table based RDBMS that provides ACID, high performance and linear scaleout. (Granted, with some tradeoffs - it's not for running reports off of, for instance.) On the non-OLTP side, look for massively parallel databases, which still allow SQL while providing amazing scaleout analysis performance. (Downside is that those analysis databases are very expensive; I'm unaware of any cheap or opensource alternative that still uses SQL, except perhaps Pig, but it doesn't really compare, performance-wise.)
Tying yourself to LINQ is one thing, tying yourself to MS SQL Server is completely fine if you're ok with the licensing fees. SQL Server is probably one of the best enterprise databases out there right now, I imagine right behind oracle.
They aren't really tying themselves to LINQ, though. It seems like they're using LINQ to get the dev speed improvement when they can, and dropping down to writing their own SQL in the subset of cases where they know that LINQ isn't fast enough.
Sure, I'm just trying to speak up for some of the less popular technologies in today's dev community :). SQL server has the downside of costing money, but i think it should be in the equation when making technology decisions today, it's a vastly superior piece of tech compared to mysql and postgres :).
They turned on indexing on the table and went to a JOIN type query. Boy. Trouble believing steps 1 - 5 were necessary to come to such a rudimentary fix.
The N+1 problem was fixed by changing the ViewModel to use a left join which pulled all the records in one query.
This is a classic ORM problem. The usual solution: apply a Declarative Batch Query. I've even written one of those things! The reason why you might want to code a Declarative Batch Query option for your Object Relational framework? You can then apply it to similar problem queries in a few minutes total and one line of code.
The first operation I optimized went from 2500 SQL calls to just 40, which is an over 50X speedup. And yes, the entire declarative mechanism was written in about a week to satisfy a routine "Change Request," and yes, it ran in production on a heavily used trading application at a major multinational.
I see 10x speedup was given by an index. A while back I was helping a customer with some serious performance issues, and I pulled together a few procs that look at e.g. indexes, most-locked-objects/whatever over the whole DB. I've seen this below trick referenced many times but the source article seems to be here:
It gives an indication of the highest-impact indexes that are missing. Obviously there's room for experts to tweak, but for most it's an excellent quick tool. Don't create every index it suggests, obviously, just use it as a guide to look for hotspots. It's nice because often it'll surprise you with areas you had no idea were an issue. And it takes a few seconds to run.
"__A NoSQLite might counter that this is what key-value database is for. All that data could have been retrieved in one get, no tuning, problem solved. The counter is then you lose all the benefits of a relational database and it can be shown that the original was fast enough and could be made very fast through a simple turning process, so there is no reason to go NoSQL.__"
Moral of Story:
Relational database should be used in most cases. NoSQL is overated & extra-hyped.
Arguably, this particular failure case is what has been automated. It isn't actually necessary to have an ORM that all-but-requires N+1 select problems. I wrote one once that could join across an arbitrary number of tables in on shot. Alas, it is now gone in an acquisition. But I know it's possible.
Part of the problem is that I think most ORM authors know OO but don't necessarily grok SQL, so you get the same failure cases over and over again. It's not easy to write an ORM that can do arbitrary joins and it's borderline impossible to retrofit one that didn't have it built in from day one. The other big failure that you get is a failure to support group queries or arbitrary selects.
Performing a code review the code uses a LINQ-2-SQL multi join. LINQ-2-SQL takes a high level ORM description and generates SQL code from it. The generated code was slow and cost a 10x slowdown in production.
I believe they are using SQL server, which keeps statistics on the most longest total runing, and longest per-run queries. It's a good idea to monitor this list and take a very close look at every execution plan that comes up. Many times you end up doing RID lookups due to non-clustered indexes that don't cover the requested data.
MSSQL is easier to optimize at last for me than say Sybase and Oracle, which I also supported.
People can bash MS as much as they want but their products are dirt simple to use and for the most part efficient. MSSQL comes with some query plan analysis so you can look at hit and miss ratios. From this info you can make some decisions as to add additional clustered or non-clustered indexes or use stored procedure which are saved (preprocessed) for faster execution. I forgot which of the several books I used to do this. But I was pleasantly surprised at the performance increase with my occasional tweaking.
Performance was not a priority for us that is why it was done occasionally. Not my call. My main job was make sure replications happened effectively, maintain good restores and security, and create stored procedures and triggers.
I hated the ORM in Rails. I do not want magic when I know how to do it more efficiently. For example there are certain times it is better to use DISTINCT versus GROUP BY for unique values.
I miss MSSQL. I use MySQL and it is ok but not like my sweetie and, thank you, not like the ugly gorilla Oracle.
The article shows the SO team doing all of these things. It's about the opportunity costs of which semi-slow queries you tackle in which order. As new features are added, performance shifts around in different areas of the site.
This headline is incredibly misleading. The underlying/original blog post is all about tuning the queries behind a single page when the developer noticed some latency while looking at logs.
They looked at SQL and added an index to their DB?
When I read this I actually cringed, because they changed their SQL query in production without looking at the query plan first. Apparently it's the blind leading the blind over at stack overflow, not that it seems to be hurting them.
"In production this query was taking too long, so the next step was to look at the query plan, which showed a table scan was being used instead of an index. A new index was created which cut the page load time by a factor of 10."
I read it as: 1) noticed that's slow in production, 2) look at query plan (prod or dev, not clear) 3) add index (not clear if they did that in dev first and then prod)
my dev box runs a clone of production, I do all my tuning on dev then deploy to staging, confirm it is still good there and finally ship it to production
Granted, you often had to actually write Stored Procedures by hand back then. But then if you look at the LINQ in the article, you'll notice that it's pretty much exactly the SQL you'd stick into your stored procedure, just backwards. Had they not bothered introducing LINQ in the first place, they'd have had their 100x performance boost from the get go.
Naturally, SP's are not some magic 100x'ing cure-all. They're just a non-generated version of your SQL, which means that you're guaranteed never to have your ORM go nuts and build some monstrosity like the one outlined in the article.
You still need to tune your SQL by hand, but at least you can tune the SQL rather than some not-particularly-helpful abstraction on top of it.