Yesterday I was looking at execution plans on a troublesome inner join condition and once again was moved to wonder about the “estimated subtree cost” of “243.872”. I Googled around for this and found only circular definitions of the form, “the estimated subtree cost is the estimated cost of executing the specified portion of the query”.
Well, duh. This is like being in a foreign country and buying a trinket from a street vendor and being told that it will cost you twelve. Twelve what?
My colleague Joe Murray finally dug up the answer here. Sort of:
The unit of measure for execution plan costs is buried in the documentation under the topic of the query governor cost limit: “Query cost refers to the estimated elapsed time, in seconds, required to complete a query on a specific hardware configuration.” Since the basic architecture of this cost model originates in SQL Server version 7, with significant changes in version 2005, the reference system might be a Pentium Pro. It might also be implied that the execution plan IO cost refers only to time spent waiting for IO, and not CPU time consumed. In actuality, the IO request to disk by itself has non-negligible CPU cost.
At any rate, the cost of 243.872 for my query bore no resemblance at all to reality, given that I gave up on it ever completing when it passed the 8 hour mark.
While I’m letting off steam, what’s up with an INNER JOIN condition exactly on two matching indexes in the tables involved that used to take 3 minutes when there were over 3 million rows to deal with, and now takes infinitely more time when there are less than 2 million rows to deal with? The optimizer is selecting a hash join as if the indexes don’t exist. If I force a loop join the estimated execution cost nearly doubles.
This is the kind of thing that used to keep me awake nights in the early 1990’s when I first learned non-procedural, set-oriented SQL after years of relying on the procedural DML in the XBase world, where I had clear visibility into every phase of an operation rather than leaving it all to a generic black box in the bowels of some proprietary engine. Fortunately this kind of thing rarely happens in practice, but when it does, it’s always at a bloody inconvenient time and the reason always turns out to be frighteningly obscure — and, usually, hard to ferret out.
Update: I had forgotten that several tables using outer joins in this query had been rebuilt from scratch and my script had left the indexes out. But why did the query plan assign negligible cost to those joins, and the lion’s share to the hash join underlying the inner join on the two primary tables? Telling me where the actual cost was would have led me straight to the problem, not away from it. And I still don’t get why the optimizer chose the hash join in any event. Moral of the story — TMMI (Too Much Misleading Information) early in an investigation is a Bad Thing.