SQL Server: Avoiding a Sort with Merge Join Concatenation
There are cases when SQL Server doesn’t normally realize it can rely on index order, but you can make the platform realize it with some extra manipulation and creativity. This article is the first in a two-part series in which I describe such tasks and solutions. This month I’ll talk about avoiding a sort with the Merge Join (Concatenation) operator. Next month I’ll describe avoiding a sort when needing the data in descending order.
December 18, 2017
One of the advantages of B-tree-based indexes is that they organize the data ordered by some index key. SQL Server can then rely on index order to present or process data ordered (for operations such as ORDER BY, GROUP BY and JOIN), without the penalty of an actual sort operation. Pulling the data preordered from an index has linear scaling since the cost of an ordered scan is simply proportional to the number of rows involved. Puling the data unordered and then applying an actual sort operation scales in an extra-linear fashion--more specifically, in an n log n fashion. So, if you often need to rely on ordered data, it’s beneficial to have supporting indexes. However, there are cases when SQL Server doesn’t normally realize it can rely on index order, but you can make the platform realize it with some extra manipulation and creativity. This article is the first in a two-part series in which I describe such tasks and solutions. This month I’ll talk about avoiding a sort with the Merge Join (Concatenation) operator. Next month I’ll describe avoiding a sort when needing the data in descending order.
Order Data with NULLs Last
Our first example for specialized ordering needs is a classic. I’ll use the Sales.Orders table in the TSQLV4 database for this example. The task is to return orders (orderid and shippeddate columns) ordered by shippeddate ascending, orderid ascending, but with NULLs last. Normally, SQL Server orders NULLs first. A common way for people to handle the task is by using a CASE expression as the first ordering expression returning a flag with higher ordering value for NULLs, and then shippeddate and orderid, like so:
USE TSQLV4;
SELECT orderid, shippeddate
FROM Sales.Orders
ORDER BY
CASE WHEN shippeddate IS NOT NULL THEN 0 ELSE 1 END,
shippeddate, orderid;
This solution returns the desired output with NULLs ordered last, as the query output shows:
orderid shippeddate
----------- -----------
10249 2014-07-10
10252 2014-07-11
10250 2014-07-12
10251 2014-07-15
10255 2014-07-15
...
11073 NULL
11074 NULL
11075 NULL
11076 NULL
11077 NULL
There is an index called idx_nc_shippeddate defined on shippeddate and orderid (implied due to the presence of a clustered index on orderid). The problem is that since the first ordering expression manipulates columns, similar to breaking sargability of filters, it prevents SQL Server from being able to rely on index order in this example. This results in explicit sorting in the plan for this query as shown Figure 1, and therefore n log n scaling.
Figure 1 – Order data with NULLs last, with Sort
There is a trick that you can apply as a workaround. Write two queries—one returning rows where shippeddate is not NULL, with a flag sort column (call it sortcol) based on the constant 0, and another returning rows where shippeddate is NULL, with the flag sort column based on the constant 1. Apply a UNION ALL operator between them, and place the whole thing in a CTE. Have the outer query against the CTE return the desired columns only, ordering the rows by sortcol, shippeddate and orderid. Here’s the complete solution’s code:
WITH C AS
(
SELECT orderid, shippeddate, 0 AS sortcol
FROM Sales.Orders
WHERE shippeddate IS NOT NULL
UNION ALL
SELECT orderid, shippeddate, 1 AS sortcol
FROM Sales.Orders
WHERE shippeddate IS NULL
)
SELECT orderid, shippeddate
FROM C
ORDER BY sortcol, shippeddate, orderid;
The plan for this query is shown in Figure 2.
Figure 2 –
Notice in the plan that there are two index seeks used to obtain the data—one for the rows where shippeddate is not NULL (emitted ordered by shippeddate and orderid) and another where shippeddate is NULL (also emitted ordered by shippeddate and orderid). The two Compute Scalar operators produce the flag based on the constant—0 for rows where shippeddate is not NULL and 1 for rows where shippeddate is NULL. The Merge Join (Concatenation) operator than merges the two ordered inputs based on the order of: flag column, shippeddate and orderid. This operator is similar to the Merge Join operator, only instead of joining, it unifies two sorted inputs. Figure 3 illustrates how this operator works with a simplified example of two sorted integer inputs.
Figure 3 - Illustration of Merge Join (Concatenation)
The algorithm is pretty simple:
· Fetch the first row in each of the sorted inputs.
· While not reached the end of any of the inputs:
o If the left sort key is less than or equal to the right sort key, emit the left row and read the next row from the left input. Otherwise, emit the right row and read the next row from the right input.
· If any rows are left in any of the inputs, emit those.
The beauty of this algorithm is that it can be applied to preordered inputs based on index order scans (or seeks and range scans), like in the plan shown in Figure 2, meaning that in such a case the plan scales linearly.
Window function with NULLs last
The Merge Join (Concatenation) operator can be used to produce ordered unified rows for any purpose—not just for presentation ordering purposes. For example, window functions, such as ROW_NUMBER, need the data sorted by the window partitioning and window ordering columns, or just by the window ordering columns if there’s no window partition clause. For example, suppose you wanted to compute rows numbers for orders based on shippeddate ascending, orderid ascending ordering, but again, with NULLs last. Just like with the previous example, here in the window order clause you can start with a CASE expression that produces a flag with a higher ordering value for NULLs than for non-NULL values, and then continue with shippeddate and orderid, like so:
SELECT orderid, shippeddate,
ROW_NUMBER() OVER(ORDER BY
CASE WHEN shippeddate IS NOT NULL THEN 0 ELSE 1 END,
shippeddate, orderid) AS rownum
FROM Sales.Orders;
You get the following output showing row numbers starting with 1 for the non-NULL shipped dates, and with the last range of row numbers assigned to rows with a NULL shipped date:
orderid shippeddate rownum
----------- ----------- --------------------
10249 2014-07-10 1
10252 2014-07-11 2
10250 2014-07-12 3
10251 2014-07-15 4
10255 2014-07-15 5
...
11073 NULL 826
11074 NULL 827
11075 NULL 828
11076 NULL 829
11077 NULL 830
The plan for this query is shown in Figure 4.
Figure 4 – Window function with NULLs last, with Sort
As in Figure 1, due to the fact that the window order clause starts with an expression that manipulates columns, the optimizer doesn’t rely on index order and applies explicit sorting. Just like in the previous example, you can avoid explicit sorting by using unioned queries with a sort flag (sortcol), only here it’s the window order clause—not the presentation order clause—that is based on sortcol, shippeddate and orderid. Here’s the complete solution query:
WITH C AS
(
SELECT orderid, shippeddate, 0 AS sortcol
FROM Sales.Orders
WHERE shippeddate IS NOT NULL
UNION ALL
SELECT orderid, shippeddate, 1 AS sortcol
FROM Sales.Orders
WHERE shippeddate IS NULL
)
SELECT orderid, shippeddate,
ROW_NUMBER() OVER(ORDER BY sortcol, shippeddate, orderid) AS rownum
FROM C;
The plan for this query is shown in Figure 5.
Figure 5 – Window function with NULLs last, with Merge Join (Concatenation)
Observe the use of the Merge Join (Concatenation) applied to the two preordered inputs, emitting the unified rows sorted like the window function needs; hence, no explicit sorting is used.
In other articles I cover solutions to tasks involving intervals where I rely on this trick to avoid explicit sorting. You can find examples in a series I wrote about Calculating Concurrent Sessions (Part 1, Part 2, Part 3), and in an article on Packing Intervals. In those solutions I unify interval start times with interval end times to one chronologically ordered sequence of events, and, using the Merge Join (Concatenation) operator, avoid the need for explicit sorting.
Unpivoting
Another example where the Merge Join (Concatenation) operator can be used to avoid explicit sorting is in ordering of unpivoted data. To demonstrate the technique, I’ll use a table called CustSales, which you create and populate with 1,000,000 rows by running the following code:
USE tempdb;
DROP TABLE IF EXISTS dbo.CustSales;
GO
CREATE TABLE dbo.CustSales
(
custid INT NOT NULL
CONSTRAINT PK_CustSales PRIMARY KEY,
val2016 NUMERIC(12, 2) NULL,
val2017 NUMERIC(12, 2) NULL,
val2018 NUMERIC(12, 2) NULL,
);
INSERT INTO dbo.CustSales WITH (TABLOCK) (custid, val2016, val2017, val2018)
SELECT custid,
NULLIF(val2016, 0) AS val2016,
NULLIF(val2017, 0) AS val2017,
NULLIF(val2018, 0) AS val2018
FROM ( SELECT N.n AS custid,
ABS(CHECKSUM(NEWID())) % 1000 AS val2016,
ABS(CHECKSUM(NEWID())) % 1000 AS val2017,
ABS(CHECKSUM(NEWID())) % 1000 AS val2018
FROM TSQLV4.dbo.GetNums(1, 1000000) AS N ) AS D;
CREATE INDEX idx_val2016 ON dbo.CustSales(val2016, custid);
CREATE INDEX idx_val2017 ON dbo.CustSales(val2017, custid);
CREATE INDEX idx_val2018 ON dbo.CustSales(val2018, custid);
The CustSales table has a row per customer (custid) and a column per order year (valyyyy), with the total order values for the current customer and year. Observe the code creates an index on (valyyyy, custid) for each of the three order years. The task is to unpivot the data so that you get a row per customer and year, with one result column holding the year (call it orderyear) and another holding the value (call it val). In addition, you want the result to be sorted by the val column, ascending.
There are two main techniques people use to handle such unpivoting tasks—one using the UNPIVOT operator and another using the APPLY operator. Here’s the classic solution using the UNPIVOT operator:
SELECT custid, CAST(RIGHT(valyear, 4) AS INT) AS orderyear, val
FROM dbo.CustSales
UNPIVOT(val FOR valyear IN (val2016, val2017, val2018)) AS U
ORDER BY val;
The plan for this query is shown in Figure 6.
Figure 6 – Plan for UNPIVOT query
Observe the explicit Sort operator in the plan. With 1,000,000 rows in the source table, it took this query 4 seconds to complete on my machine (with results discarded to focus on processing time). Due to the need for explicit sorting, this plan scales in an n log n fashion.
Here’s the solution using the APPLY operator:
SELECT custid, orderyear, val
FROM dbo.CustSales
CROSS APPLY ( VALUES(2016, val2016),
(2017, val2017),
(2018, val2018) ) AS A(orderyear, val)
WHERE val IS NOT NULL
ORDER BY val;
The plan for this query is shown in Figure 7.
Figure 7 – Plan for APPLY query
As you can see, the plan is very similar to the one used for the solution with the UNPIVOT operator, and also here the plan applies explicit sorting. This query too took 4 seconds to complete on my machine, and has n log n scaling.
If you’re unhappy with the plans that we got for both the UNPIVOT- and the APPLY-based solutions, you’re in the same boat as me. Your intuition should tell you that there may be a way to leverage the indexes that we have on the (valyyyy, custid) combinations. Indeed, such a solution exists. You write a separate query for each of the years, returning custid, the applicable yyyy constant as orderyear, and the respective valyyyy column as val, for rows where valyyyy is not NULL. You apply UNION ALL operators between the queries, and order the result by the column val, like so:
SELECT custid, 2016 AS orderyear, val2016 AS val
FROM dbo.CustSales
WHERE val2016 IS NOT NULL
UNION ALL
SELECT custid, 2017 AS orderyear, val2017 AS val
FROM dbo.CustSales
WHERE val2017 IS NOT NULL
UNION ALL
SELECT custid, 2018 AS orderyear, val2018 AS val
FROM dbo.CustSales
WHERE val2018 IS NOT NULL
ORDER BY val;
Figure 8 shows the execution plan for this solution.
Figure 8 – Plan for UNION ALL query
This technique leverages the order-preserving Merge Join (Concatenation operator), which relies on index order, twice—once between the queries that handle the years 2016 and 2017, and another time between the result and the query that handles the year 2018. This query completed in 1 second on my machine and has linear scaling.
Finally, just for fun, suppose that you want to unpivot the data, sorting the rows by val ascending, only this time keeping rows with NULLs in the val column, and having NULLs sorted last! This can be achieved by mixing the two techniques I described in the article—the one that orders NULLs last without sorting, and the one that unpivots data without sorting. This means that you need to unify six queries—one for each of the three years times each of the two NULL | non-NULL cases producing the sort flag (sortcol), like so:
WITH C AS
(
SELECT custid, 2016 AS orderyear, val2016 AS val, 0 AS sortcol
FROM dbo.CustSales
WHERE val2016 IS NOT NULL
UNION ALL
SELECT custid, 2016 AS orderyear, val2016 AS val, 1 AS sortcol
FROM dbo.CustSales
WHERE val2016 IS NULL
UNION ALL
SELECT custid, 2017 AS orderyear, val2017 AS val, 0 AS sortcol
FROM dbo.CustSales
WHERE val2017 IS NOT NULL
UNION ALL
SELECT custid, 2017 AS orderyear, val2017 AS val, 1 AS sortcol
FROM dbo.CustSales
WHERE val2017 IS NULL
UNION ALL
SELECT custid, 2018 AS orderyear, val2018 AS val, 0 AS sortcol
FROM dbo.CustSales
WHERE val2018 IS NOT NULL
UNION ALL
SELECT custid, 2018 AS orderyear, val2018 AS val, 1 AS sortcol
FROM dbo.CustSales
WHERE val2018 IS NULL
)
SELECT custid, orderyear, val
FROM C
ORDER BY sortcol, val;
The plan for this solution (using SentryOne’s Plan Explorer) is shown in Figure 9.
Figure 9 – Plan for UNION ALL query, keeping NULLs
Amazingly, five Merge Join (Concatenation) operators are used in this plan, without the need for explicit sorting, hence this plan scales linearly.
Summary
This article demonstrates the use of query rewrites to enable the use of optimal optimization capabilities. The focus of this article was avoiding explicit sorting by using techniques that can leverage the Merge Join (Concatenation) order-reliant and order-preserving operator. Next month I’ll continue the coverage of explicit-sort avoidance by discussing queries that need to handle data in descending order.
About the Author(s)
You May Also Like