'Completing values from a sparse representation of data
I'm working on a financial system application, in Microsoft SQL Server, that
requires calculating a rate
for a given tenor
(maturity in days) from yield
curves stored in a database. The problem is that the yield curves are stored
sparsely around dates
and tenors
; i.e. not all dates, and not all tenors,
are available in the database.
The yield curves are stored as this:
CREATE TABLE Curves (
date DATE NOT NULL,
type VARCHAR(10) NOT NULL,
currency CHAR(3) NOT NULL,
tenor INT NOT NULL,
rate DOUBLE NOT NULL
);
I want to be able to get a rate for a given date
, type
, currency
, and
tenor
. With the following assumptions:
- For a
givenDate
, choose the most recent curve for the giventype
andcurrency
. If thegivenDate
is not found use the previously available date to get the rates. - For a
givenTenor
, do a linear interpolation from the availabletenors
. If thegivenTenor
is smaller than the smallest tenor, use the rate associated with the smallest tenor. If thegivenTenor
is larger than the largest tenor, use the date associated with the largest tenor. For aything in between use a linear interpolation between the two closest tenors.
So assuming that I have another table with financial instruments:
CREATE TABLE Instruments (
id INT NOT NULL PRIMARY KEY,
date DATE NOT NULL,
type VARCHAR(10) NOT NULL,
currency CHAR(3) NOT NULL,
tenor INT NOT NULL,
);
I would like to have a stored function that can produce the rates, following the assumptions above, in a query like this:
SELECT *, SOME_FUNCTION(date, type, currency, tenor) AS rate FROM Instruments;
I believe that a function can be created to do this, but I'm also sure that a naive implementation would not perform at all. Particularly with millions of records. The algorithm for the function would be something like this:
- With the
givenDate
, get theavailableDate
as the maximum date that is less than or equal to thegivenDate
, for thegivenType
andgivenCurrency
. - For the
availableDate
get all thetenors
andrates
. With this vector, go over the tenors with the asumptions as above, to calculate a rate.
Any ideas on how to do this in Microsoft SQL Server in a performant way?
-- Edit
Following up on @dalek @trenton-ftw comments, here is the 'naive' implementation of the algorithm. I've created random data to test:
- 5,000 Curve points, spanding the last 5 years, with two types and two currencies,
- 1,000,000 Instruments with random dates, types, currencies and tenors.
To test the function I just executed the following query:
SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor"))
FROM Instruments
For the random sample above this query executes in 2:30 minutes in my laptop. Nevertheless, the real application would need to work on something close to 100,000,000 instruments in a heavily loaded sql server. That would potentially put it in the 20 minute ballpark.
How can I improve the performance of this function?
Here is the naive implementation:
CREATE OR ALTER FUNCTION TenorRate(
@date DATE,
@type VARCHAR(8),
@currency VARCHAR(3),
@tenor INT
) RETURNS REAL
BEGIN
DECLARE @rate REAL
-- The available date is the first date that is less than or equal to
-- the provided one, or the first one if none are less than or equal to it.
DECLARE @availableDate DATE
SET @availableDate = (
SELECT MAX("date") FROM Curves
WHERE "date" <= @date AND "type" = @type AND "currency" = @currency
)
IF (@availableDate IS NULL)
BEGIN
SET @availableDate = (
SELECT MIN("date") FROM Curves
WHERE "type" = @type AND "currency" = @currency
)
END
-- Get the tenors and rates for the available date, type and currency.
-- Ordering by tenor to ensure that the fetch is made in order.
DECLARE @PreviousTenor INTEGER, @PreviousRate REAL
DECLARE @CurrentTenor INTEGER, @CurrentRate REAL
DECLARE Curve CURSOR FAST_FORWARD READ_ONLY FOR
SELECT "tenor", "rate"
FROM Curves
WHERE "date" = @availableDate AND "type" = @type AND "currency" = @currency
ORDER BY "tenor"
-- Open a cursor to iterate over the tenors and rates in order.
OPEN Curve
FETCH NEXT FROM Curve INTO @CurrentTenor, @CurrentRate
IF (@Tenor < @CurrentTenor)
BEGIN
-- If the tenor is less than the first one,
-- then use the first tenor rate.
SET @rate = @CurrentRate
END
ELSE
BEGIN
WHILE @@FETCH_STATUS = 0
BEGIN
IF (@Tenor = @CurrentTenor)
BEGIN
-- If it mathces exactly return the found rate.
SET @rate = @CurrentRate
BREAK
END
IF (@Tenor < @CurrentTenor)
BEGIN
-- If the current tenor is lesss than the current
-- (but not equal) then interpolate with the
-- previous one and return the calculated rate.
SET @rate = @PreviousRate +
(@tenor - @PreviousTenor) *
(@CurrentRate - @PreviousRate) /
(@CurrentTenor - @PreviousTenor)
BREAK
END
-- Keep track ot the previous tenor and rate.
SET @PreviousTenor = @CurrentTenor
SET @PreviousRate = @CurrentRate
-- Fetch the next tenor and rate.
FETCH NEXT FROM Curve INTO @CurrentTenor, @CurrentRate
END
IF (@Tenor > @CurrentTenor)
BEGIN
-- If we exahusted the tenors and still nothing found,
-- then use the last tenor rate.
SET @rate = @CurrentRate
END
END
CLOSE Curve
DEALLOCATE Curve
RETURN @rate
END;
Solution 1:[1]
Assuming a table dbo.Curves
exists with the following structure:
CREATE TABLE dbo.Curves(
[date] [date] NOT NULL
,[type] [varchar](10) NOT NULL
,currency [char](3) NOT NULL
,tenor [int] NOT NULL
,rate [real] NOT NULL
,UNIQUE NONCLUSTERED ([date] ASC,[type] ASC,currency ASC,tenor ASC)
,INDEX YourClusteredIndexNameHere CLUSTERED ([type] ASC,currency ASC,[date] ASC)
);
Notable differences from OP's DDL script:
- column
rate
has the data type corrected fromdouble
toreal
. Assuming this is the correct data type because the provided function storesrate
values in variables with areal
data type. The data typedouble
does not exist in SQL Server. - I have added a clustered index on (
type
,currency
,date
). The table did not have a clustered index in OP's DDL script, but should certainly have one. Considering the only knowledge I have of how this table is used is this function, I will design one that works best for this function. - OP indicated this table has a unique constraint on (
date
,type
,currency
,tenor
), so I have included.
Just to note, if the unique constraint that already exists does not require the provided order, I would recommend removing that unique constraint and the clustered index I recommended and simply creating a single clustered primary key constraint on (type
,currency
,date
,tenor
).
Differences between OP's function and set based approach
- The actual flow is very similar to OP's provided function. It is well commented, but the function does not read top to bottom. Start from the
FROM
of the inner most subquery and read 'outwards'. - This function is inline-able. You can confirm this once this function is installed to a given database using Microsoft's documentation on this topic.
- I corrected the input type for the
@i_Type
variable tovarchar(10)
to match thetype
column of thedbo.Curves
table. - I created the two tables and populated them with what I believe was reasonable randomized data. I used the same scale of 200:1 from OP's test scenario of 1m records in
dbo.Instruments
and 5k records indbo.Curves
. I tested using 80k records indbo.Instruments
and 400 records indbo.Curves
. Using the exact same test script from OP ofSELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor")) FROM Instruments;
- OP's
CURSOR
based approach had a CPU/Elapsed ms of 57774/58654. - Set based approach I wrote had a CPU/Elapsed ms of 7437/7440. So roughly 12.6% elapsed time of the original.
- OP's
- The set based approach is guaranteed to result in more reads overall because it is repeatedly reading from
dbo.Curves
as opposed to theCURSOR
option which pulls the rows fromdbo.Curves
one time and then reads from theCURSOR
. It is worth it to note that comparing execution plans on both function approaches in the test case above will yield misleading results. Because theCURSOR
based option cannot be inlined, the logical operators and query statistics are not going to be shown in the execution plan, but the query plan for the set based approach can be inlined and so the logcial operators and query statistics will be included in the execution plan. Just thought I would point that out. - I did validate the results of this compared to the original function. I noticed that there some potential areas for discrepany.
- Both this function and the OP's
CURSOR
based option are doing math usingreal
data types. I saw a few examples where it looked like the math, although using the same numbers between theCURSOR
approach and the set based approach resulted in different numbers. When I validated that the output number, when rounded to the 5th decimal, matched exactly. You might need to track that down, but considering that you said you are building a financial application and a financial application is the classic example wherereal
andfloat
data types should be avoided (because they are approximate data types) I strongly suggest you change the way these values are both stored and used in this function to be numeric. It was not worth my time to track this issue down because IMO it only exists because of very bad practice that should be resolved anyhow. - Based on my understanding of OP's approach, in scenarios where the input tenor is between the minimum and maximum tenor we would want to use the two rows from
dbo.Curves
where the tenor value is as close to the input tenor as possible, and then use those two rows (and therate
from those two rows) to perform a linear interpolation for the input tenor. In this same scenario, the OP's function finds the closest tenor available and then instead of using the row with the next closest available tenor, it uses the tenor with the previous row (sorted bytenor ASC
). Because of how theCURSOR
implementation was written, this is not guaranteed to be the next closest row to the input tenor so I fixed that in my approach. Feel free to revert if I misunderstood.
- Both this function and the OP's
Notes about my specific implementation
- This approach is not heavily optimized towards certain conditions that might exist more commonly. There are still potential optimizations that you might be able to make in your system based on the conditions that exist.
- I am using an aggregate to pivot rows to columns. I did test out
LEAD
/LAG
to do this, but found that pivoting using the aggregate was consistently faster by roughly 7% decrease in elapsed time. - There are simple warehousing strategies that could be used to GREATLY increase the performance of this approach. For example, storing the pre-calculated min/max for a given combination of a (
type
,currency
,date
) could have a tremendous impact.
Here is the function:
CREATE OR ALTER FUNCTION dbo.TenorRate(
@i_Date DATE
,@i_Type VARCHAR(10)
,@i_Currency CHAR(3)
,@i_Tenor INT
)
RETURNS REAL
BEGIN
RETURN (
SELECT
/*now pick the return value based on the scenario that was returned, as a given scenario can have a specific formula*/
CASE
WHEN pivoted_vals.Scenario = 1 THEN pivoted_vals.CurrentRate /*returned row is first row with matching tenor*/
WHEN pivoted_vals.Scenario = 2 THEN pivoted_vals.CurrentRate /*returned row is row with smallest tenor*/
WHEN pivoted_vals.Scenario = 3 THEN pivoted_vals.CurrentRate /*returned row is row with largest tenor*/
WHEN pivoted_vals.Scenario = 4 /*returned row contains pivoted values for the two rows that are two closest tenor rows*/
THEN pivoted_vals.PreviousRate
+
(@i_Tenor - pivoted_vals.PreviousTenor)
*
(pivoted_vals.CurrentRate - pivoted_vals.PreviousRate)
/
(pivoted_vals.CurrentTenor - pivoted_vals.PreviousTenor)
END AS ReturnValue
FROM
(SELECT
/*pivot rows using aggregates. aggregates are only considering one row and value at a time, so they are not truly aggregating anything. simply pivoting values.*/
MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.Scenario ELSE NULL END) AS Scenario
,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentRate ELSE NULL END) AS CurrentRate
,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentTenor ELSE NULL END) AS CurrentTenor
,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentRate ELSE NULL END) AS PreviousRate
,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentTenor ELSE NULL END) AS PreviousTenor
FROM
/*return a row number for our top two returned rows. done in an outer subquery so that the row_number function is not used until the final two rows have been selected*/
(SELECT TOP (2)
picked_curve.Scenario
,picked_curve.ScenarioRankValue
,picked_curve.CurrentRate
,picked_curve.CurrentTenor
/*generate row number to match order by clause*/
,ROW_NUMBER() OVER(ORDER BY picked_curve.Scenario ASC,picked_curve.ScenarioRankValue ASC) AS RowNumber
FROM
/*we need top two rows because scenario 4 requires the previous row's value*/
(SELECT TOP (2)
scenarios.Scenario
,scenarios.ScenarioRankValue
,c.rate AS CurrentRate
,c.tenor AS CurrentTenor
FROM
/*first subquery to select the date that we will use from two separate conditions*/
(SELECT TOP (1)
date_options.[date]
FROM
(/*most recent available date before or equal to input date*/
SELECT TOP (1)
c.[date]
FROM
dbo.Curves AS c
WHERE
/*match on type and currency*/
c.[type] = @i_Type
AND
c.currency = @i_Currency
AND
c.[date] <= @i_Date
ORDER BY
c.[date] DESC
UNION ALL
/*first available date after input date*/
SELECT TOP (1)
c.[date]
FROM
dbo.Curves AS c
WHERE
/*match on type and currency*/
c.[type] = @i_Type
AND
c.currency = @i_Currency
AND
c.[date] > @i_Date
ORDER BY
c.[date] ASC) AS date_options
ORDER BY
/*we want to prioritize date from first query, which we know will be 'older' than date from second query. So ascending order*/
date_options.[date] ASC) AS selected_date
/*go get curve values for input type, input currency, and selected date*/
INNER JOIN dbo.Curves AS c ON
@i_Type = c.[type]
AND
@i_Currency = c.currency
AND
selected_date.[date] = c.[date]
/*go get max and min curve values for input type, input currency, and selected date*/
OUTER APPLY (SELECT TOP (1) /*TOP (1) is redundant since this is an aggregate with no grouping, but keeping for clarity*/
MAX(c_inner.tenor) AS MaxTenor
,MIN(c_inner.tenor) AS MinTenor
FROM
dbo.Curves AS c_inner
WHERE
c_inner.[type] = @i_Type
AND
c_inner.currency = @i_Currency
AND
c_inner.[date] = selected_date.[date]) AS max_min_tenor
/*for readibility in select, outer apply logic to give us a value will prioritize certain scenarios over others (and indicate to us the scenarios that was returned) and
return the minimum number of rows that are ranked in a manner in which the top returned rows contain the information needed for the given scenario*/
OUTER APPLY (SELECT
CASE /*rank the scenarios*/
WHEN @i_Tenor = c.tenor THEN 1
WHEN @i_Tenor < max_min_tenor.MinTenor THEN 2
WHEN @i_Tenor > max_min_tenor.MaxTenor THEN 3
ELSE 4 /*input tenor is between the max/min tenor*/
END AS Scenario
,CASE /*rank value that ensures the top row will be the row we need for the returned scenario*/
WHEN @i_Tenor = c.tenor THEN c.tenor
WHEN @i_Tenor < max_min_tenor.MinTenor THEN c.tenor
WHEN @i_Tenor > max_min_tenor.MaxTenor THEN c.tenor*-1
ELSE ABS(c.tenor-@i_Tenor)
END AS ScenarioRankValue) AS scenarios
ORDER BY
/*highest priority scenario (by lowest value) and the associated value to be ranked, which is designed to be used in ascending order*/
scenarios.Scenario ASC
,scenarios.ScenarioRankValue ASC) AS picked_curve
ORDER BY
picked_curve.Scenario ASC
,picked_curve.ScenarioRankValue ASC) AS top_2_rows) AS pivoted_vals
);
END;
GO
Just one last note on usage, there are certain ways that you might use this query that could prevent it from being inline. I highly recommend you read the entirety of Microsoft's doc on Scalar UDF Inlining, but at a minimum you should at least read over the Inlineable scalar UDF requirements contained in that same doc.
Hopefully this helps you out or at least points you in the right direction.
Solution 2:[2]
In fact your function look like an aggregate function. If it is the case, you should write a SQL CLR one. We do that for financial score and the performance was fine...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | trenton-ftw |
Solution 2 | SQLpro |