'Completing values from a sparse representation of data

I'm working on a financial system application, in Microsoft SQL Server, that requires calculating a rate for a given tenor (maturity in days) from yield curves stored in a database. The problem is that the yield curves are stored sparsely around dates and tenors; i.e. not all dates, and not all tenors, are available in the database.

The yield curves are stored as this:

CREATE TABLE Curves (
    date DATE NOT NULL,
    type VARCHAR(10) NOT NULL,
    currency CHAR(3) NOT NULL,    
    tenor INT NOT NULL,
    rate DOUBLE NOT NULL
);

I want to be able to get a rate for a given date, type, currency, and tenor. With the following assumptions:

  • For a givenDate, choose the most recent curve for the given type and currency. If the givenDate is not found use the previously available date to get the rates.
  • For a givenTenor, do a linear interpolation from the available tenors. If the givenTenor is smaller than the smallest tenor, use the rate associated with the smallest tenor. If the givenTenor is larger than the largest tenor, use the date associated with the largest tenor. For aything in between use a linear interpolation between the two closest tenors.

So assuming that I have another table with financial instruments:

CREATE TABLE Instruments (
    id INT NOT NULL PRIMARY KEY,
    date DATE NOT NULL,
    type VARCHAR(10) NOT NULL,
    currency CHAR(3) NOT NULL,
    tenor INT NOT NULL,
);

I would like to have a stored function that can produce the rates, following the assumptions above, in a query like this:

SELECT *, SOME_FUNCTION(date, type, currency, tenor) AS rate FROM Instruments;

I believe that a function can be created to do this, but I'm also sure that a naive implementation would not perform at all. Particularly with millions of records. The algorithm for the function would be something like this:

  • With the givenDate, get the availableDate as the maximum date that is less than or equal to the givenDate, for the givenType and givenCurrency.
  • For the availableDate get all the tenors and rates. With this vector, go over the tenors with the asumptions as above, to calculate a rate.

Any ideas on how to do this in Microsoft SQL Server in a performant way?

-- Edit

Following up on @dalek @trenton-ftw comments, here is the 'naive' implementation of the algorithm. I've created random data to test:

  • 5,000 Curve points, spanding the last 5 years, with two types and two currencies,
  • 1,000,000 Instruments with random dates, types, currencies and tenors.

To test the function I just executed the following query:

SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor")) 
FROM Instruments

For the random sample above this query executes in 2:30 minutes in my laptop. Nevertheless, the real application would need to work on something close to 100,000,000 instruments in a heavily loaded sql server. That would potentially put it in the 20 minute ballpark.

How can I improve the performance of this function?

Here is the naive implementation:

CREATE OR ALTER FUNCTION TenorRate(
    @date DATE, 
    @type VARCHAR(8), 
    @currency VARCHAR(3), 
    @tenor INT
) RETURNS REAL
BEGIN

    DECLARE @rate REAL

    -- The available date is the first date that is less than or equal to 
    -- the provided one, or the first one if none are less than or equal to it.

    DECLARE @availableDate DATE
    SET @availableDate = (
        SELECT MAX("date") FROM Curves 
        WHERE "date" <= @date AND "type" = @type AND "currency" = @currency
    )
    IF (@availableDate IS NULL)
        BEGIN
            SET @availableDate = (
                SELECT MIN("date") FROM Curves 
                WHERE "type" = @type AND "currency" = @currency
            )
        END

    -- Get the tenors and rates for the available date, type and currency.
    -- Ordering by tenor to ensure that the fetch is made in order.

    DECLARE @PreviousTenor INTEGER, @PreviousRate REAL
    DECLARE @CurrentTenor INTEGER, @CurrentRate REAL

    DECLARE Curve CURSOR FAST_FORWARD READ_ONLY FOR
        SELECT "tenor", "rate" 
        FROM Curves 
        WHERE "date" = @availableDate AND "type" = @type AND "currency" = @currency 
        ORDER BY "tenor"

    -- Open a cursor to iterate over the tenors and rates in order.

    OPEN Curve
    FETCH NEXT FROM Curve INTO @CurrentTenor, @CurrentRate 

    IF (@Tenor < @CurrentTenor) 
    BEGIN
        -- If the tenor is less than the first one, 
        -- then use the first tenor rate.
        SET @rate = @CurrentRate
    END
    ELSE
    BEGIN        
        WHILE @@FETCH_STATUS = 0
            BEGIN
                IF (@Tenor = @CurrentTenor)
                BEGIN
                    -- If it mathces exactly return the found rate.
                    SET @rate = @CurrentRate
                    BREAK
                END
                IF (@Tenor < @CurrentTenor)
                BEGIN
                    -- If the current tenor is lesss than the current 
                    -- (but not equal) then interpolate with the 
                    -- previous one and return the calculated rate.
                    SET @rate = @PreviousRate + 
                        (@tenor - @PreviousTenor) * 
                        (@CurrentRate - @PreviousRate) / 
                        (@CurrentTenor - @PreviousTenor)
                    BREAK
                END
                -- Keep track ot the previous tenor and rate.
                SET @PreviousTenor = @CurrentTenor
                SET @PreviousRate = @CurrentRate
                -- Fetch the next tenor and rate.
                FETCH NEXT FROM Curve INTO @CurrentTenor, @CurrentRate
            END
        IF (@Tenor > @CurrentTenor) 
        BEGIN
            -- If we exahusted the tenors and still nothing found, 
            -- then use the last tenor rate.
            SET @rate = @CurrentRate
        END
    END

    CLOSE Curve
    DEALLOCATE Curve

    RETURN @rate

END;


Solution 1:[1]

Assuming a table dbo.Curves exists with the following structure:

CREATE TABLE dbo.Curves(
    [date] [date] NOT NULL
    ,[type] [varchar](10) NOT NULL
    ,currency [char](3) NOT NULL
    ,tenor [int] NOT NULL
    ,rate [real] NOT NULL
    ,UNIQUE NONCLUSTERED ([date] ASC,[type] ASC,currency ASC,tenor ASC)
    ,INDEX YourClusteredIndexNameHere CLUSTERED ([type] ASC,currency ASC,[date] ASC)
);

Notable differences from OP's DDL script:

  • column rate has the data type corrected from double to real. Assuming this is the correct data type because the provided function stores rate values in variables with a real data type. The data type double does not exist in SQL Server.
  • I have added a clustered index on (type,currency,date). The table did not have a clustered index in OP's DDL script, but should certainly have one. Considering the only knowledge I have of how this table is used is this function, I will design one that works best for this function.
  • OP indicated this table has a unique constraint on (date,type,currency,tenor), so I have included.

Just to note, if the unique constraint that already exists does not require the provided order, I would recommend removing that unique constraint and the clustered index I recommended and simply creating a single clustered primary key constraint on (type,currency,date,tenor).

Differences between OP's function and set based approach

  • The actual flow is very similar to OP's provided function. It is well commented, but the function does not read top to bottom. Start from the FROM of the inner most subquery and read 'outwards'.
  • This function is inline-able. You can confirm this once this function is installed to a given database using Microsoft's documentation on this topic.
  • I corrected the input type for the @i_Type variable to varchar(10) to match the type column of the dbo.Curves table.
  • I created the two tables and populated them with what I believe was reasonable randomized data. I used the same scale of 200:1 from OP's test scenario of 1m records in dbo.Instruments and 5k records in dbo.Curves. I tested using 80k records in dbo.Instruments and 400 records in dbo.Curves. Using the exact same test script from OP of SELECT AVG(dbo.TenorRate("date", "type", "currency", "tenor")) FROM Instruments;
    • OP's CURSOR based approach had a CPU/Elapsed ms of 57774/58654.
    • Set based approach I wrote had a CPU/Elapsed ms of 7437/7440. So roughly 12.6% elapsed time of the original.
  • The set based approach is guaranteed to result in more reads overall because it is repeatedly reading from dbo.Curves as opposed to the CURSOR option which pulls the rows from dbo.Curves one time and then reads from the CURSOR. It is worth it to note that comparing execution plans on both function approaches in the test case above will yield misleading results. Because the CURSOR based option cannot be inlined, the logical operators and query statistics are not going to be shown in the execution plan, but the query plan for the set based approach can be inlined and so the logcial operators and query statistics will be included in the execution plan. Just thought I would point that out.
  • I did validate the results of this compared to the original function. I noticed that there some potential areas for discrepany.
    • Both this function and the OP's CURSOR based option are doing math using real data types. I saw a few examples where it looked like the math, although using the same numbers between the CURSOR approach and the set based approach resulted in different numbers. When I validated that the output number, when rounded to the 5th decimal, matched exactly. You might need to track that down, but considering that you said you are building a financial application and a financial application is the classic example where real and float data types should be avoided (because they are approximate data types) I strongly suggest you change the way these values are both stored and used in this function to be numeric. It was not worth my time to track this issue down because IMO it only exists because of very bad practice that should be resolved anyhow.
    • Based on my understanding of OP's approach, in scenarios where the input tenor is between the minimum and maximum tenor we would want to use the two rows from dbo.Curves where the tenor value is as close to the input tenor as possible, and then use those two rows (and the rate from those two rows) to perform a linear interpolation for the input tenor. In this same scenario, the OP's function finds the closest tenor available and then instead of using the row with the next closest available tenor, it uses the tenor with the previous row (sorted by tenor ASC). Because of how the CURSOR implementation was written, this is not guaranteed to be the next closest row to the input tenor so I fixed that in my approach. Feel free to revert if I misunderstood.

Notes about my specific implementation

  • This approach is not heavily optimized towards certain conditions that might exist more commonly. There are still potential optimizations that you might be able to make in your system based on the conditions that exist.
  • I am using an aggregate to pivot rows to columns. I did test out LEAD/LAG to do this, but found that pivoting using the aggregate was consistently faster by roughly 7% decrease in elapsed time.
  • There are simple warehousing strategies that could be used to GREATLY increase the performance of this approach. For example, storing the pre-calculated min/max for a given combination of a (type,currency,date) could have a tremendous impact.

Here is the function:

CREATE OR ALTER FUNCTION dbo.TenorRate(
    @i_Date DATE
    ,@i_Type VARCHAR(10)
    ,@i_Currency CHAR(3)
    ,@i_Tenor INT
) 
RETURNS REAL
BEGIN
RETURN (
SELECT 
    /*now pick the return value based on the scenario that was returned, as a given scenario can have a specific formula*/
    CASE 
        WHEN pivoted_vals.Scenario = 1 THEN pivoted_vals.CurrentRate /*returned row is first row with matching tenor*/
        WHEN pivoted_vals.Scenario = 2 THEN pivoted_vals.CurrentRate /*returned row is row with smallest tenor*/
        WHEN pivoted_vals.Scenario = 3 THEN pivoted_vals.CurrentRate /*returned row is row with largest tenor*/
        WHEN pivoted_vals.Scenario = 4 /*returned row contains pivoted values for the two rows that are two closest tenor rows*/
            THEN pivoted_vals.PreviousRate
                 + 
                 (@i_Tenor - pivoted_vals.PreviousTenor) 
                 * 
                 (pivoted_vals.CurrentRate - pivoted_vals.PreviousRate) 
                 / 
                 (pivoted_vals.CurrentTenor - pivoted_vals.PreviousTenor)
    END AS ReturnValue 
FROM 
    (SELECT 
        /*pivot rows using aggregates. aggregates are only considering one row and value at a time, so they are not truly aggregating anything. simply pivoting values.*/
        MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.Scenario ELSE NULL END) AS Scenario
        ,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentRate ELSE NULL END) AS CurrentRate
        ,MAX(CASE WHEN top_2_rows.RowNumber = 1 THEN top_2_rows.CurrentTenor ELSE NULL END) AS CurrentTenor
        ,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentRate ELSE NULL END) AS PreviousRate
        ,MAX(CASE WHEN top_2_rows.RowNumber = 2 THEN top_2_rows.CurrentTenor ELSE NULL END) AS PreviousTenor
    FROM 
        /*return a row number for our top two returned rows. done in an outer subquery so that the row_number function is not used until the final two rows have been selected*/
        (SELECT TOP (2)  
            picked_curve.Scenario
            ,picked_curve.ScenarioRankValue
            ,picked_curve.CurrentRate
            ,picked_curve.CurrentTenor 
            /*generate row number to match order by clause*/
            ,ROW_NUMBER() OVER(ORDER BY picked_curve.Scenario ASC,picked_curve.ScenarioRankValue ASC) AS RowNumber 
        FROM 
            /*we need top two rows because scenario 4 requires the previous row's value*/
            (SELECT TOP (2) 
                scenarios.Scenario
                ,scenarios.ScenarioRankValue
                ,c.rate AS CurrentRate
                ,c.tenor AS CurrentTenor 
            FROM 
                /*first subquery to select the date that we will use from two separate conditions*/
                (SELECT TOP (1) 
                    date_options.[date]
                FROM 
                    (/*most recent available date before or equal to input date*/
                    SELECT TOP (1) 
                        c.[date]
                    FROM 
                        dbo.Curves AS c
                    WHERE 
                        /*match on type and currency*/
                        c.[type] = @i_Type 
                        AND 
                        c.currency = @i_Currency
                        AND 
                        c.[date] <= @i_Date
                    ORDER BY 
                        c.[date] DESC
                    UNION ALL 
                    /*first available date after input date*/
                    SELECT TOP (1) 
                        c.[date]
                    FROM 
                        dbo.Curves AS c
                    WHERE 
                        /*match on type and currency*/
                        c.[type] = @i_Type 
                        AND 
                        c.currency = @i_Currency
                        AND 
                        c.[date] > @i_Date
                    ORDER BY 
                        c.[date] ASC) AS date_options 
                ORDER BY 
                    /*we want to prioritize date from first query, which we know will be 'older' than date from second query. So ascending order*/
                    date_options.[date] ASC) AS selected_date 
                /*go get curve values for input type, input currency, and selected date*/
                INNER JOIN dbo.Curves AS c ON 
                    @i_Type = c.[type] 
                    AND 
                    @i_Currency = c.currency
                    AND 
                    selected_date.[date] = c.[date] 
                /*go get max and min curve values for input type, input currency, and selected date*/
                OUTER APPLY (SELECT TOP (1) /*TOP (1) is redundant since this is an aggregate with no grouping, but keeping for clarity*/
                                MAX(c_inner.tenor) AS MaxTenor
                                ,MIN(c_inner.tenor) AS MinTenor
                            FROM 
                                dbo.Curves AS c_inner 
                            WHERE 
                                c_inner.[type] = @i_Type
                                AND 
                                c_inner.currency = @i_Currency
                                AND 
                                c_inner.[date] = selected_date.[date]) AS max_min_tenor
                /*for readibility in select, outer apply logic to give us a value will prioritize certain scenarios over others (and indicate to us the scenarios that was returned) and 
                return the minimum number of rows that are ranked in a manner in which the top returned rows contain the information needed for the given scenario*/
                OUTER APPLY (SELECT
                                CASE /*rank the scenarios*/
                                    WHEN @i_Tenor = c.tenor THEN 1
                                    WHEN @i_Tenor < max_min_tenor.MinTenor THEN 2
                                    WHEN @i_Tenor > max_min_tenor.MaxTenor THEN 3
                                    ELSE 4 /*input tenor is between the max/min tenor*/
                                END AS Scenario
                                ,CASE /*rank value that ensures the top row will be the row we need for the returned scenario*/
                                    WHEN @i_Tenor = c.tenor THEN c.tenor 
                                    WHEN @i_Tenor < max_min_tenor.MinTenor THEN c.tenor 
                                    WHEN @i_Tenor > max_min_tenor.MaxTenor THEN c.tenor*-1
                                    ELSE ABS(c.tenor-@i_Tenor)
                                END AS ScenarioRankValue) AS scenarios
            ORDER BY 
                /*highest priority scenario (by lowest value) and the associated value to be ranked, which is designed to be used in ascending order*/
                scenarios.Scenario ASC
                ,scenarios.ScenarioRankValue ASC) AS picked_curve
        ORDER BY 
            picked_curve.Scenario ASC 
            ,picked_curve.ScenarioRankValue ASC) AS top_2_rows) AS pivoted_vals
); 
END; 
GO 

Just one last note on usage, there are certain ways that you might use this query that could prevent it from being inline. I highly recommend you read the entirety of Microsoft's doc on Scalar UDF Inlining, but at a minimum you should at least read over the Inlineable scalar UDF requirements contained in that same doc.

Hopefully this helps you out or at least points you in the right direction.

Solution 2:[2]

In fact your function look like an aggregate function. If it is the case, you should write a SQL CLR one. We do that for financial score and the performance was fine...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 trenton-ftw
Solution 2 SQLpro