Forum Discussion

A819A1L's avatar
A819A1L
Brass Contributor
Aug 15, 2023

MERGE INTO statement increasing previously captured "snapshot" data

Hello, I need help understanding why my snapshot logic isn't working:

 

I have a table that captures "snapshot" data for staff absences each day. On Friday 11th I did a simple select query which returned 109,541 rows. The SKEY_ROW_START and SKEY_ROW_END are the fields used to determine which records should be included in each snapshot date (D.SKEY_DATE as shown below).

 

SELECT * FROM HR.FACT_STAFF_ABSENCE_SNAPSHOT
LEFT JOIN HR.DIM_DATE AS D ON D.SKEY_DATE BETWEEN SKEY_ROW_START AND SKEY_ROW_END
WHERE D.SKEY_DATE = 20230811

 

I ran the same query today (15th), which returned 109,614 rows. This is a problem because regardless of when the query is run, it should always return the same result when querying a specific date. 

 

This is the sql that populates the snapshot table. The purpose if this statement is to compare the results from both non-snapshot and snapshot table based on ROW_HASH. Where they match update the SKEY_ROW_END to "Today". Where they don't match, insert new records and set SKEY_ROW_START and SKEY_ROW_END to "Today". I didn't write this code, it was implemented by a developer that no longer works at my organization. But the statement seemed fine to me. I believe it was designed this way because each "snapshot date" should be an accumulation of absence records in the last year.

 

 WITH S
 AS (SELECT distinct [ABSENCE_DAYS]
      ,[ABSENCE_HOURS]
      ,[ABSENCE_WEEKS]
      ,[ABSENCE_NUMBER]
      ,[APPOINTMENT_ID]
      ,[SKEY_PERSON]
      ,[START_DATE]
      ,[END_DATE]
      ,[ABSENCE_REASON_CODE]
      ,[ABSENCE_REASON_DESCRIPTION]
      ,[RETURN_TO_WORK_INTERVIEW]
      ,[SKEY_START_DATE]
      ,[SKEY_END_DATE]
      ,[SKEY_ACTIVE_CONTRACT]
      ,[SKEY_LATEST_APPOINTMENT]
      ,[SKEY_COLLEGE_OF_SCHOOLS]
      ,[SKEY_SCHOOL]
      ,[SKEY_DEPARTMENT]
      ,[SKEY_DIVISION]
      ,[SKEY_ACTUAL_GRADE]
      ,[SKEY_EMPLOYEE_STATUS]
      ,[SKEY_EMPLOYEE_SUB_STATUS]
      ,[SKEY_EMPLOYEE_LOCATION]
      ,[SKEY_HESA_DEFINED_STAFF_CONTRACT]
      ,[SKEY_CATEGORY]
      ,[SKEY_LEAVING_CODE]
      ,[SKEY_REASON_CODE]
      ,[SKEY_APPOINTMENT_STATUS]
      ,[SKEY_COST_CENTRE]
      ,[SKEY_PAY_CODE]
	  ,CAST(CONVERT(VARCHAR(8),GETDATE(),112) AS INT) SKEY_ROW_START
	  ,CAST(CONVERT(VARCHAR(8),GETDATE(),112) AS INT) SKEY_ROW_END
	  ,CONVERT(VARBINARY(32)
	  ,HASHBYTES('SHA2_256',CONCAT([ABSENCE_DAYS]	,'|'
      ,[ABSENCE_HOURS]	,'|'
      ,[ABSENCE_WEEKS]	,'|'
      ,[ABSENCE_NUMBER]	,'|'
      ,[APPOINTMENT_ID]	,'|'
      ,[SKEY_PERSON]	,'|'
      ,[START_DATE]	,'|'
      ,[END_DATE]	,'|'
      ,[ABSENCE_REASON_CODE]	,'|'
      ,[ABSENCE_REASON_DESCRIPTION]	,'|'
      ,[RETURN_TO_WORK_INTERVIEW]	,'|'
      ,[SKEY_START_DATE]	,'|'
      ,[SKEY_END_DATE]	,'|'
      ,[SKEY_ACTIVE_CONTRACT]	,'|'
      ,[SKEY_LATEST_APPOINTMENT]	,'|'
      ,[SKEY_COLLEGE_OF_SCHOOLS]	,'|'
      ,[SKEY_SCHOOL]	,'|'
      ,[SKEY_DEPARTMENT]	,'|'
      ,[SKEY_DIVISION]	,'|'
      ,[SKEY_ACTUAL_GRADE]	,'|'
      ,[SKEY_EMPLOYEE_STATUS]	,'|'
      ,[SKEY_EMPLOYEE_SUB_STATUS]	,'|'
      ,[SKEY_EMPLOYEE_LOCATION]	,'|'
      ,[SKEY_HESA_DEFINED_STAFF_CONTRACT]	,'|'
      ,[SKEY_CATEGORY]	,'|'
      ,[SKEY_LEAVING_CODE]	,'|'
      ,[SKEY_REASON_CODE]	,'|'
      ,[SKEY_APPOINTMENT_STATUS]	,'|'
      ,[SKEY_COST_CENTRE]	,'|'
      ,[SKEY_PAY_CODE]	,'|'
	  ,[SKEY_WORK_LOCATION]))) AS ROW_HASH
	  ,[SKEY_WORK_LOCATION]
	   ,SKEY_ABSENCE_TERM
FROM HR.FACT_STAFF_ABSENCE
WHERE SKEY_START_DATE >=(DATEPART(YEAR,GETDATE())-1)*10000 +801
)

MERGE INTO HR.FACT_STAFF_ABSENCE_SNAPSHOT AS T
USING S
ON T.ROW_HASH = S.ROW_HASH
WHEN MATCHED THEN
UPDATE 
SET T.SKEY_ROW_END = CAST(CONVERT(VARCHAR(8),GETDATE(),112) AS INT)
WHEN NOT MATCHED BY TARGET
THEN 
INSERT ([ABSENCE_DAYS]
      ,[ABSENCE_HOURS]
      ,[ABSENCE_WEEKS]
      ,[ABSENCE_NUMBER]
      ,[APPOINTMENT_ID]
      ,[SKEY_PERSON]
      ,[START_DATE]
      ,[END_DATE]
      ,[ABSENCE_REASON_CODE]
      ,[ABSENCE_REASON_DESCRIPTION]
      ,[RETURN_TO_WORK_INTERVIEW]
      ,[SKEY_START_DATE]
      ,[SKEY_END_DATE]
      ,[SKEY_ACTIVE_CONTRACT]
      ,[SKEY_LATEST_APPOINTMENT]
      ,[SKEY_COLLEGE_OF_SCHOOLS]
      ,[SKEY_SCHOOL]
      ,[SKEY_DEPARTMENT]
      ,[SKEY_DIVISION]
      ,[SKEY_ACTUAL_GRADE]
      ,[SKEY_EMPLOYEE_STATUS]
      ,[SKEY_EMPLOYEE_SUB_STATUS]
      ,[SKEY_EMPLOYEE_LOCATION]
      ,[SKEY_HESA_DEFINED_STAFF_CONTRACT]
      ,[SKEY_CATEGORY]
      ,[SKEY_LEAVING_CODE]
      ,[SKEY_REASON_CODE]
      ,[SKEY_APPOINTMENT_STATUS]
      ,[SKEY_COST_CENTRE]
      ,[SKEY_PAY_CODE]
      ,[SKEY_ROW_START]
      ,[SKEY_ROW_END]
      ,[ROW_HASH]
	  ,[SKEY_WORK_LOCATION]
	  ,SKEY_ABSENCE_TERM)
VALUES
(S.[ABSENCE_DAYS]
      ,S.[ABSENCE_HOURS]
      ,S.[ABSENCE_WEEKS]
      ,S.[ABSENCE_NUMBER]
      ,S.[APPOINTMENT_ID]
      ,S.[SKEY_PERSON]
      ,S.[START_DATE]
      ,S.[END_DATE]
      ,S.[ABSENCE_REASON_CODE]
      ,S.[ABSENCE_REASON_DESCRIPTION]
      ,S.[RETURN_TO_WORK_INTERVIEW]
      ,S.[SKEY_START_DATE]
      ,S.[SKEY_END_DATE]
      ,S.[SKEY_ACTIVE_CONTRACT]
      ,S.[SKEY_LATEST_APPOINTMENT]
      ,S.[SKEY_COLLEGE_OF_SCHOOLS]
      ,S.[SKEY_SCHOOL]
      ,S.[SKEY_DEPARTMENT]
      ,S.[SKEY_DIVISION]
      ,S.[SKEY_ACTUAL_GRADE]
      ,S.[SKEY_EMPLOYEE_STATUS]
      ,S.[SKEY_EMPLOYEE_SUB_STATUS]
      ,S.[SKEY_EMPLOYEE_LOCATION]
      ,S.[SKEY_HESA_DEFINED_STAFF_CONTRACT]
      ,S.[SKEY_CATEGORY]
      ,S.[SKEY_LEAVING_CODE]
      ,S.[SKEY_REASON_CODE]
      ,S.[SKEY_APPOINTMENT_STATUS]
      ,S.[SKEY_COST_CENTRE]
      ,S.[SKEY_PAY_CODE]
      ,CAST(CONVERT(VARCHAR(8),GETDATE(),112) AS INT)
      ,CAST(CONVERT(VARCHAR(8),GETDATE(),112) AS INT)
      ,S.[ROW_HASH]
	  ,S.[SKEY_WORK_LOCATION]
	  ,S.SKEY_ABSENCE_TERM)
;

 

When I compared the two result sets, I managed to identify the additional 73 records based on the ROW_HASH.  In each one the SKEY_ROW_END was greater than 20230811, as expected. But the SKEY_ROW_START of these records pre-dates 20230811, which explains why they are included in the select query. What I don't understand is, how? Surely if they are new records added after 20230811, the SKEY_ROW_START should be greater than 20230811?

 

Is anyone able to identify anything wrong with the MERGE/INSERT statement that might explain what's happened?

 

Cheers 

 

  • A819A1L 

     

    This may not be a technical issue at all.

     

    You need to consider how leave is administered, which would involve talking to departments like human resources and payroll, etc.

     

    In my experience, it's quite common for unplanned leave (such as sick or compassionate leave, etc.) to be entered retrospectively.

     

    I've also seen a good number of places enter leave batch-style, such as on a particular day of the week, or when their in-tray/inbox has enough pending request to make it worthwhile.

     

    And then there's just plain old corrections - whether that's because someone entered the dates incorrectly in the first place, or perhaps the leave dates were adjusted at the person on leave's request.

     

    The point is, there's a human behaviour issue at play here - it's not a real time technical transaction process.

     

    Given the volume of rows you're pulling, that's a very low number of variations.

     

    Your best course of actions is to pick a handful of the variations and go check with the custodians (HR, payroll, whoever it may be). Maybe there's a problem but I doubt it.

     

    I'm more inclined to think your assumption that for a given historic date that the reported figure won't change is not realistic. Certainly, the further you go back in time the more accurate that assumption will probably be, but where you're searching against recent dates - as your example is, and by the time you factor in the length of a pay cycle (often fortnightly or monthly in my experience), you can expect to see variations.

     

    Cheers,

    Lain

  • LainRobertson's avatar
    LainRobertson
    Silver Contributor

    A819A1L 

     

    This may not be a technical issue at all.

     

    You need to consider how leave is administered, which would involve talking to departments like human resources and payroll, etc.

     

    In my experience, it's quite common for unplanned leave (such as sick or compassionate leave, etc.) to be entered retrospectively.

     

    I've also seen a good number of places enter leave batch-style, such as on a particular day of the week, or when their in-tray/inbox has enough pending request to make it worthwhile.

     

    And then there's just plain old corrections - whether that's because someone entered the dates incorrectly in the first place, or perhaps the leave dates were adjusted at the person on leave's request.

     

    The point is, there's a human behaviour issue at play here - it's not a real time technical transaction process.

     

    Given the volume of rows you're pulling, that's a very low number of variations.

     

    Your best course of actions is to pick a handful of the variations and go check with the custodians (HR, payroll, whoever it may be). Maybe there's a problem but I doubt it.

     

    I'm more inclined to think your assumption that for a given historic date that the reported figure won't change is not realistic. Certainly, the further you go back in time the more accurate that assumption will probably be, but where you're searching against recent dates - as your example is, and by the time you factor in the length of a pay cycle (often fortnightly or monthly in my experience), you can expect to see variations.

     

    Cheers,

    Lain

    • A819A1L's avatar
      A819A1L
      Brass Contributor
      Hi Lian,

      Thanks for the advice. I believe you're right. I have been advised to redevelop the ETL to load all live records and assign a snapshot date rather than an accumulation of changes.

      Cheers,
      • LainRobertson's avatar
        LainRobertson
        Silver Contributor

        A819A1L 

         

        That sounds like good advice, as the merging approach can give misleading data.

         

        Consider you have a person taking leave with these initial dates:

         

        Start dateEnd date
        2023-08-172023-08-30

         

        But then later on, they have to adjust the leave they applied for above to this:

         

        Start dateEnd date
        2023-09-172023-09-30

         

        The problem with the MERGE approach is that for this person, they now have two leave windows scheduled:

         

        Start dateEnd date
        2023-08-172023-08-30
        2023-09-172023-09-30

         

        So, for this poor person, it looks like they're taking off twice as much time as they really are. It obviously also makes the organisational leave rollup inaccurate.

         

        MERGE is great for things change logging/auditing but not for this kind of scenario, where - as you say - using the live/final records is the more trustworthy approach when measuring the amount of organisational leave being taken per period.

         

        Cheers,

        Lain

Resources