Forum Discussion
SQL Query Incremental data load
- Jul 22, 2024
If you're not maintaining the reference timestamp outside of the source table, and instead, you're going to use the target table to determine a reference date, then you would leverage a subquery to pull the maximum value contained in either createdDate or updatedDate in that target table.
This would look something like this:
SELECT [s].* FROM [sourceTable] AS [s] WHERE COALESCE([s].[updatedDate], [s].[createdDate]) > ( SELECT CASE WHEN MAX([t].[updatedDate]) >= MAX([t].[createdDate]) THEN MAX([t].[updatedDate]) ELSE MAX([t].[createdDate]) END FROM [targetTable] AS [t] );
Nothing changes prior to the original COALESCE, you're just changing the WHERE statement after the relational operator.
Cheers,
Lain
Thanks Lain,
The 'Some dateTime value' you are saying in the query will be vmaxUpdatedDate or vmaxCreatedDate ? Which max date I should pick that I do not loose the incremental data.
And below are some more details
The dates are captured at the point of time, when new rec is inserted or updated. I do not have much control over source table & DB, I am only allowed to select.
The dateCreated is datetime2 and UpdatedDate datetime filed, when I query I see it has date value along with timestamp for both the fields.
I see that along with PK the DateCreated is not null and UpdatedDate can be null, UpdatedDate will only have values when the rec is updated later after being inserted. Hence I am requested to pull the incremental data based on UpdatedDate and if this col has NULL then take DateCreated
The 'some DateTime value' is the value you're keeping track of external to the table itself. You could be saving it entirely external in something like a cookie, an XML file, or even internal to a database - whether that's the source database or another one.
Ultimately, you will have to keep track of the last timestamp you queried against so that your next query uses that value as it's "from" point of reference.
You don't need any form of control over the source database itself whatsoever. Read-only access is fine for the scenario you've described.
The COALESCE statement takes care of arbitrating between updatedDate and creationDate. Have a read of it to learn more about how it works, but the short description in the context if your example is that it will first look at updatedDate and return that value if it is not null. If updatedDate is NULL, then it will return createdDate.
Cheers,
Lain
- Neils2401Jul 22, 2024Copper ContributorThanks Lain,
'some DateTime value' can be MaxDateUpdated or MaxDateCreated from target table. Or both hence I was taking the UNION approach so that both the scenarios are covered. I do not have one particular field which captures the change, I have to figure it out based on the two dates.
The business says, use the PK and DateUpdated if not null, or use the PK and CreatedDate if DateUpdated is NULL.
PK I can use when inserting the incremental load into target table, but for getting the delta from source into staging table I have build some logic that I do not miss out any incremental data.
The job is to be run on daily basis to bring in the incremental load- LainRobertsonJul 22, 2024Silver Contributor
If you're not maintaining the reference timestamp outside of the source table, and instead, you're going to use the target table to determine a reference date, then you would leverage a subquery to pull the maximum value contained in either createdDate or updatedDate in that target table.
This would look something like this:
SELECT [s].* FROM [sourceTable] AS [s] WHERE COALESCE([s].[updatedDate], [s].[createdDate]) > ( SELECT CASE WHEN MAX([t].[updatedDate]) >= MAX([t].[createdDate]) THEN MAX([t].[updatedDate]) ELSE MAX([t].[createdDate]) END FROM [targetTable] AS [t] );
Nothing changes prior to the original COALESCE, you're just changing the WHERE statement after the relational operator.
Cheers,
Lain
- Neils2401Jul 22, 2024Copper ContributorThis looks interesting, I have to incorporate this into SSIS as my source is Oracle and target (stg & target tables) are in SQL Server.
This is one of the reason I was first getting the max dates (updated and created) from SQL Server target tables storing it in variables and then doing UNION with separate queries by comparing this max values with source dates. In doing so I am I missing anything or that will also work, to keep it simple.
Thank again Lain for your time on this issue.