Forum Discussion

AzureNewbie1's avatar
AzureNewbie1
Copper Contributor
Jul 04, 2024

Is the use of NVARCHAR(MAX) acceptable for this scenario?

Hi,
I'm using Azure Data Factory (ADF) Pipelines to ingest several external datasources into our domain. As you'll know, I need to create a target database table for each source table I'm looking to ingest.
Upon creating the target tables for the first time, I can look at the source tables field properties (datatype and size) and set those same properties for the target tables. My concern is if the source owners change the source field properties in the future and do not notify us (which is a strong possibility) then the ADF Pipelines may fail in which I will then investigate and resolve. There could be another scenario where source field A is 20 characters long, and so I set the target field to be 20 characters long, but then the source field is amended to be 50 characters long, and I'm not informed. This may not break the ADF Pipeline execution but simply truncate the data.

An approach would be to set each target field to be NVARCHAR( MAX). This would (should) avoid the situation above. I'm a one-person band on all things data at my company and am attempting to juggle many things in the best way I can. I feel the NVARCHAR(MAX) approach would reduce headaches and needed time to investigate.

Questions:
1) I understand it isn't good practice to use NVARCHAR(MAX) unless needed, but would you advise it acceptable with what I'm concerned about?
2) I understand using NVARCHAR(MAX) takes up more database size (allows for 2Gb) but if we're not actually filling that 2Gb up does it cost more in Azure SQL Server?  The volume of data we have isn't great but I am still considering the costs involved.

TLDR: Would setting target fields as NVARCHAR(MAX) increase Azure SQL Server costs even if the fields only contain 20 characters?

Thanks in advance.

  • Acceptability of NVARCHAR(MAX):

    Pros:
    Flexibility: Accommodates changes in source field lengths without requiring table schema changes in the target database.
    Resilience: Reduces the risk of ADF pipeline failures or data truncation issues due to schema changes in source tables.
    Cons:
    Performance: NVARCHAR(MAX) can potentially impact performance, especially if large amounts of data are stored.
    Indexing: Indexing on NVARCHAR(MAX) columns is less efficient compared to fixed-length NVARCHAR columns.
    Best Practices: Using NVARCHAR(MAX) for all columns is generally not recommended unless necessary.
    Hybrid Approach:

    Use NVARCHAR with a reasonable length for most columns based on the current source data length.
    Reserve NVARCHAR(MAX) for columns where you expect significant changes in data length or where you cannot predict future length requirements.
    Monitoring and Alerts:

    Implement monitoring and alerting for schema changes in source tables. Tools like Azure Data Factory and Azure SQL Database provide monitoring capabilities that can help you track changes and respond proactively.
    Regular Audits:

    Periodically audit the source and target schemas to ensure they remain in sync. This can help you catch any changes early and adjust the target schema as needed.
    sing NVARCHAR(MAX) can be acceptable in your scenario, especially for fields where length can vary or change unexpectedly. However, consider a balanced approach to avoid potential performance impacts and maintain best practices. The cost in Azure SQL Server will primarily depend on the actual data stored, not the maximum potential size of NVARCHAR(MAX) fields.
  • bandaruajeyudu's avatar
    bandaruajeyudu
    Copper Contributor
    Acceptability of NVARCHAR(MAX):

    Pros:
    Flexibility: Accommodates changes in source field lengths without requiring table schema changes in the target database.
    Resilience: Reduces the risk of ADF pipeline failures or data truncation issues due to schema changes in source tables.
    Cons:
    Performance: NVARCHAR(MAX) can potentially impact performance, especially if large amounts of data are stored.
    Indexing: Indexing on NVARCHAR(MAX) columns is less efficient compared to fixed-length NVARCHAR columns.
    Best Practices: Using NVARCHAR(MAX) for all columns is generally not recommended unless necessary.
    Hybrid Approach:

    Use NVARCHAR with a reasonable length for most columns based on the current source data length.
    Reserve NVARCHAR(MAX) for columns where you expect significant changes in data length or where you cannot predict future length requirements.
    Monitoring and Alerts:

    Implement monitoring and alerting for schema changes in source tables. Tools like Azure Data Factory and Azure SQL Database provide monitoring capabilities that can help you track changes and respond proactively.
    Regular Audits:

    Periodically audit the source and target schemas to ensure they remain in sync. This can help you catch any changes early and adjust the target schema as needed.
    sing NVARCHAR(MAX) can be acceptable in your scenario, especially for fields where length can vary or change unexpectedly. However, consider a balanced approach to avoid potential performance impacts and maintain best practices. The cost in Azure SQL Server will primarily depend on the actual data stored, not the maximum potential size of NVARCHAR(MAX) fields.
    • AzureNewbie1's avatar
      AzureNewbie1
      Copper Contributor

      bandaruajeyudu - thank you for your response.
      My comments based on your feedback are:
      1) By using NVARCHAR(MAX) I would avoid the need to amend the target structure should the source change. This is a benefit, and so is the truncation avoidance concern.
      2) We won't be storing a large amount of data in our target database that I'm considering using NVARCHAR(MAX) in. There may be 50 tables and each table may be 30 columns wide and 30k rows deep.
      3) I don't expect to explicitly create indexes on the target tables. Unless Azure SQL Server implicitly creates indexes on the tables then I feel the limitation of indexing when using NVARCHAR(MAX) concern is not relevant here.
      4) The hybrid approach seems sensible. It would take a little more effort upfront but result in a more suitable approach. The concern here though is I have no idea if the source, maintained by external parties, would change and how. Using NVARCHAR(MAX) seems lazy but I wouldn't be using it for this reason, but instead to future-proof any source changes.
      5) You mentioned Azure Data Factory has the ability to monitor source changes. This sounds very useful. Please can you provide me with a link to what you're describing?  If this monitoring can alert me to a source field (or even just a table) change then this could be relied on and therefore I wouldn't need the failsafe approach of using NVARCHAR(MAX).

      Thanks.

Resources