database platform cse – db2
14 TopicsIngesting Mainframe File System Data (EBCDIC) into SQL DB on Fabric Using OSS Cobrix
Introduction Mainframe/Midrange data is often stored in fixed-length format, where each record has a predetermined length, or variable-length format, where each record’s length may vary. The data is stored in binary format, using Extended Binary Coded Decimal Interchange Code (EBCDIC) encoding and the metadata for the EBCDIC files is stored in a copybook file. These EBCDIC encoded files store data uniquely based on its data type, which is vital Mainframe file system data optimal storage and performance. However, this presents a challenge when migrating data from Mainframe or Midrange systems to distributed systems. The data, originally stored in a format specific to Mainframe or Midrange systems, is not directly readable upon transfer to distributed systems. As distributed systems only understand code pages like American Standard Code for Information Interchange (ASCII) To make this data readable on a distributed system, we would need to do an EBCDIC to ASCII code page conversion. This conversion can be achieved in many ways. Few of them are Microsoft Host Integration Server, Host File client Logic app IBM host File connector. Our detailed blog about it is here. Open Source (OSS) Libraries. Third-party ISV solutions. Microsoft Host Intergration server Microsoft Host Integration server (HIS) has a component named Host File Client (HFC). This particular component helps in converting Mainframe EBCDIC files to ASCII using a custom developed C# solution. More details on this solution is provided in HIS documentation page. Logic App Converter. If you prefer to choose a cloud native solution, then you can try to use the Host File Connector in Azure Logic Apps. The detailed process has been documented in this blog post. Fabric (with Open-Source Libraries) Microsoft Fabric is an enterprise-ready, end-to-end analytics platform. It unifies data movement, data processing, ingestion, transformation, real-time event routing, and report building. It supports these capabilities with integrated services like Data Engineering, Data Factory, Data Science, Real-Time Intelligence, Data Warehouse, and Databases. There are many open-source solutions which can help in achieving conversion of mainframe data to ASCII. This will help in converting files using Fabric, Databricks, Synapse on Azure. This blog will focus on the OSS option. Data Ingestion Architecture Using OSS on Fabric There are multiple Open-source libraries that can be utilized for this data conversion. In this article we will dive deeper into one of these solutions - Cobrix COBRIX is an open-source library built using scala and leverages the multithreaded process powered framework of spark. This helps in converting the file faster than compared to other single threaded processes. As this is multithreaded, it will need a pool of compute resources to achieve the conversion. Cobrix can run on spark environments like Azure Synapse, Databricks and Microsoft Fabric. We will dive deeper into how we can set up Cobrix on Microsoft Fabric. Download required Cobrix packages We will have to first download the required Cobrix packages from the right sources. As Fabric has a particular runtime dependency, please make sure your download the right build for Scala as per the fabric environment that you setup. You will have to download two jars named Cobol-Parser_xx.xx.jar and Spark-cobol_xxx.xx.jar. Setup the Fabric Environment. Login to fabric.microsoft.com. Create a Fabric workspace Create an Environment in the workspace Open the Environment and click on custom Libraries. Upload the two jars which were downloaded earlier.Once you have uploaded your custom library setup should look something like this. Create a new Lakehouse. Upload the cobol copybook file as well as the Mainframe Datafile in Binary to a particular location in the lakehouse. At the end of this step your lakehouse setup should look something of this kind. For both these files, copy the Azure Blob File System Secure (ABFSS) path by right clicking on the files. This link can be used to point to the file from the Spark notebook. Create a new Fabric pipeline. This pipeline will have two components, the first component will be a notebook, which will call the Cobrix framework to convert the file from EBCDIC to ASCII. Second piece of it will be a copy activity to copy the contents of the output file created in the notebook to a SQL DB on Fabric. Create a new Notebook . Attach the environment which you had created earlier to this notebook. In the notebook cell, use can use this piece of code. //Blob access var CopyBookName = "abfss://file1.cpy" var DataFileName = "abfss://file1.dat" var outputFileName = "abfss://output.txt" //Cobrix Converter Execution val cobolDataframe = spark .read .format("za.co.absa.cobrix.spark.cobol.source") .option("copybook", CopyBookName) .load(DataFileName) //Display DataFrame to view conversion results cobolDataframe.printSchema() cobolDataframe.show() Once you have set the configuration properly, you are all set to run the notebook. And this will convert the file from EBCDIC to ASCII and store it to the Lakehouse. Add a Copy activity to the pipeline with File as Source and SQL server as destination. At this point in time, your pipeline should look something like this Once you run this pipeline, the Mainframe EBCDIC file will be converted to ASCII and then loaded into Fabric Native SQL DB table. Third-party ISV solutions. There are many third-party ISV solutions which are available for EBCDIC to ASCII conversions. Please get int touch with us to help you get the right solution for your requirements. Summary EBCDIC to ASCII conversion is a critical piece of work during the data migration/modernization journey. Being able to do this with ease and accuracy will drive the success of data migration. With this feature enabled in fabric, this opens up a new set of use cases like Mainframe report generation etc kind of use cases which are predominantly data warehouse driven. Feedback and suggestions If you have feedback or suggestions for improving this data migration asset, please send an email to Database Platform Engineering Team.Copy Data to Azure SQL Database from Mainframe Db2 using Microsoft Fabric Data pipelines
To migrate data from Db2 on a mainframe to an Azure SQL Database, leveraging Fabric Data Pipelines provides a streamlined and scalable solution. Fabric Data Pipelines simplify data extraction from Db2, transform it as needed, and load it efficiently into Azure SQL Database. This approach allows you to automate the entire ETL process while taking advantage of Azure's scalability and integration features, ensuring a smooth transition of legacy data to the cloud.Best practices for modernizing large-scale Text-Search from legacy data systems to Azure SQL
Introduction Text-search is an essential aspect of contemporary data management, enabling users to effectively extract information from natural-language text. This technology allows for the retrieval of specific details from a text corpus, providing insights that would be unattainable through conventional search methods. Most modern databases include text-search capabilities by default, while legacy databases and data systems typically do not. In legacy systems, users may need to configure these features separately or use external services to implement text-search functions. SQL Server natively supports Full-Text search (and has been available from SQL Server 2005 onwards), which removes the requirement for additional component installations and thereby negates the need for any external data movement. The effectiveness of any feature is contingent upon proper design and implementation. For handling text-search on large volume tables and VLDBs (i.e. TBs or PBs of data), adhering to established techniques and best practices ensures optimal performance and manageability. This article will specifically focus on the performance considerations and optimization techniques required for implementing Full-Text Search functionality on Azure SQL Database. We will discuss a use case that demonstrates how to optimize full-text search innovatively to deliver quick and efficient results. Modernizing legacy workloads This article delves into the text search feature in Azure SQL and its detailed implementation for a specific use-case. This is a common scenario that customers may encounter during migration or modernization from legacy mainframe, midrange or on-premises x86 systems. For instance, certain legacy systems employ an OLTP database for standard searches and a separate server specifically catering to the text-search use case. This design necessitates regular data transfers between these two systems, potentially leading to instances where customers cannot access up-to-date information and must accept staleness in their data processing. There have been cases where text search requests were submitted to backend systems as separate queries, queued to run after business hours in batch jobs, with results being returned to the end-user on the following business day. Such implementations were common in legacy systems and continue to impact businesses today. When modernizing such workloads, customers can adopt and utilize native features like Full-Text Search which enables functional benefits to end-users like providing them with results in real-time and allowing them to focus more on their core business activities. Below table shows how long these searches take when implemented through different technologies. The table volume considered here is ~5 billion rows Technology Preferred Execution Type Feature utilized for Text-Search End-to-End Timeframe Legacy Systems Nightly Batch LIKE or External Text Search Server 1 business day Cloud Database Micro-batch LIKE ~ 20 mins Azure SQL DB Real-time Full Text Search <1 sec Note: The performance increase shown here is not a direct result of utilizing Full Text Search. Though Full Text Search increases the speed of querying such data, the increase in overall performance is a consequence of the system design and usage of multiple complementary SQL features that fit this use-case. Scenario Let's consider a use case where the SQL Database contains a substantially large table named CustomerInfo with billions of rows and terabytes of data. This table includes columns such as Customer Name, Address, State, CustInfo etc. Assume that Custinfo field encompasses free text / notes entered by customer service team for logging. Our use case involves retrieving rows from this Custinfo column based on a partial search criteria provided by the end-user through an application (i.e. web page, mobile app, etc.). For example, the query could be to extract all the rows where Custinfo field has notes stating that ‘mobile’ or ‘address’ is updated. The search functionality should retrieve all corresponding records, whether the match is exact or partial, and sort them according to their proximity to the provided key words. Implementing a standard search using the LIKE operator on such a large dataset would be inefficient, and constructing the logic for specific word searches on a free-text character column is complex to build and maintain. Full-Text Search for this use-case involves applying sound design principles, as well as leveraging SQL Server native features such as partitioning and indexed views, which enhance operational efficiency and manageability. Note: In context of this article, we are using Azure SQL DB Hyperscale, but implementation remains mostly the same for other SQL offerings as well. Technology Overview Table Partitioning involves dividing a large table into smaller, more manageable segments called "partitions". Instead of storing all data in one extensive table, the data is separated into several smaller tables, each containing a portion of the data. Detailed information on Azure SQL Hyperscale table partitioning, along with Best Practices & Recommendations, can be found in a two-part technical blog: Part 1 and Part 2. Indexed Views, also known as Materialized Views, is a SQL Server feature that can improve the performance of queries on large tables. Indexed views store the result set of a query physically on a disk, which allows for faster retrieval. Using Indexed views for Full-Text indexing can enhance the performance and maintainability of large datasets. Details on how to create an indexed view are provided here. Full Text Search is a robust feature that facilitates complex search on text data stored in SQL tables. This feature supports efficient querying in extensive text fields such as documents, articles, and product descriptions through a specialized index known as a Full Text Index. Full-Text Search uses advanced indexing and query functions to perform complex searches, such as phrase matching and linguistic analysis, with high accuracy and speed. It ranks results based on relevance, ensuring the most pertinent results appear at the top. Search results can be refined with logical operators (AND/OR/NOT) and word weights. The system supports rich query syntax for natural language queries and is compatible with multiple languages and character sets, making it suitable for global applications. A comprehensive overview and setup instructions for Full-Text Search can be found in the document here. Key concepts of FULL-TEXT Search are Full Text Catalog and Full-Text Index. Optimizing Full-Text Search in SQL Server This blog post explains how to implement Text Search by using a combination of different but complementary built-in features like Full-Text Search, Partitioning, and Indexed views. In this scenario, we are implementing the following techniques The base table is large and therefore partitioned into smaller chunks of data Each Indexed View will be built aligning to a single partition on the base table. Each Full Text Index will be built aligning to a single Indexed View. Full Text Catalogs will hold the relevant build information for the Full Text Indexes. Key Benefits in this architecture for VLDB: Provides a scalable model for growing data sizes. Partitioning the base table provides better manageability and optimized performance. Indexed views aligned with Partitioning Key allows quicker population of Indexed view. Having Full text catalog aligned with indexed view allows to rebuild / reorganize Full text catalog for specific subset. Full Text Index aligned with Indexed view provides manageable sized indexes and quicker crawling. Ability to rebuild / reorganize specific full-text indexes and improved query performance. Note: Full-Text index population is an asynchronous activity. For a given Index view / table, only one Full-Text index is allowed. With in a full-text index multiple columns can be included. Below diagram provides a representation of how we can implement Full-Text search on very large tables. Steps highlighted: 1 Partitioning base table. 2 Indexed views aligned with partition key. 3 Full-Text Catalogs. 4 Full-Text Indexes. Partitioning Choosing a partition key that evenly distributes data yields optimal results and is the most important parameter for partitioning. Partition switching/splitting divides a large partition into smaller ones when it grows over time. Here, if the partitioning column is STATE, the base table splits data based on STATE. Sample Table definition along with partitioning: --Partition Function Creation CREATE PARTITION FUNCTION [PF_STATE](varchar(2)) AS RANGE FOR VALUES ('AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY') --Partition Scheme Creation CREATE PARTITION SCHEME [PS_STATE] AS PARTITION [PF_STATE] ALL TO ([PRIMARY]) --Table DDL with partition Key. CREATE TABLE [dbo].[CUSTOMERINFO] ( [CUSTID] [int] NOT NULL, [CUSTNAME] [varchar](40), [ADDRESS] [char](60), [CITY] [char](20), STATE varchar(2), [ZIP] [char](10) , [COUNTRY] [char](15), CUSTINFO nvarchar(max) PRIMARY KEY CLUSTERED ( [CUSTID] ASC, STATE ASC ) WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ) ON [PS_STATE](STATE); Indexed view To improve efficiency, multiple indexed views are created and aligned with the partition key column. Indexed views are schema-bound and store the results of the view’s query, so the results are precomputed and stored on disk. For large tables, creating an indexed view that aligns with the partitioned column helps to quickly populate the indexed view and aids in updating the views when the base table changes. Multiple indexed views enable the creation of multiple full-text indexes, optimizing the search process. Note: Indexed view should have an unique index so this can be utilized during full-text index creation. Sample Indexed views along with clustered index: -- Create Indexed view for STATE AL and clustered index which is subsequently used for FullText Index. CREATE VIEW [dbo].[v_CUSTOMERINFO_AL] WITH SCHEMABINDING AS select CUSTID ,CUSTNAME,ADDRESS,CITY,STATE,ZIP,COUNTRY,CUSTINFO From dbo.CUSTOMERINFO where STATE = 'AL' ; GO CREATE UNIQUE CLUSTERED INDEX [CX_CUSTOMERINFO_AL] ON [dbo].[v_CUSTOMERINFO_AL] ( CUSTID ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY] GO -- Create Indexed view for STATE AK and clustered index which is subsequently used for FullText Index. CREATE VIEW [dbo].[v_CUSTOMERINFO_AK] WITH SCHEMABINDING AS select CUSTID ,CUSTNAME,ADDRESS,CITY,STATE,ZIP,COUNTRY,CUSTINFO From dbo.CUSTOMERINFO where STATE = 'AK' ; GO CREATE UNIQUE CLUSTERED INDEX [CX_CUSTOMERINFO_AK] ON [dbo].[v_CUSTOMERINFO_AK] ( CUSTID ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY] GO Full-Text Search Full-Text catalog A Full-Text catalog is a logical container for full-text indexes, independent of any table. One catalog can be used for indexes on different tables. For databases with large or growing text data, use multiple catalogs. Rebuild the Full-Text catalog after significant data changes (bulk inserts, large updates) to improve performance and reduce index fragmentation. Include this rebuild in regular database maintenance activities. Sample Full Text Catalog: --FullText Catalog for STATE AL and AK. CREATE FULLTEXT CATALOG CT_CUSTOMERINFO_AL AS DEFAULT; CREATE FULLTEXT CATALOG CT_CUSTOMERINFO_AK AS DEFAULT; To change properties of Full-Text Catalog like REORGANIZE and REBUILD provided here Full-Text Index A table or indexed view can only have one full-text index, which can cover multiple columns. These indexes are linked to a unique index of the base table and attached to a full-text catalog. They can be created on char-based data types and XML. The process of creating and maintaining full-text indexes is called population or crawl, which is asynchronous and updates with changes to the underlying table or view and can be maintained with the "Change_Tracking" property. For large tables, aligning full-text indexes with indexed views allows for multiple manageable indexes, improving performance and speeding up data updates. Ensure high I/O file groups are allocated for full-text indexes in SQL Server installations. Fulltext index comes with an option of adding & removing columns as well as enabling or disabling it as needed. Note: During Bulk data load (or other unique scenarios), FULL-TEXT index population (i.e. Change_Tracking) can be set to Manual / OFF and post load activity, Populate Full-Text Index and reset Change_Tracking to AUTO. Sample Full Text Index: --Full Text Index aligned to Clustered Index on Indexed view and Full-Text Catalog. CREATE FULLTEXT INDEX ON [dbo].[v_CUSTOMERINFO_AL](CUSTINFO) KEY INDEX [CX_CUSTOMERINFO_AL] ON CT_CUSTOMERINFO_AL WITH CHANGE_TRACKING = AUTO; CREATE FULLTEXT INDEX ON [dbo].[v_CUSTOMERINFO_AK](CUSTINFO) KEY INDEX [CX_CUSTOMERINFO_AK] ON CT_CUSTOMERINFO_AK WITH CHANGE_TRACKING = AUTO Full-Text Index properties that can be changed are listed here. Sample queries using Full-Text search Full Text samples with contains and more examples are here Sample 1: Contains with OR --To Identify the CUSTINFO column which has words ‘Mobile’ Or ‘Address’. select * from dbo.[v_CUSTOMERINFO_AL] where contains (CUSTINFO,'Mobile or Address') Sample 2: Contains with AND --To Identify the CUSTINFO column which has words ‘Mobile’ AND ‘Address’. select CUSTID,CUSTNAME,CUSTINFO from dbo.[v_CUSTOMERINFO_AL] where contains (CUSTINFO,'Mobile and Address') Details about Ranking are in Full-Text search with Ranking Note: While SQL optimizer selects best execution plan for a query, for very specific and unique scenarios Hints also can be used to optimize further. In this specific use-case, Indexed View Matching functionality can be implemented using the NOEXPAND Table Hint which drastically improves performance. By adding this hint, the query optimizer uses the index on the view if a query contains references to columns that are present both in an indexed view and base tables, and the query optimizer determines that using the indexed view provides the best method for executing the query. Please perform due diligence prior to using Hints. Full-Text Index Maintenance and validation Types of Full-Text Index population. How to Improve the Performance of Full-Text indexes Sample Queries that helps in Full Text Maintenance and validation: --How to check Full Text index population status SELECT object_name(object_id) as tablename ,change_tracking_state_desc, has_crawl_completed,crawl_type_desc,crawl_start_date,crawl_end_date FROM sys.fulltext_indexes where object_name(object_id) like 'v_CUSTOMERINFO%' Additional Information If more advanced AI based search capabilities are needed, Azure provides solutions like Azure AI Search. These are different offerings from Azure SQL and must be deployed separately and integrated appropriately. If the use-case involves Vector search, Azure SQL DB can work with vectors and details are here. As the SQL Server Full Text functionality keeps advancing, there are specific features that get deprecated. These can be found here. When building new applications using Full Text, we recommend that this list is being considered for future proofing your design. Feedback and suggestions If you have feedback or suggestions for improving this data migration asset, please contact the Databases SQL Ninja Engineering Team (datasqlninja@microsoft.com). Thanks for your support! Note: For additional information about migrating various source databases to Azure, see the Azure Database Migration Guide.Resolving ADF Copy Upsert Failures in SQL DB for XML Columns
Customers modernizing their Oracle, Mainframe/Midrange Db2, Sybase and other databases to Azure SQL often use Azure Data Factory (ADF) for data transfer. For large datasets, an initial copy is made a few days prior, followed by incremental data copies during cutover. The ADF Copy activity's Upsert method is common for these incremental transfers, but this method fails for tables which contain XML columns. This blog explains using Stored Procedure as the Sink behavior in Copy activity to perform Upsert into Azure SQL Database. Same method can be used even when target is Azure SQL MI or Azure SQL VM.Database Schema Compare Tool
Migration to Azure SQL is far easier than before. Availability of Schema migration, Database migration, Database comparison, Schema comparison tools from Microsoft makes migration easier and risk free. This blog is about how the Database Schema comparison tool can be used to perform Db2 and Azure SQL Schema comparison.