Accessing a data profile programmatically
Published Mar 25 2019 02:04 PM 402 Views
Copper Contributor
First published on MSDN on Mar 13, 2008

The new Data Profiling Task in 2008 generates an XML output file. The output is easy to read, and could be used to make decisions within a control flow. For example, you could check whether or not to process your most recent data set based on criteria in the profile (are any values in certain columns null? Do my values fall within expected ranges?)

While you can parse this XML file in traditional ways, those of you who aren't afraid of using undocumented, unsupported APIs can make use of the classes in the Microsoft.SqlServer.DataProfiler assembly which we use to load the XML profile internally.

Let's use a simple scenario: We want to check one of the columns (AddressLine1) in our staging table to make sure that it contains no NULL values. If it is clean, we process it in a data flow task. If NULL values are found, we want to send an email to the DBA.

Here's what the control flow looks like:

We're first running our data profile, then processing the results in a script task. If we match our criteria (i.e. no nulls in the AddresLine1 column), the script task returns success and we run the "Process" data flow. If the criteria doesn't match, we fail the task, and run the send mail task instead.

Note, we can store the results of the data profiling task in a variable (as a string) instead of saving it out to disk.

Now the interesting part - the script task.

First, add a reference to the Microsoft.SqlServer.DataProfiler DLL. It can be found under %ProgramFiles%\Microsoft SQL Server\100\DTS\Binn\Microsoft.SqlServer.DataProfiler.dll (and should also be in the GAC).

The following code loads the data profile XML from the package variable, de-serializes it into a DataProfile object, and cycles through the profiles until it finds the one its looking for.

using Microsoft.DataDebugger.DataProfiling;



const string ColumnName = "AddressLine1";

readonly long Threshold = 0;



public void Main()

{

Dts.TaskResult = (int)ScriptResults.Success;



// Retrieve the profile from the package variable

string dataProfleXML = Dts.Variables["User::DataProfile"].Value.ToString();



// Deserialize

DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();

DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));



// Cycle through the profiles to find the one we're looking for

foreach (Profile p in profile.DataProfileOutput.Profiles)

{

// Check the profile type

if (p is ColumnNullRatioProfile)

{

// Match the column name

ColumnNullRatioProfile nullProfile = p as ColumnNullRatioProfile;

if (nullProfile.Column.Name.Equals(ColumnName, StringComparison.InvariantCultureIgnoreCase))

{

// Make sure it's within our threshold

if (nullProfile.NullCount > Threshold)

{

// Fail the task

Dts.TaskResult = (int)ScriptResults.Failure;

}



break;

}

}

}

}

<br/>.csharpcode, .csharpcode pre<br/>{<br/> font-size: small;<br/> color: black;<br/> font-family: consolas, "Courier New", courier, monospace;<br/> background-color: #ffffff;<br/> /*white-space: pre;*/<br/>}<br/>.csharpcode pre { margin: 0em; }<br/>.csharpcode .rem { color: #008000; }<br/>.csharpcode .kwrd { color: #0000ff; }<br/>.csharpcode .str { color: #006080; }<br/>.csharpcode .op { color: #0000c0; }<br/>.csharpcode .preproc { color: #cc6633; }<br/>.csharpcode .asp { background-color: #ffff00; }<br/>.csharpcode .html { color: #800000; }<br/>.csharpcode .attr { color: #ff0000; }<br/>.csharpcode .alt <br/>{<br/> background-color: #f4f4f4;<br/> width: 100%;<br/> margin: 0em;<br/>}<br/>.csharpcode .lnum { color: #606060; }

Note, as I mentioned above, this API is for internal use, and is subject to change (it already has between CTP5 and CTP6, and will change again in the upcoming CTP-Refresh). I just thought some people out there might find this interesting.

Version history
Last update:
‎Mar 25 2019 02:05 PM
Updated by: