The RegEx flat file source is one of the new http://www.codeplex.com/SQLSrvIntegrationSrv for SQL Server 2008 we’ve published to Codeplex. It uses http://msdn.microsoft.com/en-us/library/hs600312.aspx to extract values from a text file. It works similar to the flat file source, except that it’s not limited to CSV-type files (I saw a demo where the developer who created the sample used it to extract class names from a source files).
Note, when I ran the installer for the sample, it didn’t put the RegExFlatFileSource.dll file under 100\DTS\PipelineComponents. It only put the source down. We’ll either update the installer, or update the docs on the site. To get the component, I opened the project in Visual Studio, and ran a build. The project has a post build step which places the DLL under the PipelineComponents directory, and runs gacutil.exe to place it in the GAC.
Once the sample is installed, you can add it to your toolbar in Visual Studio.
We’ll start off with a simple flat file example. My data looks like this:
value 1,1,2001-01-01
value 2,2,2002-02-02
value 3,3,2003-03-03
value 4,4,2004-04-04
value 5,5,2005-05-05
value 6,6,2006-06-06
value 7,7,2007-07-07
value 8,8,2008-08-08
Columns are defined using http://msdn.microsoft.com/en-us/library/bs2twtah.aspx . If I want the component to behave like a regular CSV parser, I can use a regular expression like this:
(\w+),(\w+),(\w+)
If I click on the Column Mappings tab, I can see four columns have been defined.
Like with regular expression matches, the first match (column 0) is the entire pattern (which is the whole line in this case). The next three columns are the groups I defined in my regex.
I can also provide default names for my columns by naming the groups. This regex is a little more specific, and adds names to the groups using the ?<name> syntax supported by .NET.
(?<text>.*?),(?<number>\d),(?<date>\d{4}-\d{2}-\d{2})
If I look back at the Column Mappings tab, I can see the columns now have names.
You’ll notice that the component provides two outputs – one for rows that match the pattern, and one for rows that don’t match. The non-matching row output will always have a single string column which contains the entire line.
Adding a data viewer to the Match output, I can see all of the column matches it made:
I really like this sample because it opens up a lot of data sources, like log files, that used to require custom parsing using a script component. I think it has a lot of potential!