Forum Discussion
Keyword Dictionaries in Custom SITs with Powershell
Thanks luchete!
I think I've ended up with something along the lines we've both mentioned so far.
I'm a novice here, but have had a thorough look at the documentation and gotten AI-assistance to make a short PowerShell script to loop trough the rulepack/s I want to implement in local storage. That will give an overview of the first 16 lines in each XML file to get an understanding of what I'm about to implement. Create and update GUID's for RulePackID and EntityID, and update static information for Publisher, before creating the actual Rule Package/s. Please provide any input you may have to the approach.
# Set the folder path where XML files are located
$folderPath = "/Files/Scripts/RulePacks/"
# Get all XML files in the folder (excluding subfolders)
$xmlFiles = Get-ChildItem -Path $folderPath -Filter "*.xml" -File
# Loop through each XML file and print the first 16 lines for inspection
$xmlFiles | ForEach-Object {
# Read the first 16 lines from the XML file
$first16Lines = Get-Content -Path $_.FullName -TotalCount 16
# Output the first 16 lines with file name
Write-Host "`n First 16 lines from $($_.Name):"
$first16Lines
}
# Iterate over each XML file
foreach ($xmlFile in $xmlFiles) {
$xmlContent = Get-Content -Path $xmlFile.FullName -Raw
# Create a unique RulePack ID, Publisher ID, and Entity ID
$RulePackID = [guid]::NewGuid().ToString()
$PublisherID = "XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
$PublisherName = "MyCompany"
$EntityID = [guid]::NewGuid().ToString()
# Replace placeholders in the XML content with the actual values
$xmlContent = $xmlContent -replace '\$RulePackID', $RulePackID
$xmlContent = $xmlContent -replace '\$PublisherID', $PublisherID
$xmlContent = $xmlContent -replace '\$PublisherName', $PublisherName
$xmlContent = $xmlContent -replace '\$EntityID', $EntityID
# Write the updated content back to the same XML file
Set-Content -Path $xmlFile.FullName -Value $xmlContent
Write-Host "Updated XML file: $($xmlFile.FullName)"
}
# Loop through each XML file
foreach ($xmlFile in $xmlFiles) {
try {
# Check if file exists before attempting to load
if (Test-Path $xmlFile.FullName) {
# Import the updated XML file
New-DlpSensitiveInformationTypeRulePackage -FileData ([System.IO.File]::ReadAllBytes($xmlFile.FullName))
Write-Host "Successfully uploaded Custom SIT for file: $($xmlFile.Name)"
} else {
Write-Host "The file does not exist: $($xmlFile.FullName)"
}
}
catch {
Write-Host "Failed to upload Custom SIT for file: $($xmlFile.Name). Error: $_"
}
}Further, I've looked at the exported Sensitive Information Types from Microsoft, and the structure of both the actual rules and the example from Microsoft Learn. I've accepted defeat, and acknowledged some positive consequences of utilizing KeyWord Lists instead of KeyWord Dictionaries. For one it's possible to choose between "word match" and "string match", and I'm yet to experience running into the limitation, that I can't find specified, for the size of the KeyWord List. It should be possible to accomplish what I originally wanted https://learn.microsoft.com/en-us/purview/sit-create-a-keyword-dictionary?tabs=purview#using-keyword-dictionaries-in-custom-sensitive-information-types-and-dlp-policies
The alternative I chose was switching to KeyWord Lists and ended up with packing it all into XML files per rule package, so I can easily choose witch ones to apply.
<?xml version="1.0" encoding="utf-8"?>
<RulePackage xmlns="http://schemas.microsoft.com/office/2011/mce">
<RulePack id="2bf706a0-e0a4-4858-b11d-8cf0275ea973">
<Version build="0" major="1" minor="0" revision="0"/>
<Publisher id="XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"/>
<Details defaultLangCode="nb-no">
<LocalizedDetails langcode="nb-no">
<PublisherName>"MyCompany"</PublisherName>
<Name>Norwegian Custom Medical</Name>
<Description>Custom sensitive information types for Norwegian medical context</Description>
</LocalizedDetails>
</Details>
</RulePack>
<Rules>
<Entity id="226109b0-0a96-4d0d-a459-d770cc412dc5" patternsProximity="300" recommendedConfidence="85">
<!-- Pattern 1: Low Confidence (60) - Matches based on the Norwegian Medical Diseases list -->
<Pattern confidenceLevel="60">
<IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
</Pattern>
<!-- Pattern 2: Medium Confidence (75) - Matches based on Norwegian Medical Diseases with support from Norwegian Medical Facilities -->
<Pattern confidenceLevel="70">
<IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
<Match idRef="Keyword_Norwegian_Medical_Facilities" />
</Pattern>
<!-- Pattern 3: High Confidence (85) - Matches multiple keyword dictionaries including diseases, facilities, and treatment-related terms -->
<Pattern confidenceLevel="80">
<IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
<Match idRef="Keyword_Norwegian_Medical_Facilities" />
<Any minMatches="1">
<Match idRef="Keyword_Norwegian_Medical_Pharmaceuticals" />
<Match idRef="Keyword_Norwegian_Medical_Diagnosis_Codes" />
<Match idRef="Keyword_Norwegian_Medical_Treatments" />
</Any>
</Pattern>
<!-- Pattern 4: High Confidence (90) - Adds a regex for Norwegian date format to ensure more accurate detection of sensitive medical data -->
<Pattern confidenceLevel="85">
<IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
<Match idRef="Keyword_Norwegian_Medical_Facilities" />
<Any minMatches="1">
<Match idRef="Keyword_Norwegian_Medical_Pharmaceuticals" />
<Match idRef="Keyword_Norwegian_Medical_Diagnosis_Codes" />
<Match idRef="Keyword_Norwegian_Medical_Treatments" />
</Any>
<Match idRef="Regex_norwegian_date" /> <!-- Norwegian Date Regex to match medical documents that are date-stamped -->
</Pattern>
</Entity>
<!-- Keyword definitions for Norwegian Medical Diseases -->
<Keyword id="Keyword_Norwegian_Medical_Diseases">
<Group matchStyle="string">
<Term>ADHD</Term>
<Term>Zika virus</Term>
</Group>
</Keyword>
<!-- Keyword definitions for Norwegian Medical Pharmaceuticals -->
<Keyword id="Keyword_Norwegian_Medical_Pharmaceuticals">
<Group matchStyle="string">
<Term>medisin</Term>
<Term>tablett</Term>
</Group>
</Keyword>
<!-- Keyword definitions for Norwegian Medical Diagnosis Codes -->
<Keyword id="Keyword_Norwegian_Medical_Diagnosis_Codes">
<Group matchStyle="string">
<Term>ICD-10 C18</Term>
<Term>ICD-10 C34</Term>
</Group>
</Keyword>
<!-- Keyword definitions for Norwegian Medical Treatments -->
<Keyword id="Keyword_Norwegian_Medical_Treatments">
<Group matchStyle="string">
<Term>CT-scan</Term>
<Term>Dialyse</Term>
</Group>
</Keyword>
<!-- Keyword definitions for Norwegian Medical Facilities -->
<Keyword id="Keyword_Norwegian_Medical_Facilities">
<Group matchStyle="string">
<Term>Akershus universitetssykehus</Term>
<Term>Apotek</Term>
</Group>
</Keyword>
<!-- Regex definition for Norwegian date format -->
<Regex id="Regex_norwegian_date">
\d{2}[./]\d{2}[./]\d{2,4} <!-- Matches dates in formats dd/mm-yy, dd.mm.yy, dd/mm/yyyy, dd.mm.yyyy -->
</Regex>
<LocalizedStrings>
<Resource idRef="226109b0-0a96-4d0d-a459-d770cc412dc5">
<Name default="true" langcode="nb-no">Norwegian Custom Medical</Name>
<Description default="true" langcode="nb-no">
Custom sensitive information types for Norwegian medical context.
</Description>
</Resource>
</LocalizedStrings>
</Rules>
</RulePackage>Hey sonstevold,
Thanks for sharing your approach! It’s great to see you’re making progress with the PowerShell script, especially automating the GUID updates and handling the XML structure. Switching to Keyword Lists seems like a smart move for flexibility, and your use of patterns to adjust confidence levels will definitely help minimize false positives.
Keep me updated on how it goes from here.
Regards!