Forum Discussion

sonstevold's avatar
sonstevold
Copper Contributor
Feb 21, 2025

Keyword Dictionaries in Custom SITs with Powershell

Does anyone have any experience with using self made keyword dictionaries when creating custom sensitive information types with powershell?

I've tried for a couple of days to make this work now, but can't find a solution that let me utilize any of the values from Get-DlpKeywordDictionary anywhere in the powershell script or XML-file defining the rule package?

When using the Purview GUI, I know I can create the rules one by one, copy the previous one and build the structure I want, utilizing the keyword dictionaries I've uploaded as both primary elements, and secondary elements, to get a rule package with the needed complexity to avoid a load of false positives.

My issue is related to the Scandinavian languages and the lack of suitable default SITs from Microsoft. I have built my own, the GUI is slow, and relating to the creation of the dictionaries, DLP-rules, Sensitivity labels and so on, I want to automise the process to speed things up when helping a new organization implementing this.

I have started looking at making the creation of the dictionaries part of the XML-files also creating the rule packages, and rather adjusting them as/if needed from the GUI, but being able to accomplish what I'm actually after would be preferred.

  • luchete's avatar
    luchete
    Steel Contributor

    Hello sonstevold,

    I’ve worked once with custom keyword dictionaries in PowerShell when creating sensitive information types, and I understand the challenge you're facing. The issue is that you can't directly pull values from Get-DlpKeywordDictionary into the PowerShell script or XML. Unfortunately, the default cmdlets don't offer a straightforward way to integrate them into rule packages automatically.

    I mean, you can use PowerShell to upload your dictionaries and reference them in your rules, but the tricky part is automating the full process, especially for complex rules to avoid false positives. It sounds like you’re on the right track by trying to integrate dictionary creation into the XML files. You might need to manipulate the rule XML structure manually or explore more advanced APIs to streamline the process.

    • sonstevold's avatar
      sonstevold
      Copper Contributor

      Thanks luchete!

      I think I've ended up with something along the lines we've both mentioned so far.

      I'm a novice here, but have had a thorough look at the documentation and gotten AI-assistance to make a short PowerShell script to loop trough the rulepack/s I want to implement in local storage. That will give an overview of the first 16 lines in each XML file to get an understanding of what I'm about to implement. Create and update GUID's for RulePackID and EntityID, and update static information for Publisher, before creating the actual Rule Package/s. Please provide any input you may have to the approach.

      # Set the folder path where XML files are located
      $folderPath = "/Files/Scripts/RulePacks/"
      
      # Get all XML files in the folder (excluding subfolders)
      $xmlFiles = Get-ChildItem -Path $folderPath -Filter "*.xml" -File
      
      # Loop through each XML file and print the first 16 lines for inspection
      $xmlFiles | ForEach-Object {
          # Read the first 16 lines from the XML file
          $first16Lines = Get-Content -Path $_.FullName -TotalCount 16
      
          # Output the first 16 lines with file name
          Write-Host "`n First 16 lines from $($_.Name):"
          $first16Lines
      }
      
      # Iterate over each XML file
      foreach ($xmlFile in $xmlFiles) {
          $xmlContent = Get-Content -Path $xmlFile.FullName -Raw
      
          # Create a unique RulePack ID, Publisher ID, and Entity ID
          $RulePackID = [guid]::NewGuid().ToString()
          $PublisherID = "XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
          $PublisherName = "MyCompany"
          $EntityID = [guid]::NewGuid().ToString()
      
          # Replace placeholders in the XML content with the actual values
          $xmlContent = $xmlContent -replace '\$RulePackID', $RulePackID
          $xmlContent = $xmlContent -replace '\$PublisherID', $PublisherID
          $xmlContent = $xmlContent -replace '\$PublisherName', $PublisherName
          $xmlContent = $xmlContent -replace '\$EntityID', $EntityID
      
          # Write the updated content back to the same XML file
          Set-Content -Path $xmlFile.FullName -Value $xmlContent
      
          Write-Host "Updated XML file: $($xmlFile.FullName)"
      }
      
      # Loop through each XML file
      foreach ($xmlFile in $xmlFiles) {
          try {
              # Check if file exists before attempting to load
              if (Test-Path $xmlFile.FullName) {
                  # Import the updated XML file
                  New-DlpSensitiveInformationTypeRulePackage -FileData ([System.IO.File]::ReadAllBytes($xmlFile.FullName))
                  
                  Write-Host "Successfully uploaded Custom SIT for file: $($xmlFile.Name)"
              } else {
                  Write-Host "The file does not exist: $($xmlFile.FullName)"
              }
          }
          catch {
              Write-Host "Failed to upload Custom SIT for file: $($xmlFile.Name). Error: $_"
          }
      }

      Further, I've looked at the exported Sensitive Information Types from Microsoft, and the structure of both the actual rules and the example from Microsoft Learn. I've accepted defeat, and acknowledged some positive consequences of utilizing KeyWord Lists instead of KeyWord Dictionaries. For one it's possible to choose between "word match" and "string match", and I'm yet to experience running into the limitation, that I can't find specified, for the size of the KeyWord List. It should be possible to accomplish what I originally wanted referencing the "Identity"-element from the KeyWordDictionary.

      The alternative I chose was switching to KeyWord Lists and ended up with packing it all into XML files per rule package, so I can easily choose witch ones to apply.

       

      <?xml version="1.0" encoding="utf-8"?>
      <RulePackage xmlns="http://schemas.microsoft.com/office/2011/mce">
          <RulePack id="2bf706a0-e0a4-4858-b11d-8cf0275ea973">
              <Version build="0" major="1" minor="0" revision="0"/>
              <Publisher id="XXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"/>
              <Details defaultLangCode="nb-no">
                  <LocalizedDetails langcode="nb-no">
                      <PublisherName>"MyCompany"</PublisherName>
                      <Name>Norwegian Custom Medical</Name>
                      <Description>Custom sensitive information types for Norwegian medical context</Description>
                  </LocalizedDetails>
              </Details>
          </RulePack>
      
          <Rules>
              <Entity id="226109b0-0a96-4d0d-a459-d770cc412dc5" patternsProximity="300" recommendedConfidence="85">
              
              <!-- Pattern 1: Low Confidence (60) - Matches based on the Norwegian Medical Diseases list -->
              <Pattern confidenceLevel="60">
                  <IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
              </Pattern>
              
              <!-- Pattern 2: Medium Confidence (75) - Matches based on Norwegian Medical Diseases with support from Norwegian Medical Facilities -->
              <Pattern confidenceLevel="70">
                  <IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
                  <Match idRef="Keyword_Norwegian_Medical_Facilities" />
              </Pattern>
      
              <!-- Pattern 3: High Confidence (85) - Matches multiple keyword dictionaries including diseases, facilities, and treatment-related terms -->
              <Pattern confidenceLevel="80">
                  <IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
                  <Match idRef="Keyword_Norwegian_Medical_Facilities" />
                  <Any minMatches="1">
                      <Match idRef="Keyword_Norwegian_Medical_Pharmaceuticals" />
                      <Match idRef="Keyword_Norwegian_Medical_Diagnosis_Codes" />
                      <Match idRef="Keyword_Norwegian_Medical_Treatments" />
                  </Any>
              </Pattern>
      
              <!-- Pattern 4: High Confidence (90) - Adds a regex for Norwegian date format to ensure more accurate detection of sensitive medical data -->
              <Pattern confidenceLevel="85">
                  <IdMatch idRef="Keyword_Norwegian_Medical_Diseases" />
                  <Match idRef="Keyword_Norwegian_Medical_Facilities" />
                  <Any minMatches="1">
                      <Match idRef="Keyword_Norwegian_Medical_Pharmaceuticals" />
                      <Match idRef="Keyword_Norwegian_Medical_Diagnosis_Codes" />
                      <Match idRef="Keyword_Norwegian_Medical_Treatments" />
                  </Any>
                  <Match idRef="Regex_norwegian_date" /> <!-- Norwegian Date Regex to match medical documents that are date-stamped -->
              </Pattern>
      
              </Entity>
      
              <!-- Keyword definitions for Norwegian Medical Diseases -->
              <Keyword id="Keyword_Norwegian_Medical_Diseases">
                <Group matchStyle="string">
                    <Term>ADHD</Term>
                    <Term>Zika virus</Term>
                  </Group>
              </Keyword>
      
              <!-- Keyword definitions for Norwegian Medical Pharmaceuticals -->
              <Keyword id="Keyword_Norwegian_Medical_Pharmaceuticals">
                  <Group matchStyle="string">
                      <Term>medisin</Term>
                      <Term>tablett</Term>
                  </Group>
              </Keyword>
      
              <!-- Keyword definitions for Norwegian Medical Diagnosis Codes -->
              <Keyword id="Keyword_Norwegian_Medical_Diagnosis_Codes">
                  <Group matchStyle="string">
                      <Term>ICD-10 C18</Term>
                      <Term>ICD-10 C34</Term>
                  </Group>
              </Keyword>
      
              <!-- Keyword definitions for Norwegian Medical Treatments -->
              <Keyword id="Keyword_Norwegian_Medical_Treatments">
                  <Group matchStyle="string">
                      <Term>CT-scan</Term>
                      <Term>Dialyse</Term>
                  </Group>
              </Keyword>
      
              <!-- Keyword definitions for Norwegian Medical Facilities -->
              <Keyword id="Keyword_Norwegian_Medical_Facilities">
                  <Group matchStyle="string">
                      <Term>Akershus universitetssykehus</Term>
                      <Term>Apotek</Term>
                  </Group>
              </Keyword>
      
              <!-- Regex definition for Norwegian date format -->
              <Regex id="Regex_norwegian_date">
              \d{2}[./]\d{2}[./]\d{2,4} <!-- Matches dates in formats dd/mm-yy, dd.mm.yy, dd/mm/yyyy, dd.mm.yyyy -->
              </Regex>
      
              <LocalizedStrings>
                  <Resource idRef="226109b0-0a96-4d0d-a459-d770cc412dc5">
                      <Name default="true" langcode="nb-no">Norwegian Custom Medical</Name>
                      <Description default="true" langcode="nb-no">
                          Custom sensitive information types for Norwegian medical context.
                      </Description>
                  </Resource>
              </LocalizedStrings>
          </Rules>
      </RulePackage>
      • luchete's avatar
        luchete
        Steel Contributor

        Hey sonstevold,

        Thanks for sharing your approach! It’s great to see you’re making progress with the PowerShell script, especially automating the GUID updates and handling the XML structure. Switching to Keyword Lists seems like a smart move for flexibility, and your use of patterns to adjust confidence levels will definitely help minimize false positives.

        Keep me updated on how it goes from here. 

        Regards!

Resources