Forum Discussion

curious7's avatar
curious7
Copper Contributor
Feb 18, 2025
Solved

What REGEX can I use to detect a UPN being sent in Email/Shared Document in Onedrive/Sharepoint

I need to create a REGEX for a sensitive info type that will detect if a UPN is being sent in email or shared with external users in a document.

I created a primary element with following for single level (Eg - user@localhost) and 2 level domains (Eg - email address removed for privacy reasons):-

Single level- <?\w+?\.?\w+@\w+>?

2 level- <?\w+?\.?\w+@\w+\.\w+>?

I have added Secondary element to match minimum of 1 domain from our domain list (keyword List).

And then another secondary element to not match following REGEX element (as I don't want to match something like this which is used when replying to any email "<email address removed for privacy reasons":-
Single level- <\w+?\.?\w+@\w+>

2 level- <\w+?\.?\w+@\w+\.\w+>

Single level- <\w+?\.?\w+@\w+>

2 level- <\w+?\.?\w+@\w+\.\w+>

Also, I added additional checks for this because I don't want to catch email address in the format "<email address removed for privacy reasons" while replying to any email:

"not start with" - "<"
"not ends with" - ">"

But if a user responds to external user then it still ends up catching the UPN inside the less than and greater than sign in the following string - "<email address removed for privacy reasons>". Because "<email address removed for privacy reasons>" will come up in all email replies to external user, so I don't want to catch it with the SIT.

What am I doing wrong and how can I achieve this? This SIT will be used inside DLP policy.

  • Hello curious7,

    The regex you're using for detecting email addresses is too general and doesn't fully account for the context of replies with "<" and ">".

    To fix this, I recommend adjusting your regex so it’s more specific about excluding those reply formats. One way is to use lookahead and lookbehind assertions. Here's an approach:

    Try adding a condition to ensure the email is not preceded by "<" and not followed by ">". You can do something like this:

    (?<!<)\b\w+(\.\w+)?@\w+(\.\w+)+\b(?!>)

    This regex ensures that the email address is not directly preceded by "<" and not directly followed by ">". This should stop the unwanted matches from email replies.

    Additionally, ensure that your secondary element for domain checks correctly matches your intended domains without being too broad. This can help prevent false positives.

    Regards!

1 Reply

  • luchete's avatar
    luchete
    Steel Contributor

    Hello curious7,

    The regex you're using for detecting email addresses is too general and doesn't fully account for the context of replies with "<" and ">".

    To fix this, I recommend adjusting your regex so it’s more specific about excluding those reply formats. One way is to use lookahead and lookbehind assertions. Here's an approach:

    Try adding a condition to ensure the email is not preceded by "<" and not followed by ">". You can do something like this:

    (?<!<)\b\w+(\.\w+)?@\w+(\.\w+)+\b(?!>)

    This regex ensures that the email address is not directly preceded by "<" and not directly followed by ">". This should stop the unwanted matches from email replies.

    Additionally, ensure that your secondary element for domain checks correctly matches your intended domains without being too broad. This can help prevent false positives.

    Regards!

Resources