What REGEX can I use to detect a UPN being sent in Email/Shared Document in Onedrive/Sharepoint

Question

I need to create a REGEX for a sensitive info type that will detect if a UPN is being sent in email or shared with external users in a document.

I created a primary element with following for single level (Eg - user@localhost) and 2 level domains (Eg - email address removed for privacy reasons):-

Single level- <?\w+?\.?\w+@\w+>?

2 level- <?\w+?\.?\w+@\w+\.\w+>?

I have added Secondary element to match minimum of 1 domain from our domain list (keyword List).

And then another secondary element to not match following REGEX element (as I don't want to match something like this which is used when replying to any email "<email address removed for privacy reasons":-
Single level- <\w+?\.?\w+@\w+>

2 level- <\w+?\.?\w+@\w+\.\w+>

Single level- <\w+?\.?\w+@\w+>

2 level- <\w+?\.?\w+@\w+\.\w+>

Also, I added additional checks for this because I don't want to catch email address in the format "<email address removed for privacy reasons" while replying to any email:

"not start with" - "<"
"not ends with" - ">"

But if a user responds to external user then it still ends up catching the UPN inside the less than and greater than sign in the following string - "<email address removed for privacy reasons>". Because "<email address removed for privacy reasons>" will come up in all email replies to external user, so I don't want to catch it with the SIT.

What am I doing wrong and how can I achieve this? This SIT will be used inside DLP policy.

luchete · Accepted Answer

Hello curious7,The regex you're using for detecting email addresses is too general and doesn't fully account for the context of replies with "&lt;" and "&gt;".To fix this, I recommend adjusting your regex so it’s more specific about excluding those reply formats. One way is to use lookahead and lookbehind assertions. Here's an approach:Try adding a condition to ensure the email is not preceded by "&lt;" and not followed by "&gt;". You can do something like this:(?&lt;!&lt;)\b\w+(\.\w+)?@\w+(\.\w+)+\b(?!&gt;)This regex ensures that the email address is not directly preceded by "&lt;" and not directly followed by "&gt;". This should stop the unwanted matches from email replies.Additionally, ensure that your secondary element for domain checks correctly matches your intended domains without being too broad. This can help prevent false positives.Regards!

Forum Discussion

What REGEX can I use to detect a UPN being sent in Email/Shared Document in Onedrive/Sharepoint

1 Reply

Resources