A way to filter gibberish Emails and Usernames

Copper Contributor

Hello,

 

At my job I frequently have to go through "fake" accounts and deactivate them. To investigate them I export them into an Excel Doc and while I have some Conditional Formatting to catch the frequent repeat accounts it is not enough to catch all of them. I occasionally have to manually review anywhere from 1k to 5k accounts.

I was wondering if there is a formula to identify gibberish email names or usernames. Essentially they look like this and are separated by their own columns: EMAIL: YASFGE1245 and Username: Nm8878.

I thought about using a VLOOKUP to compare to a list of known patterns but there are so many that it can cause our slow computers to lock up.

12 Replies

@KumaMatata 

Identifying gibberish or fake accounts based on patterns in email names or usernames can be challenging, as there are countless variations. However, you can use some Excel functions and formulas to create a set of rules that might help in flagging potential gibberish.

Here is a basic example to get you started:

Let's assume your email addresses are in column A, and usernames are in column B. You can add a new column (let's say column C) with a formula to check for patterns. Use the following formula in cell C2 and drag it down:

=IF(AND(ISNUMBER(SEARCH("^[A-Za-z]{3,}[0-9]{4,}$", A2)), ISNUMBER(SEARCH("^[A-Za-z]{2,}[0-9]{4,}$", B2))), "Potential Gibberish", "Normal")

This formula checks if both the email and username follow a pattern of starting with letters and ending with numbers. You might need to adjust these patterns based on the specific characteristics of the accounts you consider gibberish.

Explanation of the formula:

  • SEARCH("^[A-Za-z]{3,}[0-9]{4,}$", A2): Checks if the email starts with at least 3 letters and ends with at least 4 numbers.
  • SEARCH("^[A-Za-z]{2,}[0-9]{4,}$", B2): Checks if the username starts with at least 2 letters and ends with at least 4 numbers.
  • AND(ISNUMBER(...), ISNUMBER(...)): Checks if both conditions are true.
  • If both conditions are true, it returns "Potential Gibberish"; otherwise, it returns "Normal".

Feel free to adjust the patterns in the SEARCH functions based on your observations of gibberish accounts. This is just a starting point, and you may need to refine the criteria based on the specific characteristics of the accounts you encounter. The text was revised with the AI.

 

My answers are voluntary and without guarantee!

 

Hope this will help you.

 

Was the answer useful? Mark as best response and like it!

This will help all forum participants.

Thankyou @NikolinoDE! Is there a way to make it also check for gibberish names that dont have numbers in them like dxus or gjdfvxbtgvxcg?

@KumaMatata 

To check for gibberish names that don't have numbers, you can modify the formula to include a condition for names without numbers.

Here is an updated formula:
=IF(

    AND(

        OR(ISNUMBER(SEARCH("^[A-Za-z]{3,}[0-9]{4,}$", A2)), ISNUMBER(SEARCH("^[A-Za-z]{2,}[0-9]{4,}$", B2))),

        NOT(OR(ISNUMBER(SEARCH("[0-9]", A2)), ISNUMBER(SEARCH("[0-9]", B2))))

    ),

    "Potential Gibberish",

    "Normal"

)

 

Explanation:

  • OR(ISNUMBER(SEARCH("^[A-Za-z]{3,}[0-9]{4,}$", A2)), ISNUMBER(SEARCH("^[A-Za-z]{2,}[0-9]{4,}$", B2))): Checks if either the email or username follows the pattern of starting with letters and ending with numbers.
  • NOT(OR(ISNUMBER(SEARCH("[0-9]", A2)), ISNUMBER(SEARCH("[0-9]", B2)))): Checks if neither the email nor the username contains any numbers.
  • AND(..., ...): Checks if both conditions are true.

If both conditions are true, it returns "Potential Gibberish"; otherwise, it returns "Normal".

This formula now considers accounts with names that don't have numbers as potential gibberish. Adjustments can be made based on your specific observations and patterns of gibberish names.

Hi @NikolinoDE 

 

Liked your post as I learned something with your SEARCH("^[A-Za-z]{3,}[0-9]{4,}$", A2)

 

Quick questions if you don't mind...
#1 SEARCH being case-insensitive wouldn't the following work as well?

SEARCH("^[a-z]{3,}[0-9]{4,}$", A2)
or
SEARCH("^[A-Z]{3,}[0-9]{4,}$", A2)

 

#2 What's the meaning of the $ sign at the end? Can you provide a pointer to doc. or briefly explain?

 

Thanks Much

@Lorenzo 

  1. Case Insensitivity in SEARCH:

Yes, you're correct. The SEARCH function in Excel is case-insensitive by default. Therefore, both of the following patterns would work:

  • SEARCH("^[a-z]{3,}[0-9]{4,}$", A2): This pattern checks if the string starts with at least 3 lowercase letters and ends with at least 4 numbers.
  • SEARCH("^[A-Z]{3,}[0-9]{4,}$", A2): This pattern checks if the string starts with at least 3 uppercase letters and ends with at least 4 numbers.

You can use either of these patterns based on your specific needs and the characteristics of the data you're working with.

  1. Meaning of the $ sign:

In regular expressions (regex), the $ sign represents the end of a line or string. In the context of the pattern ^[A-Za-z]{3,}[0-9]{4,}$:

  • ^[A-Za-z]{3,}: This part ensures that the string starts (^) with at least 3 uppercase or lowercase letters.
  • [0-9]{4,}$: This part ensures that the string ends ($) with at least 4 numeric digits.

So, the $ ensures that the pattern is matched only when the specified conditions apply from the beginning to the end of the string.

For either version I cannot seem to get it to "trigger" it to mark an account as "Potential Gibberish". For example the following variations I have tried:

EMAILS (Assume there is some sort of domain after the @):
kouoi26@
cdsa@
adsvgabaw@
gregfdf651dhedr@
jsrtdgftf23@

Usernames:
Nm1234
J23w2365
random_name1

I tried brute forcing a change to "Potential Gibberish" but nothing I did changed it. Is there something I am missing?

@KumaMatata 

It looks like the pattern might need some adjustments to better match the criteria you're looking for. The key is to refine the regular expression to capture the patterns you consider as potential gibberish. Based on the examples you provided, here's a modified formula:

This formula checks for two patterns:

=IF(
    AND(
        OR(ISNUMBER(SEARCH("^[A-Za-z]{2,}[0-9]{1,}@[a-zA-Z]+\.[a-zA-Z]+$", A2)), ISNUMBER(SEARCH("^[A-Za-z]{1,}[0-9]{1,}$", B2))),
        NOT(OR(ISNUMBER(SEARCH("[0-9]", A2)), ISNUMBER(SEARCH("[0-9]", B2))))
    ),
    "Potential Gibberish",
    "Normal"
)
  1. For Emails:
    • Starts with at least 2 letters.
    • Followed by at least 1 number.
    • Contains the "@" symbol.
    • Followed by at least one letter for the domain.
    • Followed by a period "." and at least one more letter for the domain.
  2. For Usernames:
    • Starts with at least 1 letter.
    • Followed by at least 1 number.

This formula now accounts for the email structure with a domain part and should flag cases that don't match these patterns as "Potential Gibberish."

You may need to further refine the patterns based on your specific data characteristics. Feel free to experiment and adjust the regular expressions to match your requirements. Regular expressions can sometimes be a bit tricky, and trial and error is often needed to get the exact match you're looking for.

@NikolinoDE 

(THANKS for all your explanations - Liked)

I feel embarrassed to ask but after reading re. RegEx and trying even very simplified versions of your formula I keep getting #VALUE! errors:

Sample2.png


Wonder where my bad is or if I missed something???


In case this would matter I run 365 Personal/Windows v2310 Build 16924.20124 (so not Insider)
and only Add-in currently installed is DAX Studio

@Lorenzo To the best of my knowledge (and after a quick Google search), regular expressions or RegEx is not supported by any of the built-in functions in Excel. However, there are plenty of VBA and UDF solutions available online. Regarding Add-Ins, though, when I searched for RegEx in the Add-Ins Store, nothing promising came up.

 

When using "^[a-z]" in the SEARCH function, you are searching for the literal string "^[a-z]". For example, if the string in cell A2 of your screenshot were "abc^[a-z]def", the SEARCH function would return 4.

 

Unfortunately, the SEARCH function only works with basic wildcards (*?~) and does not appear to work with RegEx notation.

@djclements 

Appreciated. We obviously made the same searches and I also checked the Insider Blog. I just want(ed) to double-check I did not miss/misunderstand something as @NikolinoDE - nowhere in his 1st proposal to the OP - mentioned any specific requirement and obviously the latter had no issue implementing the various proposals (???)

 

Hopefully @NikolinoDE will clarify things...

@djclements 

You are correct, and I appreciate your clarification. I apologize for any confusion caused by my previous responses :sad:. The SEARCH function in Excel does not support full regular expressions; it primarily performs simple substring searches.

To achieve more complex pattern matching, especially using regular expressions, a more suitable approach would be to utilize VBA within Excel or a third-party tool that supports regex.

Here's an example of a VBA code snippet that you can use in Excel:

Vba code is untested, please backup your file first.

Function IsPotentialGibberish(email As String, username As String) As String
    Dim regex As Object
    Set regex = CreateObject("VBScript.RegExp")

    ' Email pattern: at least 2 letters, followed by at least 1 number, then @, some letters, dot, more letters
    regex.Pattern = "^[A-Za-z]{2,}[0-9]{1,}@[a-zA-Z]+\.[a-zA-Z]+$"
    If regex.Test(email) And Not (regex.Test(username) And regex.Test(username)) Then
        IsPotentialGibberish = "Potential Gibberish"
    Else
        IsPotentialGibberish = "Normal"
    End If
End Function

You can use this function in your Excel worksheet. Assuming your email is in cell A2 and the username is in cell B2, you would enter the following formula in another cell:

=IsPotentialGibberish(A2, B2)

This function uses the VBScript.RegExp object to perform regular expression-based pattern matching. It checks both the email and username against the specified pattern and returns "Potential Gibberish" if the conditions are met.

Remember to enable macros and use the VBA editor to insert this code into a module in your Excel workbook.

Again, I appreciate your understanding, and I hope this solution better fits your needs.

@NikolinoDE, Thanks for your clarification & no worries at all re. the confusion