What are Random Strings?

Data analysis often presents us with the challenge of dealing with noisy inputs. This is particularly evident when working with large datasets of user inputs. Detecting and filtering out random string inputs can prove invaluable in various scenarios, such as data validation, quality control or the development of machine learning (ML) based natural language processing (NLP) systems (check for data quality improvement steps in our previous blog post).

Random string inputs can significantly impact the accuracy and effectiveness of data analysis and ML models. By identifying and filtering out these noise elements, we can enhance the reliability and quality of our results.

Image 1 - Random Strings

Image 1 – Random Strings

What should/can be considered a ‘random string’?

The term “random string” may seem straightforward at first glance, but its interpretation can be subjective and context-dependent. What one person perceives as a random string, another may see as a meaningful pattern or representation. This relativity is particularly evident when considering visual or linguistic aspects.

Let’s delve into a few examples to highlight the importance of relativity when defining random strings. Take, for instance, the string ‘\__o__/’. At first glance, it might appear random, but one could argue that it resembles a stick figure of a man with raised hands. In this case, the interpretation of randomness becomes subjective, as different individuals may perceive different patterns or associations.

Similarly, let’s consider the word “Szczesny.” In the English language, it may appear as a random combination of letters, devoid of any meaning. However, in Polish, it happens to be the surname of a famous goalkeeper (Image 2). This example demonstrates how the interpretation of randomness can be influenced by language and cultural context. What may seem random in one language could hold significance or meaning in another.

Image 2 - Wojciech Szczesny, a Polish professional footballer

Image 2 [Source: Yahoo/Reuters] – Wojciech Szczesny, a Polish professional footballer

It’s crucial to acknowledge that randomness is not an inherent property of a string itself but rather a perception based on patterns or associations that individuals may perceive. In the context of data analysis and ML, understanding this relativity is vital when developing algorithms or models that aim to detect and filter out random strings.

When designing systems to identify random strings, it becomes essential to consider the specific domain, context, and purpose of the analysis. What may be considered random noise in one scenario could be a valuable signal in another. For instance, when analyzing social media data, a sequence of random characters might indicate spam or bot-generated content. On the other hand, in cryptography, random strings play a crucial role in generating secure encryption keys.

Moreover, the concept of randomness itself can vary depending on the intended application. Some algorithms generate strings that may appear random to human observers but possess specific statistical properties that make them suitable for cryptographic purposes. In contrast, other applications require strings that exhibit specific patterns or structures to serve their intended function.

How to detect a random string of characters in a specific locale?

If the goal is to detect random strings which consist of a specific set of characters (e.g., English alphabet characters), one possible way of detecting random strings is using bigrams. 

To be able to do that, you need to create a reference dictionary of bigrams for the considered locale. For the English language, we can use bigrams and their frequencies from Peter Norvig’s analysis. For example, in English language text, the bigram “th” is very common, while the bigram “zx” is not. Using the available list of bigrams and their frequencies, it is possible to normalize it (e.g., min-max normalization) and create a reference dictionary. An important note here is that you can easily create your own list of bigrams (extracted from word corpus) for any locale if you are able to find open-source books, newspapers or magazines.

Reference dictionary is a dictionary of all 26² = 676 combinations of English bigrams with associated frequency numbers (from 0 to 100) where higher values represent more frequent bigrams (like “th”) and lower values represent less frequent bigrams (like “zx”). Users are able to adjust the threshold value (default value is 0.1) in order to define a boundary between less common and common bigrams. For example, if the threshold is 0, all bigrams will be considered as common bigrams. Otherwise, if the threshold is 100, all bigrams will be considered as less common bigrams. You can find a reference dictionary of bigrams here.

To consider whether a word is a random string or not, here is a list of steps that are used to help detect random strings:

  1. Length and Character Validation: The first checkpoint of the algorithm verifies that the word is longer than three characters (words up to three characters may be regular abbreviations) and consists solely of English alphabetic characters. This ensures that we are working with meaningful words rather than arbitrary combinations of symbols
  2. Repeating Characters: If the entire word is made up of the same character, such as “hhhhhhh”, it is immediately labeled as random. This is because the presence of repeating characters indicates a lack of complexity and structure, suggesting a random generation
  3. Lowercasing the Word: To eliminate the case sensitivity, the algorithm converts the word to lowercase. This prevents the distinction between uppercase and lowercase characters from affecting the detection process
  4. Bigrams Analysis: A bigram refers to a sequence of two consecutive characters in a word. The algorithm generates a list of all possible bigrams within the word. For example, the word “random” would produce the following bigrams: (ra, an, nd, do, om)
  5. Common and Uncommon Bigrams: Each bigram is compared against a reference dictionary of common English bigrams. If a bigram has a frequency above a given threshold, it is considered common; otherwise, it is classified as uncommon. The choice of threshold depends on the desired sensitivity of the algorithm
  6. Counting Common and Uncommon Bigrams: The algorithm calculates the number of common and uncommon bigrams in the word. If the count of common bigrams exceeds that of uncommon bigrams, the word is classified as non-random. Conversely, if the count of uncommon bigrams surpasses that of common bigrams, the word is deemed random

Here are Python and Java codes for the proposed algorithm:

Python:

def is_random_string(word, threshold=0.1):
   # Allow only words longer than 3 characters which contain only English alphabetic characters
   if len(word) < 4 or not word.isalpha():
       return False


   # Turn word into lowercase
   word = word.lower()


   # Repeating characters
   if len(set(word)) == 1:
           return True


   # Get list of bigrams from the word
   bigrams = [word[i:i + 2] for i in range(len(word) - 1)]


   # Get number of common and uncommon bigrams
   num_common_bigrams = sum(1 for bigram in bigrams if en_bigrams_dict.get(bigram, 0) > threshold)
   num_uncommon_bigrams = len(bigrams) - num_common_bigrams


   # Higher number wins
   if num_common_bigrams > num_uncommon_bigrams:
       return False
   else:
       return True

Java:

package checker;


import java.util.HashMap;
import java.util.Map;


public class RandomStringChecker {
   private static final double DEFAULT_THRESHOLD = 0.1;
   private static final Map<String, Double> enBigramsDict = new HashMap<>();


   static {
       // Populate the dictionary of English bigrams here
       enBigramsDict.put("ab", 6.461565670356169);
       enBigramsDict.put("bc", 0.0531330714265234);
       enBigramsDict.put("cd", 0.06273467822837461);
       // ...
   }


   public static boolean isRandomString(String word) {
       return isRandomString(word, DEFAULT_THRESHOLD);
   }


   public static boolean isRandomString(String word, double threshold) {
       // Allow only words longer than 3 characters which contain only English alphabetic characters
       if (word.length() < 4 || !word.matches("[a-zA-Z]+")) {
           return false;
       }


       // Repeating characters
       if (word.chars().distinct().count() == 1) {
           return true;
       }


       // Turn word into lowercase
       word = word.toLowerCase();


       // Get list of bigrams from the word
       String[] bigrams = new String[word.length() - 1];
       for (int i = 0; i < word.length() - 1; i++) {
           bigrams[i] = word.substring(i, i + 2);
       }


       // Get number of common and uncommon bigrams
       int numCommonBigrams = 0;
       for (String bigram : bigrams) {
           if (enBigramsDict.containsKey(bigram) && enBigramsDict.get(bigram) > threshold) {
               numCommonBigrams++;
           }
       }
       int numUncommonBigrams = bigrams.length - numCommonBigrams;


       // Higher number wins
       return numCommonBigrams <= numUncommonBigrams;
   }
}

This algorithm has been released as a Python package and can easily be installed using the command:

pip install random-string-detector

Examples

Python method calls:

print(is_random_string('abcd')) # True
print(is_random_string('Thgrbh')) # False
print(is_random_string('Thgrbh', 5)) # True - threshold adjusted

Java method call:

public static void main(String[] args) {
   System.out.println(isRandomString("abcd")); // true
}


Conclusion

The interpretation of randomness depends on the specific domain, context, and purpose of the analysis. What may be considered random noise in one scenario could be valuable information or a critical signal in another. It is essential to align the definition of random strings with the objectives and requirements of the particular application to ensure the reliability and quality of results.

While the proposed algorithm is not foolproof and cannot definitively determine randomness, it serves as a useful tool for various applications that require distinguishing between random and non-random strings.

A potential algorithm improvement would be to have the bigram dictionary in a separate file that we could load into memory.

Want to discuss this in relation to your project? Get in touch:

Leave a Reply