If you have a few minutes to spare, consider filling in my “Vote for Candy” survey to rank your favorite candies as part of my latest project, especially if you are a resident of a smaller US state. Click/tap here to go to the survey. This tab will stay open.

Gift cards will be given to random participants.



Hashing Data to Memorable Phrases

By Max Candocia

|

October 02, 2020

Have you ever wanted to hash/randomize a primary key or other private information, but also wanted to be able to refer to it with a memorable phrase? Probably not, but with my new package avilable on CRAN, KeyToEnglish, now you can!

The package primarily revolves around one function, keyToEnglish(), that does the following:

  1. Convert input to character type if needed
  2. Use a hash function (default md5) to hash the input
  3. Break the hashes into substrings of small length (default 3) and convert to integer from hexadecimal values
  4. Map each integer to a string, then combine the strings together for each element of input

Here's an example:

# install with `install.packages("keyToEnglish")`
library(keyToEnglish)

email_addresses = c(
  'anika_harmonica@themangoblues.com',
  'billy_silly@someclowncollege.com',
  'carys_ferrous@steel-rarebit.org',
  'diarra_tiara@coffeequeens.net',
  'eri_merry@joifish.org'
)

print(keyToEnglish(email_addresses))
## [1] "GeneralPurityTunnelSpellingFeeding"        "PrintInfusionAdverseEngraveCentral"        "IndependentEffectiveFlavorConsistWall"    
## [4] "HearPigmentCouncilTeacherPressing"         "UnitComplexionElderConstitutionFellowship"

Alternately, you can provide a list of word lists, and the output will include words strung together in the order of the lists they appeared in. Note that for best results, the least-common-multiple of the sizes of all of the lists should be relatively small. I usually make my list sizes all powers of 2 in order to accomplish this.

# hash to a memorable sentence
# equivalent to
# print(hash_to_sentence(email_addresses))
print(keyToEnglish(email_addresses, word_list=wml_long_sentence))
## [1] "EruditeMoltenPetalResurrectsEmbossedLingonberries" "HelplessWideChicoryBifurcatesDitsyNecks"          
## [3] "CapriciousKitchPartnerGluesShinyGrime"             "EnchantedGlassRockstarChainsLaqueredGauntlets"    
## [5] "HauntedRainbowShinerObliteratesOrangeCounts"

You can also define your own word lists:

custom_word_lists = list(
  sizes=c('infintesimal','miniscule','tiny','small','average','big','huge','astronomical'),
  colors=c('red','blue','green','yellow','orange','purple','pink','brown'),
  nouns=c('monkey','parrot','kitty','newt','fish','buffalo','wasp','octopus'),
  of='of',
  nouns2=c('doom','love','chaos','happiness','anger','sadness','swoleness','alacrity')
)

keyToEnglish(
  email_addresses,
  word_list=custom_word_lists
)
## [1] "AstronomicalYellowOctopusOfChaos"    "BigBlueKittyOfChaos"                 "MinisculeBlueKittyOfSwoleness"      
## [4] "InfintesimalYellowMonkeyOfHappiness" "MinisculeOrangeBuffaloOfSadness"

Of course, this only has 4,096 unique combinations. If you want to calculate the maximum number of keys you can generate before encountering a collision, you can use the function uniqueness_max_size(), which approximates this number:

print(uniqueness_max_size(4096, 0.01))
## [1] 9.073718

Surpisingly, it is only 9. Using the wml_long_sentence multi-wordlist, the value is a bit higher:

print(uniqueness_max_size(wml_long_sentence, 0.01))
## [1] 19028965

which is about 19 million, which is more than enough for most applications. As a general rule, the probability of any collisions occuring is proportional to the square root of the number of permutations.

Random Sentence Generator

In case you just want random strings, you can also run generate_random_sentences(). Note that this uses the openssl package to generate random numbers, but if you want to use set.seed(), or just run it faster, you can use the fast parameter.

print(generate_random_sentences(5))
## [1] "Hellish black secessionist illuminates moist chevaliers." "Harmonious nylon pus bifurcates maroon diamonds."        
## [3] "Mysterious galvanized atom manufactures pink rices."      "Calculating glossy gauge inverts oak demons."            
## [5] "Nutty drenched lime condemns pyrite dirks."

Several word lists/word multi-lists are included with this package in order to make it easier to run code out-of-the-box: