Rewriting the Anagram program

As Calvin pointed out, its a pain to do the string length and other calculations, so I went back and rewrote the anagram calculation using the prime number trick. That is, assign a prime to each letter from ‘a’ to ‘z’, then you can easily tell if two words are anagrams by multiplying all the prime factors. This will give a unique number.

There are two tricks. First, if you assign too big a prime then you can get an overflow. On a Mac running gcc, the largest integers are 64 bits long. C represents these as “unsigned long long int” which is quite a mouthful. Given there are 26 letters in the alphabet, the worst case would be the 26th prime 103 times 10 before you over flow (103^10 > 2^64). This is pretty unlikely in a real world dictionary.

To make it less likely, you can pick an encoding where the most “frequent”:http://scottbryce.com/cryptograms/stats.htm letters have the smallest prime. That is E is the most likely in english, so assign it a prime of 2, T is next, so assign it prime of 3 and so forth (for your reference, the letter order by decreasing frequency is ETAOIN SRHLDC UMFPGW YGVKXJ QZ and using the Encore scrabble dictionary, the word with the highest prime is adrenocorticotrophin with a canonical number of 18,438,608,663,595,509,046 or 1.8 x 10^19 which is just shy of 2^64 which is 1.8×10^19, but you should have a check for integer overflow.

The code is really simple then, you just multiple the encoding prime for each letter and then match it against the canonical “number” for each entry in the dictionary.