Cifras y Letras (3)

Posted on . Updated on .

Some days ago I was searching for an old CD and instead of finding what I was looking for, I found another old CD with a digital Spanish dictionary. I didn’t even remember I had it, but it quickly prompted me to finish the letters program for the TV show. The dictionary was one of those typical Windows applications distributed in a CD that installs a minimum amount of data from it to your hard drive and runs from the CD itself.

I immediately had a look on the CD contents and spotted a .MDB file quite big in size. Kexi was unable to import the data, and didn’t appear to give me good results opening the database either. I waited some more days until I had access to a computer with MS Office installed, and opened the file with MS Access. To my surprise, the result was similar. The database appeared to contain a single table, where each row contained a word id, a definition id and the text of the definition, but the words themselves were nowhere to be found. Furthermore, the definition text appeared to be garbage. I didn’t bother to investigate, but maybe their bits were inverted, or the characters rotated like in ROT13. In any case, I lost hope and discarded finishing the program.

However, some days later, while watching TV I had the idea to run a recursive grep on the CD files just in case the words were stored in cleartext in a separate file. I didn’t have much hope, but the words did turn up. They were stored in a different file, using the ISO-8859-1 charset. The file was quite big and mostly filled with space characters. Each word seemed to use a fixed chunk of 65 bytes. I ran an on-the-fly Python program directly from the Python interpreter to split the file into chunks, strip whitespace and encode its contents in UTF-8. Five minutes later I had stored the word list into a text file called palabras-utf8.txt with one word per line. That file is distributed inside I needed to hand-edit that file using VIM to obtain a final list of words. The work mostly involved removing accents, words that start with a bracket that seemed to indicate "unofficial" words, split alternate spellings of words into different lines and some more minor changes. The final result is stored in palabras-utf8-final.txt, also included in the archive.

As you can see, many of the lines in the final file have alternate endings for different genders, and optional suffixes that modify the word gender. A file like that one can be processed with the script included in the archive, that will create a database of words by processing that file. This database is created using the shelve Python module (yes, I know it was a nice situation to use sqlite3 but I wanted some backwards compatibility with Python 2.4). It uses the method I mentioned in the previous Cifras y Letras post. It takes each word, sorts its letters and associates the word, maybe along with other words, with its corresponding sorted letters combination. For example, the combination aacs corresponds to the words casa (house) and saca (bag). Given 9 letters, you form the combinations of 9 letters (there’s only 1), the combinations of 8 letters (there are 9), etc. You sort the letters of each combination and use them as keys to access the database, so it returns the list of words. As I explained in the previous post, this is much faster than forming every permutation and checking if it’s in the dictionary. In this case, preprocessing the database and organizing it once allows us to be much more efficient when we use it. The script creates the database this way and handles the alternative endings and word suffixes, introducing both variants in the database. If you want to use a different text file and a different database name, those can be changed in the first lines of the script.

Finally, when you’ve run and you have your palabras.db file, use to search for words. It needs two parameters: a sequence of letters and a minimum length for the words in the result. Internally, you can and may need to tune the terminal charset, that should probably be 'iso-8859-1' if you’re under Windows or probably 'utf-8' if you’re using one of the most popular Linux distributions. This charset and the database name can be changed in the first lines of the script. Example of use:

$ ./ ananedros 7
orensana (8)
senadora (8)
rondana (7)
saneado (7)
arenosa (7)
sanador (7)
endosar (7)
senador (7)
sondear (7)
asadero (7)
darsena (7)

It’s worth mentioning that the database lacks gerunds and any past participles that aren’t used as adjectives. These are allowed in the TV show, so don’t be surprised if someone finds a longer word this way in some particular situations. Due to the lack of verb markers (in Spanish, there are may words ending in -ar, -er and -ir that aren’t verbs) and the existance of irregular verbs, I can’t generate gerunds and past participles automatically.

Both and are less than 100 lines long, and some of them are comments. This gives you an idea of the game simplicity. The code and the word lists are released to the public domain.

Load comments