Woes of getting an Arabic vocabulry word-list/dictionary; done using Linux!

One of the things that one might not find on the internet is Arabic supplies! In my case, I just wanted an Arabic word-list/dictionary! Little did I know, that simple request was actually a portal to endless woes!

A quick notice if you are here to download a free Arabic word-list/dictionary, then the link is at the bottom of the post! Elsewise, please enjoy the post!

How it all started

So while looking around and messing with stuff, I realized that I needed an Arabic word-list! I needed a text file that has all Arabic vocabulary just like Linux’s main English dictionary (/usr/share/dict/words)! Like any person who needs anything, I googled! Nothing useful came out! I paraphrased, … still nothing! I kept trying and trying all words of the same meaning, but all I was getting is frustration! No, I wasn’t able to find any decent free Arabic dictionary as a text file!

I took a five minutes break from googling and started thinking of what might lead me to what I’m seeking for! … I knew it was there, and since it has something to do with Arabic and open-source, it had to be something to do with Arab-eyes (a foundation for Arabic Unix-related projects)! I surfed their wiki, and asked in their IRC channel #arabeyes on freenode, but nothing came back to me! I kept searching, but it was futile! … Until a friend from #linuxac (Linux Arabic Community) said something that made me take a look at it from a different point of view! It made me think of relating it with Linux, so I started checking out apps that use the words text file linux dictionary!

Apps were a lot, but one of the major ones was aspell, Linux command-line spell-checker! I knew the aspell project was big, so apt-cache search aspell | grep ar is what I did! And that’s when I went like “FINALLY”; there’s an Arabic package for aspell (aspell-ar)! So I installed it and I was all happy!

“Time for file hunting” that’s what I said! I tried all search methods, locate and friends did me no good; I wasn’t able to find the dictionary of the installed package! So I went to consult uncle google! This time I had more specific key words, and it did the trick, … well sort of! It lead me to the first conversation between an Arabian guy with the aspell folks providing them with a simple dictionary of his own – that guy turned to be one of Arab-eyes members! The provided data was old, 2005/06 old, but I downloaded it just to see what are the files! Downloaded, extracted, and exclaimed after seeing no text files! There were ar.content file and ar.cwl.gz archive! However, I didn’t give up just yet, I took the names and searched my desktop for them with the handy locate after updating the database of course! I did so, because the chances are: the aspell devs haven’t changed the structure of their app, so files should be the same! And I did find them, a more up-to-date version, of course! I extracted the archive (ar.cwl.gz) and found out that the content is useless, a binary file!

So I was back to where I started! However, I took another approach that time: the command line tools! I simply entered apropos word | grep list and there were bunch of commands that matched! While going through them, some commands caught my attention:

remove-default-wordlist (8) - remove default wordlist
select-default-wordlist (8) - select default wordlist
update-default-wordlist (8) - update default wordlist
word-list-compress (1) - word list for GNU Aspell

The first three weren’t much of a help, but the last one was the one! man word-list-compress told me that the binary files I found earlier were actually useful; they actually were the word-lists themselves!! This tool compresses text files so they become binary files so that aspell can deal with them! Not only that, it also decompresses the binary files! … I was literally celebrating – alone!

So my steps were: get archive, extract, use tool!

cp /usr/share/aspell/ar.cwl.gz ./
gunzip ar.cwl.gz
word-list-compress -d <ar.cwl> ar-words.list

Decompressed

I was a speechless nut! … A long list of those wonderful questionable symbols! It seemed like a list, might be an Arabic one, BUT IT WAS GIBBERISH! … after a short while of frustration, it hit me! “It’s the encoding”, I presumed! So I did file ar-words.list and I was correct, it wasn’t displayed well because it wasn’t UTF-8, it was LATIN/ISO8859! Not all apps will be able to display that since it’s not UTF encoded, some apps have limited support! So I had to convert it since I was planning to share it! So, apropos encoding and I got this wonderful command: iconv which converts encodings! Read the man page and got this command:

Encoding fail

iconv -f ISO8859-1 -t UTF-8 ar-words.list > ar-words-utf.list

and enter! … this is what I got! Gibberish! I tried latin1 instead of ISO8859, but no luck! I was really frustrated and I couldn’t find any solution online; all I found were telling the same as I already did! I was about to give up, but I thought that there’s still one more thing I can do: Becoming reckless! So I started trying encodings arbitrarily as my last attempt!

iconv -f ISO8859-2 -t UTF-8 ar-words.list > ar-words-utf.list (Result: gibberish)
iconv -f ISO8859-3 -t UTF-8 ar-words.list > ar-words-utf.list (Result: error)
iconv -f ISO8859-4 -t UTF-8 ar-words.list > ar-words-utf.list (Result: more gibberish)
iconv -f ISO8859-5 -t UTF-8 ar-words.list > ar-words-utf.list (Result: Bulgarian gibberish)

I went like “SIGH >:(” then entered:

iconv -f ISO8859-6 -t -t UTF-8 ar-words.list > ar-words-utf.list

IT WAS ARABIC!! I got myself a complete Arabic vocabulary word-list successfully!! 71502 words!!

A part of the Arabic word-list

Download a complete Arabic word-list/dictionary

You can download it from here: ClickMe!

Note: the provided list was extracted from an open-source Linux package! You might be able to get a more up-to-date version of the list by making your own!

Make yourself an up-to-date Arabic word-list/dictionary

To be able to follow up with this tutorial you need to have Ubuntu or any other Linux distribution (Commands written for Ubuntu/Debian/Gnome)! A script is provided at the bottom!

  1. Launch your terminal (Applicaion –> Accessories –> Terminal)
  2. Install the Arabic extension package for aspell: sudo apt-get install aspell-ar
  3. Copy Arabic list archive: cp /usr/share/aspell/ar.cwl.gz ./
  4. Extract Arabic list archive: gunzip ar.cwl.gz
  5. Decompress the extracted file to gain the word-list: word-list-compress -d <ar.cwl> dump
  6. Change word-list encoding to be readable and presentable: iconv -f ISO8859-6 -t UTF8 dump > words-ar

You can alternatively download a set-of-commands/lazy-script which does the steps above – Click here to download! Once downloaded to your desktop, type in the command line: bash ~/Desktop/GetArabicWordlist

Note: You will be asked for your password to install the package which is necessary!

Update: A similar project to aspell, ayaspell, have a more up-to-date version of that list; yes, they are on the same base! So if you’re interested, get it from here: http://ayaspell.sourceforge.net/! Just download the latest archive, extract it, and do step number 6! :D


Hopefully, this blogpost will be found whenever one searches for an Arabic word list dictionary!

Advertisements

~ by AnxiousNut on December 22, 2010.

12 Responses to “Woes of getting an Arabic vocabulry word-list/dictionary; done using Linux!”

  1. I saw the binary files for the Arabic dictionary, and many others, for Android available in a project on Google Code a few days back. You could’ve simply downloaded that.

    Good work nonetheless.

  2. Hmmm, interesting! Can you link me to the project you were referring to? I want to see which has a larger set of vocab!

  3. I think it was under this project: http://code.google.com/p/softkeyboard/

    I remember stumbling upon it a few days back when I was looking at a UTF8 problem with Arabic.

  4. […] If you wanted a filtered result, you’d need an Arabic .txt dictionary, you can get one here! […]

  5. […] In case you don’t know the location of the linux words (English) wordlist on your system, SHAME ON YOU! LOL, it’s under /usr/share/dict/words (Ubuntu). And if you ever forgot, it’s provided withing the help message. As for the Arabic wordlist, you can get it from my old blog-post. […]

  6. […] Arabic sub Menu Arabic عربى Koran قراءنWoes of getting an Arabic vocabulry word-list/dictionary; done using …Saving Arabic words into a database – ASP / Active Server PagesLearn Arabic Words – Core […]

  7. what shitty website

  8. This is exactly what I was looking for. Thanks a lot for your hard work!

  9. Thanks. helped me a lot

  10. your awesome man thank you so much from your arab friend :)

  11. Jazak Allaah khair.

  12. This may be an old post but in case new people are looking for an Arabic Word List, here is one with 9M+ words:
    https://sourceforge.net/projects/arabic-wordlist/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: