Santhosh Thottingal

My experiments with Freedom

Yahoo search bug
santhoshtr
None of the search engines can handle Indian languages very well. Google removes the zero width joiners, non joiners , that are used in many languages. Yahoo doesnot remove it. But a UI bug in webpage makes the results wrong..
See the below image:





The bottom half of the image is the source code. We can clearly see that the closing bold tag is placed in between the word instead of putting at the end of the word. As a result, the word is rendered wrong in the page.
This happens for all languages which use ZWJ, ZWNJ, ZWS etc. It breaks the word just before the zwnj/zwj and puts the end of bold tag to highlight the search result..

I showed this to Gopal and told me that he filed a bug on that.
Tags: ,

KDE spellchecker not working for Indian Languages
santhoshtr
As I mentioned in my blog post on Language detection the sonnet spellchecker of KDE is not working. I read the code of the Sonnet and found that it fails to determine the word boundaries in a sentence (or string buffer) and passes the parts of the words to backend spellcheckers like aspell or hunspell. And eventually we get all words wrong. This is the logic used in Sonnet to recognize the word boundaries
Loop through the chars of the word, until the current char is not a letter/ anymore.
And for this , it use the QChar::.isLetter() function. This functions fails for Matra signs of our languages.

A screenshot from a text area in Konqueror:

For example
#include <QtCore/QString>
#include <stdlib.h>
int main(){
	QChar letter ;
	letter = 'அ';
	fprintf(stdout,"%d\n", letter.isLetter());
	letter = 'ी';
	fprintf(stdout,"%d\n", letter.isLetter());
}
In this program, you will get true as output for அ and false for ी.

When I showed this to Sayamindu during foss.in , he showed me a bug in glibc . Eventhough the bug is about Bengali, it is applicable for all languages. It is assigned to Pravin Satpute and he told me that he got a solution and will be submitting soon to glibc.

But I am wondering why this bug in KDE unnoticed so far? Nobody used spellcheck for Indian languages in KDE?!

Let me explain why this is not happening in GNOME spellchecker if this is a glibc bug. In gnome, this word splitting will be done in application itself using gtk_text_iter_* and these iteration through words are done by pango words boundary detection algorithms.

Filed a bug in KDE to track it.

Youtube to MPEG or Ogg video conversion
santhoshtr
Here is the two line method to convert a youtube video to oggvorbis video.
Locate clive and ffmpeg2theora in your package and install
$clive http://in.youtube.com/watch?v=6JeZ5oeAEyU (replace this with the youtube address you want) It will create a flv file.
Convert to mpeg video file
$ffmpeg -i AmericaAmerica.flv AmericaAmerica.mpg
Convert to ogg video file
$ffmpeg2theora AmericaAmerica.mpg (replace it with the name of the flv file the previous command created)
Done. You can see the .ogg file in the directory from where you executed the above commands

Dhvani 0.94 Released
santhoshtr

A new version of Dhvani -The Indian Language Text to Speech System is available now. The new version comes with the following improvements/features

Dhvani documentation is available here. Binary packages and source code are available here
Thanks
  • Rahul Bhalerao for Marathi module and patches
  • Zabeehkhan for Pashto Module
  • Nirupama, CDAC Chennai and CDAC Noida people for testing and reporting bugs
  • NRCFOSS Chennai, Krishnakanth Mane and many others for feedbacks
  • Amida Simputer team for patches on Telugu module especially the Telugu number reading logic
  • Debayan and Roshan for testing and informing problems

There was good amount of code change in this version. Still there are many improvements to do in language modules and synthesizer. Some of the language modules requires developers who speak that language. Syntheziser got some improvements and require some amount of research to make the speech more natural. So your feedbacks, suggestions, bug reports and patches are valuable.

PS: A note for quick usage after installation from binary: After installing deb or rpm, Open gedit, edit->preferences->plugins, enable external tools. Dhvani will be available as a plugin there. Select some text in any of the supporting languages and click the Dhvani menu.

Tags:

Language Detection and Spellcheckers
santhoshtr
A few weeks back there was a discussion on #indlinux IRC channel about automatic language detection. The idea is, spellcheckers or any language tools should not ask the users to select a language. Instead, they should detect the language automatically. The idea is not new. There is a KDE bug hereand Ubuntu has this as an brainstorm idea. It seems M$ word already have this.

A sample use case can be this: "While preparing a document in Openoffice, I want to write in English as well as in Hindi. For doing spellcheck, I need to manually change the language rather than the application detect it automatically"

Regarding the algorithm behind automatic language detection, there are many approaches. Statistical approaches are effective for languages sharing same script(For eg: languages which use latin script or Hindi and Marathi). N-gram based methods are used in statistical approach. Here is a 'patented' idea . And this page explains a character trigram approach. Google has a language detection service(http://www.google.com/uds/samples/language/detect.html) and it seems it is still in development or 'learning stage'.


Here is an example of statistical language detection: languid(It did not work for me when I tried, But you can download the source code and check)

Sonnet is the spellchecker framework of KDE written by J. Rideout. It is also trying to provide the language detection feature. Here is an old article in linux.com about that. It is based on n-gram based text categorization and is a port of languid. From the article:

A gram is a segment of text made of N number of characters. Sonnet uses trigrams, made from three characters. By analyzing the popularity of any given trigram within a text, one may make assumptions about the language the text is written in. Rideout gives an example: "The top trigram for our English model is '_th' and for Spanish '_de'. Therefore, if the text contains many words that start with 'th' and no words that start with 'de,' it is more likely the text is in English [than Spanish]. Additionally, there are several optimizations which include only checking the language against languages with similar scripts and some heuristics that use the language of neighboring text as a hint."


(I tried sonnet and could not get it working for ml_IN. Instead of words, it was iterating through letters. Anyway I will check this problem later.)

As far as Indian languages are concerned, Unicode code range based language detection will work for most of the cases. Most of the languages has its own script and Unicode code point range. For example, detecting Malayalam is a matter of checking the letters are in the Malayalam Unicode range. But for Devanagari script it is not straight forward. Hindi , Marathi etc use Devanagari script. Dhvani, the text to speech system for Indian languages use a simple algorithm for language detection(http://dhvani.sourceforge.net/doc/language-detection.html). There the Hindi and Marathi is identified by giving a priority for LANG environment variable. But it will fail if somebody try to use Marathi in an English desktop(Users can specify the language to be used – In that case language detection will not be done.).

In the case of spell checkers other than LANG environment variable there are other options. When you type in gedit or any text editors, detecting the keyboard layout will be one way of detecting the language. But it depends which IME the users uses. It can be xkb or scim or even a copy-paste.

Anyway, it is pretty clear that the current natural language features in the free desktops requires more improvements. Based on a discussion we had in #indlinux IRC, we had setup a wiki page here to discuss on this.

As a proof of concept, I tried to write a spellchecker for Gedit texteditor with language detection for Indian languages. Basically it uses Unicode character range. It is a gedit plugin written in python. And it uses pyenchant spellcheck wrapper library. Install python-enchant using your package manager if it is not already installed. Download the plugin and python module to ~/.gnome2/gedit/plugins folder and restart gedit. Enable external tools and new Spellchecker plugin in edit->preferences->plugins. It does not have the pango error style underline or suggestions in context menu as of now. It just prints the results and suggestions in the console of gedit. And ‘Add to Dictionary’ etc are not there now.

I would like to request interested developers to come forward and make this feature ready to use in free desktops. Suggestions are welcome. We need good algorithms for detecting the language too.
A sample use case: "System locale is English and I am typing a document in Hindi and want to write some Marathi sentences in between. Without manually changing the language, system detect the language of each word and check the spelling against corresponding dictionaries."

PS: Because of the inflectional and agglutinative nature of some of the Indian languages, the spell checking is not at all effective. I will write on that later.

Gedit plugin for showing unicode codepoints
santhoshtr
While working with Unicode text, it is often required to get the Unicode code points of text for debugging. Using python, it is very easy to get the unicode codepoints of the text. Following examples illustrates it.

>>> "സന്തോഷ്".decode("utf-8")
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d'

or

>>> str=u"സന്തോഷ്"
>>> print repr(str)
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d'

Well, But we need to take python console and type/paste the text etc..How can we make it more easy? What if pressing F12 key after selecting some text gives the codepoints?
So I wrote a plugin for gedit. I never knew that writing a gedit plugin is too easy. This tutorial gives all the required information.
Download the plugin file and python module and place it in .gnome2/gedit/plugins folder inside your home folder. And restart gedit. Enable the plugin from Edit->Preferences->Plugins menu. Note that you need to enable the External tools plugin too.


Select some text and press F12. If text is not selected, entire content of the document will be used.
Tags: , ,

Screensavers in your language
santhoshtr
I had written a blog post about hacking the glmatrix screensaver with the glyphs of our languages.

Now I have those screensavers in the following languages:

Hindi : Deb Package , RPM

Gujarati : Deb Package , RPM

Bengali : Deb Package , RPM

Oriya: Deb Package , RPM

Tamil : Deb Package , RPM

Malayalam: Deb Package , RPM


Try it and enjoy !!
ps: I used the default fonts of Fedora 9 for these. If you have any specific font to be used please let me know. I used Dyuthi calligraphic font for Malayalam.

Swanalekha M17N based Input Method for 11 Languages
santhoshtr
Swanalekha is an Input method originally designed for Malayalam. It is works with scim. as well as m17n. The input method scheme is transliteration based and it has a unique feature of candidate list menu(which I will explain shortly). Now I have extended it to 10 other Indian languages.

Before explaining how swanalekha is different from other phonetic/transliteration based input methods, let me explain some of the characteristics of transliteration. Transliteration based input methods were following a strict one to one mapping from english letters to another Indian language. For eg: The ka=क ,pa = प , ti = टि etc.. when you write bharath, you will easily transliterate it to hindi as भारत. But for a rule based transliteration system it is भरत unless the english is bhaarath. Some times it may be Bhaarat too.. See another example: Kartik. it should be transliterated to കാര്‍ത്തിക് in Malayalam. So some people write it as Karthik, and some others write it as karthick too. All these are based on personal preferences. But when it use transliteration based input methods, people find difficulty with using a strict rule based writing method. There they have to write kaa for കാ or કા or கா or কা. Users like to get what they mean without the difficulty following the strict rules of transliteration. In an Intelligent transliteration based system when somebody write linux they should be able to map it to लिनक्स . Some times a choice to select लैनक्स is also preferable. This is what google transliteration does. No rules, no learning.. just type in english...

Google's Transliteration is based on machine learning and statistical approach. And it works only when we are online and only in webpages. Now I will explain how swanalekha tries to provide a solution for the above problem.
For each english letter or pattern , we saw that there are multiple choices . ka can be क, का . ga can be ക, ഗ, ഖ, ഘ, ഗാ in Malayalam. sa can be स, श etc.. So swanalekha provides all these candidates as a suggestion menu under the cursor while typing. See the below image of Hindi swanalekha version.





The differences between google transliteration and swanalekha are:
a) Google transliterate is web based and works in webpages when you are online. Swanalekha works in all applications in your GNU/Linux desktop such as gedit, openoffice, kwrite, firefox...
b) Google transliterate gives suggestions as words, but swanalekha works in letter level (not exactly a single letter. but like का, કા etc. )
c) Google transliterate is machine learning based. But swanalekha is rule based with 'one to many' pattern mapping in m17n

The candidates are mapped to English string patterns inside the source code- the m17n input method files - .mim files.
You can download the .mim files from here. Icons for each language is also provided. You can see .mim files for Malayalam, Hindi, Telugu, Oriya, Tamil, Bengali, Assamese, Panjabi, Gujarati, Marathi and Kannada. Note that other than Malayalam all other source files are not complete. They are generated using a small python script from Malayalam mapping file. They are just templates with approximate mapping. And should be corrected and modified by a person who know that language very well. Malayalam mapping is tested and it is already packaged for Fedora and already present in m17n upstream as part of m17n-contrib package. It is widely used by GNU/Linux users in Kerala too.
Candidate selection based Input methods are very common in CJK(Chinese, Japanese, Korean) languages. Swanalekha is first implementation of candidate list outside CJK using scim and m17n.

So if anybody is interested in testing and correcting the mappings for your language, please continue reading :)

How to Install :
download the tar ball containing all .mim files and icons from here. Extract it and copy all .mim to /usr/share/m17n
sudo cp *.mim /usr/share/m17n
sudo cp *.png /usr/share/m17n/icons

Note that you need to install scim-m17n before doing this. Most of the distros will have it pre installed
After copying these , restart your X by pressing alt+ctrl+del or do a logout+login
Open gedit, select input method as scim, and select your language from the scim menu. Start typing

How to correct the maps?
Open the .mim file for your language using any text editor.
You will see lines in lisp syntax. No, You need not know Lisp :)
For example in hi-swanalekha.mim, you will see a line like this
("sa" (("स") ("श")))
This means, for 'sa', show स and श as candidates with स as default option. If you want to add सा as third option just change the line like this
("sa" (("स") ("श") ("सा")))
If any pattern is not found in .mim file just add one more line there following the above syntax. Only thing is you should be careful about opening and closing parenthesis since it is Lisp.

Once you are done, install it by just copying it to /usr/share/m17n folder. Restarting X is required to restart scim. or even a 'killall scim' will do sometimes
Don't change any other code(code for candidate selection using up/down arrow, and using number keys) unless you know what you are doing.

Let me know if you face any issues..

Happy Hacking and Happy Deepavali !!!

Geo-visualisation, the FOSS way
santhoshtr
My friend Jaisen Nedumpala has been developing a Geo-visualisation system for Cheruvannoor Grama Panchayath(Page in ml_IN) of Kerala. The system, developed using FOSS tools is available here
"Development of effective geo-visualisation based decision support system (DSS) involved primarily data compilation from collateral sources, setting up appropriate hardware configuration, design of database and design of a spatial DSS. "
Jaisen used softwares like GRASS, UMN MapServer and ka-Map. He has written a detailed documentation(English) on how he developed this and what are all the tools used.
Tags:

UTF8Decoder
santhoshtr
zabeehkhan was trying to code a Pashto (ps_AF) module for dhvani. And he told me that "it is not saying anything" :). So I took the code and found the problem. Dhvani has a UTF-8 decoder and UTF-16 converter. It was written by Dr. Ramesh Hariharan and was tested only with the unicode range of the languages in India. It was buggy for most of the other languages and there by the language detection logic and text parsing logic was failing. So I did some googling, went through the code tables of gucharmap and got some helpful information from here and here
So here is my new UTF8Decoder and converter
/*
UTF8Decoder.c
This program converts a utf-8 encoded string to utf-16 hexadecimal code sequence

UTF-8 is a variable-width encoding of Unicode.
UTF-16 is a fixed width encoding of two bytes

A UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to
encode a character. For example, the character U+000A (line feed) must be accepted from
a UTF-8 stream only in the form 0x0A, but not in any of the following five possible overlong forms:

  0xC0 0x8A
  0xE0 0x80 0x8A
  0xF0 0x80 0x80 0x8A
  0xF8 0x80 0x80 0x80 0x8A
  0xFC 0x80 0x80 0x80 0x80 0x8A

Ref: UTF-8 and Unicode FAQ for Unix/Linux http://www.cl.cam.ac.uk/~mgk25/unicode.html

Author: Santhosh Thottingal <santhosh.thottingal at gmail.com>
License: This program is licensed under GPLv3 or later version(at your choice)
*/
#include<stdlib.h>
#include<stdio.h>
#include<string.h>
unsigned short
utf8_to_utf16 (unsigned char *text, int *ptr)
{

  unsigned short c;		/*utf-16 character */
  int i = 0;
  int trailing = 0;
  if (text[*ptr] < 0x80)	/*ascii character till 128 */
    {
      trailing = 0;
      c = text[(*ptr)++];
    }
  else if (text[*ptr] >> 7)
    {
      if (text[*ptr] < 0xE0)
	{
	  c = text[*ptr] & 0x1F;
	  trailing = 1;
	}
      else if (text[*ptr] < 0xF8)
	{
	  c = text[*ptr] & 0x07;
	  trailing = 3;
	}

      for (; trailing; trailing--)
	{
	  if ((((text[++*ptr]) & 0xC0) != 0x80))
	    break;
	  c <<= 6;
	  c |= text[*ptr] & 0x3F;
	}

    }
  return c;

}


/* for testing */
int
main ()
{
  char *instr = "സന്തോഷ് തോട്ടിങ്ങല്‍";	/* my name :) */
  int length = strlen (instr);
  int i = 0;

  for (; i < length;)
    {
      printf ("0x%.4x ", utf8_to_utf16 (instr, &i));
    }
  printf ("\n");
/* output is:
0x0d38 0x0d28 0x0d4d 0x0d24 0x0d4b 0x0d37 0x0d4d 0x0020 0x0d24 0x0d4b 0x0d1f 0x0d4d 0x0d1f 0x0d3f 0x0d19 0x0d4d 0x0d19 0x0d32 0x0d4d 0x200d 
*/

  return 0;
}

There may be already existing libraries for this, but writing a simple one ourself is fun and good learning experience. For example, in python, to get the UTF-16 code sequence for a unicode string, we can use this:
str=u"സന്തോഷ്‌"
print repr(str)

This gives the following output
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u0d4d'

?

Log in