Santhosh Thottingal

My experiments with Freedom

Goodbye LJ!
santhoshtr
Good Bye Livejournal..
I am moving to my new home : http://thottingal.in/blog
Friends, Please update your bookmarks, feed subscriptions etc..
Migrating all livejournal posts was too easy with new version of wordpress.

Openoffice Indic Regional Language group
santhoshtr
We just formed Indic Regional Language group for Openoffice. This is as per the Openoffice Native Language Consortium Plans. The objectives of such groups can be read from here. Basically the group is meant for better coordination among Indic languages to make Openoffice experience in our language better.
The announcement of this group is here

Thanks to Charles-H. Schulz, we got a mailing list indic@native-lang.openoffice.org. To subscribe login to http://native-lang.openoffice.org

We just started, and I will soon setup a wiki page there. To start with , I will collect the list of issues pending for Indian languages from people from various languages and will find out people from various languages as point of contacts. Feel free to contact me for anything related to Openoffice in your language.

Update: June 3, 2009: This is our wiki page

In solidarity
santhoshtr
Tags:

Python isalpha is buggy
santhoshtr
This code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
ml_string=u"സന്തോഷ്  हिन्दी"
for ch in ml_string:
    if(ch.isalpha()):
        print ch

gives this output
സ
ന
ത
ഷ
ह
न
द
And fails for all mathra signs of Indian languages. This is a known bug in glibc. Does anybody know whether python internally use glibc functions for this basic string operations or use separate character database llke QT does?
Tags: ,

N-gram Visualization Experiment
santhoshtr
Following image shows the python-graphviz generated visualization of N-Gram representation of first paragraph this article from Hindi wikipedia. The image represents the possible paths through which a sentence can be constructed if we start from a word भारत.
Click to view the enlarged image


Localization: What are we missing?
santhoshtr
[This blog post is kind of self criticism and written not forgetting the valuable contribution that l10n communities are doing. ]
Some observations on the Localized desktops in Indian Languages
* Not all localization team members try the application that he/she translate at least once before working on the PO file. Result: If somebody does the localization without understanding what the application does and try the en_US interface, he/she miss the context of the strings. An example I have seen : the string "Querying" was translated to xx_IN language string which means "Questioning" instead of the required string corresponding to "Searching". Sometimes we miss to understand how much space the string is going to take in the screen and we translate a small English word to a long xx_IN string to make the meaning clear. Result: Ugly interface.

Tamil gedit from Ubuntu 8.10(Click to enlarge)
* Not all localization team members *try* the application that he/she translated after completing the PO file or even after the application is released. This happens when he/she translates many applications(sometimes if it is part his/her job).
* Practically, there is no process called *testing* localized desktop in our SDLC. L10N members translates a PO file and sometimes he/she translates it as text file rather than a user interface. It is must that we should bring some process to make sure that the localized desktop is tested for usability, contextually correct translation, spelling mistakes, wrong short cut keys, fuzzy strings , non translated strings in main interface etc etc.
* Since the ratio between the total number of applications in a desktop environment and number of team members is very less, we end up in translating one application by many people. Result: inconsistent translation and no ownership for ensuring the translation quality. Ramadoss from Tamil team was suggesting that ideally , for each application there should be a person from each language , who is responsible for timely translation, testing. He can take more than one application responsibility but not more than say, 10. Practically, this requires a big l10n community per language and unfortunately we don't have it as of now.
* Peer review, one of the important and mandatory process in l10n is not happening properly when the release date is approaching. L10N communities often try to meet the percentage of completion somehow. IMHO, the new l10n tools frameworks often miss to give importance for peer review in the workflow they design. FOSS community, being inclusive in nature often welcomes new l10n contributors. I have seen many members improving their l10n skills after making the corrections as per the review comments from others. When a new l10n workflow allows every contributor to submit their translated PO file without the peer review from community, the ultimate result is very bad user interface. We have seen this many times with Rosetta translations of Ubuntu. Everybody going there tries out the Rosetta "features" and leave few strings "translated" there. And Ubuntu takes those strings for their immediate release. Upstream translations are never taken on time or the "translated" strings are never submitted to upstream. Result: Very bad localized desktop with many spelling mistakes, inconstant translations etc.. We ml_IN team used to watch who is "contributing" through Rosetta and make him work with the community. I hope the new translation frameworks will give sufficient attention to this problem. If we are not keeping a balance between newbie translation and quality assuarance , our localized desktops will not improve.


Again Tamil gedit, but from Debian Lenny. Compare it with the Ubuntu version shown above(Click to enlarge)



* User feedback: The number of users who use the desktop in their own mother tongue, even though the % of translation is more than 80% for many languages, is very less. IMHO, It is because of a 'dependency conflict' of the following things
a) A person who is not good in English
b) A person want to use computer in mother language for some "purpose"
c) A person who is capable of spending Rs ~20K for a computer
Most of the cases, there is a conflict between any 2 of the following and that ends up in a) Person use his desktop in english b) Person not using computer at all. I am sure that if there are good number of users, we will not end up in interfaces I showed in the screen shots.
* One inconsistency I noticed across localized desktops is regarding the shortcut keys/accelerator keys. Some languages use English short cut keys and give at the end of the word in Brackets for eg: അടയ്ക്കുക (C). As you can see in screen shots sometimes we have small letter and sometimes capital letter for that. Some languages use letters in xx_IN itself. But there is no consistency. For Control and Alt keys, some language translate them, some others keep it in English. What is the problem with English short cut key? For using English short cut key , the user should be using English layout keyboard. For shortcut keys in xx_IN, one should be using xx_IN keyboard layout. For a user(assume that he use xx_IN desktop since he is not good in English) typing in xx_IN in gedit using xx_IN keyboard, is it possible to use the short cut keys if we give in English? Are we expecting that for using short cut key while typing the document, he change switch his keyboard layout ? (btw, anybody noticed that Apple doesn't use Accelerator keys in its OS?)

bn_IN gedit in Ubuntu 8.10(Click to enlarge)

ml_IN gnome dictionary client in Ubuntu 8.10(Click to enlarge)

Suggestion/Ideas are welcome... How can we make our localized desktop more beautiful and user friendly?

Updates...
santhoshtr

KDE Indic Screensavers
santhoshtr

I ported all of the Matrix screensavers with Indian language glyphs to KDE4. For details about the screensavers please read:

Download the binary packages: Deb package, and RPM package

There are 6 screensavers in that package, for Malayalam, Hindi, Oriya , Bengali, Tamil and Gujarati. After installation, goto KDE system settings->Desktop->Screensaver and select any of this.

Screenshots(click to get the image in original size):

KDE Screensaver configuration for Hindi:

Enjoy...!

Hyphenation of Indian Languages in Webpages
santhoshtr
In my last blogpost I explained hyphenation of Indian language text in openoffice. In this blogpost I will explain how hyphenation can be done in webpages.

As I explained importance of hyphenation come into picture when we justify the text. The length of the lines are controlled by the parent tags.... Unicode had defined a special character called soft hyphen for hyphenation denoted by ­ . In HTML, the plain hy­phen is rep­re­sent­ed by the "-" char­ac­ter (- or-). The soft hy­phen is rep­re­sent­ed by the char­ac­ter en­ti­ty ref­er­ence ­ (­ or ­)

User agents-browsers can break the line whenever a soft hyphen is found. So if we have a javascript based implemenation, which insert the softhyphen in between the words based on language specific rules, we can achieve hyphenation in webpages too.

Hyphenator is a project which does exactly the same. "Hyphenator.js brings client-side hyphenation of HTML-Documents on to every browser by inserting soft hyphens using hyphenation patterns and Frank M. Liangs hyphenation algorithm commonly known from LaTeX and Openoffice. "

Hyphenator was not tested for any non-latin languages so far. I tried to add support for Indian languages and the result was satisfactory. I used the same rules I defined for openoffice. Unlike latin languages, the number of hyphenation patterns for Indian languages is very less and the performance is good because of that.

I have added Malayalam, Tamil, Hindi, Oriya, Kannda, Telugu, Bengali, Gujarati and Panjabi support to it. You can see a working example here. (I wanted to embed one example here. But livejournal doesnot allow javascript inside blog body ). The column layout is done by CSS. Try resizing the browser windows and try a print preview too..

Don't forget to read the source code of that page. It is very simple. If you want hyphenation in your webpage, all you need is to include the javascript as done in the example. We need to provide the lang attributes for nodes so that the required patterns for that language can be loaded. I placed the new language patterns temporarily in download area of SMC. I will ask the author of Hyphenator to include it in upstream itself. Code is available here


Update(18-Dec-2008):Thanks to Mathias Nater, author of hyphenator, the patterns were added to upstream.

Hyphenation of Indian Languages and Openoffice
santhoshtr
What is Hiphenation?

Hyphenation is the process inserting hyphens in between the syllables of a word so that when the text is justified, maximum space is utilized.

Hiphenation is an important feature that DTP softwares provide. For Indian languages there is no good DTP softwares available. XeTex is the only choice to work with unicode and professional quality page layout. But xetex and DTP are not exactly same. Inkscape can be used as temporary solution. But only for small scale works. There is a project going on to add Harfbuzz backend to Scribus, the freedomware DTP package.

Hiphenation is also requred in many other places. Actually it is required where ever we 'justify' a block of text in openoffice or any wordprocessors. Same is the case of webpages. If we justify a block of text in ml_IN, let is see what is happening now

Note the long gaps between words. This is a screenshot taken from firefox. The default hiphenation just breaking the lines in space characters. And no doubt that it makes the pages ugly. The problem becomes worse if the length of the word is more and column width is less.

So what is the solution?

Ideal solution : Applications should be aware of the language, its hiphenation rules and should to the hiphenation wherever required.

Openoffice can take hiphenation dictionaries just like spell checkers. But for Indian languages, we are yet to prepare hiphenation dictionaries(more on that later.) . CSS3 draft of w3c has a provision for hyphenate. But it is stil in draft stage

Algorithm For Hiphenation

The basic for all hyphenation algorithms is the hyphenation algorithm, designed by Frank Liang in 1983, which is adopted in TeX. Wikipedia artcle of TeX explain this with very simple example

If TeX must find the acceptable hyphenation positions in the word encyclopedia, for example, it will consider all the subwords of the extended word .encyclopedia., where . is a special marker to indicate the beginning or end of the word. The list of subwords include all the subwords of length 1 (., e, n, c, y, etc), of length 2 (.e, en, nc, etc), etc, up to the subword of length 14, which is the word itself, including the markers. TeX will then look into its list of hyphenation patterns, and find subwords for which it has calculated the desirability of hyphenation at each position. In the case of our word, 11 such patterns can be matched, namely 1c4l4, 1cy, 1d4i3a, 4edi, e3dia, 2i1a, ope5d, 2p2ed, 3pedi, pedia4, y1c. For each position in the word, TeX will calculate the maximum value obtained among all matching pattern, yielding en1cy1c4l4o3p4e5d4i3a4. Finally, the acceptable positions are those indicated by an odd number, yielding the acceptable hyphenations en-cy-clo-pe-di-a. This system based on subwords allows the definition of very general patterns (such as 2i1a), with low indicative numbers (either odd or even), which can then be superseded by more specific patterns (such as 1d4i3a) if necessary. These patterns find about 90% of the hyphens in the original dictionary; more importantly, they do not insert any spurious hyphen. In addition, a list of exceptions (words for which the patterns do not predict the correct hyphenation) are included with the Plain TeX format; additional ones can be specified by the user.

For more details about the algorithm used in Openoffice read this paper by Nemeth Laszlo

Hiphenation in Indian languages.

Unlike English or any other languages, hiphenation in Indian languages are not that much complex. In general following are the rules

  • [consonant][vowel][consonat] can be hiphenated as [consonant][vowel] - [consonat] if vowel is not a virama or halant
  • Dont split a word after ZWJ
  • We can split a word after ZWNJ
  • plus any language specific rules. For eg: in ml_IN a line should not start with a chillu letter.

Hiphenation Dictionaries for Indian languages.

Based on the above mentioned rules, Let us try to create hiphenation dictionaries for Indian languages. I will explain this with the help of a Hindi word example: अनुपल्ब्ध. We have to define the following rules in the dictionary for this
अ1 -> 1 is odd number , ie. word can be splitterd after अ
ु1 -> 1 is odd number , ie. word can be splitterd after ु
1ल -> 1 is odd number , ie. word can be splitterd before ल
1प -> 1 is odd number , ie. word can be splitterd before प
1ब -> 1 is odd number , ie. word can be splitterd before ब
्2 -> 2 is even number , ie. word can NOT be splitterd after ्
1ध -> 1 is odd number , ie. word can be splitterd before ध
So the end result is अ+नु+प+ल्ब्ध

Same way we can create the Hyphenation dictionaries for all other languages. I have prepared the Hyphenation dictionaries for 8 Indian Languages. Download it from the git repo of the SMC.
How to Install a xx_IN hyphenation dictionary.
  • Copy the hyphenation dictionay hyph_xx_IN to /usr/share/myspell/dicts folder.
  • Create a file at /usr/share/myspell/infos/ooo/ folder named openoffice.org-hyphenation-xx with one line content
    HYPH xx IN hyph_xx_IN
  • Run this command sudo update-openoffice-dicts

Open the openoffice writer, Open some fille in your language or type some text. Justify the text. Set the language of the selection by using Tools->Language menu Hiphenate it by using Tools->Language->Hiphenation menu.

Hope it works :). I tested only Hindi and Malayalam. For other languages , inform me if you see any problems or if it is not working . Here is the hyphenated Malayalam paragraph. Compare it with the image I showed at the beginning of this blogpost

Ok. so after testing these hyphenation dictionaries, if we provide them to upstream and packaged, hyiphenation problems in openoffice is solved. :)

But.... How to solve this problem in web pages?!. We will discuss it in next blogpost!
PS: Thanks to Nemeth Laszlo , author of Hunspell and Openoffice Hyphenation for helping me to prepare the hyphenation tables.


Update(Apr 16,2009) The hyphenation dictionaries were packaged for Fedora and will be part of Fedora 11

You are viewing santhoshtr