
#!/usr/bin/env python
# -*- coding: utf-8 -*-
ml_string=u"സന്തോഷ് हिन्दी"
for ch in ml_string:
if(ch.isalpha()):
print ch
സ ന ത ഷ ह न दAnd fails for all mathra signs of Indian languages. This is a known bug in glibc. Does anybody know whether python internally use glibc functions for this basic string operations or use separate character database llke QT does?
I ported all of the Matrix screensavers with Indian language glyphs to KDE4. For details about the screensavers please read:
Download the binary packages: Deb package, and RPM package
There are 6 screensavers in that package, for Malayalam, Hindi, Oriya , Bengali, Tamil and Gujarati. After installation, goto KDE system settings->Desktop->Screensaver and select any of this.
Screenshots(click to get the image in original size):As I explained importance of hyphenation come into picture when we justify the text. The length of the lines are controlled by the parent tags.... Unicode had defined a special character called soft hyphen for hyphenation denoted by ­ . In HTML, the plain hyphen is represented by the "-" character (- or-). The soft hyphen is represented by the character entity reference ­ (­ or ­)
User agents-browsers can break the line whenever a soft hyphen is found. So if we have a javascript based implemenation, which insert the softhyphen in between the words based on language specific rules, we can achieve hyphenation in webpages too.
Hyphenator is a project which does exactly the same. "Hyphenator.js brings client-side hyphenation of HTML-Documents on to every browser by inserting soft hyphens using hyphenation patterns and Frank M. Liangs hyphenation algorithm commonly known from LaTeX and Openoffice. "
Hyphenator was not tested for any non-latin languages so far. I tried to add support for Indian languages and the result was satisfactory. I used the same rules I defined for openoffice. Unlike latin languages, the number of hyphenation patterns for Indian languages is very less and the performance is good because of that.
I have added Malayalam, Tamil, Hindi, Oriya, Kannda, Telugu, Bengali, Gujarati and Panjabi support to it. You can see a working example here. (I wanted to embed one example here. But livejournal doesnot allow javascript inside blog body ). The column layout is done by CSS. Try resizing the browser windows and try a print preview too..
Don't forget to read the source code of that page. It is very simple. If you want hyphenation in your webpage, all you need is to include the javascript as done in the example. We need to provide the lang attributes for nodes so that the required patterns for that language can be loaded. I placed the new language patterns temporarily in download area of SMC. I will ask the author of Hyphenator to include it in upstream itself. Code is available here
Hyphenation is the process inserting hyphens in between the syllables of a word so that when the text is justified, maximum space is utilized.
Hiphenation is an important feature that DTP softwares provide. For Indian languages there is no good DTP softwares available. XeTex is the only choice to work with unicode and professional quality page layout. But xetex and DTP are not exactly same. Inkscape can be used as temporary solution. But only for small scale works. There is a project going on to add Harfbuzz backend to Scribus, the freedomware DTP package.
Hiphenation is also requred in many other places. Actually it is required where ever we 'justify' a block of text in openoffice or any wordprocessors. Same is the case of webpages. If we justify a block of text in ml_IN, let is see what is happening now
Note the long gaps between words. This is a screenshot taken from firefox. The default hiphenation just breaking the lines in space characters. And no doubt that it makes the pages ugly. The problem becomes worse if the length of the word is more and column width is less.
So what is the solution?
Ideal solution : Applications should be aware of the language, its hiphenation rules and should to the hiphenation wherever required.
Openoffice can take hiphenation dictionaries just like spell checkers. But for Indian languages, we are yet to prepare hiphenation dictionaries(more on that later.) . CSS3 draft of w3c has a provision for hyphenate. But it is stil in draft stage
Algorithm For HiphenationThe basic for all hyphenation algorithms is the hyphenation algorithm, designed by Frank Liang in 1983, which is adopted in TeX. Wikipedia artcle of TeX explain this with very simple example
If TeX must find the acceptable hyphenation positions in the word encyclopedia, for example, it will consider all the subwords of the extended word .encyclopedia., where . is a special marker to indicate the beginning or end of the word. The list of subwords include all the subwords of length 1 (., e, n, c, y, etc), of length 2 (.e, en, nc, etc), etc, up to the subword of length 14, which is the word itself, including the markers. TeX will then look into its list of hyphenation patterns, and find subwords for which it has calculated the desirability of hyphenation at each position. In the case of our word, 11 such patterns can be matched, namely 1c4l4, 1cy, 1d4i3a, 4edi, e3dia, 2i1a, ope5d, 2p2ed, 3pedi, pedia4, y1c. For each position in the word, TeX will calculate the maximum value obtained among all matching pattern, yielding en1cy1c4l4o3p4e5d4i3a4. Finally, the acceptable positions are those indicated by an odd number, yielding the acceptable hyphenations en-cy-clo-pe-di-a. This system based on subwords allows the definition of very general patterns (such as 2i1a), with low indicative numbers (either odd or even), which can then be superseded by more specific patterns (such as 1d4i3a) if necessary. These patterns find about 90% of the hyphens in the original dictionary; more importantly, they do not insert any spurious hyphen. In addition, a list of exceptions (words for which the patterns do not predict the correct hyphenation) are included with the Plain TeX format; additional ones can be specified by the user.
For more details about the algorithm used in Openoffice read this paper by Nemeth Laszlo
Hiphenation in Indian languages.Unlike English or any other languages, hiphenation in Indian languages are not that much complex. In general following are the rules
Based on the above mentioned rules, Let us try to create hiphenation dictionaries for Indian languages. I will explain this with the help of a Hindi word example: अनुपल्ब्ध.
We have to define the following rules in the dictionary for this
अ1 -> 1 is odd number , ie. word can be splitterd after अ
ु1 -> 1 is odd number , ie. word can be splitterd after ु
1ल -> 1 is odd number , ie. word can be splitterd before ल
1प -> 1 is odd number , ie. word can be splitterd before प
1ब -> 1 is odd number , ie. word can be splitterd before ब
्2 -> 2 is even number , ie. word can NOT be splitterd after ्
1ध -> 1 is odd number , ie. word can be splitterd before ध
So the end result is अ+नु+प+ल्ब्ध
Open the openoffice writer, Open some fille in your language or type some text. Justify the text. Set the language of the selection by using Tools->Language menu Hiphenate it by using Tools->Language->Hiphenation menu.
Hope it works :). I tested only Hindi and Malayalam. For other languages , inform me if you see any problems or if it is not working . Here is the hyphenated Malayalam paragraph. Compare it with the image I showed at the beginning of this blogpost
Ok. so after testing these hyphenation dictionaries, if we provide them to upstream and packaged, hyiphenation problems in openoffice is solved. :)
But.... How to solve this problem in web pages?!. We will discuss it in next blogpost!
PS: Thanks to Nemeth Laszlo , author of Hunspell and Openoffice Hyphenation for helping me to prepare the hyphenation tables.
Loop through the chars of the word, until the current char is not a letter/ anymore.And for this , it use the QChar::.isLetter() function. This functions fails for Matra signs of our languages.
A screenshot from a text area in Konqueror:
For example
#include <QtCore/QString>
#include <stdlib.h>
int main(){
QChar letter ;
letter = 'அ';
fprintf(stdout,"%d\n", letter.isLetter());
letter = 'ी';
fprintf(stdout,"%d\n", letter.isLetter());
}
In this program, you will get true as output for அ and false for ी.
When I showed this to Sayamindu during foss.in , he showed me a bug in glibc . Eventhough the bug is about Bengali, it is applicable for all languages. It is assigned to Pravin Satpute and he told me that he got a solution and will be submitting soon to glibc.
But I am wondering why this bug in KDE unnoticed so far? Nobody used spellcheck for Indian languages in KDE?!
Let me explain why this is not happening in GNOME spellchecker if this is a glibc bug. In gnome, this word splitting will be done in application itself using gtk_text_iter_* and these iteration through words are done by pango words boundary detection algorithms.
Filed a bug in KDE to track it.$clive http://in.youtube.com/watch?v=6JeZ5oeAEy U (replace this with the youtube address you want)
It will create a flv file. $ffmpeg -i AmericaAmerica.flv AmericaAmerica.mpg$ffmpeg2theora AmericaAmerica.mpg (replace it with the name of the flv file the previous command created)A new version of Dhvani -The Indian Language Text to Speech System is available now. The new version comes with the following improvements/features
There was good amount of code change in this version. Still there are many improvements to do in language modules and synthesizer. Some of the language modules requires developers who speak that language. Syntheziser got some improvements and require some amount of research to make the speech more natural. So your feedbacks, suggestions, bug reports and patches are valuable.
PS: A note for quick usage after installation from binary: After installing deb or rpm, Open gedit, edit->preferences->plugins, enable external tools. Dhvani will be available as a plugin there. Select some text in any of the supporting languages and click the Dhvani menu.
A gram is a segment of text made of N number of characters. Sonnet uses trigrams, made from three characters. By analyzing the popularity of any given trigram within a text, one may make assumptions about the language the text is written in. Rideout gives an example: "The top trigram for our English model is '_th' and for Spanish '_de'. Therefore, if the text contains many words that start with 'th' and no words that start with 'de,' it is more likely the text is in English [than Spanish]. Additionally, there are several optimizations which include only checking the language against languages with similar scripts and some heuristics that use the language of neighboring text as a hint."
>>> "സന്തോഷ്".decode("utf-8")
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u 0d4d'
>>> str=u"സന്തോഷ്"
>>> print repr(str)
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u 0d4d'
"Development of effective geo-visualisation based decision support system (DSS) involved primarily data compilation from collateral sources, setting up appropriate hardware configuration, design of database and design of a spatial DSS. "Jaisen used softwares like GRASS, UMN MapServer and ka-Map. He has written a detailed documentation(English) on how he developed this and what are all the tools used.
/*
UTF8Decoder.c
This program converts a utf-8 encoded string to utf-16 hexadecimal code sequence
UTF-8 is a variable-width encoding of Unicode.
UTF-16 is a fixed width encoding of two bytes
A UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to
encode a character. For example, the character U+000A (line feed) must be accepted from
a UTF-8 stream only in the form 0x0A, but not in any of the following five possible overlong forms:
0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A
Ref: UTF-8 and Unicode FAQ for Unix/Linux http://www.cl.cam.ac.uk/~mgk25/unicode.html
Author: Santhosh Thottingal <santhosh.thottingal at gmail.com>
License: This program is licensed under GPLv3 or later version(at your choice)
*/
#include<stdlib.h>
#include<stdio.h>
#include<string.h>
unsigned short
utf8_to_utf16 (unsigned char *text, int *ptr)
{
unsigned short c; /*utf-16 character */
int i = 0;
int trailing = 0;
if (text[*ptr] < 0x80) /*ascii character till 128 */
{
trailing = 0;
c = text[(*ptr)++];
}
else if (text[*ptr] >> 7)
{
if (text[*ptr] < 0xE0)
{
c = text[*ptr] & 0x1F;
trailing = 1;
}
else if (text[*ptr] < 0xF8)
{
c = text[*ptr] & 0x07;
trailing = 3;
}
for (; trailing; trailing--)
{
if ((((text[++*ptr]) & 0xC0) != 0x80))
break;
c <<= 6;
c |= text[*ptr] & 0x3F;
}
}
return c;
}
/* for testing */
int
main ()
{
char *instr = "സന്തോഷ് തോട്ടിങ്ങല്"; /* my name :) */
int length = strlen (instr);
int i = 0;
for (; i < length;)
{
printf ("0x%.4x ", utf8_to_utf16 (instr, &i));
}
printf ("\n");
/* output is:
0x0d38 0x0d28 0x0d4d 0x0d24 0x0d4b 0x0d37 0x0d4d 0x0020 0x0d24 0x0d4b 0x0d1f 0x0d4d 0x0d1f 0x0d3f 0x0d19 0x0d4d 0x0d19 0x0d32 0x0d4d 0x200d
*/
return 0;
}
There may be already existing libraries for this, but writing a simple one ourself is fun and good learning experience.
For example, in python, to get the UTF-16 code sequence for a unicode string, we can use this:
str=u"സന്തോഷ്"
print repr(str)
u'\u0d38\u0d28\u0d4d\u0d24\u0d4b\u0d37\u 0d4d'
![]() | You are viewing Log in Create a LiveJournal Account Learn more | Explore LJ: Life Entertainment Music Culture News & Politics Technology |