Skip to main content

Posts

Showing posts from August, 2007

Changes in Java to support supplementary Unicode characters

Support for supplementary characters might need changes in the Java language as well as the API. A few questions come to mind. How do we support supplementary characters at the primitive level (char is only 16 bits)? How do we support supplementary characters in low level API's (such as the static methods of the Character class) ? How do we support supplementary characters in high level API's that deal with character sequences? How do we support supplementary characters in Java literals? How do we support supplementary characters in Java source files? The expert commitee that worked on JSR-204 dealt with all these questions and many more (I'm sure) . After deliberating as well as experimenting with how the changes would affect code, they came up with the following solution. The primitive char was left unchanged. It is still 16 bits and no other type has been added to the Java language to support the supplementary range of unicode characters.  Low level API's, such as

It's been a while since I posted

It's been a week since I posted last. I am really sorry, this is the second time in succession that I have missed my target of posting at least thrice a week. By way of an excuse, all I have is a lame "it's been a bit crazy at work" . I am messing around with a lot of client side technologies, like AJAX and the plethora of libraries that accompany it, and all this without really understanding Javascript well enough. One of the libraries I am checking out is DWR (Direct Web Remoting) . It allows Javascript code to invoke Java objects. All this is done by creating proxy objects in Javascript that make AJAX calls to the DWR Servlet, which in turn invokes the Java objects. I personally think, it's a very nice concept, and it also supports reverse AJAX. Would you like to know more about DWR? Please comment and let me know. I will then post a series on DWR after completing the current one on Unicode characters.   On a total tangent, here's a little something from

Supplemantary character support in Java

In the last post I wrote that supplementary characters in the Unicode standard are in the range above U+FFFF, which means they need more than 16 bits to represent them. Since the char primitive type in Java is a 16 bit character, we will have to use 2 char's for them. I just finished reading some stuff on supplementary character support in Java, and well, there are parts I understood right away and parts that are going to need further reading. I will try to share what I am learning on this blog. However, let us first clarify some terminology. Character: Is an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph ), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries. Character Set: Is a collection of characters. Unicode is a coded character set that assigns a unique number to every character defined in the Unicode

Catching up with Java 5

Java 5 (a.k.a Tiger) has been around from a while. But there are still many developer's (including myself) who do not know about and use all it's features. So, in an effort to educate myself and help others, I have decided to spend some time everyday reading Java 1.5 Tiger A Developer's Notebook , and share my findings with others on this blog. Something I found out today (I know this should have happened long back, but such is the profession of programming :-) ), is that since Java 1.5 there is support for Unicode 4 which supports a supplemantary character set, that goes beyond 16 bits. An interesting implication is that a the char data type may no longer be able to hold all characters, because those in the supplementary range can now take upto 21 bits. This means that a string that contains certain characters may have to encode them as 2 char data types. Such a pair of characters that represents one codepoint is known as a surrogate pair. Now a string with n codepoints m