Skip to main content

Changes in Java to support supplementary Unicode characters

Support for supplementary characters might need changes in the Java language as well as the API. A few questions come to mind.

  • How do we support supplementary characters at the primitive level (char is only 16 bits)?
  • How do we support supplementary characters in low level API's (such as the static methods of the Character class) ?
  • How do we support supplementary characters in high level API's that deal with character sequences?
  • How do we support supplementary characters in Java literals?
  • How do we support supplementary characters in Java source files?

The expert commitee that worked on JSR-204 dealt with all these questions and many more (I'm sure) . After deliberating as well as experimenting with how the changes would affect code, they came up with the following solution.

The primitive char was left unchanged. It is still 16 bits and no other type has been added to the Java language to support the supplementary range of unicode characters.

 Low level API's, such as static methods of the Character class, accepted the char primitive type before support for supplementary characters was provided in Java. However, since Java 5.0, methods such as isLetter(...) of the Character class provide an overloaded method that accepts an int representing the code point, along with the earlier method that accepted a char.

 
JavaCharacterAPI.JPG 

 

High level API's will continue to work "as is" for most developers. They represent character sequences as UTF-16 sequences. Some methods in String and StringBuffer now have parrallel methods to work with code points. Some such methods are codePointAt(...) , codePointBefore(...), and codePointCount(). For example the codePointCount() method returns the number of code points in a String, which may not be the same as the number of characters in the String, if some characters are from the supplementary range and are represented as surrogate pairs.

 

JavaStringMethodsForUnicode.JPG 

 

Identifiers in Java can contain any letter or digit. Many supplementary characters are letters or digits. To allow supplementary characters to be used in identifiers, the Java compiler and other tools were modified to use different API methods (isJavaIdentifierPart(int), isJavaIdentifierStart(int)).

Since we need to support supplementary characters all the way, they also need to be supported in Java source files. I will discuss how to include unicode characters in Java source files and get them to compile using the Java compilers -encode option, in the next blog post.

While I was reading about encoding, I came accross this interesting blog post that describes a situation when an I18N enables Java program ceased to work after the build machine was moved from a Windows box to a Red Hat box. The reason of course was encoding related issues.

 



Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net

Comments

Popular posts from this blog

My HSQLDB schema inspection story

This is a simple story of my need to inspect the schema of an HSQLDB database for a participar FOREIGN KEY, and the interesting things I had to do to actually inspect it. I am using an HSQLDB 1.8 database in one of my web applications. The application has been developed using the Play framework , which by default uses JPA and Hibernate . A few days back, I wanted to inspect the schema which Hibernate had created for one of my model objects. I started the HSQLDB database on my local machine, and then started the database manager with the following command java -cp ./hsqldb-1.8.0.7.jar org.hsqldb.util.DatabaseManagerSwing When I tried the view the schema of my table, it showed me the columns and column types on that table, but it did not show me columns were FOREIGN KEYs. Image 1: Table schema as shown by HSQLDB's database manager I decided to search on StackOverflow and find out how I could view the full schema of the table in question. I got a few hints, and they all pointed to

Fuctional Programming Principles in Scala - Getting Started

Sometime back I registered for the Functional Programming Principles in Scala , on Coursera. I have been meaning to learn Scala from a while, but have been putting it on the back burner because of other commitments. But  when I saw this course being offered by Martin Odersky, on Coursera , I just had to enroll in it. This course is a 7 week course. I will blog my learning experience and notes here for the next seven weeks (well actually six, since the course started on Sept 18th). The first step was to install the required tools: JDK - Since this is my work machine, I already have a couple of JDK's installed SBT - SBT is the Scala Build Tool. Even though I have not looked into it in detail, it seems like a replacement for Maven. I am sure we will use it for several things, however upto now I only know about two uses for it - to submit assignments (which must be a feature added by the course team), and to start the Scala console. Installed sbt from here , and added the path

Inheritance vs. composition depending on how much is same and how much differs

I am reading the excellent Django book right now. In the 4th chapter on Django templates , there is an example of includes and inheritance in Django templates. Without going into details about Django templates, the include is very similar to composition where we can include the text of another template for evaluation. Inheritance in Django templates works in a way similar to object inheritance. Django templates can specify certain blocks which can be redefined in subtemplates. The subtemplates use the rest of the parent template as is. Now we have all learned that inheritance is used when we have a is-a relationship between classes, and composition is used when we have a contains-a relationship. This is absolutely right, but while reading about Django templates, I just realized another pattern in these relationships. This is really simple and perhaps many of you may have already have had this insight... We use inheritance when we want to allow reuse of the bulk of one object in other