Skip to main content

Supplemantary character support in Java

In the last post I wrote that supplementary characters in the Unicode standard are in the range above U+FFFF, which means they need more than 16 bits to represent them. Since the char primitive type in Java is a 16 bit character, we will have to use 2 char's for them.

I just finished reading some stuff on supplementary character support in Java, and well, there are parts I understood right away and parts that are going to need further reading. I will try to share what I am learning on this blog. However, let us first clarify some terminology.

Character: Is an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.

Character Set: Is a collection of characters.

Unicode is a coded character set that assigns a unique number to every character defined in the Unicode standard.

A Code Point is the unique number assigned to every Unicode character. Valid Unicode code points are in the range of U+0000 to U+10FFFF. This range is capable of holding more than a million characters, out of which 96,382 have been assigned by the Unicode 4.0 standard.

Supplementary characters are those characters that have been assigned code points beyond U+FFFF. So essentially they lie in the range of U+10000 - U+10FFFF.

When these characters are to be stored in a computer system, they have to be stored as a sequence of bits (this is known as UTF-32 encoding). The simplest way store them is to store each character as a 4 byte sequence capable to addressing the entire unicode range. However this will waste a lot of space, because most of the time we deal with characters in the ASCII range of 00 - FF. Some other mechanism is needed to make better use of the computer's memory and storage. Other encodings that exist are UTF-8 and UTF-16, which as their names suggest, use 8-bit and 16-bit sequences.

A natural question that must have occurred to you is, how do we store characters that go beyond 8 bits or 16 bits in UTF-8 and UTF-16. This is made possible by using multiple blocks. Each block will also have to indicate whether it represents a single character or is part of a series of blocks that represent one character. UTF-8 and UTF-16 help us store characters using less space than UTF-32. The most widely used encoding standard is UTF-8.

In the next post I will discuss how Java supports the supplementary range in it's API's and in the Virtual Machine.



Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net

Comments

Popular posts from this blog

Running your own one person company

Recently there was a post on PuneTech on mom's re-entering the IT work force after a break. Two of the biggest concerns mentioned were : Coping with vast advances (changes) in the IT landscape Balancing work and family responsibilities Since I have been running a one person company for a good amount of time, I suggested that as an option. In this post I will discuss various aspects of running a one person company. Advantages: You have full control of your time. You can choose to spend as much or as little time as you would like. There is also a good chance that you will be able to decide when you want to spend that time. You get to work on something that you enjoy doing. Tremendous work satisfaction. You have the option of working from home. Disadvantages: It can take a little while for the work to get set, so you may not be able to see revenues for some time. It takes a huge amount of discipline to work without a boss, and without deadlines. You will not get the benefits (insuranc...

Testing Groovy domain classes

If you are trying to test Grails domain class constraints by putting your unit test cases in the 'test/unit' directory, then your tests will fail because the domain objects will not have the 'valdate' method. This can be resolved in two ways: Place the test cases inside test/integration (which will slow things down) Use the method 'mockForConstraintsTests(Trail)' to create mock method in your domain class and continue writing your test cases in 'test/unit' What follows is some example code around this finding. I am working on a Groovy on Grails project for a website to help programmers keep up and refresh their skills. I started with some domain classes and then moved on to write some unit tests. When we create a Grails project using grails create-app , it creates several directories, one of which is a directory called 'test' for holding unit tests. This directory contains two directories, 'unit', and 'integration' for unit and ...

Planning a User Guide - Part 3/5 - Co-ordinate the Team

Photo by  Helloquence  on  Unsplash This is the third post in a series of five posts on how to plan a user guide. In the first post , I wrote about how to conduct an audience analysis and the second post discussed how to define the overall scope of the manual. Once the overall scope of the user guide is defined, the next step is to coordinate the team that will work on creating the manual. A typical team will consist of the following roles. Many of these roles will be fulfilled by freelancers since they are one-off or intermittent work engagements. At the end of the article, I have provided a list of websites where you can find good freelancers. Creative Artist You'll need to work with a creative artist to design the cover page and any other images for the user guide. Most small to mid-sized companies don't have a dedicated creative artist on their rolls. But that's not a problem. There are several freelancing websites where you can work with great creative ...