Skip to main content

Supplemantary character support in Java

In the last post I wrote that supplementary characters in the Unicode standard are in the range above U+FFFF, which means they need more than 16 bits to represent them. Since the char primitive type in Java is a 16 bit character, we will have to use 2 char's for them.

I just finished reading some stuff on supplementary character support in Java, and well, there are parts I understood right away and parts that are going to need further reading. I will try to share what I am learning on this blog. However, let us first clarify some terminology.

Character: Is an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.

Character Set: Is a collection of characters.

Unicode is a coded character set that assigns a unique number to every character defined in the Unicode standard.

A Code Point is the unique number assigned to every Unicode character. Valid Unicode code points are in the range of U+0000 to U+10FFFF. This range is capable of holding more than a million characters, out of which 96,382 have been assigned by the Unicode 4.0 standard.

Supplementary characters are those characters that have been assigned code points beyond U+FFFF. So essentially they lie in the range of U+10000 - U+10FFFF.

When these characters are to be stored in a computer system, they have to be stored as a sequence of bits (this is known as UTF-32 encoding). The simplest way store them is to store each character as a 4 byte sequence capable to addressing the entire unicode range. However this will waste a lot of space, because most of the time we deal with characters in the ASCII range of 00 - FF. Some other mechanism is needed to make better use of the computer's memory and storage. Other encodings that exist are UTF-8 and UTF-16, which as their names suggest, use 8-bit and 16-bit sequences.

A natural question that must have occurred to you is, how do we store characters that go beyond 8 bits or 16 bits in UTF-8 and UTF-16. This is made possible by using multiple blocks. Each block will also have to indicate whether it represents a single character or is part of a series of blocks that represent one character. UTF-8 and UTF-16 help us store characters using less space than UTF-32. The most widely used encoding standard is UTF-8.

In the next post I will discuss how Java supports the supplementary range in it's API's and in the Virtual Machine.



Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net

Comments

Popular posts from this blog

My HSQLDB schema inspection story

This is a simple story of my need to inspect the schema of an HSQLDB database for a participar FOREIGN KEY, and the interesting things I had to do to actually inspect it. I am using an HSQLDB 1.8 database in one of my web applications. The application has been developed using the Play framework , which by default uses JPA and Hibernate . A few days back, I wanted to inspect the schema which Hibernate had created for one of my model objects. I started the HSQLDB database on my local machine, and then started the database manager with the following command java -cp ./hsqldb-1.8.0.7.jar org.hsqldb.util.DatabaseManagerSwing When I tried the view the schema of my table, it showed me the columns and column types on that table, but it did not show me columns were FOREIGN KEYs. Image 1: Table schema as shown by HSQLDB's database manager I decided to search on StackOverflow and find out how I could view the full schema of the table in question. I got a few hints, and they all pointed to

Fuctional Programming Principles in Scala - Getting Started

Sometime back I registered for the Functional Programming Principles in Scala , on Coursera. I have been meaning to learn Scala from a while, but have been putting it on the back burner because of other commitments. But  when I saw this course being offered by Martin Odersky, on Coursera , I just had to enroll in it. This course is a 7 week course. I will blog my learning experience and notes here for the next seven weeks (well actually six, since the course started on Sept 18th). The first step was to install the required tools: JDK - Since this is my work machine, I already have a couple of JDK's installed SBT - SBT is the Scala Build Tool. Even though I have not looked into it in detail, it seems like a replacement for Maven. I am sure we will use it for several things, however upto now I only know about two uses for it - to submit assignments (which must be a feature added by the course team), and to start the Scala console. Installed sbt from here , and added the path

Five Reasons Why Your Product Needs an Awesome User Guide

Photo Credit: Peter Merholz ( Creative Commons 2.0 SA License ) A user guide is essentially a book-length document containing instructions for installing, using or troubleshooting a hardware or software product. A user guide can be very brief - for example, only 10 or 20 pages or it can be a full-length book of 200 pages or more. -- prismnet.com As engineers, we give a lot of importance to product design, architecture, code quality, and UX. However, when it comes to the user manual, we often only manage to pay lip service. This is not good. A usable manual is as important as usable software because it is the first line of help for the user and the first line of customer service for the organization. Any organization that prides itself on great customer service must have an awesome user manual for the product. In the spirit of listicles - here are at least five reasons why you should have an awesome user manual! Enhance User Satisfaction In my fourteen years as a