
This document is written in UTF-8, for example.Ĭurrently there are more than 135.000 different characters implemented, with space for more than 1.1 millions. UTF-8 is definitely the most popular encoding in the Unicode family, especially on the Web. Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32. Its meaning depends on the character encoding used.
UNICODE ENCODING IN JAVA CODE
A code point takes the form of U+, ranging from U+0000 to U+10FFFF.Īn example code point looks like this: U+004F. Unicode maps every character to a specific code, called code point. Its aim is to provide a unique number to identify every character for every language, on any platform. There are lots of character sets which are used by computers, but Unicode is the first of its kind to aim to support every single written language on earth (and beyond!). The ISO-8859-1 character set does not contain these custom double quote characters.Ĭonverting from Windows-1252 to ISO-8859-1 will result in a silent loss of data: This sentence contains ?smart quotes? not found in ISO-8859-1.Unicode is an industry standard for consistent encoding of written text. Or, at the very least, you need to be sure that the data in any given file will contain characters which are guaranteed to exist in the target character set - a difficult (impossible?) guarantee to enforce.Ĭonsider this example, which is encoded using the Windows-1252 character set, and which contains Microsoft’s so-called “smart quotes”: This sentence contains “smart quotes” not found in ISO-8859-1. The target character set must contain valid encodings for every character in the source character set, to be sure that data will not be lost.

You cannot convert any character set to any other character set. The above reads the input file one line at a time. Private static void convertEncoding () throws IOException We can see what bytes make up that string as follows: Taking the letter A, we know that has a Unicode value of U+0041.Ĭonsider the Java string String str = "A" But, again, as noted above the underlying storage used by Java is actually a byte array. The first value in the pair is taken from the high-surrogates range, ( \uD800-\uDBFF), the second from the low-surrogates range ( \uDC00-\uDFFF). Java handles Unicode supplementary characters using pairs of char values, in structures such as char arrays, Strings and StringBuffers. Characters outside of the BMP range are referred to as “supplementary characters”.

It currently covers code points in the range U+0000 to U+10FFFF - which is 21 bits of data (approximately 1 million possible values). Over time, Unicode has expanded significantly.

A single char represents a single BMP symbol. This was handled by earlier versions of Java by the char primitive. These are often referred to as the Base Multilingual Plane (BMP). Unicode RangesĮarly versions of Unicode defined 65,536 possible values from U+0000 to U+FFFF. It used to only use UTF-16 - and it now uses ISO-8859- and UTF-16 as noted above.

And Java has never used UTF-8 for its internal representation of strings. But it’s worth noting that internally from Java 9 onwards, Java uses a byteto store strings. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string.
UNICODE ENCODING IN JAVA PLUS
…from a UTF-16 char array to a byte array plus an encoding-flag field. Java changed its internal representation of the String class… In Java 9 that changed to using a more compact format by default, as presented in JEP 254: Compact Strings: Prior to Java 9, a string was represented internally in Java as a sequence of UTF-16 code units, stored in a char. Char ca = 'a' char cx = ca + 1 // COMPILATION ERROR
