What are Unicode, UTF-8 and UTF-32?

Posted on . Updated on .

This is a fairly simple topic, easily addressed by checking out Wikipedia or some other online resource. However, let’s clarify it. To put it simply, Unicode is a standarized character table that associates every character with a number, its character code. For example, the occidental character a has the unicode character code 97, or 61 in hexadecimal, and it’s also often refered to as U+000061.

UTF-8 and UTF-32 are, on the other hand, encoding schemes. They specify the byte or octet sequence associated to a given sequence of unicode character codes. They also specify, given a sequence of bytes or octets, which unicode character code sequence, if any, they represent. For example, let’s suppose you want to store the Unicode character number 97, that is, the unicode character U+000061, which is the letter a. It will be stored in different ways depending on the chosen encoding scheme. For example, UTF-32 is quite simple and direct. It uses 4 bytes (octets) to store any Unicode character, and each 4-byte group is the result of storing the character code as a 4-byte natural number. I’m not sure if it’s stored in big endian or little endian format, in case any of those is specified, but it doesn’t matter for this example and we will suppose it’s big endian. The letter a would be stored as 00 00 00 61.

UTF-32 is the only encoding scheme to use a fixed number of bytes for every character in Unicode. UTF-8, for example, uses a variable number of bytes to store each character, depending on the character code. UTF-8 was designed with a purpose in mind: to be backwards compatible with the ASCII character set. This means that the first 128 possible bytes (from 00 to 7F) would be used to store the same characters present in the ASCII character set and, moreover, they would only be used to store those characters. UTF-8 guarantees, for example, that if the byte 00 appears in a sequence, it would represent the null character and wouldn’t be part of any other multibyte sequence associated to some other character. For this reason, the letter a would be stored in UTF-8 as a single byte 61.

Why is that important? Because much software and many protocols don’t need to be rewritten to be compatible with UTF-8. For example, in the C programming language, every string is terminated by a null character '\0' which marks the end of the string. The description of UTF-32 is incompatible with this definition. The a letter would be stored as 00 00 00 61, as we saw before. You can see the null byte appearing 3 times in the sequence, and in none of them it marks the end of the string, may the letter a appear in the middle of a string. Another example: in Linux, a file name can contain any octet, except 00 and 2F (the null character and the ASCII code for the directory separator '/'). Thanks to the definition of UTF-8, the Linux kernel is prepared to deal with UTF-8 filenames without making any changes. The byte 00 can still be used as the string terminator and the byte 2F can still be used as the directory separator. The definition of UTF-8 guarantees those two bytes won’t appear anywhere except when they represent those exact same characters.

To sum up, Unicode is a standard table that, among other things, associates characters with a unique character code, and UTF-8, UTF-32 and others are encoding schemes. They describe how to represent those character codes using bytes or octets. Notice how, in the encoding schemes we described, the representation of character number U+000061 contains the byte 61. This is the result of trying to make things somehow simple. You could describe another encoding scheme yourself in which, for example, character U+000061 is encoded as the sequence 67 90 F2 if you wish. Any encoding scheme would be valid as long as it lets you transform a sequence of character codes to a sequence of bytes, and the resulting sequence of bytes back to the same sequence of character codes. However, not every encoding scheme is backwards compatible with ASCII.

Load comments