What the ɊȱɁͲ Is UTF-8? A Character Encoding Primer

April 13, 2011 — Code

If you’ve ever had character encoding problems and looked up UTF-8 to find out what it really is, you’ve probably encountered definitions like “a multibyte character encoding for Unicode” or “a mapping of the Unicode character set.” These definitions, while correct, don’t really tell you much unless you already understand character encodings. Let me try to explain it simply.

ASCII

All data stored on every modern computer system looks like this:

011010000110010101101100011011000110111100100000
0111011101101111011100100110110001100100

Whether it’s an image, an audio file, or plain text, it’s all just ones and zeros (each one is called a bit). Different programs interpret the ones and zeros with the assumption that they are arranged according to the rules of a particular file format. For example the first bits of a GIF file express the GIF format version and the image dimensions. If you try to open an audio file in an image editor the program usually won’t let you because the sequence of ones and zeros doesn’t look like what it expects so it doesn’t know how to interpret the data. But text editors will let you open just about anything, and if you open an audio file you’ll get a lot of garbage. But it’s not ones and zeros, it’s funny looking characters. Why? Because text editors are made for working with text so they interpret the bits as text characters. Let’s look at how they do it.

Let’s say we have a file consisting of the ones and zeros given above. If we open this in a simple text editor, the editor will first separate the bits into groups of 8, each one called a byte:

01101000 01100101 01101100 01101100 01101111 00100000
01110111 01101111 01110010 01101100 01100100

It then reads each byte as a binary number. The lowest value an 8-digit binary number can have is zero (00000000) and the highest is 255 (11111111). So the text editor sees the data as a sequence of numbers between 0 and 255. Translating the above bytes we get:

104 101 108 108 111 32 119 111 114 108 100

It then looks up each of these numbers in a table to find out what character each number corresponds to. Part of the table might look like this:

32: <space> 109: m
.. 110: n
100: d 111: o
101: e 112: p
102: f 113: q
103: g 114: r
104: h 115: s
105: i 116: t
106: j 117: u
107: k 118: v
108: l 119: w

So, translating the above numbers into characters we get:

hello world

Pretty simple right? What we’ve been looking at so far is called ASCII encoding, which basically means that the table or map used to translate numbers into characters is called ASCII.

See It For Yourself

If you’re a healthy skeptic you may be wondering if this isn’t an oversimplification. If you’re on a Unix-like system with the xxd command (comes with Vim) you can see for yourself that it isn’t. The xxd command creates a “hex dump” of its input. That is, it converts its input to raw bits. So let’s give it the "hello world" string and set the -b option (which displays output in binary instead of the default hexadecimal):

echo 'hello world' | xxd -b

You should get this:

0000000: 01101000 01100101 01101100 01101100 01101111 00100000 hello
0000006: 01110111 01101111 01110010 01101100 01100100 00001010 world.

Which is the same as my example above plus an extra character on the end (10: line break) which is added by echo. So there.

Unicode

Unicode is basically a character lookup table that was designed to include every single character used in any language, anywhere. Needless to say, it’s huge. There are over 100,000 characters. You can buy a hard copy of Unicode v5.0 which is a 1400 page book, and more characters were added in Unicode 6.0. Obviously 8 bits isn’t enough to represent all these characters. In fact, Unicode character numbers go up to over 1.1 million (there are gaps in the numbering) so we need at least 21 bits (221= 2,097,152) to encode them all. There have been many attempts to create character encodings (roughly synonymous with file formats for purposes of this discussion) for the entire set of Unicode characters, each with its own benefits and drawbacks.

UTF-32

UTF-32 is one such encoding which uses 32 bits to encode every character, so "hello world" looks like this:

01101000 00000000 00000000 00000000
01100101 00000000 00000000 00000000
01101100 00000000 00000000 00000000
01101100 00000000 00000000 00000000
01101111 00000000 00000000 00000000
00100000 00000000 00000000 00000000
01110111 00000000 00000000 00000000
01101111 00000000 00000000 00000000
01110010 00000000 00000000 00000000
01101100 00000000 00000000 00000000
01100100 00000000 00000000 00000000

UTF-32 has more than enough numbers for every Unicode character but clearly this in not a very compact format. UTF-32 files are four times as large as ASCII files with the same text.

UTF-16

Another encoding, UTF-16, uses (you guessed it) 16 bits per character. Files are half the size of UTF-32 but with only 16 bits some of the Unicode character set is missing. The good thing is that Unicode was designed so that the most common characters have the lowest numbers, and the group of characters that form what’s known as the “Basic Multilingual Plane” are all covered by UTF-16. (There is also a way to encode additional characters using UTF-16 but that is beyond the scope of this article.)

Another very practical problem with UTF-16 and -32 is that you need a text editor that knows how to read them. That is, even if you’re just writing normal English text with no characters beyond those in the standard ASCII range, if you save the file as UTF-16 and send it to a friend who opens it in an ASCII-only text editor, it will look like jibberish.

UTF-8

Wouldn’t it be great if there was an encoding that was compact, backwards compatible with ASCII, and included all the Unicode characters too? UTF-8 is a variable length encoding which uses 1 to 4 bytes for each character and can represent the entire Unicode character repertoire without creating huge files (or, for web pages, slow downloads) like UTF-32.

How is that possible? How does the computer know how many bytes to read for each character? Don’t feel bad if you can’t figure it out because it took someone as smart as Ken Thompson to design it.

Basically it works like this: for characters in the ASCII range, UTF-8 is identical to ASCII (and remember: every ASCII byte starts with a 0). To represent a character beyond the ASCII range, the first byte starts with the number of 1s which is the number of total bytes used to represent the character (followed by a zero), and each successive byte starts with 10 so that every byte can quickly be identified:

if it starts with 0 it’s an ASCII character
if it starts with 10 it’s a continuation of a multi-byte character
if it starts with 110 it’s the first byte of a 2-byte character
if it starts with 1110 it’s the first byte of a 3-byte character
if it starts with 11110 it’s the first byte of a 4-byte character

All other bits are then concatenated to get the binary number which will be mapped to a character. With all these “metadata” bits we don’t have true 32-bit character numbers like in UTF-32, so can we really represent the full Unicode character set? Remember, we need 21 bits to do that, so let’s see how many actual data bits there are in the largest UTF-8 character (data bits are bold):

11110111 10111111 10111111 10111111

If you count them you’ll see there are exactly the 21 required for Unicode.

The End

Now it should be clear why in some programming languages you get strange results when counting the number of characters in a non-ASCII string: they’re assuming the string is ASCII and counting the number of bytes it takes in memory. To count the characters accurately the language has to be told what encoding the string uses and it has to know how to parse that encoding.

There are a lot of other concerns that character encoding designers must deal with like compatibility with different platforms and performance in search/manipulation operations, but this is as far as I’m going to go in this article. For more information, check out these resources:

Epilogue

April 19, 2011

After I posted this article there was a discussion on Reddit in which muyuu made some excellent points which I thought were worth posting here:

UTF-32 files are four times as large as ASCII files with the same text” seems to imply UTF-32 is retarded (or UTF-16 for that matter). You should add that obviously neither was designed to store ASCII text and that you can’t represent Unicode text in ASCII at all, unless the whole text happens to fall into the very small ASCII subset.

You should also add that UTF-8 text is only compact when something like 3/4+ of your text is plain ASCII. If your text is in Japanese or Chinese, for example, then UTF-8 is ridiculously inefficient and UTF-16 is much better (or even better, their respective local encodings; they have many and most of them are variable-length). 30-40% extra size in text makes a lot of difference when the majority of your users connect from their cell phones.

It’s also worth mention that variable-length encodings compress a lot worse than fixed length encodings, especially in the case of UTF-16 – because codepage grouping and character order are not random, and any trivial compressor will greatly benefit from that. Things are routinely compressed when transmitted over networks.

This is for the “UTF-8 is all we need” brigade. If you have many users in countries with different writing systems, supporting different encodings might be a good idea. Obviously it can be a complex issue, but – for instance – an additional 20% wait to the ping your users may have, it can be a deal breaker for your microblogging site in favour of a local one.


comments powered by Disqus