What you need to know about character sets and encoding

This article is part of the sequence The Basics You Won’t Learn in the Basics aimed at eager people striving to gain a deeper understanding of programming and computer science.

My last article was about different data types and some tricks with them. We talked a little about characters as well. However, working with them can be a little bit strange due to the presence of a fancy term in computing called encoding.

Today, my friend asked me to go and fix the subtitles for his movies. He had been telling me that some strange symbols appear all the time. So he tried reinstalling windows and changing all sorts of options but nothing seemed to work. He clearly had no idea what an encoding is. However, I guess that is normal since he doesn’t have a CS background. But there seems to be a lot of developers out there (me, including, in the old days) who don’t know what encoding means. Surely, they might have heard of UTF-8, but what is it? We have ASCII right?

Well, I am going to address the issue of encoding in this article as I think it is fundamental to anyone getting his hands dirty with programming and computing. It seems not many programming basics courses cover this topic in much detail.

Character sets

Last time, I said a thing about how characters use a table called ASCII for mapping numbers to different letters and symbols. And truly, this table allows us to use 128 values to store all the essential letters and symbols there are in the English alphabet and a spare set of 128 other values for different needs. This seemed to work well back in the old days. But at one point, there was a need for internalization. That is, to display letters from different alphabets across the world. Take Farsi or Arabic, for example. For that purpose, people started using the leftover 128 values for letters and symbols from their own culture. But that approach didn’t seem to work that well for a long period of time.

Unicode

Shortly after, the Unicode character set was born. It is used to address this issue of representing different languages and cultures and the basic idea is to provide a unique code point for all the different characters out there. That means that this character set could represent values above the 8 bit mark. The notation used for Unicode characters is U+XXXX in hexadecimal (That is the base-16 numeral system). For example, if you want to display the cyrillic character я in Unicode, the code being used is U+044F. This character set provides code points for all the different symbols out there from various cultures and it even has spare code points for symbols yet to be defined. For a full list of Unicode characters, you can check out unicode-table.com.

Encoding

Ok, so now we know how to deal with characters from all kinds of cultures and we are done? Well, there is one final issue we should address – how do we store those characters in memory. At first glance, such an issue seems kind of unnecessary. Since we know what the code points of the characters are, we could just store the values of those code points. And that is a reasonable argument and is one way to store characters in memory. So if we want to store the cyrillic letter я in memory, we can just store the values 04 4F in 2 bytes consecutively in memory. That would work out quite well.

Problems with this approach

However, imagine we just want to store a regular English text without any fancy international symbols. What do we do then?

Well, as before, we can use 2 bytes again to store that letter. So, for example, the English letter A in Unicode is the value 101 (ASCII is a subset of Unicode) and can be stored as 00 41. But you might now notice that there is an overhead of 1 byte to store the letter, since A can be easily stored in 1 byte. That means that a text which was encoded with ASCII before, now encoded as Unicode is twice as large. That was actually a reasonable argument, at first, not to use Unicode at all. This way of storing characters (Or actually called – encoding of characters) is called UCS-2.

But then people thought about this and said – OK, can’t we store small characters in 1 byte and big ones in 2 bytes? And they came out with another encoding called UTF-8, which is actually the most widely used encoding nowadays.

But there are other issues besides this one as well. Such issues, for example, are in what order do we store the bytes in memory? Should we store A as 00 41 or 41 00? That requires some additional header info in the text, which specifies how exactly it is encoded.

These and various other issues have caused the creation of all sorts of encodings nowadays. But as I mentioned, the most widely used one is UTF-8. So when my friend had his subtitles encoded in UTF-8 but he was trying to display them with a different incompatible encoding, all sorts of strange characters appeared on the screen.

How do we address the problem of character sets and encoding in our code?

Well, it seems that most modern languages have solved this issue for us. For example, C# and Java store characters as 2 bytes in order to be able to encode different languages.

C++ and C, however, have an issue with dealing with Unicode strings. The main issue comes from the fact that character data types in these languages take up 1 byte. That means that you should use an external library or structure in C for dealing with Unicode text. In C++, you can use the wchar_t data type instead of char for starters. But you should also stop using standard functions but use more Unicode-driven ones. And last but not least, you shouldn’t use str++ to iterate a string anymore, since Unicode characters could take up a variable amount of bytes.

The easiest thing you can do is to just avoid using Unicode in these languages, but if you really need it, refer to this article: Unicode in C and C++.

Conclusion

I have merely scratched the surface of how character sets and encoding work. But my goal is not to provide a detailed document of how to work with them. Instead I want to provide you with enough knowledge to understand the basic concept and get you prepared for dealing with it in your code.

At the very least, you might encounter an encoding problem with subtitles for a movie. And if that is the case, now you know how to address that issue.

For a more comprehensive text on character sets and encodings, you can check out this famous article about the topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

See you next time. We will get back to dealing with binary numbers and introduce how you can use binary operations to manipulate them.