This article’s goal is to introduce terms such as ascii, unicode, utf-8, and explain how they relate to each other. At the bottom, a few examples on how to operate with these entities in Python will be provided.

Table of Contents

#TODO: Create table of contents before finalizing #TODO: Check the references before finalizing.

General terms

Even though everybody has (most likely) pretty much the same understanding of terms like a character set and encoding, their definitions differ slightly across the internet. I chose the ones that seem the clearest and most rational to me.

Character set

A character set
is a collection of characters used to represent text in a computer system[9].

The character set might include numbers, letters, punctuation marks, symbols, emojis, and control characters. While a character set in general might be considered simply as a set of characters, in the software engineering field, we usually understand that the character set defines the mapping of its characters to numbers as well.

Example

A character set might look like the following:

charset.svg
Sample character set.

Encoding

An encoding
is an unambiguous mapping between bit strings and the set of possible data[8].

That is an encoding defines how is each character from a character set uniquely translated to a sequence of one or more bits. Note that there might be more encodings for a single character set.

Example

In the sample character set, there is a total of 26 characters. Hence, 5 bits ($2^5 = 32$) is enough to cover all characters. Then the encoding of the sample character set might be defined as:

charset.svg
Encoding of the Sample character set.

Let’s say we wanted to encode word firefly. Hence, based on the encoding, this string of characters would be represented with a sequence of bits:

ASCII

ASCII (American Standard Code for Information Interchange)
is a character encoding standard for electronic communications[5].

The statement above effectively means that ASCII defines a numerical representation of a set of selected characters. There are 128 characters including lowercase and capital letters of the English alphabet, numbers, some special characters such as &#$, and other, non-printable characters. Check the original table from the RFC20.

|----------------------------------------------------------------------|
  B  \ b7 ------------>| 0   | 0   | 0   | 0   | 1   | 1   | 1   | 1   |
   I  \  b6 ---------->| 0   | 0   | 1   | 1   | 0   | 0   | 1   | 1   |
    T  \   b5 -------->| 0   | 1   | 0   | 1   | 0   | 1   | 0   | 1   |
     S                 |-----------------------------------------------|
               COLUMN->| 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   |
|b4 |b3 |b2 |b1 | ROW  |     |     |     |     |     |     |     |     |
+----------------------+-----------------------------------------------+
| 0 | 0 | 0 | 0 | 0    | NUL | DLE | SP  | 0   | @   | P   |   ` |   p |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 0 | 0 | 1 | 1    | SOH | DC1 | !   | 1   | A   | Q   |   a |   q |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 0 | 1 | 0 | 2    | STX | DC2 | "   | 2   | B   | R   |   b |   r |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 0 | 1 | 1 | 3    | ETX | DC3 | #   | 3   | C   | S   |   c |   s |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 1 | 0 | 0 | 4    | EOT | DC4 | $   | 4   | D   | T   |  d  |   t |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 1 | 0 | 1 | 5    | ENQ | NAK | %   | 5   | E   | U   |  e  |   u |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 1 | 1 | 0 | 6    | ACK | SYN | &   | 6   | F   | V   |  f  |   v |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 0 | 1 | 1 | 1 | 7    | BEL | ETB | '   | 7   | G   | W   |  g  |   w |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 0 | 0 | 0 | 8    | BS  | CAN | (   | 8   | H   | X   |  h  |   x |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 0 | 0 | 1 | 9    | HT  | EM  | )   | 9   | I   | Y   |  i  |   y |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 0 | 1 | 0 | 10   | LF  | SUB | *   | :   | J   | Z   |  j  |   z |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 0 | 1 | 1 | 11   | VT  | ESC | +   |  ;  | K   | [   |  k  |   { |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 1 | 0 | 0 | 12   | FF  | FS  | ,   | <   | L   | \   |  l  |   | |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 1 | 0 | 1 | 13   | CR  | GS  | -   | =   | M   | ]   |  m  |   } |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 1 | 1 | 0 | 14   | SO  | RS  | .   | >   | N   | ^   |  n  |   ~ |
|---|---|---|---|------|-----|-----|-----|-----|-----|-----|-----|-----|
| 1 | 1 | 1 | 1 | 15   | SI  | US  | /   | ?   | O   | _   |  o  | DEL |
+----------------------+-----------------------------------------------+

This standard uses a single byte, where the most significant bit is set to 0, hence 7 remaining bits hold the information.

Well, ASCII apparently fulfilled its purpose to define the basic set of the most common characters, yet the limitation of 128 entries is obvious. Fortunately, there is the Unicode standard with the UTF-8 encoding, thanks to which I can write beautiful Czech sentence like: Ověnčený chmýřím, neštěstí šíříš. Please, don’t ask me to translate this :grimacing:.

Unicode

The Unicode Standard
is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world’s writing systems that can be digitized[6].

The standard was invented to reflect the needs of the internet to be able to represent any character in a unified way. Unicode is a character set, which includes ASCII characters, different accents, CJK characters, emojis, etc. This standard could define more than 1.1 million of unique characters in total.

Current version of the Unicode (16.0) defines 154,998 characters (so still a lot of free slots). The limit for the amount of characters described by the Unicode depends on its encodings, more specifically on UTF-16, which is the most restricted compared to the two other formats UTF-8 and UTF-32.

However, the UTF-8 version of the format plays the main role in today’s world.

UTF-8

Based on the current statistics (2025), the usage across the internet of this character encoding is more than 98%.

The Unicode Transformation Format - 8 bit
is a character encoding standard for electronic communication, defined by the Unicode Standard[7].

The main difference between the Unicode and UTF-8 is obvious now. Unicode is a character set, i.e. mapping of the different characters/symbols to numbers, and UTF-8 is an encoding, i.e. a way how to translate numbers to bytes, a series of 1’s and 0’s.

Examples

References

  1. RFC20: ASCII Format for Network Interchange
  2. RFC5198: Unicode Format for Network Interchange
  3. SO: How many characters can be mapped with Unicode?
  4. w3techs.com: Usage statistics of character encodings for websites
  5. Wikipedia: ASCII
  6. Wikipedia: Unicode
  7. Wikipeda: UTF-8
  8. MIT OCW: Encoding
  9. NordVPN: Character Set