Back when I started working with computers, understanding the nature of ASCII was exciting. In fact, just knowing how to convert binary to hex was fun.
That was a lot of years ago — berfore ASCII had yet reached drinking age — but character encoding standards are as important as ever today with the internet being so much a part of our business and our personal lives. They're also more complex and more numerous than you might imagine. So, let’s dive into some of the details of what ASCII is and some of the commands that make it easier to see coding standards in action.
Why ASCII?
ASCII came about to circumvent the problem that different types of electronic systems were storing text in different ways. They all used some form of ones and zeroes (or ONs and OFFs), but the issue of compatibility became important when they needed to interact. So, ASCII was developed primarily to provide encoding consistency. It became a standard in the U.S. in 1960. Initially, ASCII characters used only 7 bits. Some years later, ASCII was extended to use all 8 bits in each byte.
That said, it is important to understand that ASCII, the American Standard Code for Information Interchange is not used on all computers. In fact, most Linux systems today use UTF-8 — a standard closely related to ASCII but not quite identical. In UTF-8, the classic ASCII characters are encoded in 7 bits and characters with greater values use two bytes.
Some of the more important encoding standards in use today include:
- ASCII — Most widely used for English before 2000
- UTF-8 — Used in Linux by default along with much of the internet
- UTF-16 — Used by Microsoft Windows, Mac OS X file systems and others
- GB 18030 — Used in China (contains all Unicode chars)
- EUC-JP (Extended Unix Code) — Used in Japan
- IEC 8859 series — Used for most European languages
According to one source that I describe below, however, there are as many as 1,173 different encoding schemes in use today.
Viewing an ASCII translation table
One of the easiest ways to display an ASCII table on Linux systems is to use the man ASCII or man ascii command. Within the body of the page displayed, you will see a table that starts like this:
Oct Dec Hex Char Oct Dec Hex Char ──────────────────────────────────────────────────────────────────────── 000 0 00 NUL '\0' (null character) 100 64 40 @ 001 1 01 SOH (start of heading) 101 65 41 A 002 2 02 STX (start of text) 102 66 42 B 003 3 03 ETX (end of text) 103 67 43 C 004 4 04 EOT (end of transmission) 104 68 44 D 005 5 05 ENQ (enquiry) 105 69 45 E 006 6 06 ACK (acknowledge) 106 70 46 F 007 7 07 BEL '\a' (bell) 107 71 47 G 010 8 08 BS '\b' (backspace) 110 72 48 H 011 9 09 HT '\t' (horizontal tab) 111 73 49 I
Notice that the table is split into two 4-column displays. The right side in the display is in bold font above (depending on your browser) to make this more clear. Each side displays the octal, decimal, hexadecimal and character representations a series of characters. The letter "I" (bottom right) is shown as being 1001001 in binary is 111 in octal, 73 in decimal, and 49 in hex.
Looking at file content
To display the content of a file in some other format than its character (ASCII) format, you could use any of a number of different commands. These include od (octal dump), hexdump, xxd and iconv.
od
The od -bc command will display a file in both octal and character format. The \012 at the end is the newline character that ends the single line of text.
$ cat testing Testing 1 2 3 $ od -bc testing 0000000 124 145 163 164 151 156 147 040 061 040 062 040 063 012 T e s t i n g 1 2 3 \n 0000016
To view the same file in hex, you could use this command, though you'll probably notice that the characters in each two-letter set are swapped. For example, T=54 and e=65 so you might expect to see "5465" instead of "6554".
$ od -hc /tmp/testing 0000000 6554 7473 6e69 2067 2031 2032 0a33 T e s t i n g 1 2 3 \n
Adding an "endian" specification to the od command gets around this issue:
$ od -xc --endian=big testing 0000000 5465 7374 696e 6720 3120 3220 330a T e s t i n g 1 2 3 \n 0000016
The big-endian and little-endian designation refers to whether the data values are ordered with the most significant ( big-endian ) or least significant ( little-endian ) byte first.
The command below shows the same text in octal. Keep in mind that octal 124 is 01 010 100 in binary and 54 (0101 0100) in hex — same values, different way of expressing.
$ echo Testing 1 2 3 | od -bc 0000000 124 145 163 164 151 156 147 040 061 040 062 040 063 012 T e s t i n g 1 2 3 \n 0000016
hexdump
Another useful command is hexdump. In the examples below, we see hexdump displaying the file in hex, character and octal format.
hex
$ hexdump testing 0000000 6554 7473 6e69 2067 2031 2032 0a33 000000e
character
$ hexdump -c testing 0000000 T e s t i n g 1 2 3 \n 000000e
one-byte octal
$ hexdump -b testing 0000000 124 145 163 164 151 156 147 040 061 040 062 040 063 012 000000e
xxd
The xxd is a command that creates a hex dump or converts a hex dump to some other format. It displays a file in big-endian format by default.
$ xxd testing 00000000: 5465 7374 696e 6720 3120 3220 330a Testing 1 2 3.
$ echo "Testing 1 2 3" | xxd 00000000: 5465 7374 696e 6720 3120 3220 330a Testing 1 2 3.
In continuous hex dump style:
$ echo "Testing 1 2 3" | xxd -p 54657374696e672031203220330a $ echo 54657374696e672031203220330 | xxd -r -p Testing 1 2 3
iconv
The iconv command will translate content from one character encoding to another. This is the command that, as I promised earlier, suggests that there are 1,173 different encoding schemes. Let's see why.
The --list option gets the command to list all encoding schemes.
$ iconv --list | wc -l 1173
You can get a list of them using the iconv --list command or focus solely on the UTF* schemes with this command:
$ iconv --list | grep UTF ISO-10646/UTF-8/ ISO-10646/UTF8/ UTF-7// UTF-8// UTF-16// UTF-16BE// UTF-16LE// UTF-32// UTF-32BE// UTF-32LE// UTF7// UTF8// UTF16// UTF16BE// UTF16LE// UTF32// UTF32BE// UTF32LE//
The iconv command converts data between different formats using the syntax:
iconv [-f from-encoding] [-t to-encoding] [inputfile]
Here's an example of iconv in action. Our initial file is a copy of the testing file that I'm calling testing8.
$ cat testing8 Testing 1 2 3
We use the iconv command to convert the file to UTF16 format:
$ iconv -f utf8 -t utf16 testing8 > testing16
A file listing shows that the resultant file is just over twice the size of the original.
$ ls -l testing* -rw-rw-r-- 1 shs shs 30 Dec 22 11:06 testing16 -rw-r--r-- 1 shs shs 14 Dec 22 11:05 testing8
Out of curiosity, we look at the new file and every other byte is "000". The data we're displaying isn't making use of the extra byte. We also see that two bytes were tacked onto the beginning of the file. These two bytes are intended to indicate the endianness of the UTF16 format and are not treated as characters.
$ od -bc testing16 0000000 377 376 124 000 145 000 163 000 164 000 151 000 156 000 147 000 377 376 T \0 e \0 s \0 t \0 i \0 n \0 g \0 0000020 040 000 061 000 040 000 062 000 040 000 063 000 012 000 \0 1 \0 \0 2 \0 \0 3 \0 \n \0 0000036
Character encoding is a much larger and more complex issue than ASCII. Fortunately, Linux offers nice tools that allow you to peer into coding schemes and see what happens when you convert one to another.