Squinting at ASCII on Linux

ASCII plays a much more important role on our systems than generating techno-art. Let's explore the commands that allow you to see how it works.

Squinting at ASCII on Linux
Kyle McDonald (CC BY 2.0)

Back when I started working with computers, understanding the nature of ASCII was exciting. In fact, just knowing how to convert binary to hex was fun.

That was a lot of years ago — berfore ASCII had yet reached drinking age — but character encoding standards are as important as ever today with the internet being so much a part of our business and our personal lives. They're also more complex and more numerous than you might imagine. So, let’s dive into some of the details of what ASCII is and some of the commands that make it easier to see coding standards in action.

Why ASCII?

ASCII came about to circumvent the problem that different types of electronic systems were storing text in different ways. They all used some form of ones and zeroes (or ONs and OFFs), but the issue of compatibility became important when they needed to interact. So, ASCII was developed primarily to provide encoding consistency. It became a standard in the U.S. in 1960. Initially, ASCII characters used only 7 bits. Some years later, ASCII was extended to use all 8 bits in each byte.

That said, it is important to understand that ASCII, the American Standard Code for Information Interchange is not used on all computers. In fact, most Linux systems today use UTF-8 — a standard closely related to ASCII but not quite identical. In UTF-8, the classic ASCII characters are encoded in 7 bits and characters with greater values use two bytes.

Some of the more important encoding standards in use today include:

  • ASCII — Most widely used for English before 2000
  • UTF-8 — Used in Linux by default along with much of the internet
  • UTF-16 — Used by Microsoft Windows, Mac OS X file systems and others
  • GB 18030 — Used in China (contains all Unicode chars)
  • EUC-JP (Extended Unix Code) — Used in Japan
  • IEC 8859 series — Used for most European languages

According to one source that I describe below, however, there are as many as 1,173 different encoding schemes in use today.

Viewing an ASCII translation table

One of the easiest ways to display an ASCII table on Linux systems is to use the man ASCII or man ascii command. Within the body of the page displayed, you will see a table that starts like this:

       Oct   Dec   Hex   Char                        Oct   Dec   Hex   Char
       ────────────────────────────────────────────────────────────────────────
       000   0     00    NUL '\0' (null character)   100   64    40    @
       001   1     01    SOH (start of heading)      101   65    41    A
       002   2     02    STX (start of text)         102   66    42    B
       003   3     03    ETX (end of text)           103   67    43    C
       004   4     04    EOT (end of transmission)   104   68    44    D
       005   5     05    ENQ (enquiry)               105   69    45    E
       006   6     06    ACK (acknowledge)           106   70    46    F
       007   7     07    BEL '\a' (bell)             107   71    47    G
       010   8     08    BS  '\b' (backspace)        110   72    48    H
       011   9     09    HT  '\t' (horizontal tab)   111   73    49    I

Notice that the table is split into two 4-column displays. The right side in the display is in bold font above (depending on your browser) to make this more clear. Each side displays the octal, decimal, hexadecimal and character representations a series of characters. The letter "I" (bottom right) is shown as being 1001001 in binary is 111 in octal, 73 in decimal, and 49 in hex.

Looking at file content

To display the content of a file in some other format than its character (ASCII) format, you could use any of a number of different commands. These include od (octal dump), hexdumpxxd and iconv.

od

The od -bc command will display a file in both octal and character format. The \012 at the end is the newline character that ends the single line of text.

$ cat testing
Testing 1 2 3
$ od -bc testing
0000000 124 145 163 164 151 156 147 040 061 040 062 040 063 012
          T   e   s   t   i   n   g       1       2       3  \n
0000016

To view the same file in hex, you could use this command, though you'll probably notice that the characters in each two-letter set are swapped. For example, T=54 and e=65 so you might expect to see "5465" instead of "6554".

$ od -hc /tmp/testing
0000000    6554    7473    6e69    2067    2031    2032    0a33
          T   e   s   t   i   n   g       1       2       3  \n

Adding an "endian" specification to the od command gets around this issue:

$ od -xc --endian=big testing
0000000    5465    7374    696e    6720    3120    3220    330a
          T   e   s   t   i   n   g       1       2       3  \n
0000016

The big-endian and little-endian designation refers to whether the data values are ordered with the most significant ( big-endian ) or least significant ( little-endian ) byte first.

The command below shows the same text in octal. Keep in mind that octal 124 is 01 010 100 in binary and 54 (0101 0100) in hex — same values, different way of expressing.

$ echo Testing 1 2 3 | od -bc
0000000 124 145 163 164 151 156 147 040 061 040 062 040 063 012
          T   e   s   t   i   n   g       1       2       3  \n
0000016

hexdump

Another useful command is hexdump. In the examples below, we see hexdump displaying the file in hex, character and octal format.

hex

$ hexdump testing
0000000 6554 7473 6e69 2067 2031 2032 0a33
000000e

character

$ hexdump -c testing
0000000   T   e   s   t   i   n   g       1       2       3  \n
000000e

one-byte octal

$ hexdump -b testing
0000000 124 145 163 164 151 156 147 040 061 040 062 040 063 012
000000e

xxd

The xxd is a command that creates a hex dump or converts a hex dump to some other format. It displays a file in big-endian format by default.

$ xxd testing
00000000: 5465 7374 696e 6720 3120 3220 330a       Testing 1 2 3.
$ echo "Testing 1 2 3" | xxd
00000000: 5465 7374 696e 6720 3120 3220 330a       Testing 1 2 3.

In continuous hex dump style:

$ echo "Testing 1 2 3" | xxd -p
54657374696e672031203220330a
$ echo 54657374696e672031203220330 | xxd -r -p
Testing 1 2 3

iconv

The iconv command will translate content from one character encoding to another. This is the command that, as I promised earlier, suggests that there are 1,173 different encoding schemes. Let's see why.

The --list option gets the command to list all encoding schemes.

$ iconv --list | wc -l
1173

You can get a list of them using the iconv --list command or focus solely on the UTF* schemes with this command:

$ iconv --list | grep UTF
ISO-10646/UTF-8/
ISO-10646/UTF8/
UTF-7//
UTF-8//
UTF-16//
UTF-16BE//
UTF-16LE//
UTF-32//
UTF-32BE//
UTF-32LE//
UTF7//
UTF8//
UTF16//
UTF16BE//
UTF16LE//
UTF32//
UTF32BE//
UTF32LE//

The iconv command converts data between different formats using the syntax:

iconv [-f from-encoding] [-t to-encoding] [inputfile]

Here's an example of iconv in action. Our initial file is a copy of the testing file that I'm calling testing8.

$ cat testing8
Testing 1 2 3

We use the iconv command to convert the file to UTF16 format:

$ iconv -f utf8 -t utf16 testing8 > testing16

A file listing shows that the resultant file is just over twice the size of the original.

$ ls -l testing*
-rw-rw-r-- 1 shs shs 30 Dec 22 11:06 testing16
-rw-r--r-- 1 shs shs 14 Dec 22 11:05 testing8

Out of curiosity, we look at the new file and every other byte is "000". The data we're displaying isn't making use of the extra byte. We also see that two bytes were tacked onto the beginning of the file. These two bytes are intended to indicate the endianness of the UTF16 format and are not treated as characters.

$ od -bc testing16
0000000 377 376 124 000 145 000 163 000 164 000 151 000 156 000 147 000
        377 376   T  \0   e  \0   s  \0   t  \0   i  \0   n  \0   g  \0
0000020 040 000 061 000 040 000 062 000 040 000 063 000 012 000
             \0   1  \0      \0   2  \0      \0   3  \0  \n  \0
0000036

Character encoding is a much larger and more complex issue than ASCII. Fortunately, Linux offers nice tools that allow you to peer into coding schemes and see what happens when you convert one to another.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:
Must read: 10 new UI features coming to Windows 10