Encoding#

One of the first data types you learn to work with as a programmer is strings. But this seemingly simple data type has very complex depths, so complex that it’s still changing, and various programming languages still handle strings in very different ways. In Bash, the contents of a string variable is “simply” stored as a sequence of bytes (values 0 through 255) with a NUL byte at the end. For scripting purposes the NUL byte is not part of the variable value, and this terminator means that if you try to store arbitrary binary data in a variable the value will be cut off at the first occurrence of a NUL byte:

As you can see, the NUL byte at the end is not considered part of the string.

That takes care of Bash variables: series of bytes with no special meaning, internally terminated by a NUL byte. To get to what humans would consider a string you have to add an encoding: a mapping from byte values to code units, and in the case of multi–byte encodings another mapping from code units to code points (often called “characters” although this is a heavily overloaded word). Let’s first check which encoding the current shell is using:

locale prints the settings rather than the variable assignments. So if you want to get the current collation setting in a script you should inspect the output of locale rather than the value of $LC_COLLATE (“collate” is synonymous with “sort” and “order”). Even if $LC_COLLATE is set it may be overridden by $LANG or $LC_ALL.

The values except for LANGUAGE are formatted as language[_territory][.codeset][@modifier] (documented in man 3 setlocale). We’re only interested in the LC_CTYPE (locale character type) “codeset” part, “UTF-8”, which tells the shell how to interpret byte sequences as code points. Let’s see what it does:

 

This page is a preview of The newline Guide to Bash Scripting

No discussions yet