Bytes, Runes, and Strings: How Text Works in Go

This article assumes prior knowledge of character sets including ASCII and Unicode. If you want a no-nonsense explanation, check out: A No Nonsense Guide to Unicode

Working with text in the early days was easy.

There was English. And only English.

Fast forward to today and text has gotten a lot more complex.

Modern languages like Go, which natively support the Unicode character set allows any character in existence to be used [1].

This added complexity leaves many programmers confused.

In this article, we’ll clear up the confusion by understanding how text really works in Go.

Fundamental Data Types

At the most basic level, computers only understand bits. These bits have no intrinsic meaning.

To represent human-readable characters, we need to use a character set. A character set is a set of characters, each of which has a defined mapping to a unique sequence of bits.

Early character sets including ASCII have traditionally used 1 byte i.e. 8 bits to encode a single character of text in a computer. Languages like C simply had to provide a single data type (char) that was capable of representing 8 bytes to store characters.

Modern character sets including Unicode, which are tasked with representing every character known to man, often have to use more than 1 byte to encode a single human-readable character of text. For example, the Unicode UTF-8 encoding uses anywhere between 1 and 4 bytes to represent a single code point [2].

To work with modern text in Go, we must understand three fundamental data types.

Bytes

A byte is an unsigned 8-bit integer. In other words, the byte type is an alias for the uint8 type [3]. We use byte instead of uint8 in any situation where the numerical value represents an encoding of a character, and not merely a standard 8-bit unsigned integer. Note that a byte is equivalent to the char type in C.

A byte can store any number between 0-255 and can used to represent a single-byte UTF-8 code point- these are identical to the ASCII characters.

We define bytes using either single quotes or the encoding of the character.

For example, we can specify the letter a as a byte:

var a byte = 97   // valid
var a byte = 'a'  // valid
var a byte = '🏖️'  // invalid

However, attempting to store a multi-byte character such as 🏖️ results in an overflow.

If we were using a language that only provided single-byte character support, like C, we’d either have to stop here and endure the pain of emoji-less code, or implement the functionality ourselves.

Thankfully, we can sleep easy- Go provides an additional data type for just such a situation.

Runes

A rune is a 32-bit signed integer. In other words, the rune type is an alias for the int32 type.

A rune is used to represent a single UTF-8 code point. Since UTF-8 code points range from 1 to 4 bytes i.e. 32-bits long, a rune can be used to store any UTF-8 code point.

We define runes using single quotes:

r1 := `©`
var r2 = 'k'
var r3 rune = '🏖️'

Note that when using single quotes, the default type is rune. Since a byte is also defined using single quotes, we must explicitly specify the type when we want to define a byte.

You might be wondering when to use rune over byte or vice versa. Runes are always composed of 4 bytes- this means that when storing single-byte ASCII values, 3-bytes are always wasted. For maximum memory-efficiency, we use bytes for single-byte code points/ASCII values and runes for multi-byte code points.

Now that we know that a byte is used to represent a single byte code-point, and that a rune is used to represent multi-byte code-points, we’re ready for the final data type.

Strings

In Go, there are two types of strings; string literals and string values.

String Literals

String literals are strings that are explicitly written in the source code.

There are two types of string literals. These include:

Interpreted/Double Quote string literals- strings that are delimited by double quotes. Interpreted string literals allow escaping [4].
Non-interpreted/Backtick/Raw Strings- strings that are delimited by backticks (`). Raw strings do not honour escaping. Since the source code of any Go program must be valid UTF-8 encoded, any string literal that contains one or more invalid UTF-8 characters will fail to compile.

From this, we know that any string literal, whether interpreted or raw, must be a valid UTF-8 encoded string.

String Values

String values are strings which are not present in the source code and are often generated at run-time. For example, this may be textual data received from an external system or text output to the terminal.

Since string values are not guaranteed to originate from the program source, unlike string literals, they are not guaranteed to be valid UTF-8 encoded strings.

If you need UTF-8 encoded strings, it’s a good idea to validate any string values that originate from outside your program.

Knowing how to define strings is the first step.

Now, we need to know how to interact with them.

Interacting with Strings in Go

At the lowest level, a string is a read-only, immutable, slice of arbitrary bytes [5].

By arbitrary, we mean that the bytes can be of any format. When a character value is stored in a string, its byte-at-a-time representation is stored. Go does not know or require that the bytes represent any particular encoding e.g. ASCII, UTF-8, UTF-32 etc- they are just bytes.

Since a string is just a slice of bytes, we can very easily convert between the two using the string() and []byte() conversion functions:

var str = "hello"
var strBytes = []byte(str) // [104 101 108 108 111]
var sameStr = string(strBytes) // "hello"

We can interact with strings at the byte-level by using Unicode-unaware operators and functions. These pay no attention to any encoding and simply interpret a byte at a time.

This includes indexing a string using the slice operator, using a standard for loop and computing the length of a string using the len operator. For example:

var str = "hello"
var h = str[0] // 104 i.e. byte value of 'h'

for i := 0; i < len(str); i++ {
  fmt.Println(str[i]) // output numeric byte value of every char in str
}

var length = len(str) // 5 since there are 5 bytes in "hello"

Operating at the byte level is fine when using strings in which each character is represented by a single-byte code point i.e. ASCII character strings, since the number of bytes is equal to the number of characters.

But when considering strings that contain other Unicode characters, where a single character can be comprised of multiple code points and thus bytes, this approach can give odd results.

Take the string, str = "🏖️" for example. This emoji is represented by multiple multi-byte code points. Computing the length of str results in a total of 7, meaning that this single grapheme/human readable character is comprised of 7 bytes.

To resolve this, we need to use Unicode-aware operators and functions which operate at the Unicode rune level. This means that we index the string over each Unicode grapheme/character, and not just each byte; a single byte may not be a full character and is thus meaningless.

A common Unicode-aware structure is the for range loop. The for range loop decodes one UTF-8-encoded rune on each iteration. On each iteration, the index of the loop is the byte index of the current rune, and the value is the code point of the rune.

For example

func main() {
	var str = "世界"

	for index, runeValue := range str {
		fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
	}
}

outputs

U+4E16 '世' starts at byte position 0
U+754C '界' starts at byte position 3

Notice how the loop does not iterate over every individual byte but cleverly interprets the string, str, as a UTF-8 encoded string that is comprised of UTF-8 code points/runes.

Whilst string literals always contain valid UTF-8 encoded characters, string values provide no such guarantee.

In cases where the for range encounters an invalid UTF-8 sequence, the rune value is set to 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

Additional Unicode-aware functions are provided by the unicode/utf8 package in the Go standard library.

Summary

Here’s a quick summary of what you need to know.

In ASCII, each character is represented by a single byte. Thus, a single ASCII character can be stored in Go as a byte data type. In contrast, Unicode characters outside the simple ASCII set are comprised of multiple bytes up to a maximum of 4 bytes. The rune data type, which stores a maximum of 4 bytes, can thus represent any Unicode character in existence.

Strings are used to represent a sequence of multiple characters. In Go, all string literals are valid UTF-8 encoded. String values do have this guarantee. Internally, strings are comprised of a slice of arbitrary bytes. In the case where the string contains multi-byte characters, we need to use Unicode-aware functions, which are able to correctly combine the individual bytes into runes. In the special case where the string only contains single-byte characters, Unicode-unaware and Unicode-aware functions return the same results since each character is a single rune which is a single byte.

Notes

[1] This is hardly surprising- Go was actually conceived by the same creators of UTF-8, the most popular character-encoding scheme for Unicode.

[2] For more information on Unicode, see A No Nonsense Guide to Unicode

[3] The reason for this alias is down to semantics- byte clearly expresses the intention that the corresponding integer value is used to represent a character and not a standard 8-bit, unsigned integer.

[4] An escape sequence is a sequence of characters that are used to represent characters that would otherwise be difficult to textually represent including newlines, whitespace characters etc. Adding one or more escape sequences to a string is known as escaping.

[5] A string is actually a byte sequence wrapper. The implementation looks something like:

type string struct {
   elements *byte //underlying bytes
   len int // number of bytes
}