Learn With Me: Elixir - Strings (#9)

As we discussed in the post on data types, a string in Elixir is actually a binary containing the bytes of a UTF-8-encoded string.

iex> is_binary("Speedy Taco")
true

However, you can still think of it as a string type. You won't know any differently unless you decide to take a really close look. All the things you can do with a string can be done without even knowing that it is really a binary. I've noticed that this way of thinking about data types is a common thing in Elixir.

Digression Time - Unicode and Encodings

I won't go into all the details of Unicode and the UTF encodings, but I will give a bit of a summary for those who are interested. Unicode consists of a huge number of characters, from Latin script to Cyrillic to Chinese characters, to more obscure scripts, and even a lot of graphical characters such as emojis.

Each Unicode character is associated with a number, which is called a code point. Possible code points range from 0 up to 0x10FFFF, although currently everything above 0x2FFFF is mostly unassigned. The first 128 code points correspond 1 to 1 with basic ASCII, to make it compatible with pre-Unicode ASCII text.

Encoding refers to the method we use to store those code points. The easiest way to encode Unicode code points is to just use 32-bits for each character. A simple array of 32-bit characters will store any code point in the current Unicode standard. This encoding is known as UTF-32, because each code unit, which is describes a binary chunk used to store the code points, is 32-bits

UTF-32 is almost never used because this encoding is extremely wasteful. Most code points in common use require far less memory to store. A simple document consisting of nothing more than standard Latin characters, like this post, would be 4x smaller in simple ASCII than in UTF-32.

Most of the text being stored is toward the lower end of the code point range, so more efficient encodings are typically used.

UTF-16 uses 16-bit code units (a code unit being the basic building block for an encoding) to represent a code point. A single 16-bit code unit will still store almost all commonly-used characters, and in the rare instance when a really high-numbered code point is used, it will be stored in two 16-bit code units. This means that the number of code units that represents a character can be 1 or 2.

UTF-8 uses 8-bit code units to represent a code point. This is the most common encoding (and the one used by Elixir) because it's the most efficient for common characters and it's backwards-compatible with ASCII. A single UTF-8 code unit is indistinguishable from basic ASCII, so anything encoded in ASCII (which is the vast majority of text prior to Unicode) is also valid UTF-8. UTF-8 takes between 1 and 4 8-bit code units (each equal to a byte) to represent all Unicode code points, so the number of bytes per character is variable.

UTF-8 requires more sophisticated code to examine than a simple array of ASCII bytes, since the code has to examine the bytes to figure out how many bytes each character consists of, but it's quite memory-efficient, especially considering the huge amount of text that only needs one byte to encode. However, that's all invisible: Elixir takes care of that for us. If you really want to, you can see the UTF-8 bytes directly by examining the binary data in a string.

I find this a particularly interesting subject and many years ago I wrote a C++ library called UtfString (available on Github) to work with UTF-8 and UTF-16-encoded strings. I learned a whole lot about UTF-8 and UTF-16 encodings when writing that, so it was well worth doing for that alone.

String Literals

I hear that string handling in Elixir strongly resembles that of Ruby, which is no surprise, considering that the creator of Elixir, José Valim, was very active in the Ruby ecosystem.

The standard string in Elixir uses double quotes to signify that it is an Elixir string.

iex> "This is a string"
"This is a string"

String Concatenation

Strings can be concatenated using the <> operator.

iex> "Chicken" <> "Foot"
"ChickenFoot"

String Sigils

String literals can also be specified using something called a "sigil". I don't quite understand exactly what sigils are yet, but they appear to be a way of specifying some kind of data literal in Elixir. Sigils always seem to start with a tilde character (~) followed by a letter. The string sigil is ~s(), so we can specify a string like "Captain Ahab" using this sigil syntax.

iex> "Captain Ahab"
"Captain Ahab"
iex> ~s(Captain Ahab)
"Captain Ahab"

I'm not sure why this sigil would be used instead of the "" notation. Perhaps it's so that it's a bit easier to put double quote characters in a string without escaping them.

iex> ~s(This "string" contains "double quotes")
"This \"string\" contains \"double quotes\""

String Interpolation

String interpolation is the process of embedding an expression in a string literal, which will be evaluated at runtime. The result is integrated into a string. This is a convenient way of constructing a string while maintaining high readability.

Javascript (ES6 and higher) has string interpolation

"Next year, my dog will be ${dogAge + 1} years old"

C# also has string interpolation, although that's one of the language's more recent additions (C# 6).

$"Next year, my dog will be {dog.Age + 1} years old"

In Elixir, string interpolation can be accomplished using "#{}" notation within the string.

iex> name = "Bob"
"Bob"
iex> "Hello, my name is #{name}"
"Hello, my name is Bob"

Any expression can be put between the brackets and it will be converted to a string.

iex> x = 4
4
iex> y = 5
5
iex> "The sum is #{x + y}"
"The sum is 9"

I played with this a bit to see how data structures are converted to a string with interpolation. It seems that lists are displayed not as list literals, but as binary. I expect the same for other data structures.

iex> list = [1, 2, 3]
[1, 2, 3]
iex> "The contents of the list are #{list}"
<<84, 104, 101, 32, 99, 111, 110, 116, 101, 110, 116, 115, 32, 111, 102,
  32, 116, 104, 101, 32, 108, 105, 115, 116, 32, 97, 114, 101, 32, 1, 2,
  3>>

Elixir String Functions

The Elixir String module contains functions to manipulate string. We'll cover modules and the Elixir standard library in further posts, but here's a few examples.

iex> String.length(string)
16
iex> String.codepoints(string)
["T", "h", "i", "s", " ", "i", "s", " ", "a", " ", "s", "t", "r", "i",
 "n", "g"]
iex> String.at(string, 3)
"s"
iex> String.upcase(string)    
"THIS IS A STRING"

Heredocs

Heredocs are multiline strings that are typically used for code documentation outside of a function, although it's possible to use them inside a function for specifying a multiline string literal. Newline characters within the context of a heredoc literal are made part of the string value. The triple quotes that begin and end the heredoc have to go on their own line.

iex> """
...> This is a multiline string
...> that goes on for multiple lines
...> """
"This is a multiline string\nthat goes on for multiple lines\n"

Learn With Me: Elixir - Strings (#9)

Kevin Peter

Kevin Peter

Digression Time - Unicode and Encodings

String Literals

String Concatenation

String Sigils

String Interpolation

Elixir String Functions

Heredocs

Learn With Me: Elixir - ElixirLargeSort IntSort Project Part 4 (#82)

Learn With Me: Elixir - ElixirLargeSort IntSort Project Part 3 (#81)

Learn With Me: Elixir - ElixirLargeSort IntSort Project Part 2 (#80)

Learn With Me: Elixir - Modules and Functions: Part 1 (#10)

Learn With Me: Elixir - Binaries (#8)