Learn With Me: Elixir - Data Types (#6)

Like most languages, Elixir has data types. There aren't that many of them either. We'll be able to take advantage of the simpler data types immediately, but the data types with more complexity to them will require posts of their own later on.

The Basic Types

These data types are what I refer to as the basic data types. This is my terminology, not that of the Elixir language.

Integers

Integers in Elixir are similar to integers any other language. They're whole numbers without any decimal component. An integer literal looks just like pretty much every language I've ever seen: just the number.

iex> integer1 = 300
300
iex> integer2 = 4
4
iex> integer3 = -34
-34
iex> bigger_integer = 745426991
745426991

Unlike Javascript, there is no number type in Elixir, but rather separate types for integer and floating-point values. Unlike C#, there aren't a group of integer types with one type using more memory and holding larger values than another (byte, short, int, long, etc). An integer in Elixir can hold any arbitrary integer. There is no concept of a smallest possible value or a largest possible value. Absolutely enormously large integers can be specified in Elixir. As long as Elixir has memory to store the integer, it will be able to store it.

I imagine that working with very large numbers takes more memory and processing time than working with than smaller numbers, but my experiences in playing around with IEx show that simple math with very large integers doesn't produce any delay that I can perceive. I'm sure there's a difference at the time scales that processors work on, but humans are too slow to see it.

iex> big_integer = 2409045930580398034859043850983827847329987883234
2409045930580398034859043850983827847329987883234
iex> another_big_integer = 8245789239840280948509343533503938584234
8245789239840280948509343533503938584234
iex> big_integer * another_big_integer
19864485012660862539728121685875365511253137949455637945916932944539055016611316465332756

To make numbers easier to read, you can use underscores to group digits in an integer literal. This is only to make the integer literal more readable to humans and has no meaning to the compiler.

iex> big_integer = 34_453_234_984
34453234984

The underscores can go anywhere and can group digits in anyway you like. Elixir doesn't care.

iex> big_integer = 45_4354_234_3
4543542343

Notice that the underscores are not saved as part of the integer value. IEx echos the value without any underscores.

Elixir includes the is_integer function that you can use to determine if a value is an integer.

iex> is_integer(445)
true
iex> is_integer("445")
false
iex> is_integer(true)
false

Floats

A floating-point number in Elixir is represented by the float data type. A float is a double precision IEEE floating point value, equivalent to a double in C#. A float literal is just a number with a decimal portion, the same way we would specify it in C# or Javascript.

iex> decimal_number = 3.141
3.141
iex> decimal_number = -0.453434
-0.453434
iex> decimal_number = 1.0
1.0

A float works just like a double-precision floating point number would in any other language where it's implemented. That means that it's precision is not infinite and it cannot store an infinite number of digits like the integer data type.

decimal_number = 23434242354576575634523249854.453878983947548738940835
2.3434242354576578e28

The fact that it's a floating point number means that it's subject to the occasional slight inaccuracy as is common among all floating point numbers.

The following subtraction should be -0.01, but it's slightly off due to the nature of floating point numbers.

iex> 86.24 - 86.25
-0.010000000000005116

Don't ever use floating point numbers for money. I wish Elixir had a fixed-point number type like decimal in C#. Hopefully, there's an Elixir library that will give us something like that.

There's also an is_float function that you can use to determine if a value is a float

iex> is_float(3.14)
true
iex> is_float(1)
false
iex> is_float(1.0)
true
iex> is_float("Bob")
false

Atoms

Atoms were the most difficult data type for me to understand in Elixir, which is strange because it's not a complex concept: it's very just different to anything I've seen before. It takes some time and some examples to get a sense of how to use an atom. I hear that Ruby has something similar, so Ruby developers should feel at ease here, but I've never seen an equivalent to this in any of the languages I've used. Symbols in Javascript (ES6 and later) sometimes play a similar role to atoms, but they are definitely not the same as atoms. It took me a little time to get to the point where I feel I could give a decent explanation.

Elixir defines an atom as "a constant whose name is its own value". This definition confused me when I first saw it, and didn't tell me much, so I'll try to explain it in different words. An atom is a unique value that is used for things like keys in data structures, result codes, and anywhere else where having an unique value would be useful.

An atom literal is a colon (:) followed by some text.

iex> result = :ok                       
:ok                                     ```
iex> result = :error                    
:error                                  
iex> key = :name                        
:name                                   
iex> key = :age                         
:age                                    
iex> long_atom = :this_is_a_long_atom   
:this_is_a_long_atom                    

Atoms tend to be short and they represent a unique value that no other value will be equal to. An atom, once created, gets put into some kind of atom table in Elixir and never released. This means that you don't want to go crazy with creating atoms or do something like load atoms from dynamic data.

As a unique value, an atom can only be equal to itself. So :ok is only equal to :ok and :name is only equal to :name.

Atoms are unique to the entire runtime environment. This means that the atom :message will have the same value in one Elixir process that it will have in another. This is a good thing for cross-process communication.

One of my first questions in encountering atoms was "Why atoms? Why doesn't Elixir just use strings instead?". That wasn't immediately obvious from the initial atom explanations I read, so I had to do a bit more research into why this is.

First of all, atoms are more efficient than strings. Elixir, from what I've read, will actually map atoms to integers internally, making comparison quite efficient. Secondly, Elixir wants to make sure that an atom is a different type so that strings used for other purposes are not used for atom purposes. Another advantage to atoms over strings is that it is more efficient for a distributed system. Most data in Elixir is allocated by the process where it is created, but atoms are global to the entire virtual machine. Atoms sent to other machines are globally created on those machines as well.

The Erlang VM also has a concept of atoms, which are essential to the way it does concurrency, so it also made good sense for Elixir to have them as well.

If you're familiar with C# or C++ (perhaps Java as well?), you may notice some similarities between an atom and an enum value. Those are also mapped to a primitive data type (usually int) under the surface and each individual enumeration value is unique in the code. C++ actually treats them openly like the underlying data type, whereas C# considers them to be their own data type. Atoms, however, are all of the same data type (the atom) and not an enumeration data type. They are also not namespaced in any way, which is by design. Elixir wants them to be available to all parts of a distributed system without any additonal work.

Be aware that atoms are never destroyed once created, and the number of atoms are limited: the table of atoms in an Erlang VM can only store up to 1,048,576 atoms. That's plenty of space for atoms defined in the source code, assuming atoms are being used as intended.

The material I've read regarding atoms strongly recommends not creating atoms dynamically based on external data being loaded into a data structure. For example, you might be creating a set of key-value entries that is loaded from a file or the network, with the key being an atom and the value being some other type of data. That's a common practice in Elixir source code, but in source code, atom creation generally happens at a very small scale. If you're loading them from an external data source, the number of allocated atoms can climb quickly.

If you're loading some JSON, for example, a lot of atoms may be created based on the keys in a JSON object. Those atoms will never be released as long as the VM is running, and you may eventually experience atom exhaustion as JSON is loaded over and over during the lifetime of the system. I'm not sure what happens when the storage for atoms is exhausted, but I suspect that would be pretty bad for the entire VM. Instead, it's recommended that you use strings for keys in this context. Strings get cleaned up by the garbage collector.

Atoms should only be created by being defined and used in source code.

It's common to have atoms not only represent keys in a data structure, but also serve as unique values such as result codes being returned from a function. It's a convention in Elixir for some functions (usually ones that access some kind of external resource) to return both the result of the operation as an atom (:ok or :error) and the data that resulted from the operation.

As I looked over more example code and Elixir videos trying to get a sense of atoms, I started to develop a good sense of where to use them. I still find them difficult to explain to someone who has never seen one before and has never seen them used. I think you'll also come to understand better how they are used as we go along. So if you haven't quite got a firm grasp on what atoms are used for yet, which is quite understandable, I think you'll get a better idea as we go along. I sure did, and I'm sure my understanding will continue to improve over time.

Like with the other data types we've seen, Elixir has an is_atom function that you can use to determine if a value is an atom.

iex> is_atom(:ok)  
true               
iex> is_atom("ok") 
false              
iex> is_atom(3)    
false              
iex> is_atom(:name)
true               

Functions

With Elixir being a functional language, functions are a data type as well. Functions can be created, assigned to variables, and passed to other functions just like any other data.

Functions is a big topic, so we'll go into what functions look like and how they work in a future post.

Like the other data types, there's also an is_function function to determine if something is a function. I'm not going to provide examples of it at this point because we don't yet know enough about how functions work in order to use it.

The Derived Types

These data types are what I refer to as the derived data types. This is my terminology, not that of the Elixir language. Derived data type are treated in many ways like their own data types in Elixir, but are actually specialized instances of other data types. This will become more clear as we go through them.

Strings

A string, which we've already seen from previous examples, is specified using a string literal.

iex> name = "Bob"          
"Bob"                      
iex> fruit = "Apple"       
"Apple"                    
iex> invader = "Zim"       
"Zim"                      

A string literal is the same as in C# and Javascript (although a single quote cannot be used in place of the double quote), and indeed many other languages. It's a string of text surrounded by double quotes.

Unlike most languages, where a string is its own data type and its underlying implementation is completely hidden by the language, Elixir exposes its underlying implementation and makes that part of the language.

A double-quoted Elixir string is actually encoded as UTF-8 and the bytes of the string are stored as a binary data type. So when you're working with a string, you're actually working with a binary. A binary is a data type that stores a sequence of bytes. We'll cover it some more in the Collection Types section below.

Elixir has so much string-specific functionality in its syntax and libraries that it's quite possible to use strings without ever thinking of them as binaries. However, you can use the binary syntax and related functions to work with a string as though it were a binary as well, if you decided you wanted to do that.

Many languages actually do something very similar: they encode strings as UTF-8 binaries and store them, but that's completely abstracted away from the programmer. Unlike most languages, Elixir actually exposes the underlying binary data and allows you to manipulate a string as a string or as a binary.

There's much more depth to strings than this, and I haven't explored all of it yet. We'll explore strings in more detail in a future post.

Annoyingly, Elixir does not have an is_string function. There is only the is_binary function, which we can use to demonstrate that a string is also a binary.

iex> string = "This is a string"
"This is a string"
iex> is_binary(string)
true
iex> is_string(string)
** (CompileError) iex:17: undefined function is_string/1

Not every binary is a string, though. The bytes in a binary have to be valid UTF-8. We can tell the difference by how IEx interprets binary values. If a binary is also a valid UTF-8 string, IEx will display it like a string. If not, IEx will display it like a binary.

iex> binary = <<51, 52, 53, 54>>
"3456"
iex> binary = <<0, 1, 2, 3>>
<<0, 1, 2, 3>>

I have no idea how you would programmatically distinguish a string from any other binary without an is_string function. I guess it is rarely necessary to do so, or I would think Elixir would already have such a function.

Character Lists

Character lists are a special kind of list that is used to store characters. A character list literal is a string of text surrounded by single quotes. The primary use of character lists is to communicate with Erlang functions, which use character lists instead of Elixir strings. We'll show an example of a character list here, but we won't go into detail until a future post.

iex> text = 'This is a character list'
'This is a character list'

We'll introduce lists in the Collection Types section below.

Booleans

Booleans are normally their own type in most programming languages. That appears to be the case in Elixir as well at first glance. Indeed, a programmer could use booleans in Elixir without ever knowing otherwise. Yet, Elixir stores them differently under the surface.

A boolean can only have two values: true and false.

iex> value = true    
true                 
iex> value = false   
false 

There's even an is_boolean function to determine if a value is a boolean value or not.

iex> is_boolean(true)
true
iex> is_boolean(false)
true
iex> is_boolean("true")
false
iex> is_boolean(5)
false

This is all you really need to know about how to use booleans. They are treated like native booleans throughout the language just like any boolean value in C# or Javascript. You could stop reading here and be just fine.

However, Elixir actually stores boolean values as an atom. The value true corresponds to the atom :true and false corresponds to the atom :false. The boolean values are just aliases for atoms. So a boolean in Elixir is just a special type of atom where the atom is :true or :false. Since atoms are unique values, and data in Elixir is immutable, every variable with a boolean value actually points to one of two atoms in the atom table: nice and easy and efficient.

All booleans are a type of atom, but only two atoms are booleans (:true and :false)

A value of :true is an atom and a boolean

iex> value = :true                        
true                                      
iex> is_boolean(value)                    
true                                      
iex> is_atom(value)                       
true                                      

The atom :false is a boolean value, but the atom :ok is not.

iex> is_boolean(:false)                   
true                                      
iex> is_boolean(:ok)                      
false                                     

The boolean value true is also an atom.

iex> is_atom(true)                        
true 

Ranges

A range in Elixir represents a range of integers. A range literal takes the form "[lower bound]..[upper bound]", where the bounds are always inclusive. So a range from 1 to 100 is represented as 1..100.

iex> range = 1..100
1..100
iex> range = -10..10
-10..10
iex> range = 25..343
25..343

A range isn't truly its own data type, but is represented by a data structure. As far as I know, unlike strings, we are unable to get direct access that data structure. A range is enumerable, so any functions that deal with enumerable data can deal with a range. From what I can tell at this point, it's pretty much only used for enumeration purposes. It's lot easier to create than a collection of all the integers in that range, and it's far more memory-efficient, since it just has to store the lower and upper bounds.

From what I understand, not all Elixir developers can agree on whether a range should be considered a data type. I consider a range a data type only because Elixir has a distinct literal syntax for a range.

Elixir does not have an is_range function.

Regular Expressions

Like a range, regular expressions are also their own distinct literal in the Elixir language that corresponds to an inaccessible data structure under the surface. Like ranges, not all Elixir developers agree on whether a regular expression can be considered a data type. I do, but only because it has its own distinct literal syntax.

Like many developers, I use regular expressions only on occasion, and when I do, I often have to go through a refresher on the regex syntax. So I usually will just learn about regular expressions in a language at a high level, diving into greater detail only when I need to. So I'll just be giving a high-level overview here, and I probably won't go into any more detail in the future unless I need to.

There's not that much too it, really. I just need to know how to create a regular expression and invoke the regex matching functionality.

I think regular expressions are useful and very powerful, but I don't find much occasion to use them in the sorts of systems I usually work on. I bet I'd suddenly become a regex expert if I had to do a lot of text matching, but that hasn't happened.

Elixir regular expressions follow the Perl standard for regular expressions (PCRE). The Elixir regular expression functionality is built on top of the Erlang regular expression functionality, so regular expressions are very similar in both languages.

An Elixir regular expression literal is in the form "~r/[expression]/[modifiers]", with the "~r" indicating that what follows is a regular expression.

This regular expression looks for one or more digits followed by a "." followed by zero or more digits. There are no begin or end anchors in the expression, so this pattern can occur in the middle of the text as well.

iex> regex = ~r/[0-9]+\.[0-9]*/
~r/[0-9]+\.[0-9]*/
iex> Regex.match?(regex, "0")
false
iex> Regex.match?(regex, "0A")
false
iex> Regex.match?(regex, "0.")
true
iex> Regex.match?(regex, "0.4")
true
iex> Regex.match?(regex, "1230.4653")
true
iex> Regex.match?(regex, "1230.4e")
true
iex> Regex.match?(regex, "1230.e4")
true
iex> Regex.match?(regex, "a1230.e4")
true
iex> Regex.match?(regex, "a1230a.e4")
false

I won't be going into the modifiers here or all the possible regex functions, but if you're interested, you can take a look at the possible modifiers in the Elixir regex documentation.

Elixir does not have an is_regex function.

Collection Types

Collection types are data types that are containers for collections of data.

  • Tuples: store an ordered, randomly-accessable collection of data, not suited for iteration. A tuple literal uses curly braces and looks like {4, "Jim", -3.2}
  • Lists: store an ordered, sequentially-accessable collection of data, well-suited for iteration. A list literal uses brackets and looks like [1, 2, 3, "Bob", :square]
  • Maps: store an unordered collection of key-value pairs like a Dictionary<T> in C# or an object in Javascript. A map literal uses a percentage and curly praces containing key-value pairs and looks like %{name: "Bob", age: 8}
  • Structs: allow predefined data structures with properties and values to be created, which can consist of a combination of any data type
  • Sets: store a unique set of elements, meaning that no element can be repeated in the set. We can apply functions from set theory such as unions, intersections, an differences.
  • Keyword Lists: can store multiple key-value pairs for the same key. Keyword lists are actually derived from the list data type, but are accessed like a map. So they have an interface like a map, but performance characteristics of a list.
  • Binaries/Bitstrings: store a blob of bits or bytes, useful for binary data. A binary/bit string literal uses double angle brackets and looks like <<13, 4, 128, 41>>

I suspect that binaries and bitstrings are stored the same way in the internals of Elixir, but I don't know that for sure. A binary provides access to binary data at the byte level and a bitstring provides access to binary data at the bit level. They're so similar in nature that I lumped them together.

We'll be covering these collection types in much more detail in future posts, so I'm just providing a summary of them here so you'll know what they are and what they look like.

Special Data Types

There are some special data types that Elixir uses when dealing with processes.

  • Port
  • Reference
  • Pid

I have no idea at this point in time what these data types are all about, but judging by their names, I'm guessing they have something to do with interprocess communication. That's all advanced stuff, so we'll probably learn about them way down the road.