Types as Bits

There are all sorts of types in Elm:

  • Bool
  • Int
  • String
  • (Int, Int)
  • Maybe Int
  • ...

We have a conceptual understanding of them by now, but how are they understood by a computer? How is Maybe Int stored on a hard disk?

Bits

A bit is little box that has two states. Zero or a one. On or off. Computer memory is one super long sequence of bits.

Okay, so all we have is a bunch of bits. Now we need to represent everything with that!

Bool

A Bool value can be either True or False. This corresponds exactly to a bit!

Int

An Int value is some whole number like 0, 1, 2, etc. You cannot fit that in a single bit, so the only other option is to use multiple bits. So normally, an Int would be a sequence of bits, like these:

00000000
00000001
00000010
00000011
...

We can arbitrarily assign meaning to each of these sequences. So maybe 00000000 is zero and 00000001 is one. Great! We can just start assigning numbers to bit sequences in ascending order. But eventually we will run out of bits...

By some quick math, eight bits only allow (2^8 = 256) numbers. What about perfectly reasonable numbers like 9000 and 8004?

The answer is to just add more bits. For a long time, people used 32 bits. That allowed for (2^32 = 4,294,967,296) numbers which covers the kinds of numbers humans typically think about. Computers these days support 64-bit integers, allowing for (2^64 = 18,446,744,073,709,551,616) numbers. That is a lot!

Note: If you are curious how addition works, learn about two’s complement. It reveals that numbers are not assigned to bit sequences arbitrarily. For the sake of making addition as fast as possible, this particular way of assigning numbers works really well.

String

The string "abc" is the sequence of characters a b c, so we will start by trying to represent characters as bits.

One of the early ways of encoding characters is called ASCII. Just like with integers, they decided to list out a bunch of bit sequences and start assigning values arbitrarily:

00000000
00000001
00000010
00000011
...

So every character needed to fit in eight bits, meaning only 256 characters could be represented! But if you only care about English, this actually works out pretty well. You need to cover 26 lower-case letters, 26 upper-case letters, and 10 numbers. That is 62. There is a bunch of room left for symbols and other weird stuff. You can see what they ended up with here.

We have an idea for characters now, but how will the computer know where the String ends and the next piece of data begins? It is all just bits. Characters look just like Int values really! So we need some way to mark the end.

These days, languages tend to do this by storing the length of the string. So a string like "hello" might look something like 5 h e l l o in memory. So you know a String always starts with 32-bits representing the length. And whether the length is 0 or 9000, you know exactly where the characters end.

Note: At some point, folks wanted to cover languages besides English. This effort eventually resulted in the UTF-8 encoding. It is quite brilliant really, and I encourage you to learn about it. It turns out that “get the 5th character” is harder than it sounds!

(Int, Int)

What about tuples? Well, (Int, Int) is two Int values, and each one is a sequence of bits. Let’s just put those two sequences next to each other in memory and call it a day!

Custom Types

A custom type is all about combining different types. Those different types may have all sorts of different shapes. We will start with the Color type:

type Color = Red | Yellow | Green

We can assign each case a number: Red = 0, Yellow = 1, and Green = 2. Now we can use the Int representation. Here we only need two bits to cover all the possible cases, so 00 is red, 01 is yellow, 10 is green, and 11 is unused.

But what about custom types that hold additional data? Like Maybe Int? The typical approach is to set aside some bits to “tag” the data, so we can decide that Nothing = 0 and Just = 1. Here are some examples:

  • Nothing = 0
  • Just 12 = 1 00001100
  • Just 16 = 1 00010000

A case expression always looks at that “tag” before deciding what to do next. If it sees a 0 it knows there is no more data. If it sees a 1 it knows it is followed by a sequence of bits representing the data.

This “tag” idea is similar to putting the length at the beginning of String values. The values may be different sizes, but the code can always figure out where they start and end.

Summary

Eventually, all values need to be represented in bits. This page gives a rough overview of how that actually works.

Normally there is no real reason to think about this, but I found it to be helpful in deepening my understanding of custom types and case expressions. I hope it is helpful to you as well!

Note: If you think this is interesting, it may be fun to learn more about garbage collection. I have found The Garbage Collection Handbook to be an excellent resource on the topic!

results matching ""

    No results matching ""