Implementing Huffman algorithm in Rust 🤔



Well, in the beginning I have to confess that I thought that this task will be easy enough to complete it within few hours max. It turned out that I spent a way more. Really a lot. Enough to say that I discovered what is all this hassle with "fighting borrow checker" about.

What was my goal?

My goal was to implement Huffman coding algorithm as in this example: https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html

The idea behind the Huffman algorithm is that most common characters will be represented with lower number of bits and those less frequent will need more bits. This feature makes this algorithm a useful tool for compression of data.

I wanted to encode text composed of just four letters: "A", "B", "C", "D".

The code

Let's see the probabilities of occurrence of each letter. I decided to hardcode the values, because I didn't want to complicate things more than needed.

1 Node {symbol: "A", prob: 0.4, left: None, right: None},
2 Node {symbol: "B", prob: 0.3, left: None, right: None},
3 Node {symbol: "C", prob: 0.2, left: None, right: None},
4 Node {symbol: "D", prob: 0.1, left: None, right: None},

The sum of all probabilities (prob in my code) should add up to 1.0, because no other letter is expected in the text.

Before I explain what is the Node in the Rust code, let's look how the algorithm works. The algorithm uses the binary tree data structure. We have some data - pair of symbol and probability of its occurence, so let's build a tree.

Here's my Node struct:

1 pub struct Node<'z> {
2 pub symbol: &'z str,
3 pub prob: f32,
4 pub left: Option<Box<Node<'z>>>
5 pub right: Option<Box<Node<'z>>>
6 }
7

There's a lot of stuff happening here, but fortunately it will help us understand some of Rust's core features.


  • My struct is public, so it can be reachable in other closures of the code. Also all of its fields are public, so they can be accessed from outside. Actually I didn't know that you can do something like this.

  • Left and right fields represent the children of the node. When you design a graph, tree, linked list, you willingly use some cyclic data structure so: pub left: Node.
    Unfortunately, Rust won't allow you to do this and produce an error like this:

    1 "recursive type `Node` has infinite size"

    Well, what does it mean?


    Memory management in Rust is kind of local. In contrary to most languages, Rust doesn't have a garbage collector mechanism - it keeps the data in the structure called stack, (it's very fast!) and removes allocated memory immediately after a function or other closure is executed. Putting a Node as a field inside itself would create a recursive relation, which would use infinite amount of memory. We don't want to do this.
    That's why I used a Box, which places the field in the heap memory structure, which may be slower, but will allow recursive relations.


  • Option marks a field optional, so it can be either Some or None. However, this will force us in the future check if there is something inside using match or unwrap() function.

  • Last but not least - 'z thing. It's called a lifetime parameter and it's quite common to see a lot of those in Rust source code. I have chosen "z" letter, but it can be any other: a, b, c and so on. Lifetime parameter tells the compiler how long the variable will live inside the memory. Two variables marked with the same lifetime parameter will have the same life span. In the context of my code, the lifetime parameter have to be the same for the struct Node, its children and implementations. I won't tell more, because this is not perfectly clear for me why this is required in the case of recursive structures.


Let's go step by step through the algorithm


1) First, we sort nodes by the probability


1 nodes.sort_by(|a, b| b.prob.partial_cmp(&a.prob;).unwrap());

Why partial_cmp()? Can't we just sort_by? NO, because basic cmp() on float is implemented in such fashion, that it will not handle a NaN in the value when the NaN may occur. Why creators of other programming languages didn't bother to cover this? In Python it's straightforward:

1 sorted([3, 4, 1, 2, float('nan')])

This above will just work, but may lead to some unexpected results. In Rust however the NaN should be taken into consideration, so the different method must be used. Yeah, there is a lot of MUST in Rust. Just see all the flame here: https://news.ycombinator.com/item?id=9089112 😲

2) Build the tree

We pop the two nodes with the least probability and create a new node, which will hold an empty ("*") letter and sum of its children probability.

 1 pub fn create_branch(nodes: &mut; Vec<node>) {
2
3 /// Create new branch whit two lowest nodes as (left and right) children.
4
5 /// Sort by probability and pop two lowest.
6 nodes.sort_by(|a, b| b.prob.partial_cmp(&a.prob;).unwrap());
7
8 let first = match nodes.pop() {
9 Some(i) => i,
10 None => return,
11 };
12
13 let second: Node = match nodes.pop() {
14 Some(i) => i,
15 None => return,
16 };
17
18 /// Create new node and push it to the vector.
19 let new_node = Node {
20 symbol: "*",
21 prob: first.prob + second.prob,
22 left: Some(Box::new(first)),
23 right: Some(Box::new(second)),
24 };
25
26 nodes.push(new_node);
27
28 }
29

We repeat this process until there is only one node left. When we finish, our tree looks like this:

huffman tree

Yay.

3) Traverse the tree

The node that was left is the *root* of the tree. We pick it and start traversing the tree using the simple queue algorithm.
  • push pair (root, empty vector) in the queue.
  • take item from queue and check if it has left or right child, then push it in the queue with following rules:
    • if it was right then push (left child, parent vector + 1)
    • if it was left then push (right child, parent vector + 0)
  • repeat until there is nothing left in the queue

 1 /// Loop until you reach all nodes that don't have left or right children.
2 while !stack.is_empty() {
3
4 let (node, codes) = stack.pop().unwrap();
5 let copied_node = node.clone();
6 let copied_codes = codes.clone();
7
8 /// Push left or right child if node has reference to it.
9 /// Add copied vector of codes with 1 or 0 at the end for right/left.
10
11 result.push((copied_node, copied_codes));
12
13 if let Some(box ref nod) = node.right {
14 let mut new_codes = codes.clone();
15 new_codes.push(1);
16 stack.push((nod, new_codes));
17 }
18
19 if let Some(box ref nod) = node.left {
20 let mut new_codes = codes.clone();
21 new_codes.push(0);
22 stack.push((nod, new_codes));
23 }
24
25 }

Note that I'm using if let expression for unwrapping the node from Option and then from Box. If let does the same thing as match, but in my case I don't need to cover all cases. If child node (left or right) is present (Some), then do something, otherwise don't bother.

I'm using the box notation, which is available in my version of Rust (rustc 1.19.0-nightly), but won't compile on the stable release.
The box ref means that we are taking the immutable reference to that what is contained in the Box. The reference is then pushed in the queue.

4) Encode the text!

1 let result: Vec<String> = text_vec.iter()
2 .map(|elem| encode_char(elem, &dictionary;))
3 .collect();

The text_vec is the list of characters from input and &dictionary; is a HashMap that we created using data from previous step.
I encode characters by iterating the vector. For each element I get the corresponding byte array and then collect() (which means that the result will be coerced to the given type).

That's it. The result of my struggle with borrow checker is following:

1 [sir@Utnapisztim huffman_coding]$ ./target/debug/huffman_coding
2 Type your text to encode.
3 AABBCCDD
4 Your text: AABBCCDD
5
6 Result: ["0", "0", "11", "11", "101", "101", "100", "100"]


Failed attempts

In fact I had a lot of problems with borrow checker complaining about moved values. The problem was: "How to traverse the tree from the root to the last leaf, marking the side which we turned each time?" Here's my failed attempt:

 1 fn go_down(node: Box&ltNode;>, direction: &str;) {
2
3 println!("{}, {}", node.symbol, node.prob);
4 match node.left {
5 Some(left) => go_down(left, "left"),
6 None => println!("nothin' on left found"),
7 }
8
9 match node.right {
10 Some(right) => go_down(right, "right"),
11 None => println!("nothin' on right found"),
12 }
13 }

I thought it was a decent idea to use recursion, because it seemed so suitable in this situation. Unfortunately:

 1 error[E0382]: use of collaterally moved value: `node.right`
2 --> 2huffman.rs:58:11
3 |
4 54 | Some(left) => go_down(left, "left"),
5 | ---- value moved here
6 ...
7 58 | match node.right {
8 | ^^^^^^^^^^ value used here after move
9 |
10 = note: move occurs because `(node.left:std::prelude::v1::Some).0` has type `std::boxed::Box&ltNode;<'_>>`, which does not implement the `Copy` trait
11
12 error[E0382]: use of moved value: `(node.right:std::prelude::v1::Some).0`
13 --> 2huffman.rs:59:14
14 |
15 54 | Some(left) => go_down(left, "left"),
16 | ---- value moved here
17 ...
18 59 | Some(right) => go_down(right, "right"),
19 | ^^^^^ value used here after move
20 |
21 = note: move occurs because `(node.left:std::prelude::v1::Some).0` has type `std::boxed::Box&ltNode;<'_>>`, which does not implement the `Copy` trait
22
23 error: aborting due to previous error(s)

I didn't understand where that move occured and spent a lot of time on trying different solutions. Normally, in most programming languages (at least imperative ones) you'd suspect, that traversing the binary tree using the recursion would somehow work, just by passing the node.left or node.right each time you go_down(). In Rust you cannot do this for some reason. I guess because when you go down a level using match, you are having the same value on two levels at the same time: in the match closure on the top and in the called function on the bottom. Or the problem may stem from this issue: Issue 16223. In fact you need a lot of imagination to get the idea about the recursion stack and how borrow checker is handling them. I gave up on using recursion to traverse the tree and decided to go on with just queue solution.

My source code:

https://github.com/sireliah/huffman_encoding

Sources and useful materials:

https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
https://rustbyexample.com/std/box.html
https://doc.rust-lang.org/beta/book/first-edition/the-stack-and-the-heap.html

Check this article, because it's explaining stack and heap in excellent way. (my seal of approval)
https://doc.rust-lang.org/beta/book/first-edition/the-stack-and-the-heap.html
Also, for recursive structs: https://stackoverflow.com/questions/25296195/why-are-recursive-struct-types-illegal-in-rust

Aaand a lot of Stack Overflow issues about "value x moved" and numerous fights against the borrow checker. 😎