WIP: Substitution cipher cryptoanalysis in Rust

Overview

In the Implementing echo in Rust post, I mentioned that my wife and I are learning Rust and cryptography. This post is the first cryptography post and it introduces substitution ciphers. Along the way, we’ll talk about more Rust.

WIP status: this is still in an early draft, but I needed to publish it in order to make it easier for us to read through together. If you’re stumbling across this from the World Wide Web, know that there might be some unintended falsehoods lurking below.

Substitution ciphers

Substitution ciphers are fairly straightforward: you substitute one character for another. Here’s an example:

A => Z
B => F
C => I
...

To decrypt the ciphertext, you reverse the substitution:

Z => A
F => B
I => C
...

Problem statement: ciphertext to analyze

We want to decrypt this ciphertext.

"LRVMNIR BPR SUMVBWVR JX BPR LMIWV YJERYRKBI JX QMBM WI BPR XJVNI MKD YMIBRUT JX IRHX
WI BPR RIIRKVR JX YMBINLMTMIPW UTN QMUMBR DJ W IPMHH BUT BJ RHNVWDMBR BPR YJERYRKBI JX
BPR QMBM MVVJUDWKO BJ YT WKBRUSURBMBWJK LMIRD JK XJUBT TRMUI JX IBNDT WB WI KJB MK RMIT
BMIQ BJ RASHMWK RMVP YJERYRKB MKD WBI IWOKWXWVMKVR MKD IJYR YNIB URYMWK NKRASHMWKRD BJ
OWER M VJYSHRBR RASHMKMBWJK JKR CJNHD PMER BJ LR FNMHWXWRD MKD WKISWURD BJ INVP MK RABRKB
BPMB PR VJNHD URMVP BPR IBMBR JX RKHWOPBRKRD YWKD VMSMLHR JX URVJOKWGWKO IJNKDHRII IJNKD
MKD IPMSRHRII IPMSR W DJ KJB DRRY YTIRHX BPR XWKMH MNBPJUWBT LNB YT RASRUWRKVR CWBP QMBM
PMI HRXB KJ DJNLB BPMB BPR XJHHJCWKO WI BPR SUJSRU MSSHWVMBWJK MKD WKBRUSURBMBWJK W JXXRU
YT BPRJUWRI WK BPR PJSR BPMB BPR RIIRKVR JX JQWKMCMK QMUMBR CWHH URYMWK WKBMVB

Note the newlines. When we copy/paste this text, we’ll need a single line. (If you want to be a cool kid, open it in vim and type ggVGJ.)

Analyzing the ciphertext

We could try to brute force an analysis. To brute force it, we’d have to sort through every possible combination of all 26 letters of the alphabet. Take a moment and try to compute that. We have 26 letters to choose from for A’s replacement. Let’s choose B. We now have 25 more letters to choose from to replace our next letter with. So on and so forth, until we have 26! (that’s shorthand for 26 * 25 * 24 …). And 26! is approximately equal to 2^28 (a huge effing number).

So, brute forcing our analysis isn’t practical. Even with a bunch of shiny new CPUs, it’d take us decades in the worst case.

There’s a smarter way. Some characters are used more frequently than others. For example, the space character, , is the most common character in English text. We can use the frequency of the ciphertext’s letters to make an educated guess about what their decrypted values are.

Starting a new Rust project

Before trying a frequency analysis, let’s set up a new Rust project.

cargo new substitution-cipher

For an overview of what this does, see the post on implementing echo in Rust.

Setting up the frequency attack

Replace the hello world in main.rs with the ciphertext from above:

fn main() {
    let ciphertext = String::from("LRVMNIR BPR SUMVBWVR JX BPR LMIWV YJERYRKBI JX QMBM WI BPR XJVNI MKD YMIBRUT JX IRHX WI BPR RIIRKVR JX YMBINLMTMIPW UTN QMUMBR DJ W IPMHH BUT BJ RHNVWDMBR BPR YJERYRKBI JX BPR QMBM MVVJUDWKO BJ YT WKBRUSURBMBWJK LMIRD JK XJUBT TRMUI JX IBNDT WB WI KJB MK RMIT BMIQ BJ RASHMWK RMVP YJERYRKB MKD WBI IWOKWXWVMKVR MKD IJYR YNIB URYMWK NKRASHMWKRD BJ OWER M VJYSHRBR RASHMKMBWJK JKR CJNHD PMER BJ LR FNMHWXWRD MKD WKISWURD BJ INVP MK RABRKB BPMB PR VJNHD URMVP BPR IBMBR JX RKHWOPBRKRD YWKD VMSMLHR JX URVJOKWGWKO IJNKDHRII IJNKD MKD IPMSRHRII IPMSR W DJ KJB DRRY YTIRHX BPR XWKMH MNBPJUWBT LNB YT RASRUWRKVR CWBP QMBM PMI HRXB KJ DJNLB BPMB BPR XJHHJCWKO WI BPR SUJSRU MSSHWVMBWJK MKD WKBRUSURBMBWJK W JXXRU YT BPRJUWRI WK BPR PJSR BPMB BPR RIIRKVR JX JQWKMCMK QMUMBR CWHH URYMWK WKBMVB");
}

Put the frequency of each letter, along with that letter, in a Vector:

fn main() {
    <<snip>>

    let mut char_frequency: Vec<(char, f32)> = vec![
        ('A', 0.0817),
        ('B', 0.0150),
        ('C', 0.0278),
        ('D', 0.0425),
        ('F', 0.0223),
        ('G', 0.0202),
        ('H', 0.0609),
        ('I', 0.0697),
        ('J', 0.0015),
        ('K', 0.0077),
        ('L', 0.0403),
        ('M', 0.0241),
        ('N', 0.0675),
        ('O', 0.0751),
        ('P', 0.0193),
        ('Q', 0.0010),
        ('R', 0.0599),
        ('S', 0.0633),
        ('T', 0.0906),
        ('U', 0.0276),
        ('V', 0.0098),
        ('W', 0.0236),
        ('X', 0.0015),
        ('Y', 0.0197),
        ('Z', 0.0007),
        ('E', 0.1270),
        (' ', 0.1988), // a guess based on a random article:
                       // https://www.researchgate.net/figure/Probability-of-characters-in-English-The-SPACE-character-represented-by-has-the_fig2_47518347
    ];
}

(The <<snip>> just means we’ve removed a chunk of code to make things easier to read on a screen.)

There’s a bit more going on. We have a mutable variable that uses a macro (we’ll talk about macros in another blog post) to initialize a Vector. Recall that Vectors are growable lists of values, and in this case, we’re populating the Vector with tuples. Tuples are ordered entries of values, but they can be of different types. We have a Character type (denoted by the single quotes, ') and a 32-bit Float.

We have a Vector of letter frequencies found in the wild, but we need to figure out how frequently letters show up in our ciphertext to match against the Vector of ‘wild’ frequencies.

We’ll use a HashMap to store our ciphertext frequencies.

<<snip>>
let mut ciphertext_letter_frequency: HashMap<char, i32> = [
    ('A', 0),
    ('B', 0),
    ('C', 0),
    ('D', 0),
    ('E', 0),
    ('F', 0),
    ('G', 0),
    ('H', 0),
    ('I', 0),
    ('J', 0),
    ('K', 0),
    ('L', 0),
    ('M', 0),
    ('N', 0),
    ('O', 0),
    ('P', 0),
    ('Q', 0),
    ('R', 0),
    ('S', 0),
    ('T', 0),
    ('U', 0),
    ('V', 0),
    ('W', 0),
    ('X', 0),
    ('Y', 0),
    ('Z', 0),
    (' ', 0),
]
.iter()
.cloned()
.collect();

This particular HashMap is initialized with an Array because we have a stable list of tuples representing the keys and values we want added to it.

We tell the compiler what we’re creating, HashMap<char, i32>, and then we create an iterable list that gets cloned and collected into the HashMap we’re creating. Iterators, well… make something iterable. They have some interesting behavior (e.g., they don’t do anything until you use them, potentially reducing the time complexity of what you’re writing) that we might touch on in a later post, but for now, just think of them as making something iterable.

We’re cloning the iterable because iter() returns a reference to the iterated values. (And before you push up your tape-covered glasses and say, “well, akshually, from_iter() is the right choice because it iterates over the value, not the reference,” know that I choose iter() because into_iter() currently iterates over references for arrays and iter() makes that nuance clearer.) A future post will talk about ownership, references, and so on in more detail.

Counting the frequency of letters in the ciphertext

We have a HashMap with 0 as placeholder values for each letter’s frequency. Let’s count the letters’ use to get real values.

To count each letter, we’ll need to split apart the ciphertext, look at each letter in turn, and increment its entry in ciphertext_letter_frequency.

Here’s how we’ll do that:

<<snip>>

for ciphertext_letter in ciphertext.chars() {
    *ciphertext_letter_frequency
        .get_mut(&ciphertext_letter)
        .unwrap() += 1;
}

Another code block with a lot going on. This increments each letter’s HashMap value. It ranges over the list of ciphertext’s characters, created by .chars(), one character at a time. And for each character, we’re getting a mutable reference to its value (they all start at 0), unwrapping that value (.get_mut() returns a Result, a type introduced in the echo post), and then incrementing what it refers to (* dereferences it, which we’ll cover in the ownership/borrowing blog post).

So far, we’ve counted each letter’s frequency. Let’s take a first pass at analyzing the ciphertext.

Analyzing the ciphertext

We have a HashMap of the ciphertext’s letter frequencies and a Vector of common letter frequencies in English text.

We need to replace the letters in the ciphertext with their complement frequencies that generally appear in English. There are a bunch of ways to do that, but here’s one of the simpler ways (at least to me): list and sort ciphertext_letter_frequency and char_frequency by magnitude and then use the indices of each sorterd list to map the letters from the general frequencies to the ciphertext’s frequencies.

To have an easy way to list and sort the ciphertext_letter_frequency, let’s turn it into a Vector.

<<snip>>

let mut ciphertext_letter_frequency = Vec::from_iter(ciphertext_letter_frequency.iter());

We’re replacing the original ciphertext_letter_frequency with a shadowed variable of the same name. It has a different type, though: a Vector. This keeps us from having variables like ciphertext_letter_frequency and and its Vector representation ciphertext_letter_frequency_vec littered throughout the code. Note the re-use of let. Without it, we’re merely reassigning a value to the already declared ciphertext_letter_frequency.

Let’s sort these Vectors by magnitude of frequency:

<<snip>>

ciphertext_letter_frequency.sort_by_key(|pair| pair.1);
char_frequency.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(Less));

We’re doing something interesting with those |s. They’re how you declare closures. Closures are like functions, but they differ in what they scope over. They capture the context where they’re declared. Functions don’t do that. For examples and to understand closures, see The Book’s entry on them. For our purposes here, think of them as merely anonymous functions. We’ll talk about what else you can do with them in a different post.

Both ciphertext_letter_frequency and char_frequency are a list of tuples, but they differ in the types for their frequency values: i32 and f32 respectively. That’s important because sorting Floats and sorting Integers behaves differently. Explaing why is going to take us on a short diversion, which can be skipped if you only care about practically sorting Floats.

Sorting Floats in Rust

People complain about how Floats are sorted in Rust, but it shows the care of the Rust language developers. The IEEE 754 standard deals with floating-point arithmetic. It requires handling the absence of values represented by NaN. Because NaN is allowed in comparisons of Floats, any sorting method must handle it. Yet, any sorting method that handles NaN won’t be a complete sorting method. Why? Well, you can’t sort the absence of a value because it’s conceptually empty to sort on absences without just taking a stand on where they should go! So, any sorting method that ranges over NaN can only partially sort unless the developers take a stand on where absences of values, represented by NaN, are sorted.

That’s why sorting Floats looks like this:

char_frequency.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap_or(Less));

The closure is .sort_by()’s comparison function, and .sort_by() expects it to return an Ordering:

pub enum Ordering {
    Less,
    Equal,
    Greater,
}

By using .unwrap_or(Less) we’re using the Less variant of the Ordering enum as the default unwrapping. That triggers when a wild NaN appears. We could have used a simple .unwrap(), but this shows how you can be opinionated using .unwrap_or().

Combining the frequencies

Let’s take our two Vectors, sorted by magnitude, and combine them into a HashMap to get a key from ciphertext letter to cleartext letter. Remember that since we’ve sorted both Vectors by magnitude of frequency, we can use the indices to make our mapping of ciphertext letter to cleartext letter.

<<snip>>

let mut analytical_attack_key: HashMap<char, char> = HashMap::new();
// _val forces a destructuring of the tuple; without it, (key) would represent the whole tuple
for (idx, (key, _val)) in ciphertext_letter_frequency.iter().enumerate() {
    analytical_attack_key.insert(**key, char_frequency[idx].0);
}

First pass at analyzing the ciphertext

Let’s take a stab at generating cleartext from the ciphertext.

<<snip>>
let mut parsed_text: String = ciphertext
    .chars()
    .map(|letter| analytical_attack_key.get(&letter).unwrap())
    .collect();

println!("parsed_text: {}", parsed_text);

This prints out what looks like a good first stab at breaking the ciphertext, but it still unusable by us to understand the cleartext:

parsed_text: YECAFSE THE WRACTNCE IU THE YASNC MIXEMEOTS IU BATA NS THE UICFS AOD MASTERG IU SELU NS THE ESSEOCE IU MATSFYAGASHN RGF BARATE DI N SHALL TRG TI ELFCNDATE THE MIXEMEOTS IU THE BATA ACCIRDNOP TI MG NOTERWRETATNIO YASED IO UIRTG GEARS IU STFDG NT NS OIT AO EASG TASB TI EKWLANO EACH MIXEMEOT AOD NTS SNPONUNCAOCE AOD SIME MFST REMANO FOEKWLANOED TI PNXE A CIMWLETE EKWLAOATNIO IOE VIFLD HAXE TI YE JFALNUNED AOD NOSWNRED TI SFCH AO EKTEOT THAT HE CIFLD REACH THE STATE IU EOLNPHTEOED MNOD CAWAYLE IU RECIPONQNOP SIFODLESS SIFOD AOD SHAWELESS SHAWE N DI OIT DEEM MGSELU THE UNOAL AFTHIRNTG YFT MG EKWERNEOCE VNTH BATA HAS LEUT OI DIFYT THAT THE UILLIVNOP NS THE WRIWER AWWLNCATNIO AOD NOTERWRETATNIO N IUUER MG THEIRNES NO THE HIWE THAT THE ESSEOCE IU IBNOAVAO BARATE VNLL REMANO NOTACT

It’s clear that we don’t have nearly enough ciphertext letters to get a trustworthy set of frequencies to map from ciphertext letters to cleartext letters. It’s an excellent first step, though, and gives us enough room to make some educated guesses about what the cleartext says.

Time to make some educated guesses on breaking the ciphertext

Our brains are quite good at recognizing patterns, including what a word might be based on its length, first letter, last letter, position in a sentence, and so on. Let’s use that ability to make good guesses about what a word is, and then feed our decisions back into our mapping of ciphertext letters to cleartext letters. That is, we’ll update our translation key based on the words we can identify, eventually (hopefully) making the ciphertext into meaningful cleartext.

<<snip>>

let mut split_parsed = parsed_text.split(" ");
let mut modified = false;

loop {
    let mut response = String::new();
    response.trim();

    let word = split_parsed.next().unwrap();
    println!("{:?}", word);

    io::stdin().read_line(&mut response);

    match response[..].trim() {
        "" => {
            continue;
        }
        "text" => {
            println!("{:?}", parsed_text)
        }
        "exit" => {
            if modified {
                parsed_text = parsed_text
                    .chars()
                    .map(|letter| analytical_attack_key.get(&letter).unwrap())
                    .collect::<String>();
            }

            break;
        }
        _ => {
            modified = true;

            for (idx, letter) in word.chars().enumerate() {
                let test = response.chars().nth(idx).unwrap();
                analytical_attack_key.insert(letter, test);
            }
        }
    };
}

And after this loop body, place println!("parsed_text: {}", parsed_text); before the main body’s closing bracket to see what you get!

Additional reading

HashMaps in The Book

PartialOrd and Ord for ordering comparisons

TODO: don’t forget to talk about imports


Aaron Arinder is a software engineer trying to die well. GitHub