using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 16)

Increasing the level of de-obsfucation

Lesson 15: Building a list of several candidates

Today the lesson will be short, I've been swamped at work but I told myself that at a minimum I wanted a post each week. As a result, we'll talk about a lot of basic things, but in reality we're just going to build on what we did in lesson 15. We took the entire Unicode confusables map in, now we're going to add some of our own.

Let's revisit some history

When the internet started taking shape, and technology became more prevalent, we started coming up with shorter ways to type things. The advent of chatrooms, instant messengers, and text messaging all influenced us to design a form of text that is similar but not the same as normal, standard text. We began finding ways to replace letters with numbers, (such as 0 for o), and finding ways to abbreviate words/phrases/terms to make them easier to type.

One of these many changes to our general text-processing was the advent of something known commonly as "leet speak", often written "l33t." This is the idea of replacing letters with numbers that look similar, often seen in online-gaming communities. For example, we would use a 3 instead of e, a 4 for an a, a 5 for s, a 7 for T. We could even use a 1 for an i or l — the reader is usually capable of distinguishing which was intended. With all that said, we can come up with 1337, which when properly replaced would be our term "leet".

This adds a precarious challenge to our spam-filtering: robots can now replace the letters with a similar number, the reader will still interpret it properly, but now it bypasses the filters (because adv3nt is not the same as advent). We need a way to distinguish when this is happening.

Build on the existing foundation

Well, we already have the foundation for that. We have the confusables map, and we can easily-enough add new items to this map. In fact, we saw an example of it previously:

let confusables2 =
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

let confusables =
    [|{ Original = ("ℕ", 0) ||> Char.toCodePoint; Replacement = "m" |> String.toCodePoints |> Seq.toArray; Field3 = "" }|]
    |> Array.append confusables2

Now, confusables2 is not a good name, and we really should make adding new ones a bit easier, so let's define a new obsfucationConfusables that has our new confusables:

let obsfucationConfusables =
    let itemToConfusable (orig, repl) =
        { Original = (orig, 0) ||> Char.toCodePoint
          Replacement = repl |> String.toCodePoints |> Seq.toArray
          Field3 = "OBS" }
    [|("1", "i"); ("1", "l")
      ("2", "z"); ("2", "s")
      ("3", "e")
      ("4", "a")
      ("5", "s"); ("5", "z")
      ("6", "g"); ("6", "b")
      ("7", "t")
      ("8", "b")
      ("9", "g")
      ("0", "o")
      ("\\", "i"); ("\\", "l")
      ("/", "i"); ("/", "l")
      ("|", "i"); ("|", "l")
      ("!", "i"); ("!", "l")
      ("+", "t")
      ("@", "a")
      ("$", "s")
      ("&", "b")
      ("(", "c")
      ("[", "c")|]
    |> itemToConfusable

This was actually really easy, we could even load this from another text-file if we'd like. Because we built with foresight, we don't have to worry too much about how to inject these new values. We can map them directly to our previous type, and build out our confusables as a simple array append (as done before):

let confusables =
    |> Array.append unicodeConfusables

We're appending obsfucationConfusables after our unicodeConfusables, so we'll be adding them to the replacements. This means that adding a new test is easy:

let filters = ["nope"; "fail"; "leet"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝒆"; "l33t"; "1337"] |> listToCodePoints

Done. That's all we need to do, now l33t and 1337 will be matched to leet. If we map out all the combinations of terms, F# will happily give us the following:

possibilitiesForL33t = [|"l33t"; "le3t"; "l3et"; "leet"|]
possibilitiesFor1337 = [|"1337"; "l337"; "i337"; "l337"; "1e37"; "le37"; "ie37"; "le37"; "13e7"; "l3e7"; "i3e7"; "l3e7"; "1ee7"; "lee7"; "iee7"; "lee7"; "133t"; "l33t"; "i33t"; "l33t"; "1e3t"; "le3t"; "ie3t"; "le3t"; "13et"; "l3et"; "i3et"; "l3et"; "1eet"; "leet"; "ieet"; "leet"|]]

Right, so we've established that this worked properly before, now it works still but we added a whole new featur and capability. The next step will be to take an input string, and process each word in it.

I know this was short and boring, but it's about the best I could do this week. I've been seriously swamped at work, it's been eating into my personal time just a bit. With that said, this should get us moving towards the path to success, and we'll get to some really fun data-analysis as we progress further. I'm going in to 'hardware mode' this weekend, so I should be right-properly refreshed by Monday.