using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 17)

How close are two words?

Lesson 16: Increasing the level of de-obsfucation

Today's lesson is going to continue on our "spam filter" adventure, but the next step it to start supporting similar-but-not-quite-identical words.

Anyone who's ever typed or written text has most assuredly spelled a word or two wrong, probably more than once. (In fact, I spell 'appropriate' wrong every day — for some reason I always end up with 'apporpriate', I think it's the speed at which I type.) These may be mistakes, or they may be intentional. Sometimes we spell something wrong for a reason.

Often times "spammers" will spell words wrong on-purpose — instead of just swapping letters out for other, similar letters, they'll actually mis-spell something knowing the filter is searching for the correctly spelled term. Of course, a good spam-filter (which is what we plan to build) can detect some of this and make decisions based on that fact itself.

As of right now we're testing several cases of our filters against the following terms: "nope"; "fail"; "leet", today we're going to add some more cases, such as "noope", and "failing". These are going to be two more words we match against, but we're also now going to return how effectively the filter matched the term. That is, "nope" would match against "nope" for a 100% match, but "noope" would not, so we'll devise a manner of indicating what the success-rate is.

We're also going to try to prevent some false-alarms. A word like "true" doesn't really match any of our filters, but it's fair to say that it shares letters with two of them: it shares one letter with "nope", and two letters with "leet". We know that it's not even close, but we'll have to tell the computer that before we continue.

This can also be the basis of a 'spell-checker'

At this moment we're devising a piece of code that could serve as a trivial spell-checker, we're going to have to design it that way because we have to support mis-spellings, and the easiest way to do so is to design it as a spell-checker.

We're also going to do some TDD here — for those unfamiliar, TDD is simply "test-driven development", or the idea that you build tests that have an input and expected output, and then start writing the code to get there. We would write a test such as:

let expected = 1.0
let inputTerm = "nope"
let inputFilter = ["nope"]
let actual = inputTerm |> matchFilters inputFilter |> Seq.tryHead |> Option.map snd |> (function | None -> 0.0 | Some v -> v)
printfn "%s" (if actual = expected then "Pass" else "Fail")

This uses the AAA pattern, or "Arrange, Act, Assert". That is:

  • Arrange: prepare the conditions for the test, this is the input and output. What are we looking for? What are we given?
  • Act: perform the operation for the test, this maps the input(s) to the output(s). What do we get?
  • Assert: determine whether the expected output and actual output match. What really happened?

When I do TDD I always write a test, then write the function being tested, and repeat. I want to start with the main "framework" that we'll use (though this is a basic testing framework, there are others that are far better than this, I recommend using one of them):

module Tests =
    module Assert =
        let private fn fn msg expected actual =
            if fn expected actual then None
            else (msg expected actual) |> Some
        let equal expected actual = fn (=) (sprintf "expected: %A; actual: %A") expected actual
        let notEqual expected actual = fn (<>) (sprintf "expected not: %A; actual: %A") expected actual
        let largerThan expected actual = fn (>) (sprintf "expected greater than: %A; actual %A") expected actual
        let largerEqual expected actual = fn (>=) (sprintf "expected greater than / equal to: %A; actual %A") expected actual
        let smallerThan expected actual = fn (<) (sprintf "expected smaller than: %A; actual %A") expected actual
        let smallerEqual expected actual = fn (<=) (sprintf "expected smaller than / equal to: %A; actual %A") expected actual
        let print assertion =
            match assertion with
            | None -> printfn "Pass"
            | Some msg -> printfn "Fail: %s" msg

    // Define tests

    let runTests () =
        let tests =
            [ // List tests
              ]
        tests |> List.iter ((|>) () >> Assert.print)

This is obviously not the easiest solution to use, but it works. Drop all your test methods in the tests list, and define them before runTests, and life should be good. As an example, our first test would look like:

let ``A term that exactly matches a word in the filter should return 1.0 as match`` () =
    let expected = 1.0
    let inputTerm = "nope"
    let inputFilter = ["nope"]
    let actual = inputTerm |> matchFilters inputFilter |> Seq.tryHead |> Option.map snd |> (function | None -> 0.0 | Some v -> v)
    Assert.equal expected actual

let runTests =
    let tests = [``A term that exactly matches a word in the filter should return 1.0 as match``]
    tests |> List.iter ((|>) () >> Assert.print)

Pretty easy, right?

Write the matchFilters function

Now we start writing the matchFilters function. We know what our inputs should be: a list of filters and a term to test, and we expect that it returns some sort of sequenced tuple. Specifically, we're going to return every term matched, and the "accuracy" of that match.

Initially, we can obviously define this pretty easily to pass our test:

let matchFilters filters term = filters |> List.choose (fun filter -> if term = filter then (filter, 1.0) |> Some else None)

If we run our tests now we should get Pass printed out. Pretty easy, right? Now we could have actually written let matchFilters filters term = ("", 1.0), and that would have been completely valid and satisfied our tests. It's not wrong. We still made all our tests pass. The version I wrote just happens to satisfy what the next test we'll write is as well.

Next we should probably test that a word like "abcd" doesn't accidentally return 1.0. If we write matchFilters as the version in the previous paragraph, it would do so, so we want a test for it:

let ``A term that does not match any words in the filter should return no matches`` () =
    let expected = 0.0
    let inputTerm = "abcd"
    let inputFilter = ["nope"]
    let actual = inputTerm |> matchFilters inputFilter |> Seq.tryHead |> Option.map snd |> (function | None -> 0.0 | Some v -> v)
    Assert.equal expected actual

Now we are forced to write a version that will at the very least return 1.0 for the first test, and 0.0 for this second one. You should now get the idea of TDD: write a test, build the method to make the test pass, repeat. Ideally, you don't trust a test until you've proven that the current version of the method made it change. That is: the test should fail first, then build a version of the method that passes. Think of it as a red light turning green: you can truly trust the traffic light once you've seen that happen, because you know the basic "is it just stuck on green?" test completed, and told you "nope, it's cycling like normal."

Now, interestingly, our second test isn't actually quite right, but I'm going to leave it as is. We said it should return no matches, but we tested it in a manner that loses that idea. It doesn't matter in the end, because no matches is synonymous with 0% match, but it's a good point to keep in mind.

Before we continue, let's make life a little easier on us. We should remember from previous lessons that our goal is always to write clear, concise, informative code. We can make matchFilters slightly better with that in mind:

let matchFilters filters term =
    let matchFilter filter =
        if term = filter then (filter, 1.0) |> Some
        else None
    filters
    |> Seq.ofList
    |> Seq.choose matchFilter

Now when we're ready to build better criteria, we just have to swap out matchFilter. Much easier than dealing with that silly lambda.

Build our spell-checking algorithm (sort-of)

Alright, on to the hard part. For the first iteration of the algorithm, we're simply going to test what letters are in both terms, then divide that by the total letters from both terms for the percentage. This will return 100% for "nope"/"nope", and 0% for "nope"/"abcd". This will also return a partial match for "nope"/"true" and "nope"/"leet".

Doing this is actually quite simple, we have a few options, the one I'll use is to create a concatenated array of both the strings, and then group them, and divide each group size by 2. As an example, consider we have "nope"/"true", when we concat we'll get [|'n'; 'o'; 'p'; 'e'; 't'; 'r'; 'u'; 'e'|], when we group the characters and get the counts we'll have [|('n', 1); ('o', 1); ('p', 1); ('e', 2); ('t', 1); ('r', 1); ('u', 1)|], if we sum floor(group.Count / 2) we get 1, which we then double and find that 2 of the total 8 characters were in each string, or 25%.

let matchFilters filters (term : string) =
    let matchFilter (filter : string) =
        let termChars = term.ToCharArray()
        let filterChars = filter.ToCharArray()
        let concated = termChars |> Array.append filterChars
        let grouped = concated |> Array.groupBy id |> Array.map (fun (c, a) -> (c, a |> Array.length))
        let sum = grouped |> Array.fold (fun acc (c, i) -> acc + i / 2) 0 |> float |> (*) 2.0
        if sum <= 0.0 then None
        else (filter, sum / (term.Length + filter.Length |> float)) |> Some
    filters
    |> Seq.ofList
    |> Seq.choose matchFilter
    |> Seq.sortByDescending snd

So that's what we end up with, for now. (We'll probably improve this in a later lesson.) We added the Seq.sortByDescending snd to order them by best match first. If we run our two existing tests we should get "Pass" and "Pass". We didn't create tests for this yet (well, I did, you probably haven't) so I recommend that before dumping this in your code you build a couple basic tests.

let testFn filters = matchFilters filters >> Seq.tryHead >> Option.map snd >> (function | None -> 0.0 | Some v -> v)

let ``A term that partially matches a word in the filter should return 0.* as match`` () =
    let expected = 0.25
    let inputTerm = "true"
    let inputFilter = ["nope"]
    let actual = (inputFilter, inputTerm) ||> testFn
    Assert.equal expected actual

let ``A term that mostly matches a word in the filter should return 0.* as match`` () =
    let expected = 0.75
    let inputTerm = "mope"
    let inputFilter = ["nope"]
    let actual = (inputFilter, inputTerm) ||> testFn
    Assert.equal expected actual

This method, of course, assumes that we only care about what characters are present and that the order is irrelevant. We'll (at some point) want to redefine it to support matching terms such that terms that have the same letters but not in the same position are not exactly a 1.0, etc. For now, this will do.

Finally, come up with a "threshold" for acceptance

When we do a "heuristic" match like this, we need to define a threshold of tolerance. We don't want "true" to match the "nope" filter because it only shares an e, we wnat to define a minimum-level-of-acceptability. For this, I usually start with a somewhat middle-ground value, knowing that 0.5 means 50% of our words were a match, I go with 0.75, or 75%. This gives us an initial feeling for where we stand, and we can tune that later.

Now I'm going to redefine things a little, don't fear, just to keep it a little cleaner:

let matchFilters filters (term : string) =
    let matchFilter (filter : string) =
        let termChars = term.ToCharArray()
        let filterChars = filter.ToCharArray()
        let concated = termChars |> Array.append filterChars
        let grouped = concated |> Array.groupBy id |> Array.map (fun (c, a) -> (c, a |> Array.length))
        let sum = grouped |> Array.fold (fun acc (c, i) -> acc + i / 2) 0 |> float |> (*) 2.0
        if sum <= 0.0 then None
        else (filter, sum / (term.Length + filter.Length |> float)) |> Some
    filters
    |> Seq.ofList
    |> Seq.choose matchFilter
let bestFilter filters =
    matchFilters filters >> Seq.sortByDescending snd >> Seq.tryHead >> Option.map snd

Pretty simple, we just moved our Seq.sortByDescending out, and now we'll go ahead and add a function for meetsThreshold which takes our "filter" and threshold:

let meetsThreshold threshold (filter, percent) = percent > threshold

Time to modify our previous matchedFilters

Finally, the fun part. We'll modify the matchedFilters to use our new algorithm. I'm only going to pick the best-matched filter, and we're only going to pick which transformation of the term it was that best-matched the filter, so we'll end up with the following:

let matchedFilters term =
    term
    |> transformToLowerTerm
    |> Array.map (bestFilter filters >> (function | None -> ([||], 0.0) | Some f -> f))
    |> Array.sortByDescending snd
    |> Array.filter (meetsThreshold 0.75)
    |> Array.tryHead
    |> (function | None -> [] | Some a -> [a])

I did make one modification to matchFIlters: the inner-function was defined to use strings, we want to use our int-arrays, so we'll change:

let matchFilters filters (term : string) =
    let matchFilter (filter : string) =
        let termChars = term.ToCharArray()
        let filterChars = filter.ToCharArray()
        let concated = termChars |> Array.append filterChars

To:

let matchFilters filters (term : int[]) =
    let matchFilter (filter : int[]) =
        let concated = term |> Array.append filter

And we should have the new algorithm. We should get some matched terms:

val matchedTerms : (string * string list) list =
  [("ℕope", ["nope"]); ("𝑵ope", ["nope"]); ("ռope", ["nope"]);
   ("nope", ["nope"]); ("𝕱ail", ["fail"]); ("𝓕ail", ["fail"]);
   ("l33t", ["leet"]); ("1337", ["leet"]); ("noope", ["nope"])]

We notice that "failing" is not in there - we can add it by tuning that 0.75 threshold from before. And, finally, if you want to include the percent that it matched the filter in your matchedTerms, replace it with the following:

let matchedTerms =
    terms
    |> List.map (fun t -> (t, t |> matchedFilters))
    |> List.filter (snd >> List.length >> any)
    |> List.map (fun (t, f) -> (t |> String.fromCodePoints, f |> List.map (fun (f, s) -> (f |> String.fromCodePoints, s))))

We just change the fst >> String.fromCodePoints to a lambda, which maps the first item to the string, and the second item as the value still.


Alright, so that sums up todays lesson. I know it got a bit rambly towards the end but that's what happens when you have absolutely no sleep and write this over a 5-day period. I actually have a plan for the next one, which involves more spreadsheet stuff, so tune-in next time for another fabulous adventure in learning how to program (but the hard way). Good luck and enjoy!

Getting started with programming and getting absolutely nowhere (Part 16)

Increasing the level of de-obsfucation

Lesson 15: Building a list of several candidates

Today the lesson will be short, I've been swamped at work but I told myself that at a minimum I wanted a post each week. As a result, we'll talk about a lot of basic things, but in reality we're just going to build on what we did in lesson 15. We took the entire Unicode confusables map in, now we're going to add some of our own.

Let's revisit some history

When the internet started taking shape, and technology became more prevalent, we started coming up with shorter ways to type things. The advent of chatrooms, instant messengers, and text messaging all influenced us to design a form of text that is similar but not the same as normal, standard text. We began finding ways to replace letters with numbers, (such as 0 for o), and finding ways to abbreviate words/phrases/terms to make them easier to type.

One of these many changes to our general text-processing was the advent of something known commonly as "leet speak", often written "l33t." This is the idea of replacing letters with numbers that look similar, often seen in online-gaming communities. For example, we would use a 3 instead of e, a 4 for an a, a 5 for s, a 7 for T. We could even use a 1 for an i or l — the reader is usually capable of distinguishing which was intended. With all that said, we can come up with 1337, which when properly replaced would be our term "leet".

This adds a precarious challenge to our spam-filtering: robots can now replace the letters with a similar number, the reader will still interpret it properly, but now it bypasses the filters (because adv3nt is not the same as advent). We need a way to distinguish when this is happening.

Build on the existing foundation

Well, we already have the foundation for that. We have the confusables map, and we can easily-enough add new items to this map. In fact, we saw an example of it previously:

let confusables2 =
    data
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> Seq.map (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

let confusables =
    [|{ Original = ("ℕ", 0) ||> Char.toCodePoint; Replacement = "m" |> String.toCodePoints |> Seq.toArray; Field3 = "" }|]
    |> Array.append confusables2

Now, confusables2 is not a good name, and we really should make adding new ones a bit easier, so let's define a new obsfucationConfusables that has our new confusables:

let obsfucationConfusables =
    let itemToConfusable (orig, repl) =
        { Original = (orig, 0) ||> Char.toCodePoint
          Replacement = repl |> String.toCodePoints |> Seq.toArray
          Field3 = "OBS" }
    [|("1", "i"); ("1", "l")
      ("2", "z"); ("2", "s")
      ("3", "e")
      ("4", "a")
      ("5", "s"); ("5", "z")
      ("6", "g"); ("6", "b")
      ("7", "t")
      ("8", "b")
      ("9", "g")
      ("0", "o")
      ("\\", "i"); ("\\", "l")
      ("/", "i"); ("/", "l")
      ("|", "i"); ("|", "l")
      ("!", "i"); ("!", "l")
      ("+", "t")
      ("@", "a")
      ("$", "s")
      ("&", "b")
      ("(", "c")
      ("[", "c")|]
    |> Array.map itemToConfusable

This was actually really easy, we could even load this from another text-file if we'd like. Because we built with foresight, we don't have to worry too much about how to inject these new values. We can map them directly to our previous type, and build out our confusables as a simple array append (as done before):

let confusables =
    obsfucationConfusables
    |> Array.append unicodeConfusables

We're appending obsfucationConfusables after our unicodeConfusables, so we'll be adding them to the replacements. This means that adding a new test is easy:

let filters = ["nope"; "fail"; "leet"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝒆"; "l33t"; "1337"] |> listToCodePoints

Done. That's all we need to do, now l33t and 1337 will be matched to leet. If we map out all the combinations of terms, F# will happily give us the following:

possibilitiesForL33t = [|"l33t"; "le3t"; "l3et"; "leet"|]
possibilitiesFor1337 = [|"1337"; "l337"; "i337"; "l337"; "1e37"; "le37"; "ie37"; "le37"; "13e7"; "l3e7"; "i3e7"; "l3e7"; "1ee7"; "lee7"; "iee7"; "lee7"; "133t"; "l33t"; "i33t"; "l33t"; "1e3t"; "le3t"; "ie3t"; "le3t"; "13et"; "l3et"; "i3et"; "l3et"; "1eet"; "leet"; "ieet"; "leet"|]]

Right, so we've established that this worked properly before, now it works still but we added a whole new featur and capability. The next step will be to take an input string, and process each word in it.


I know this was short and boring, but it's about the best I could do this week. I've been seriously swamped at work, it's been eating into my personal time just a bit. With that said, this should get us moving towards the path to success, and we'll get to some really fun data-analysis as we progress further. I'm going in to 'hardware mode' this weekend, so I should be right-properly refreshed by Monday.

Getting started with programming and getting absolutely nowhere (Part 15)

Building a list of several candidates

Lesson 14: Starting our next problem

In the previous lesson, I mentioned the following:

One of the biggest issues, that might not be immediately apparent, with how we do this is: what happens if we have multiple confusables for a character? We only transform the first one, which means we could be missing out on some possible conversions. We'll go over fixing this in the next lesson, but I want you to start thinking about it now. Essentially, our goal will be to swap out the transformTerm with a version that returns every possible option (which for each of our samples is still only 1).

If it's not clear, what I mean by this is that if a character we're given could be confused with two different characters, then we want to consider both possibilities. We want to be able to test all variants, to ensure we don't miss any.

For this, we need a way to gather all the variants. Here we'll go back to a quick REPL integration, to begin our development. We want a function, called getCombinations, that allows us to provide an array of arrays, and returns another array of arrays of each possible combination of those positions. So, if we provided [|[|"g"|]; [|"r"|]; [|"e"; "a"|]; [|"y"|]|], we want to get back [|[|"g"; "r"; "e"; "y"|]; [|"g"; "r"; "a"; "y"|]|], thus, we should have the following:

// This should be [|"grey"; "gray"|]
let combinationTest = [|[|'g'|]; [|'r'|]; [|'e'; 'a'|]; [|'y'|]|] |> getCombinations |> Array.map String.implode

Make sense? So given an array of options for a position, return all possible combinations of those options. If we provide [|[|'w'|]; [|'o'|]; [|'r'|]; [|'d'|]|], get [|[|'w'; 'o'; 'r'; 'd'|]|].

We're going to do this in a fairly simple manner, using recursion (but not tail-call recursion, in our example). Now I've done the hard part, and written the getCombinations function, in this lesson we'll focus on modifying our current program to use it.

Getting all the combinations

This function is somewhat simple, and lacks tail-call recursion, so a good exercise for you might be to rewrite it for tail-call recursion. (It can be done, and I have done it with this just to prove it, we'll discuss that in a different non-lesson.)

let rec getCombinations<'a> (input : 'a array array) : 'a array array =
    let currentArray = input |> Array.head
    let tail = input |> Array.tail
    if tail |> Array.length |> any then
        tail
        |> getCombinations
        |> Array.map (fun ia ->
            currentArray
            |> Array.map (fun ca ->
                [|[|ca|]; ia|]
                |> Array.flatten))
        |> Array.flatten
    else currentArray |> Array.map Array.initOne

So, we basically get the sub-combinations, and combine them with the combinations of the option we currently have. Pretty simple, right?

Now we can test this with the previous tests, and prove it to ourselves, but I leave that up to you. (I already proved it to myself, it's your turn to prove that it works like we want.) We use the 'generic type' specification here, as it should really work on anything. So, your signature should be 'a array array -> 'a array array.

Next, we need to pull this to our transform

Previously, we defined transformTerm as follows:

let transformTerm =
    Array.map (fun codePoint ->
        match codePoint |> findCandidates with
        | [||] -> [|codePoint|]
        | candidates -> candidates.[0].Replacement)
    >> Array.flatten

We're going to slightly update this defnition. The first step is to modify candidates to return all the replacements: candidates |> Array.map (fun x -> x.Replacement)). Next, we need to modify the [||] (empty) match-case to return [|[|codePoint|]|]. Then, we need to add a call to getCombinations at the end. Finally, move Array.flatten to the end, in an Array.map.

let transformTerm =
    Array.map (fun codePoint ->
        match codePoint |> findCandidates with
        | [||] -> [|[|codePoint|]|]
        | candidates -> candidates |> Array.map (fun x -> x.Replacement))
    >> getCombinations
    >> Array.map Array.flatten

This, of course, breaks our transformToLowerTerm and matchedFilters. For transformToLowerTerm we just need to Array.map lowerCaseTerm instead, and the matchedFilters will be a bit more difficult. Our previous version was pretty simple:

let matchedFilters term = filters |> List.filter (term |> transformToLowerTerm |> (=))

Here, we have to think about what needs done now. We have multiple terms being matched to multiple filters, as such we have a significant rewrite to do, but only on the List.filter portion. We'll modify the filter to filter each term against the current filter, and then test that we got any matches. If we did, any form of the term matched our filter.

let matchedFilters term =
    filters
    |> List.filter (fun filter ->
        let transformations = term |> transformToLowerTerm
        let anyMatch = transformations |> Array.filter ((=) filter)
        anyMatch |> Array.length |> any)

That should be all we need to modify, we can test our transformToLowerTerm with the following:

terms |> List.map (transformToLowerTerm >> Array.map String.fromCodePoints)

Small bugs begin to appear

We do happen to have a bug in this, and I'll show you what it is quick.

In the case that we add a confusable, we always ignore the original character. This can be problematic if we add confusables for the regular ASCII characters, such as o, p, or e. To fix this, we simply add Array.append [|[|codePoint|]|] after our Array.map (fun x -> x.Replacement). This will include the original character as well.

We can test all this out on our real data-set, by changing our confusables and filters:

let confusables2 =
    data
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> Seq.map (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

let confusables =
    [|{ Original = ("ℕ", 0) ||> Char.toCodePoint; Replacement = "m" |> String.toCodePoints |> Seq.toArray; Field3 = "" }|]
    |> Array.append confusables2

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)

let filters = ["mope"; "fail"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝒆"] |> listToCodePoints

This should result in matchedTerms being val matchedTerms : string list = ["ℕope"; "𝕱ail"; "𝓕ail"] instead of the whole list, if we change mope to nope we still get the first "ℕope", which proves everything works properly.

Great work, team!

Overall, we should have learned that if we structure our code well, we can add new functionality without having to do too much modification. We can easily add a whole new feature without creating much of a problem at all.


The entire goal of this lesson is to show how we can add substantial functionality to our software without doing major rewrites. It also helps demonstrate how we should be thinking about problems as we solve them. Work from a small issue up to the larger one, and build independent stages to allow easy troubleshooting, testing and integration of them. This wasn't a long lesson, but it's just as important as any other. (I've been swamped at work all week thanks to the ball being dropped by an API we consume, so haven't had a lot of time to get a good post together. Hopefully this meets everyone's expectations.)

Getting started with programming and getting absolutely nowhere (Part 14)

Starting our next problem

Lesson 13: Currying (and not the 'Steph' kind)

As expected, I want to continue on to the next issue we're going to solve. One of the many things I deal with on a daily basis is Spam Filtering, which we use a third-party provider for, as well as the built-in spam filtering of our Email system. The problem, of course, ends up being that spammers have gotten really good at getting around these filters.

Now, being who I am, I'm going to share some information with you on ways we can help alleviate this issue. One of the first things we'll do is replace all the invalid / unicode code-points that are out of our normal processing scope. What do I mean by this? Well let's see what we're talking about.

All 'text' has an encoding

Let's discuss this for a moment. All text we see has an "encoding" in a computer system — a way to map a visual character with a number. There are many encodings out there, but some of the most popular are detailed below:

  • ASCII: this was one of the first encodings used on text. In this encoding we map 128 different characters to the numbers 0 through 127, using 7 bits of space. We have since built an encoding known as "Extended ASCII", which maps an additional 128 characters to the 128 - 255 space using a full 'byte' (8-bits).
  • EBCDIC: this was an encoding invented by IBM for it's main-frame systems. This maps characters in a similar manner as ASCII, but it does so with a different 'table'. (A 'table' simply being a matrix of bit combinations to the appropriate character/signal.)
  • UTF-32: this maps all the unicode code-points to a 32-bit number, which indicates what code-point should be rendered.
  • UTF-16: this maps all the unicode code-points to a 16-bit number, which may also have a second 16-bit number after it. The .NET framework internally uses this format, so we're going to spend a lot of interaction with it, specifically.
  • UTF-8: this maps all the unicode code-points to an 8-bit number, which may also have additional 8-bit numbers after it. This is one of the most common encodings, and is backwords compatible with ASCII (0-127). You'll find this is used a lot on the web, and in cross-platform solutions because it has a somewhat low memory footprint (requires a minimum of one byte per character, up to 6 bytes per character if I recall correctly), and it's easy to work with.

I'm not going to go into detail on this subject, but you need to understand that there are many encodings, and they all mean stuff. This is a hugely immense topic, and there is a good explanation here. We just need to know that this means there are many characters to be represented, and we want to account for that.

Now Email can use any of these encodings (it's usually specified in the body somewhere), but we don't actually need to know that part. We'll actually allow the .NET API to pass us a string to a function, so knowing the encoding is less important because .NET will convert it to UTF-16. Here's what we really need to know:

  • Most characters that we see are a single UTF-16 value, something between 0 and 65535;
  • Some characters are two UTF-16 values, with each being a "high surrogate" or a "low surrogate";
  • If we convert the string to strictly UTF-32, each character is a single integer value;

I've actually done some of the hard part of mapping the characters to code-points (actually, I did all of the hard part), which is evidenced at this repository, under the String.fs file, notably toCodePoints and fromCodePoints. There is also a toCodePoint and fromCodePoint in the Char.fs file. The basics of these functions are noted below:

/// Returns the character(s) represented by the unicode code-point.
let fromCodePoint = System.Char.ConvertFromUtf32
/// Returns the unicode code-point represented by the string at the i'th position.
let toCodePoint (s : string) (i : int) = (s, i) |> System.Char.ConvertToUtf32

/// Converts the string to an array of Int32 code-points (the actual Unicode Code Point number).
let toCodePoints (str : string) : int seq =
    let mapper i c =
        // Ignore the low-surrogate because it's already been converted
        if System.Char.IsLowSurrogate(c) then None
        else i |> Char.toCodePoint str |> Some
    str
    |> Seq.mapi mapper
    |> Seq.choose id

/// Converts the array of Int32 code-points (the actual Unicode Code Point number) to a string.
let fromCodePoints =
    Array.map (Char.fromCodePoint >> explode)
    >> Array.flatten
    >> implode

This also uses my Array.flatten function, which we talked about in a previous blog post.

Ok, so what's our goal?

So the first step in this whole thing is to define the problem. Let's consider we want to flag emails which have certain terms in them, for ease of use we'll define a sample of basic words:

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)
let filters = ["nope"; "fail"] |> listToCodePoints

We want to flag the words nope and fail. Now, ordinarily this would be easy, because we can match the word with the filter. The problem here is that the word can have 'obsfucated' or 'confusable' characters, such as ℕope, 𝑵ope, ռope, etc. We can also have 𝕱ail, and 𝓕ail, etc. So, the initial issue is that each of these strings looks like what we want to match, but the computer has no way of knowing that. So we'll define some tests:

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)
let filters = ["nope"; "fail"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝓮"] |> listToCodePoints

We want a process to find what "filters" each term matched. The easiest way is to build a function:

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

And then define let transformToLowerTerm = id. Right? We don't have a function for that yet, but we know it'll take a string and return one, so return the same thing (this let's us start building our function chain). Now, if we define a few more helpers, we get the actual transformation:

let transformToLowerTerm = id

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

let any = (<) 0

let matchedTerms = terms |> List.filter (matchedFilters >> List.length >> any) |> List.map String.fromCodePoints

We should only get val matchedTerms : string list = ["nope"], just one so far. Now we need to define the transformation to let us compare our strings. For this, the Unicode Consortium gives us a list of "confusable" characters, under the Security section, which has a confusables.txt file. This file has a pretty simple CSV format, any # indicates a comment for the remainder of the line (ignore it, basically), and each field is separated by a semicolon (;). There are also tabs included, but we can strip those with String.Trim.

Loading the data from the Unicode Consortium

This file we're given by the Unicode Consortium is actually really easy to read with F#, about 20 lines or so:

type UnicodeConfusables = { Original : int; Replacement : int array; Field3 : string }

let mapToConfusable =
    function
    | [|a; b; c|] ->
        { Original = a |> String.trim |> Int.fromHex
          Replacement = b |> String.trim |> String.split ' ' |> Array.map Int.fromHex
          Field3 = c |> String.trim } |> Some
    | _ -> None

let file = "http://www.unicode.org/Public/security/10.0.0/confusables.txt"
let data =
    use wc = new WebClient()
    wc.DownloadString(file)

let confusables =
    data
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> Seq.map (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

So now we're down to building a proper transformToLowerTerm function. I've already given away what it should do by the name, so you should try your hand at building it first. Here's a hint: the transform term is simply going to map any of the confusable characters (as deemed by having a presence in the confusables array with the same Original code-point value), and the lower-case term is simply going to take anything A-Z and change it to a-z.

Once you're done, you should have two functions which we can compose into the transformToLowerTerm:

let transformToLowerTerm = transformTerm >> lowerCaseTerm

So if you've done it right, then the entire block here should compile:

let transformToLowerTerm = transformTerm >> lowerCaseTerm

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

Answer time

Here are the functions I've defined, you don't have to use them but you should be able to reproduce the same thing with them, thus for any given input, my version of lowerCaseTerm should return the same as yours, and any given input to transformTerm should return the same as yours:

let transformTerm =
    Array.map (fun codePoint ->
        match codePoint |> findCandidates with
        | [||] -> [|codePoint|]
        | candidates -> candidates.[0].Replacement)
    >> Array.flatten

let A = 'A' |> int
let Z = 'Z' |> int
let a = 'a' |> int

let lowerCaseTerm = Array.map (function | c when c >= A && c <= Z -> c + (a - A) | c -> c)

You'll notice that everything is pretty simple, possible with the exception of the math in lowerCaseTerm, but in the ASCII encoding (and, conveniently, the UTF-16 encoding) the characters A-Z are sequential, starting at value 65 (0x41). The characters a-z are also sequential, starting at 97 (0x61). Since our characters are already integers, we could have compared directly to >= 0x41 and <= 0x5A, but it's easier to follow if we define constants that hold the values there. The int function will return the integer value of the input character, so when we do int 'A', we get 65. (You can try this in F# Interactive to see what I mean.)

So, for our math, we just need to test that the character is in the range A-Z (defined by c >= A && c <= Z, and if it is return c + (a - A), which will take c and add the difference between the lower-case a to the upper-case A. This should always be 32 or 0x20. (Since each lower-case character is exactly 32-positions ahead of the upper-case one, c + 32 is always the lower-case character if we're an upper-case character.)

At this point we've done everything required to map the confusables for a word, so we'll look at the entire source and see what can or can't be done to improve it quick, then call it a day.

type UnicodeConfusables = { Original : int; Replacement : int array; Field3 : string }

let mapToConfusable =
    function
    | [|a; b; c|] ->
        { Original = a |> String.trim |> Int.fromHex
          Replacement = b |> String.trim |> String.split ' ' |> Array.map Int.fromHex
          Field3 = c |> String.trim } |> Some
    | _ -> None

let file = "http://www.unicode.org/Public/security/10.0.0/confusables.txt"
let data =
    use wc = new WebClient()
    wc.DownloadString(file)

let confusables =
    data
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> Seq.map (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)

let filters = ["nope"; "fail"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝓮"] |> listToCodePoints

let findCandidates codePoint = confusables |> Array.filter (fun x -> x.Original = codePoint)

let transformTerm =
    Array.map (fun codePoint ->
        match codePoint |> findCandidates with
        | [||] -> [|codePoint|]
        | candidates -> candidates.[0].Replacement)
    >> Array.flatten

let A = 'A' |> int
let Z = 'Z' |> int
let a = 'a' |> int

let lowerCaseTerm = Array.map (function | c when c >= A && c <= Z -> c + (a - A) | c -> c)

let transformToLowerTerm = transformTerm >> lowerCaseTerm

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

let any = (<) 0

let matchedTerms = terms |> List.filter (matchedFilters >> List.length >> any) |> List.map String.fromCodePoints
terms |> List.map (transformToLowerTerm >> String.fromCodePoints)

Right off the bat I had an idea to eliminate some of the complexities of lowerCaseTerm, let's see where it goes.

Let's look at what we want to do again: if it's a character between 'A' and 'Z' we want to replace it with the lower-case (add 32) version.

let lowerCaseTerm =
    let uppers = ['A'..'Z'] |> List.map int
    Array.map (fun c ->
        if uppers |> List.contains c then c + 32
        else c)

Now this may not be shorter, but there's no more complicated math (since we've hardcoded 32, we could replace that with our a - A again, but instead we'll just add the 32. Of course, we can also take advantage of the framework here, and use System.Char.IsUpper, but that works specifically on the char. We could convert our int to char, but then there is a possibility of converting an invalid value. For this reason, we'll stick with one of the first two options. We could also rewrite the Array.map as:

Array.map (function | c when uppers |> List.contains c -> c + 32 | c -> c)

Which does the same thing, but with a quick (pointless) pattern-match. Again, what you desire to use here is your choice, and I am simply offering different alternatives that do the same thing. This is to demonstrat that you can often do the same thing in many ways with F#. As always, I recommend writing it in the way that is most clear and concise. Don't be clever, write it in a way that you (and others) can understand it.

Once we're done our output should look something like the following:

val matchedTerms : string list =
  ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"]
val it : string list =
  ["nope"; "nope"; "nope"; "nope"; "fail"; "fail"; "pass"; "true"; "true"]

I've been running this in F# Interactive, so it automatically assigns terms |> List.map (transformToLowerTerm >> String.fromCodePoints) to it. Otherwise, everything ought to make sense: the first list is just terms that look like nope or fail, and the second is all terms transformed to lower-case.

Where we (could) fail

One of the biggest issues, that might not be immediately apparent, with how we do this is: what happens if we have multiple confusables for a character? We only transform the first one, which means we could be missing out on some possible conversions. We'll go over fixing this in the next lesson, but I want you to start thinking about it now. Essentially, our goal will be to swap out the transformTerm with a version that returns every possible option (which for each of our samples is still only 1).


Overall we should be heading down a new track this time, we won't be using a lot of the old stuff we learned, because we now need to think about a new type of problem. This next section of this series will be bringing us into another domain, which we will use to further define our knowledge. I want you to continue to explore these ideas before I bring the series to them, that way you can start testing your knowledge.

Getting started with programming and getting absolutely nowhere (Part 13)

Currying (and not the 'Steph' kind)

Lesson 12: How do we become experts?

In this lessson we're going to talk about currying (and I do mean neither the food nor the basketball player), which is one of the most complex topics in F#. The F# designers (shout out to Don Syme, Phillip Carter, and many others) have done a wonderful job of implementing this very difficult and complex idea, and it's important that we, as F# developers, understand how it works.

Currying is the idea that a function can rewrite itself to support having only one parameter. This is hugely important, and I haven't yet clarified this in our adventure, so I want to do so here. My friend Chris asked for this discussion on Twitter, and I feel absolutely obliged to make it happen.

I want to start this chat off with a quick analogy: you're standing at the checkout line in the grocery store, you start putting all your items on the belt, but the person running the register can only scan one item at a time. Since they want to keep things going quickly, they scan your items in the same order you put them on in. This is similar to currying, where the person who is running the register would be the 'main function' you are calling, but due to the ability to only handle one thing at a time, they take one, handle it, then move to the next one. Now part of the way through you decided you want a 'subtotal', they stop taking items, and deliver the result at that moment to you. This is the partial application of functions.

This analogy is what I want to liken to currying, this is a very difficult concept to understand, and the designers and engineers of F# have done a wonderful job making it happen, they really have. I really want you to have a clear understanding of how currying works, and why we do it. F# makes this extremely trivial: we're going to use the example of summing a list of integers with the List.sumBy method, rather than List.sum. We'll use this to demonstrate what exactly happens with currying.

To start, we'll define a sumBy1 function, which takes a (a list), and computes List.sumBy with a map (projection function). This is a simple task, simply pass map to the List.sumBy function, then pass a.

let sumBy1 map a =
    a |> List.sumBy map

Next, we want to 'curry' this function. As I mentioned before, 'currying' is the process of rewriting the function so that it has one parameter, not two. Here, we do this by pushing a to an inner function, and then returning that inner function. We effectively move the a parameter out of the main function, and into an inner function. If you look at the type signature of this, you'll see it's an ('a -> int) -> 'a list -> int, which our previous function was as well. This should help introduce the 'partial application' process I mentioned before, where we can run just one function and get a result of a partial applied function. (Here, we could call sumBy2 with a map, and not an a, and get the 'partially applied' version, we can also do that with the previous version: sumBy1.)

Another good point to discuss here is that signature. The a -> b -> c signature that we're seeing on these functions tells us how F# curried it. Each of the arrows is a new result, it's a return. So a -> b -> c takes an 'a', and returns a b -> c, which is a function. So it takes the first input, then returns a function which can take the second input, and then returns the final result. This should begin to make it clear why we are discussing currying and partial application together: they're directly related.

We can write a pre-curried version of our sumBy1 function, which we'll call sumBy2, and have it posess the same ('a -> int) -> 'a list -> int signature.

let sumBy2 map =
    let inner a =
        a |> List.sumBy map
    inner    

Here's an interesting observation: we magically created a new variable within our function, we never got passed an a, but we managed to create it out of nowhere. It just appeared. That's how currying works. It doesn't care about the parent function at all, it's concerned about the individual function currently in scope. We can create variables at will, and gather a whole new featureset, we don't have to even worry of the parent function.

Now with this sumBy2, it should be obvious that inner can be rewritten again to eliminate the inner function altogether, and we can make an actual function that is the inner function, as seen below:

let sumBy3Inner map = map |> List.sumBy

What this introduces us to is composition, realistically, we could write let sumFn = sumBy3Inner id, for example, to build the partially applied version of our function. What should be interesting here is that we're piping the map, even though we're not executing the function. We're simply building it up. We could call a |> sumBy3Inner id, but we're going to build a sumBy3 to further compose the chain.

let sumBy3 a = a |> sumBy3Inner

Now we get into some complex currying. We want to write our function such that the arguments are reversed: we take the list first, then the function second. This leaves us with the sumBy4 example, but it's more difficult than that, we need to find a way to curry it out.

let sumBy4 a map =
    a |> List.sumBy map

Curiously, the solution to this is actually the exact same solutions as we did for sumBy2, but instead of inner a, we'll build inner map. The same partial application aspects apply, and we've built a whole new function inside the previous one. The issue here, is that we can't abstract the inner function out, it's just not possible.

let sumBy5 a =
    let inner map =
        a |> List.sumBy map
    inner

Now if we collect the sum of each of these functions, we find that each of the results is a single int - this tells me our functions probably work, because we're working with very different types. We can also, of course, run this in F# interactive and prove it to ourselves, which we'll do.

let sum1 = [1..100] |> sumBy1 id
let sum2 = [1..100] |> sumBy2 id
let sum3 = [1..100] |> sumBy3 id
let sum4 = id |> sumBy4 [1..100]
let sum5 = id |> sumBy5 [1..100]

So all five of our values here result in the same thing, an int of the value 5050. This demonstrates that currying has no net effect on the result. The next question you might have is how does this apply to more than two parameters? The same as it applies to just two:

let fnWith3Params a b c =
    a + b + c

let curriedFnWith3Params a =
    let innerFn1 b =
        let innerFn2 c =
            a + b + c
        innerFn2
    innerFn1

We just recursively apply the currying. The most inner function returns the full result, and each outer function returns it's inner. This is all the magic, which can hardly be classified as magic. Now when we talk about partial application, there's no magic here either. Now that the function is curried, we can pretty cleanly explain partial application.

When we partially apply we don't do anything particularly abstract, we actually do some pretty standard stuff. With our curriedFnWith3Params example, when we let tempFn = curriedFnWith3Params 1, we simply get innerFn1 with the innerFn2 value a pre-applied. When we let tempFn = tempFn 2, we get innerFn2 with a and b pre-applied (or baked-in), then when we call tempFn 3, we get the result of that which is 6. That's it! No magic!

let add1 = 1 + 2 + 3
let add2 = fnWith3Params 1 2 3
let add3 = curriedFnWith3Params 1 2 3

As previously demonstrated, our three-parameter function results are all the same. We've proven that we get the same result, regardless of which form we use to write it, we have also proven that currying can be done, and it can easily be done, and we can even abstract the curried functions out of the main function and reuse them.

By now you should have a basic idea of how this comes in to play with composition. It should be somewhat clear that when we build a composition chain, we're essentially asking the compiler to build a new function that is a very straightforward chain of the others. These can even be curried as well.

At this point, you may be asking yourself: "why is this important, why do I need to understand this?" There's a very long, drawn out and contrived answer to this, so instead, I'm going to be realistic: you probably don't need to know the drawn-out details of currying, but here's what you do need to know:

  • A function has exactly one parameter, so if it looks like it has more than one, it really doesn't. Each parameter produces a new function with exactly one parameter, in the case of multi-parameter functions.
  • When we call one of these functions, we are actually calling into a tree of functions, which means we can stop at any branch and get the current "prepared" chain.
  • A comma-separated list of arguments, and a space-separated (with no comma) list of arguments are different: one builds a set of tupled parameters, the other builds a set of functions.

For a developer coming from C#, C++, VB.NET, PHP, Java, or JavaScript (or any other language where you separate parameters/arguments by commas), this might seem weird, but when you do that in F# it's considered one argument. A function defined let fn (p1, p2, p3) is not the same as a function defined let fn p1 p2 p3: the first function takes a single parameter that has three parts, and the second function is three curried functions each taking a single parameter for it's part.

Interestingly, we can apply the aspect of currying to our own design patterns, instead of needing all the parameters up-front, we can build functions that take some of the parameters and then return a new function to take the rest. This let's us maintain SRP (Single-Responsibility Principle) in F#, but in a totally new fashion. We don't need to take a parameter that's currently useless, we can leave that to the function it's being handled by. We can return a new function, which is natively built-in to the language, so our users don't have any idea that we didn't need the new parameter initially.

I actually demonstrated this the other day, in the example of bad code I wrote. When I have the match p |> getSpecificity with, I return one of two functions. Both of those functions have the same signature, so I could Have just embedded them in the match. Instead, I took advantage of function currying, so the API only has what is needed, and comes up with new parameters on the fly.

Now I could go on explaining away how this works, I really could. I've been working with this for some time now (48 hours, in fact), and I could tell you another 2-3 pages of storytime. Instead, I want to direct you to one of Scott Wlaschin's references, F# For Fun and Profit, particularly the currying section, as well as the partial application section.


I hope this was useful for you, and I hope that even though this was long winded (100+ lines of explanation for 30 lines of code) you still learned something. My first (and only, really) goal with this series is to help guide you to a new path of learning, I want to share what I know with you, with the hopes that one day you'll share what you know with the rest of us.