using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 14)

Starting our next problem

Lesson 13: Currying (and not the 'Steph' kind)

As expected, I want to continue on to the next issue we're going to solve. One of the many things I deal with on a daily basis is Spam Filtering, which we use a third-party provider for, as well as the built-in spam filtering of our Email system. The problem, of course, ends up being that spammers have gotten really good at getting around these filters.

Now, being who I am, I'm going to share some information with you on ways we can help alleviate this issue. One of the first things we'll do is replace all the invalid / unicode code-points that are out of our normal processing scope. What do I mean by this? Well let's see what we're talking about.

All 'text' has an encoding

Let's discuss this for a moment. All text we see has an "encoding" in a computer system — a way to map a visual character with a number. There are many encodings out there, but some of the most popular are detailed below:

  • ASCII: this was one of the first encodings used on text. In this encoding we map 128 different characters to the numbers 0 through 127, using 7 bits of space. We have since built an encoding known as "Extended ASCII", which maps an additional 128 characters to the 128 - 255 space using a full 'byte' (8-bits).
  • EBCDIC: this was an encoding invented by IBM for it's main-frame systems. This maps characters in a similar manner as ASCII, but it does so with a different 'table'. (A 'table' simply being a matrix of bit combinations to the appropriate character/signal.)
  • UTF-32: this maps all the unicode code-points to a 32-bit number, which indicates what code-point should be rendered.
  • UTF-16: this maps all the unicode code-points to a 16-bit number, which may also have a second 16-bit number after it. The .NET framework internally uses this format, so we're going to spend a lot of interaction with it, specifically.
  • UTF-8: this maps all the unicode code-points to an 8-bit number, which may also have additional 8-bit numbers after it. This is one of the most common encodings, and is backwords compatible with ASCII (0-127). You'll find this is used a lot on the web, and in cross-platform solutions because it has a somewhat low memory footprint (requires a minimum of one byte per character, up to 6 bytes per character if I recall correctly), and it's easy to work with.

I'm not going to go into detail on this subject, but you need to understand that there are many encodings, and they all mean stuff. This is a hugely immense topic, and there is a good explanation here. We just need to know that this means there are many characters to be represented, and we want to account for that.

Now Email can use any of these encodings (it's usually specified in the body somewhere), but we don't actually need to know that part. We'll actually allow the .NET API to pass us a string to a function, so knowing the encoding is less important because .NET will convert it to UTF-16. Here's what we really need to know:

  • Most characters that we see are a single UTF-16 value, something between 0 and 65535;
  • Some characters are two UTF-16 values, with each being a "high surrogate" or a "low surrogate";
  • If we convert the string to strictly UTF-32, each character is a single integer value;

I've actually done some of the hard part of mapping the characters to code-points (actually, I did all of the hard part), which is evidenced at this repository, under the String.fs file, notably toCodePoints and fromCodePoints. There is also a toCodePoint and fromCodePoint in the Char.fs file. The basics of these functions are noted below:

/// Returns the character(s) represented by the unicode code-point.
let fromCodePoint = System.Char.ConvertFromUtf32
/// Returns the unicode code-point represented by the string at the i'th position.
let toCodePoint (s : string) (i : int) = (s, i) |> System.Char.ConvertToUtf32

/// Converts the string to an array of Int32 code-points (the actual Unicode Code Point number).
let toCodePoints (str : string) : int seq =
    let mapper i c =
        // Ignore the low-surrogate because it's already been converted
        if System.Char.IsLowSurrogate(c) then None
        else i |> Char.toCodePoint str |> Some
    str
    |> Seq.mapi mapper
    |> Seq.choose id

/// Converts the array of Int32 code-points (the actual Unicode Code Point number) to a string.
let fromCodePoints =
    Array.map (Char.fromCodePoint >> explode)
    >> Array.flatten
    >> implode

This also uses my Array.flatten function, which we talked about in a previous blog post.

Ok, so what's our goal?

So the first step in this whole thing is to define the problem. Let's consider we want to flag emails which have certain terms in them, for ease of use we'll define a sample of basic words:

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)
let filters = ["nope"; "fail"] |> listToCodePoints

We want to flag the words nope and fail. Now, ordinarily this would be easy, because we can match the word with the filter. The problem here is that the word can have 'obsfucated' or 'confusable' characters, such as ℕope, 𝑵ope, ռope, etc. We can also have 𝕱ail, and 𝓕ail, etc. So, the initial issue is that each of these strings looks like what we want to match, but the computer has no way of knowing that. So we'll define some tests:

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)
let filters = ["nope"; "fail"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝓮"] |> listToCodePoints

We want a process to find what "filters" each term matched. The easiest way is to build a function:

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

And then define let transformToLowerTerm = id. Right? We don't have a function for that yet, but we know it'll take a string and return one, so return the same thing (this let's us start building our function chain). Now, if we define a few more helpers, we get the actual transformation:

let transformToLowerTerm = id

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

let any = (<) 0

let matchedTerms = terms |> List.filter (matchedFilters >> List.length >> any) |> List.map String.fromCodePoints

We should only get val matchedTerms : string list = ["nope"], just one so far. Now we need to define the transformation to let us compare our strings. For this, the Unicode Consortium gives us a list of "confusable" characters, under the Security section, which has a confusables.txt file. This file has a pretty simple CSV format, any # indicates a comment for the remainder of the line (ignore it, basically), and each field is separated by a semicolon (;). There are also tabs included, but we can strip those with String.Trim.

Loading the data from the Unicode Consortium

This file we're given by the Unicode Consortium is actually really easy to read with F#, about 20 lines or so:

type UnicodeConfusables = { Original : int; Replacement : int array; Field3 : string }

let mapToConfusable =
    function
    | [|a; b; c|] ->
        { Original = a |> String.trim |> Int.fromHex
          Replacement = b |> String.trim |> String.split ' ' |> Array.map Int.fromHex
          Field3 = c |> String.trim } |> Some
    | _ -> None

let file = "http://www.unicode.org/Public/security/10.0.0/confusables.txt"
let data =
    use wc = new WebClient()
    wc.DownloadString(file)

let confusables =
    data
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> Seq.map (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

So now we're down to building a proper transformToLowerTerm function. I've already given away what it should do by the name, so you should try your hand at building it first. Here's a hint: the transform term is simply going to map any of the confusable characters (as deemed by having a presence in the confusables array with the same Original code-point value), and the lower-case term is simply going to take anything A-Z and change it to a-z.

Once you're done, you should have two functions which we can compose into the transformToLowerTerm:

let transformToLowerTerm = transformTerm >> lowerCaseTerm

So if you've done it right, then the entire block here should compile:

let transformToLowerTerm = transformTerm >> lowerCaseTerm

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

Answer time

Here are the functions I've defined, you don't have to use them but you should be able to reproduce the same thing with them, thus for any given input, my version of lowerCaseTerm should return the same as yours, and any given input to transformTerm should return the same as yours:

let transformTerm =
    Array.map (fun codePoint ->
        match codePoint |> findCandidates with
        | [||] -> [|codePoint|]
        | candidates -> candidates.[0].Replacement)
    >> Array.flatten

let A = 'A' |> int
let Z = 'Z' |> int
let a = 'a' |> int

let lowerCaseTerm = Array.map (function | c when c >= A && c <= Z -> c + (a - A) | c -> c)

You'll notice that everything is pretty simple, possible with the exception of the math in lowerCaseTerm, but in the ASCII encoding (and, conveniently, the UTF-16 encoding) the characters A-Z are sequential, starting at value 65 (0x41). The characters a-z are also sequential, starting at 97 (0x61). Since our characters are already integers, we could have compared directly to >= 0x41 and <= 0x5A, but it's easier to follow if we define constants that hold the values there. The int function will return the integer value of the input character, so when we do int 'A', we get 65. (You can try this in F# Interactive to see what I mean.)

So, for our math, we just need to test that the character is in the range A-Z (defined by c >= A && c <= Z, and if it is return c + (a - A), which will take c and add the difference between the lower-case a to the upper-case A. This should always be 32 or 0x20. (Since each lower-case character is exactly 32-positions ahead of the upper-case one, c + 32 is always the lower-case character if we're an upper-case character.)

At this point we've done everything required to map the confusables for a word, so we'll look at the entire source and see what can or can't be done to improve it quick, then call it a day.

type UnicodeConfusables = { Original : int; Replacement : int array; Field3 : string }

let mapToConfusable =
    function
    | [|a; b; c|] ->
        { Original = a |> String.trim |> Int.fromHex
          Replacement = b |> String.trim |> String.split ' ' |> Array.map Int.fromHex
          Field3 = c |> String.trim } |> Some
    | _ -> None

let file = "http://www.unicode.org/Public/security/10.0.0/confusables.txt"
let data =
    use wc = new WebClient()
    wc.DownloadString(file)

let confusables =
    data
    |> String.split '\n'
    |> Seq.filter (String.startsWith "#" >> not)
    |> Seq.map (Seq.takeWhile ((<>) '#') >> Seq.toArray >> String.implode >> String.split ';')
    |> Seq.choose mapToConfusable
    |> Seq.toArray

let listToCodePoints = List.map (String.toCodePoints >> Seq.toArray)

let filters = ["nope"; "fail"] |> listToCodePoints
let terms = ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"; "pass"; "𝕿rue"; "𝓽𝓻𝓾𝓮"] |> listToCodePoints

let findCandidates codePoint = confusables |> Array.filter (fun x -> x.Original = codePoint)

let transformTerm =
    Array.map (fun codePoint ->
        match codePoint |> findCandidates with
        | [||] -> [|codePoint|]
        | candidates -> candidates.[0].Replacement)
    >> Array.flatten

let A = 'A' |> int
let Z = 'Z' |> int
let a = 'a' |> int

let lowerCaseTerm = Array.map (function | c when c >= A && c <= Z -> c + (a - A) | c -> c)

let transformToLowerTerm = transformTerm >> lowerCaseTerm

let matchedFilters term =
    let fn = term |> transformToLowerTerm |> (=)
    filters |> List.filter fn

let any = (<) 0

let matchedTerms = terms |> List.filter (matchedFilters >> List.length >> any) |> List.map String.fromCodePoints
terms |> List.map (transformToLowerTerm >> String.fromCodePoints)

Right off the bat I had an idea to eliminate some of the complexities of lowerCaseTerm, let's see where it goes.

Let's look at what we want to do again: if it's a character between 'A' and 'Z' we want to replace it with the lower-case (add 32) version.

let lowerCaseTerm =
    let uppers = ['A'..'Z'] |> List.map int
    Array.map (fun c ->
        if uppers |> List.contains c then c + 32
        else c)

Now this may not be shorter, but there's no more complicated math (since we've hardcoded 32, we could replace that with our a - A again, but instead we'll just add the 32. Of course, we can also take advantage of the framework here, and use System.Char.IsUpper, but that works specifically on the char. We could convert our int to char, but then there is a possibility of converting an invalid value. For this reason, we'll stick with one of the first two options. We could also rewrite the Array.map as:

Array.map (function | c when uppers |> List.contains c -> c + 32 | c -> c)

Which does the same thing, but with a quick (pointless) pattern-match. Again, what you desire to use here is your choice, and I am simply offering different alternatives that do the same thing. This is to demonstrat that you can often do the same thing in many ways with F#. As always, I recommend writing it in the way that is most clear and concise. Don't be clever, write it in a way that you (and others) can understand it.

Once we're done our output should look something like the following:

val matchedTerms : string list =
  ["ℕope"; "𝑵ope"; "ռope"; "nope"; "𝕱ail"; "𝓕ail"]
val it : string list =
  ["nope"; "nope"; "nope"; "nope"; "fail"; "fail"; "pass"; "true"; "true"]

I've been running this in F# Interactive, so it automatically assigns terms |> List.map (transformToLowerTerm >> String.fromCodePoints) to it. Otherwise, everything ought to make sense: the first list is just terms that look like nope or fail, and the second is all terms transformed to lower-case.

Where we (could) fail

One of the biggest issues, that might not be immediately apparent, with how we do this is: what happens if we have multiple confusables for a character? We only transform the first one, which means we could be missing out on some possible conversions. We'll go over fixing this in the next lesson, but I want you to start thinking about it now. Essentially, our goal will be to swap out the transformTerm with a version that returns every possible option (which for each of our samples is still only 1).


Overall we should be heading down a new track this time, we won't be using a lot of the old stuff we learned, because we now need to think about a new type of problem. This next section of this series will be bringing us into another domain, which we will use to further define our knowledge. I want you to continue to explore these ideas before I bring the series to them, that way you can start testing your knowledge.

Loading