using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 10)

Building some 'fuzzy logic' for our names

Lesson 9: Knowing the terminology is important

If you recall, in lesson 7 I mentioned the following:

We'll then (at a later date) explore how we can modify it to account for multi-part last names, such as LA ROUSE, or MAC GREGOR. This is a lot harder than it looks, and we'll only worry about that for Sheet 2, because in Sheet 1 it's always separated out.

Today, we're going to explore this part of our topic - we'll build some 'fuzzy logic' that doesn't seem fuzzy, but happens to be. Of course, first we need to define 'fuzzy logic', so let's do that.

What is 'fuzzy logic'?

Fuzzy logic is used fairly often in programming, it's essentially building a set of rules that sacrifice one of the following to make up for the other two: predictability, accuracy, and speed. Ours will sacrifice accuracy for speed and predictability. We're going to build a function that is mostly correct, but sometimes wrong, to allow us to solve the 95% scenario.

We need some "rules" for our fuzzy logic, so let's define them:

  • If a word is <= 3 characters, it's part of a "compound word";
    • A compound word should be grouped with the next word if possible;
  • If a word is a single character followed by a single period, in any quantity, it's an "initial";

In [lesson 8], we built splitNames:

let splitNames (name : string) =
    match ' ' |> name.Split with
    | [|first; last|] -> (Some first, None, Some last)
    | [|first; middle; last|] -> (Some first, Some middle, Some last)
    | _ -> (None, None, None)

This isn't fuzzy logic so much as bad logic: we just assume that two parts = first, last name, and three parts = first, middle, last name. We make absolutely no guarantee that there's even an apporpriate value there, for the name of John Mac Gregor, we get (Some John, Some Mac, Some Gregor). That's not right at all, for Rhonda La Rouse, we get (Some Rhonda, Some La, Some Rouse), again wrong.

So let's define portions of our 'fuzzy logic':

let isInitial : char array -> bool = (function | [|_; '.'|] -> true | _ -> false)

Good start, we test a char array for two elements of anything, '.'. We can test this easily: "a." |> Seq.toArray |> isInitial, which passes.

Now we need a function, isOnlyInitials which tells us if a string only consists of initials, i.e. D.W. should be true:

let isOnlyInitials : string -> bool =
    Seq.toArray
    >> Array.mapi (fun i x -> (x, i))
    >> Array.groupBy (fun (x, i) -> i / 2)
    >> Array.map snd
    >> Array.map (Array.map fst)
    >> Array.forall isInitial

This is an interesting function, because we use mapi to give us the index and element, then groupBy to put them into pairs, then map to only get the second element (x, i), then map (map fst) to give us just the first element from each sub-array. We're left with just forall isInitial, which is equivalent to Array.filter isInitial |> Array.length = 0. Simple, right?

Let's appy this to our existing function

So how do we apply this? We need to modify splitNames to account for the initial, and for the compound name. To do so, we'll get rid of the match, because we're now just going to work our fuzzy logic. First we need to get the parts:

let nameParts = ' ' |> name.Split

That's a good start, and it gives us an array of string objects representing each part. Previously this was all we needed, but now we need to put them into groupings based on our new rules.

Now we need to "fold" over our names, and collect them into an array of names that are based on our rules. For this we'll build a folder that takes a string list and then a string and returns a string list, with the first element being the last element tested.

The basic idea here is to match acc with prev::rem, and if prev is <= 3 characters, and it's not only initials, then our new prev should be prev part.

nameParts
|> Array.fold (fun acc pt ->
        match acc with
        | prev::rem
            when prev |> Seq.length <= 3
                && prev |> (isOnlyInitials >> not) ->
            (sprintf "%s %s" prev pt)::rem
        | _ -> pt::acc
    ) []

Make it work for Ron, darnit!

Pretty simple, right? So if we test it with a basic data-set: ["John Mac Gregor"; "Rhonda La Rouse"; "John C. Mac Gregor"; "Rhonda A. La Rouse"; "La Ronda Doe"] |> List.map splitNames;;, we can see that it returns what we expected. Now, we lose one particular case I ran into: "Do Ray Doe", which turns into (Some "Do Ray", None, Some "Doe"), when it should have been (Some "Do", Some "Ray", Some "Doe"). We also lost support for Ron Doe, which returns all None because Ron is paired with Doe. To fix this we'll alter our when guard clause:

let splitNames (name : string) =
    let isInitial : char array -> bool = (function | [|_; '.'|] -> true | _ -> false)
    let isOnlyInitials : string -> bool =
        Seq.toArray
        >> Array.mapi (fun i x -> (x, i))
        >> Array.groupBy (fun (x, i) -> i / 2)
        >> Array.map snd
        >> Array.map (Array.map fst)
        >> Array.forall isInitial
    let nameParts = ' ' |> name.Split
    let nameParts =
        nameParts
        |> Array.fold (fun acc pt ->
                match acc with
                | prev::rem
                    when prev |> Seq.length <= 3
                        && prev |> (isOnlyInitials >> not)
                        && rem |> List.length >= 1 ->
                    (sprintf "%s %s" prev pt)::rem
                | prev::rem
                    when prev |> Seq.length <= 2
                        && prev |> (isOnlyInitials >> not) ->
                    (sprintf "%s %s" prev pt)::rem
                | _ -> pt::acc
            ) []
        |> Array.ofList
        |> Array.rev
    match nameParts with
    | [|first; middle; last|] -> (Some first, Some middle, Some last)
    | [|first; last|] -> (Some first, None, Some last)
    | _ -> (None, None, None)

We also added a second prev::rem case, the first matches if there's already more than one other word with a length of <= 3 (so Ron Doe is fixed), the second matches any if the first word is <= 2 characters (working for La Ronda, but not properly for Bo David). Now we could add a filter for Bo to not match on the <= 2 guard clause, but that adds a maintainability issue. Instead, we just document that it's 99% accurate, and what cases we know won't work properly.


Today's lesson is somewhat short, but introduces a very complex topic. I recommend that you make sure you fully understand what is happening, as it's important to know. Also, if anyone has a better idea on the isOnlyInitials function, I'd be very interested in hearing about it. The grouping by index / 2 works, but I'm sure there's a better way to do it.

Loading