using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 7)

Working out our Step 2 process

Lesson 6: It's Time to Accept Arguments
Lesson 8: Let's Normalize Our Data!

We've finished working out Step 3, this is a good start as we needed to know what format would be best for Step 3, and as it turns out the best format is the same as the output format, so we're going to try to get our data there.

Step 2 is actually a painfully brutal process. We end up having to fight several discrepencies, including the following:

Sheet 1

  • Some users have the middle initial of their middle name in the same cell as their first name, instead of in a new cell;
  • Some users have multiple email addresses (separated by semicolons) in the Email column;
  • Some users have multiple Magic ID values, in multiple columns;
  • Some users lack a Magic ID completely;

Sheet 2

  • Some users have a middle name, some a middle initial, and some neither, all in the same cell;
  • Some users have multiple email addresses (separated by semicolons) in the Email column;
  • Some users have multiple Magic ID values, in the same column;
  • Some users lack a Magic ID completely;

These are all cases we want to account for, excepting Email addresses. We won't actually do anything about multiple email addresses, since we don't care about them. For multiple Magic ID values we will make them separate rows, with the same information for the other columns. Thus, our transformation should take the sample data and end up with something like:

First    Middle    Last    Phone           Date Added    Email                                          Magic ID
John     C.        Doe     111-555-1234    29-Aug-17     email@example.com                              287
Jane     L.        Doe     123-555-4321    29-Aug-17     otheremail@example.com                         331
Jane     L.        Doe     123-555-4321    29-Aug-17     otheremail@example.com                         334
Ron                Doe     321-555-5678    29-Aug-17                                                    872
Edna     Viola     Doe     8515559637      28-Aug-17     viola.doe@example.com; edna.doe@example.com    975

So by the time we finish, we should be able to transform both sheets into that exact data-set. We'll then (at a later date) explore how we can modify it to account for multi-part last names, such as LA ROUSE, or MAC GREGOR. This is a lot harder than it looks, and we'll only worry about that for Sheet 2, because in Sheet 1 it's always separated out.

Build a step-by-step process

Software development isn't just writing code all day long, and in fact, in my workplace it's 60% problem solving, 20% tech-support, and then 20% writing code. We can ignore the tech-support bit from our adventure, we're only going to focus on the 75% problem solving and 25% writing code. This lesson will be entirely problem solving, we won't write much code today, but we will implement the solutions to these problems in the next lesson.

Alright, so breaking down the problem is more an "art form" than anything else - you'll learn how to do it best from doing it, you'll develop your own "style". We're going to walk through these in my "style" of solving them, but I do want you to try and solve them on your own as you get time. It's important to explore these in your own way, as how I solve problems may not work best for you.

The first thing I want to do is define the broadest possible definition of the problem. That is, what is our main, end goal? In our case, we want to normalize a specification sheet. Great, that's a good start. We now need to determine what individual parts are in this broader problem, and solve them. We could do this through a tree-diagram of the situation, breaking it down into nodes from top-to-bottom. I recommend you try doing this, as it really does help focus your thinking.

To "normalize our specification" we have some sub problems to solve first. (After all, there's no "normalize specification" API for us to use yet.) These might be:

  1. Load a specification file;
  2. Determine what each column means;
  3. Process all the rows of the specification;
  4. Save each row of the specification;

That's more reasonable, but some of them still need broken down further, most specifically 2, 3, and 4. They each have sub-parts, so let's get to them.

  1. Load / read a specification file;
  2. (2.1) Read each sheet and do the following;
  3. (2.2) Read the first row of the sheet;
  4. (2.3) Analyze each cell from the first row;
  5. (2.4) Determine what each cell from the first row indicates;
  6. (3.1) Read each line of data from the sheet;
  7. (3.2) Read each column of data from the sheet;
  8. (3.3) Split data from each line as appropriate (combined names, magic ID's, etc.);
  9. (3.4) Filter out rows without requisite data;
  10. (3.5) Process each column of data into the apporpriate type;
  11. (4.1) Create a new "post-normalization" sheet from the source sheet;
  12. (4.2) Write each record to the new sheet;
  13. (4.3) Repeat to 2 (2.1);
  14. (4.4) Write the spreadsheet back out;

This is non-exhaustive. We still have more to cover yet, but we'll start thinking about each step and how it relates to the overall goal. We're going to design the flow of our program, which is usually somewhat fun. We'll walk through each step and think of it in human terms, so that we know what our process looks like.

The first step is easy, and I said we were mostly going to solve the problem we have, which is true, but we pretty much covered this in a previous lesson so I'll give you the code to work with:

let workbook =
    use fs = new FileStream(parameters.Filename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

Pretty basic, and sets us up with our workbook. Next, we want to read each sheet to perform some "task", right? This is not as easy as it could be with NPOI, but it's still not hard. We're going to iterate from 0 to workbook.NumberOfSheets, and get the sheet with workbook.GetSheetAt, which takes an integer index. So this is also somewhat easy, and could be written like:

[1..workbook.NumberOfSheets] |> List.iter (fun si ->
    let sheet = si |> workbook.GetSheetAt |> Option.ofObj
    ...
)

Again, pretty basic so we'll go with it today because this means we're done with these steps. The GetSheetAt can return null if there's no sheet, so we convert it to an Option out of the gate to work F#'s pattern matching better. We'll move on to 2.2, which is simple. Read the first row, which we could call as 0 |> sheet.GetRow. At this point, we're wondering how anything could actually be difficult, until we get to this next step. "Analyze each cell from the first row."

Analyzing our header

Alright, so I said this was going to be hard, and it is going to be by nature, but we're going to make it easy because we will make a few assumptions. Let's digress and explore our imagination for a moment. First off: how many different ways can we write the headers of our sheets? "FirstName, First Name, fName, First, F NAME," and we could go on for a while, ignoring case changes even. (Those all just have whitespace and word differences.) One of the more lucrative things to get into with programming is machine learning: teaching a machine how to think like a human. (For those who aren't aware: the human mind is the single greatest pattern and system recognition engine ever created.) In this type of situation, as a person, it's easy to come up with what the column represents igven any of those titles, but for the computer it's not so simple. We have to tell it what "pattern" to recognize.

Now if we were really clever, we could write some sort of regex that would indicate if a "word" represents the idea of a First Name or not: (f(irst)?\s?(name))|(first), of course this ignores the term "given name" (often used as the First Name), so we're SoL on that front, but we could make another regex ((g(iven)?\s?(name))|(given)) and put them together (((f(irst)?\s?(name))|(first))|((g(iven)?\s?(name))|(given))) and feel pretty damn good about ourselves, because now this should always detect a phrase indicating First Name, right? What about Name, First? Damn, add another regex. Oh, and 1st Name, 1st, Name 1, etc., add _ and - support. By the time we're done we might have the following (and we don't even get started on spelling correction):

(((f(irst)?)|(g(iven)?)|[1f]st)[\s_-]?(name)|(first|given|[1f]st))|(name(,?[\s_-]?(first|given|[1f](st)?)))

I don't know about you, but I don't want to be the person dealing with that. This is where machine learning might come in, teach the machine how to recognize terms that would indicate the First Name, I.e. the name we call them by informally. Or, we can make the blind assumption that our data will only ever use "First Name" as the title of the column for First Name, I think that's what we'll do.

So, if we find a single Name column, then we'll assume it contains all three names put together. If we find a First Name and Last Name column, then we'll assume it contains the first and middle name in the First Name column, and the full Last Name on it's own, and if we find First Name, Middle Name, and Last Name, then we assume that it contains each name in it's own column, with the possibility that First Name is the first name and middle initial.

We can then process the remaining columns, until we get to Magic ID, which we have two possibilities for: we either have multiple Magic ID columns, or we have a single column. This is not nearly as difficult to solve as the previous situation. We're just going to read in any header with the name Magic ID, and if we find one named exactly "Magic ID", then it means they could be joined in that column, but if we find multiple columns, then it means that they're separated.

Reading the data

Alright, hardest part over (we're not into the easy waters yet, but we're close): now we need to actually interpret our data. This is not so hard, what we're going to do here is essentially load row-by-row into a type, which we've yet to define but it's rather trivial:

type SpecUser = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicID : int
}

So what we'll do here is load a row, parse the columns (based on what we detected from the header) and then continue to the next one. If a row doesn't make a valid SpecUser, then remove it. (So Seq.choose will be useful here.)

The hardest part with this will be the splitting on multiple MagicID values, which is really not all that hard. If we Seq.choose (fun x -> [|...|]) |> Seq.flatten we'll get what we need. (Remember Seq.flatten from my earlier lessons?)

Once we've read the data, creating the new sheet is trivial. We'll actually create a whole new workbook because our sheet name should be Process_Spec, and we want to save every input sheet. This is quite trivial, and it won't take long to handle.

Finally, we'll write the apporpriate spreadsheets back out. These will be the single Process_Spec sheet which is loaded from the input spreadsheets. Again, we coveredthese before, and it's not hard to write this particular code.

Tidying up

I'm going to leave off here for today - the next lesson will get into the code for all this, but it's important for us to go over, in people-speak, what needs to happen, because as we get into code it'll start to become more and more clear why I went this route.


Bonus day: Finding a Gap in an aribtrary Sequence

Recently I had to write this code for a personal project, and since we didn't do much code above I suppose I'll share because that's the type of kind-hearted person I am.

The goal here is bewilderingly simple: given a formula for the expected item at an index F(x), and a manner of determining the current item at an index G(x), we want to detect what is the first item where F(x) <> G(x). Simple, right? A basic version might be to:

let firstNonEqualItem = [iFirst..iLast] |> Seq.filter (fun i -> i |> f <> i |> g) |> Seq.head

Yeah, this works, an example:

let items = [|0;1;2;3;4;5;7;8;9|]
let firstUnequalItem = [0..8] |> Seq.filter (fun i -> items.[i] <> i) |> Seq.head

But what obvious problem does this have? It does a linear search, which means if there are many millions of values, it finds the first value starting at the beginning, each and every time. If the first missing value is 10-million elements in, well it searches through the first 10-million elements before returning. Now what if I told you we could search the same 10-million elements with 24 iterations? What if I told you we could find any missing element in that 10-million item array in 24 or fewer iterations, each and every time, no matter where it was?

This is, in fact, not hard. The biggest problem is trying to define the plain-english for an algorithm that does it, but let's take an example. We have an array:

0 1 2 3 4 5 7 8 9

We want to find what element is missing, well we all know it's 6, but how do we teach a computer to find it?

The easiest method is to define two variables, which represent the left and right portions of the array we're in, we'll call them l and r. (If your fonts suck, those are lower-case L and R, respectively.) We'll define them as by default the first and last element, respectively.

0 1 2 3 4 5 7 8 9
l               r

Now, we'll find the element in the middle of l and r, since l = 0 and r = 8, that's 4. (l + r / 2) We'll evaluate this element, called m, and determine if f(m) = g(m).

0 1 2 3 4 5 7 8 9
l       m       r

In our case, f(4) = 4 is true, so now we'll set l to m + 1.

0 1 2 3 4 5 7 8 9
          l     r

Next, we'll repeat. This time we look at f(6), which is 7, and it's not true, so we set r to m.

0 1 2 3 4 5 7 8 9
          l m   r
0 1 2 3 4 5 7 8 9
          l r
0 1 2 3 4 5 7 8 9
          l r
          m

Next we'll test m again, which is 5, and f(5) = 5 is true, so we move l to m + 1.

0 1 2 3 4 5 7 8 9
            l
            r

Once l and r are in the same position, we return f(l), which is 6 in this case. Do note we create m three times, which means we only have to iterate 4 times to find a missing value in this array. You can calculate this as the base-2 logarithm of n, where n is the array count. (So, log(n) / log(2) if you have a cheap calculator that only does base-10 or base-e logarithms like mine.) The code, for a basic F# algorithm of this, is as follows:

let Zero = bigint 0
let One = bigint 1
let NegOne = bigint -1
let searchForOpening equals formula getItem count =
    let expected v = (v |> getItem, v |> formula) ||> equals
    let rec alg l r =
        match l = r with
        | true -> formula l
        | false ->
            let mid = (l + r) >>> 1
            match mid |> expected with
            | true -> (mid + One, r) ||> alg
            | false -> (l, mid) ||> alg
    match count with
    | r when r = 0I -> r |> formula
    | r ->
        match r |> expected with
        | true -> (r + One) |> formula
        | false -> (Zero, r) ||> alg 
let items = [|0I; 1I; 2I; 3I; 4I; 5I; 7I; 8I; 9I;|]
searchForOpening (fun a b -> a = b) id (fun a -> items.[int a]) (items |> Array.length |> bigint |> (+) NegOne)

Pretty simple, no? Do note that this algorithm assumes f(x) and g(x) are pure functions that can work off an arbitrary index. Also, the >>> may be unfamiliar to you - look up the F# operators, and you're looking for the "bitwise operators". It essentially shifts everything right n bits (1 in this case). A >>> 1 is a fancy way of saying / 2, but much faster (in many cases) and no likelihood for confusion as to whether it rounds, and if so what direction.

And if you're doubting the efficiency of the algorithm, the following test should do you some good:

searchForOpening
    (=)
    id
    (function | a when a <= 1000000000000000000000000000000000000000000000000000000000000I -> a | a -> a + One + One)
    2000000000000000000000000000000000000000000000000000000000000I

We find the missing value of an enormous series of elements, by cheating, basically. This runs about 200 iterations to find the missing value. Two-hundred, that's it. We search through a sequence of elements with 60 zeros in it so fast that it's undeniable as to it's efficiency. This can be run in F# Interactive with great ease and speed - you won't notice a delay, which means it saves us many CPU cycles. The caveat is that we need a manner of comparing equality, which isn't always possible, and we need a pure-function that can evaluate what the current and expected elements are. In fact, we'll up the anty even more:

searchForOpening
    (=)
    id
    (function | a when a <= 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000I -> a | a -> a + One + One)
    20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000I

This searches through two googol values, that's even more enormous, and it does it in a very quick time. (About 333 iterations.) So if anyone tells you "F# can't be quick" or "you can't search through a set of numbers that big", tell them they're very wrong. We just did it, and in 20 lines of code.


This was another relatively short lesson, but no less important than any other: it's absolutely inarguably imperative that you know how to solve problems. This is the single biggest thing I want to stress through this whole adventure: knowing a language and API is great, but if you don't know how to solve the problems you're about as useful as a piece of burnt toast.

Step2_Testing.xlsx (11.5KB)