## Getting started with programming and getting absolutely nowhere (Part 7)

# Working out our Step 2 process

Lesson 6: It's Time to Accept Arguments

Lesson 8: Let's Normalize Our Data!

We've finished working out Step 3, this is a good start as we needed to know what format would be best for Step 3, and as it turns out the best format is the same as the output format, so we're going to try to get our data there.

Step 2 is actually a painfully brutal process. We end up having to fight several discrepencies, including the following:

### Sheet 1

- Some users have the middle initial of their middle name in the same cell as their first name, instead of in a new cell;
- Some users have multiple email addresses (separated by semicolons) in the
`Email`

column; - Some users have multiple
`Magic ID`

values, in multiple columns; - Some users lack a
`Magic ID`

completely;

### Sheet 2

- Some users have a middle name, some a middle initial, and some neither, all in the same cell;
- Some users have multiple email addresses (separated by semicolons) in the
`Email`

column; - Some users have multiple
`Magic ID`

values, in the same column; - Some users lack a
`Magic ID`

completely;

These are all cases we want to account for, excepting `Email`

addresses. We won't actually do anything about multiple email addresses, since we don't care about them. For multiple `Magic ID`

values we *will* make them separate rows, with the same information for the other columns. Thus, our transformation should take the sample data and end up with something like:

```
First Middle Last Phone Date Added Email Magic ID
John C. Doe 111-555-1234 29-Aug-17 email@example.com 287
Jane L. Doe 123-555-4321 29-Aug-17 otheremail@example.com 331
Jane L. Doe 123-555-4321 29-Aug-17 otheremail@example.com 334
Ron Doe 321-555-5678 29-Aug-17 872
Edna Viola Doe 8515559637 28-Aug-17 viola.doe@example.com; edna.doe@example.com 975
```

So by the time we finish, we should be able to transform both sheets into that exact data-set. We'll then (at a later date) explore how we can modify it to account for multi-part last names, such as `LA ROUSE`

, or `MAC GREGOR`

. This is a *lot* harder than it looks, and we'll only worry about that for `Sheet 2`

, because in `Sheet 1`

it's *always* separated out.

## Build a step-by-step process

Software development isn't just writing code all day long, and in fact, in my workplace it's 60% problem solving, 20% tech-support, and then 20% writing code. We can ignore the tech-support bit from our adventure, we're only going to focus on the 75% problem solving and 25% writing code. This lesson will be *entirely* problem solving, we won't write much code today, but we will implement the solutions to these problems in the next lesson.

Alright, so breaking down the problem is more an "art form" than anything else - you'll learn how to do it best *from* doing it, you'll develop your own "style". We're going to walk through these in my "style" of solving them, but I do want you to try and solve them on your own as you get time. It's important to explore these in your own way, as how I solve problems may not work best for you.

The first thing I want to do is define the *broadest possible definition of the problem*. That is, what is our *main, end goal*? In our case, we want to normalize a specification sheet. Great, that's a good start. We now need to determine what individual parts are in this broader problem, and solve them. We could do this through a tree-diagram of the situation, breaking it down into nodes from top-to-bottom. I recommend you try doing this, as it really does help focus your thinking.

To "normalize our specification" we have some sub problems to solve first. (After all, there's no "normalize specification" API for us to use **yet**.) These might be:

- Load a specification file;
- Determine what each column means;
- Process all the rows of the specification;
- Save each row of the specification;

That's more reasonable, but some of them still need broken down further, most specifically 2, 3, and 4. They each have sub-parts, so let's get to them.

- Load / read a specification file;
- (2.1) Read each sheet and do the following;
- (2.2) Read the first row of the sheet;
- (2.3) Analyze each cell from the first row;
- (2.4) Determine what each cell from the first row indicates;
- (3.1) Read each line of data from the sheet;
- (3.2) Read each column of data from the sheet;
- (3.3) Split data from each line as appropriate (combined names, magic ID's, etc.);
- (3.4) Filter out rows without requisite data;
- (3.5) Process each column of data into the apporpriate type;
- (4.1) Create a new "post-normalization" sheet from the source sheet;
- (4.2) Write each record to the new sheet;
- (4.3) Repeat to 2 (2.1);
- (4.4) Write the spreadsheet back out;

This is non-exhaustive. We still have *more* to cover yet, but we'll start thinking about each step and how it relates to the overall goal. We're going to design the flow of our program, which is usually somewhat fun. We'll walk through each step and think of it in human terms, so that we know what our process looks like.

The first step is easy, and I said we were mostly going to solve the problem we have, which is true, but we pretty much covered this in a previous lesson so I'll give you the code to work with:

```
let workbook =
use fs = new FileStream(parameters.Filename, FileMode.Open, FileAccess.Read)
XSSFWorkbook(fs)
```

Pretty basic, and sets us up with our workbook. Next, we want to read each sheet to perform some "task", right? This is not as easy as it *could* be with NPOI, but it's still not hard. We're going to iterate from `0`

to `workbook.NumberOfSheets`

, and get the sheet with `workbook.GetSheetAt`

, which takes an integer index. So this is also somewhat easy, and could be written like:

```
[1..workbook.NumberOfSheets] |> List.iter (fun si ->
let sheet = si |> workbook.GetSheetAt |> Option.ofObj
...
)
```

Again, pretty basic so we'll go with it today because this means *we're done with these steps.* The `GetSheetAt`

can return `null`

if there's no sheet, so we convert it to an `Option`

out of the gate to work F#'s pattern matching better. We'll move on to 2.2, which is simple. Read the first row, which we could call as `0 |> sheet.GetRow`

. At this point, we're wondering how anything could actually be difficult, until we get to this next step. "Analyze each cell from the first row."

## Analyzing our header

Alright, so I said this was going to be hard, and it *is* going to be by nature, but we're going to make it easy because we will make a few assumptions. Let's digress and explore our imagination for a moment. First off: how many different ways can we write the headers of our sheets? "`FirstName`

, `First Name`

, `fName`

, `First`

, `F NAME`

," and we could go on for a while, ignoring case changes even. (Those all just have whitespace and word differences.) One of the more *lucrative* things to get into with programming is *machine learning*: teaching a machine how to think like a *human*. (For those who aren't aware: the human mind is the *single greatest* pattern and system recognition engine *ever created*.) In this type of situation, as a person, it's easy to come up with what the column represents igven any of those titles, but for the computer it's not so simple. We have to tell it what "pattern" to recognize.

Now if we were really clever, we could write some sort of regex that would indicate if a "word" represents the idea of a `First Name`

or not: `(f(irst)?\s?(name))|(first)`

, of course this ignores the term "`given name`

" (often used as the `First Name`

), so we're SoL on that front, but we could make *another* regex (`(g(iven)?\s?(name))|(given)`

) and put them together (`((f(irst)?\s?(name))|(first))|((g(iven)?\s?(name))|(given))`

) and feel *pretty damn good* about ourselves, because now this should always detect a phrase indicating `First Name`

, right? What about `Name, First`

? Damn, add *another* regex. Oh, and `1st Name`

, `1st`

, `Name 1`

, etc., add `_`

and `-`

support. By the time we're done we might have the following (and we don't even get started on spelling correction):

```
(((f(irst)?)|(g(iven)?)|[1f]st)[\s_-]?(name)|(first|given|[1f]st))|(name(,?[\s_-]?(first|given|[1f](st)?)))
```

I don't know about you, but I don't want to be the person dealing with that. This is where machine learning *might* come in, teach the machine how to recognize terms that would indicate the `First Name`

, I.e. the name we call them by informally. Or, we can make the *blind assumption* that our data will only ever use `"First Name"`

as the title of the column for `First Name`

, I think that's what we'll do.

So, if we find a single `Name`

column, then we'll assume it contains all three names put together. If we find a `First Name`

and `Last Name`

column, then we'll assume it contains the first and middle name in the `First Name`

column, and the full `Last Name`

on it's own, and if we find `First Name`

, `Middle Name`

, and `Last Name`

, then we assume that it contains each name in it's own column, with the possibility that `First Name`

is the first name and middle initial.

We can then process the remaining columns, until we get to `Magic ID`

, which we have two possibilities for: we either have multiple `Magic ID`

columns, or we have a single column. This is not *nearly* as difficult to solve as the previous situation. We're just going to read in any header with the name `Magic ID`

, and if we find one named *exactly* `"Magic ID"`

, then it means they *could* be joined in that column, but if we find multiple columns, then it means that they're separated.

## Reading the data

Alright, hardest part over (we're not into the easy waters yet, but we're close): now we need to actually *interpret* our data. This is not so hard, what we're going to do here is essentially load row-by-row into a `type`

, which we've yet to define but it's rather trivial:

```
type SpecUser = {
FirstName : string
MiddleName : string option
LastName : string
Email : string
MagicID : int
}
```

So what we'll do here is load a row, parse the columns (based on what we detected from the header) and then continue to the next one. If a row doesn't make a valid `SpecUser`

, then remove it. (So `Seq.choose`

will be useful here.)

The *hardest* part with this will be the splitting on multiple `MagicID`

values, which is really not all that hard. If we `Seq.choose (fun x -> [|...|]) |> Seq.flatten`

we'll get what we need. (Remember `Seq.flatten`

from my earlier lessons?)

Once we've read the data, creating the new sheet is trivial. We'll actually create a *whole new workbook* because our sheet name should be `Process_Spec`

, and we want to save *every* input sheet. This is quite trivial, and it won't take long to handle.

Finally, we'll write the apporpriate spreadsheets back out. These will be the single `Process_Spec`

sheet which is loaded from the input spreadsheets. Again, we coveredthese before, and it's not hard to write this particular code.

## Tidying up

I'm going to leave off here for today - the next lesson will get into the code for all this, but it's important for us to go over, in people-speak, what *needs* to happen, because as we get into code it'll start to become more and more clear why I went this route.

## Bonus day: Finding a Gap in an aribtrary Sequence

Recently I had to write this code for a personal project, and since we didn't do much code above I suppose I'll share because that's the type of kind-hearted person I am.

The goal here is bewilderingly simple: given a formula for the expected item at an index `F(x)`

, and a manner of determining the current item at an index `G(x)`

, we want to detect what is the *first* item where `F(x) <> G(x)`

. Simple, right? A basic version might be to:

```
let firstNonEqualItem = [iFirst..iLast] |> Seq.filter (fun i -> i |> f <> i |> g) |> Seq.head
```

Yeah, this works, an example:

```
let items = [|0;1;2;3;4;5;7;8;9|]
let firstUnequalItem = [0..8] |> Seq.filter (fun i -> items.[i] <> i) |> Seq.head
```

But what *obvious* problem does this have? It does a linear search, which means if there are *many millions* of values, it finds the first value starting at the beginning, each and every time. If the first missing value is 10-million elements in, well it searches through the first 10-million elements before returning. Now what if I told you we could search the same 10-million elements with 24 iterations? What if I told you *we could find any missing element in that 10-million item array* in 24 or fewer iterations, *each and every time*, no matter where it was?

This is, in fact, not hard. The biggest problem is trying to define the plain-english for an algorithm that does it, but let's take an example. We have an array:

```
0 1 2 3 4 5 7 8 9
```

We want to find what element is missing, well we all know it's `6`

, but how do we teach a computer to find it?

The easiest method is to define two variables, which represent the left and right portions of the array we're in, we'll call them `l`

and `r`

. (If your fonts suck, those are lower-case `L`

and `R`

, respectively.) We'll define them as *by default* the first and last element, respectively.

```
0 1 2 3 4 5 7 8 9
l r
```

Now, we'll find the element in the middle of `l`

and `r`

, since `l = 0`

and `r = 8`

, that's `4`

. (`l + r / 2`

) We'll evaluate this element, called `m`

, and determine if `f(m) = g(m)`

.

```
0 1 2 3 4 5 7 8 9
l m r
```

In our case, `f(4) = 4`

is true, so now we'll set `l`

to `m + 1`

.

```
0 1 2 3 4 5 7 8 9
l r
```

Next, we'll repeat. This time we look at `f(6)`

, which is `7`

, and it's *not* true, so we set `r`

to `m`

.

```
0 1 2 3 4 5 7 8 9
l m r
0 1 2 3 4 5 7 8 9
l r
0 1 2 3 4 5 7 8 9
l r
m
```

Next we'll test `m`

again, which is `5`

, and `f(5) = 5`

is true, so we move `l`

to `m + 1`

.

```
0 1 2 3 4 5 7 8 9
l
r
```

Once `l`

and `r`

are in the same position, we return `f(l)`

, which is `6`

in this case. Do note we create `m`

*three* times, which means we only have to iterate *4* times to find a missing value in this array. You can calculate this as the base-2 logarithm of `n`

, where `n`

is the array count. (So, `log(n) / log(2)`

if you have a cheap calculator that only does base-10 or base-e logarithms like mine.) The code, for a basic F# algorithm of this, is as follows:

```
let Zero = bigint 0
let One = bigint 1
let NegOne = bigint -1
let searchForOpening equals formula getItem count =
let expected v = (v |> getItem, v |> formula) ||> equals
let rec alg l r =
match l = r with
| true -> formula l
| false ->
let mid = (l + r) >>> 1
match mid |> expected with
| true -> (mid + One, r) ||> alg
| false -> (l, mid) ||> alg
match count with
| r when r = 0I -> r |> formula
| r ->
match r |> expected with
| true -> (r + One) |> formula
| false -> (Zero, r) ||> alg
let items = [|0I; 1I; 2I; 3I; 4I; 5I; 7I; 8I; 9I;|]
searchForOpening (fun a b -> a = b) id (fun a -> items.[int a]) (items |> Array.length |> bigint |> (+) NegOne)
```

Pretty simple, no? Do note that this algorithm assumes `f(x)`

and `g(x)`

are *pure functions* that can work off an *arbitrary index*. Also, the `>>>`

may be unfamiliar to you - look up the F# operators, and you're looking for the "bitwise operators". It essentially shifts everything right `n`

bits (`1`

in this case). A `>>> 1`

is a fancy way of saying `/ 2`

, but much faster (in many cases) and no likelihood for confusion as to whether it rounds, and if so what direction.

And if you're doubting the efficiency of the algorithm, the following test should do you some good:

```
searchForOpening
(=)
id
(function | a when a <= 1000000000000000000000000000000000000000000000000000000000000I -> a | a -> a + One + One)
2000000000000000000000000000000000000000000000000000000000000I
```

We find the missing value of an *enormous* series of elements, by cheating, basically. This runs about *200* iterations to find the missing value. **Two-hundred**, that's it. We search through a sequence of elements with *60* zeros in it so fast that it's undeniable as to it's efficiency. This can be run in F# Interactive with *great* ease and speed - you won't notice a delay, which means it saves us many CPU cycles. The caveat is that we *need* a manner of comparing equality, which isn't always possible, and we *need* a pure-function that can evaluate what the *current* and *expected* elements are. In fact, we'll up the anty even more:

```
searchForOpening
(=)
id
(function | a when a <= 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000I -> a | a -> a + One + One)
20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000I
```

This searches through *two googol values*, that's *even more enormous*, and it does it in a *very* quick time. (About 333 iterations.) So if anyone tells you "F# can't be quick" or "you can't search through a set of numbers that big", tell them they're *very* wrong. We just did it, and in 20 lines of code.

This was another relatively short lesson, but no less important than any other: it's *absolutely inarguably imperative* that you know how to solve problems. This is the single biggest thing I want to stress through this whole adventure: knowing a language and API is great, but if you don't know how to *solve the problems* you're about as useful as a piece of burnt toast.