# Why do we compose?

If you've been following along in this series (which I recommend that you do, it starts way back here), you'll notice that I've done a lot of `a |> (g >> f)`, and not as much `a |> g |> f` or `f(g(a))`, if you're a curious person (which you should be, curiosity is the only way we learn something new), you're probably wondering why. Why do I write things as a |> (g >> f)` rather than the other two options? Well, today's lesson is going to go into that and help us learn what the differences are, so that hopefully things are more clear to you in the future.

## First thing first, what is the purpose of these three things?

Before we begin our adventure, I want to refresh our memory back to lesson 1, when I said:

The pipe-right (`|>`) operator takes the value on the left side, and transforms it to be the last argument of the function on the right side.

So we need to define an even more basic idea first, because pipe-right works on that. So let's do that.

In programming, we often work with something called a function or method, now these terms are sometimes interchangeable, and sometimes not, but we'll define a function first:

A function is a symbol that takes one parameter and returns one value.

## So what's a symbol?

In software-engineering, programming, whatever you call it, a "symbol" is a name mapped to a memory location. This is an important concept, because we often thing of symbols to a value (`let x = 5`) as being the value themselves: they're not. The symbol `x` is a memory-pointer to a portion of memory that holds the value of `5`.

So, think of it this way: if you look in your phone 'contacts' you see a "name", but that name often has a "number" associated with it. When you call or text someone you don't call their name, you call the number. The name is a "symbol" that points to the "number". Hopefully that helps clear it up.

## So if a function is a symbol, what does it point to?

That depends, and I'm not nearly educated enough to go into full details on that, but I'll give you the "skinny":

• When you declare a function `let f = (+) 5`, you declare a "symbol" (`f` in this case) that maps to a set of instructions (`(+) 5` in this case).
• When you call "`f`" (such as `f(3)`), you are actually asking the software to load the instructions, and then execute them with the value "`3`" being passed as parameter. Of course, it's not obvious here because I used composition, so let's redefine `f`: `let f x = x + 5`. Now is it clear? We have a memory location that is a set of instructions (there are three, we'll get into that shortly) that performs `x + 5`.
• When you pass `x` to `f`, you are passing it via a "stack", which is an internal memory component that can be read to and written to, but only the top value. This is another important concept. We only ever see the "top" value of the stack, or the last one to be pushed in.

So, here's a more clear example:

• `let x = 5`: declares a symbol "`x`" that ponts to an address of value `5`;
• `let f x = x + 5`: declares a symbol "`f`" that points to a function that expects a single parameter bound to symbol `x` (a different `x`) then exectues the instruction `x + 5` (plus at least two others, again I'll get to that later);
• `let y = f x`: declares a symbol `y` that is the symbol `x` passed as the first parameter to the function of symbol `f`, then places the result of that into an address bound to symbol `y`;

Alright, so does that make sense? Now we understand that `f` is just a symbol, that points to a memory location with some instructions. So this introduces composition, which is the idea that we can build a function out of smaller parts. Let's define `g`: `let g x = x * 2`.

So the symbol `g` points to a memory location that says "load and execute the three instructions", but this means we can build the composite function `g of f`: `let gf = g >> f`. Think of the `>>` operator as being "and then", we say "do `g`, and then do `f` with the result." We can also `let fg = f >> g`, which is "do `f`, then do `g`". The interesting thing is that the symbol `gf` or `fg` is just a symbol holding the two functions, and saying "when you get done with the first, give that result to the second, then give me the result."

So now let's go through our example:

``````let x = 5 // Bind `5` to the symbol `x`, `x` is a value of `int`
let f = (+) 5 // Bind `+ 5` to the symbol `f`, `f` is a function of `int -> int`
let g = (*) 2 // Bind `* 2` to the symbol `g`, `g` is a function of `int -> int`
let gf = g >> f // Bind `g and then f` to the symbol `gf`, `gf` is a function of `int -> int`
let y = x |> gf // Bind the result of `gf x` to the symbol `y`, `y` is a value of `int`
``````

Now at the moment we define `y`, `gf` gets executed with the value `x`, which in tern means `g` gets executed with `x`, then `f` gets executed with `g(x)`. Interestingly, we can model this entire operation more clearly:

``````let gf x =
let x_g = g x
f x_g
``````

Now the problem here is that it takes more time and consideration to realize what happened (after you get used to composition, obviously): we bound an intermediate symbol to the result of `g`, then returned the result of `f` with that intermediate symbol. Interestingly `gf` will have the same `int -> int` signature at the root level, but it is now defined as `let gf (x : int) : int =`, rather than `let gf : int -> int =`, which is a subtle but important difference: `gf` now directly relies upon the top value of the stack rather than indirectly.

Now I'm noting this because I want to show you the pipe-right option of `gf`: `let gf x = x |> g |> f`, which is equivalent to `let gf x = f (g x)`. So we now have three options for writing `gf` (all being an equivalent result):

``````let gf x = f (g x) // Write them as parameterized calls
let gf x = x |> g |> f // Write them as piped calls
let gf = g >> f // Write them as a composition
``````

I would choose the last result each and every time and I want you to understand why: when we see `let gf = g >> f`, we should immediately think to ourselves that `gf` is a single step, composed of sub-steps `g` and then `f`. The important part there is that `gf` is a single step, it's a single operation, it stands for one thing to do, whereas `x |> g |> f` stands for multiple things to do. It may not imply it directly, but it indirectly says "take `x`, do `g` with it, then do `f` with that result", whereas `g >> f` says "do `g` and then do `f` immediately." Yes, all three mean the same thing, and yes, all three result in the same value, but it's important for us to write our code readably, we want people to understand it. When I see `f (g x)` I think "do `f`, then do `g` with `x`, and give that to the `f` being done." It's much harder to see the immediate idea that `f` is just the next step in the pipeline, especially once you get to curried or "multi-parameter" functions.

We can also note that `let gf = g >> f` is shorter than the other two results, in number of characters and time to type. (I'm faster at `>` than `(` or `)`, for example, and I'd bet you probably are too.)

## Bottom line, pick the clearest option

I want to boil this entire discussion down to one idea: write the code that's most obvious. Write the code that's most clear. Write the code that's most meaningful. We've had a trope in the software engineering world for a long time:

Write your code as if the person that has to maintain it is a violent sociopath who knows where you live.

Don't write "clever" things (unless theres a very good reason to), write things that are direct, obvious and say what they mean.

## What are the three instructions I talked about earlier?

Right, so I want to clarify this one, when I said we define `let f x = x + 5` being three instructions, what do I mean by that?

Well, in the software engineering world we don't get anything for free (except coffee sometimes, but only if we have a good employer). We pay for everything. When we define `let f x = x + 5` we have to do at least three things here:

1. Load the top value in the stack to `x`;
2. Perform the `(+)` operation with `x` and `5`;
3. Return the result;

We have to do all three, and even if we write-off `+` as two function calls (because it is, it takes two arguments so it's two calls), we have to do at least those three things. We have to load the top value on the stack, we have to perform our `+ 5` (we'll call that one instruction), and we have to return that value.

I've slowed down on these posts for the moment, largely because a lot of my free time has been lost, but I'm going to continue to do at least one a week, and when I get more time I'll be doing them more than that. I have a few different projects I'm working on that will all be described here in time, so we'll be going into some pretty advanced topics, so it would be prudent to try to study a few of these things in your free time, as we have a lot to cover yet.

# Building some 'fuzzy logic' for our names

Lesson 9: Knowing the terminology is important

If you recall, in lesson 7 I mentioned the following:

We'll then (at a later date) explore how we can modify it to account for multi-part last names, such as `LA ROUSE`, or `MAC GREGOR`. This is a lot harder than it looks, and we'll only worry about that for `Sheet 2`, because in `Sheet 1` it's always separated out.

Today, we're going to explore this part of our topic - we'll build some 'fuzzy logic' that doesn't seem fuzzy, but happens to be. Of course, first we need to define 'fuzzy logic', so let's do that.

## What is 'fuzzy logic'?

Fuzzy logic is used fairly often in programming, it's essentially building a set of rules that sacrifice one of the following to make up for the other two: predictability, accuracy, and speed. Ours will sacrifice accuracy for speed and predictability. We're going to build a function that is mostly correct, but sometimes wrong, to allow us to solve the 95% scenario.

We need some "rules" for our fuzzy logic, so let's define them:

• If a word is <= 3 characters, it's part of a "compound word";
• A compound word should be grouped with the next word if possible;
• If a word is a single character followed by a single period, in any quantity, it's an "initial";

In [lesson 8], we built `splitNames`:

``````let splitNames (name : string) =
match ' ' |> name.Split with
| [|first; last|] -> (Some first, None, Some last)
| [|first; middle; last|] -> (Some first, Some middle, Some last)
| _ -> (None, None, None)
``````

This isn't fuzzy logic so much as bad logic: we just assume that two parts = first, last name, and three parts = first, middle, last name. We make absolutely no guarantee that there's even an apporpriate value there, for the name of `John Mac Gregor`, we get `(Some John, Some Mac, Some Gregor)`. That's not right at all, for `Rhonda La Rouse`, we get `(Some Rhonda, Some La, Some Rouse)`, again wrong.

So let's define portions of our 'fuzzy logic':

``````let isInitial : char array -> bool = (function | [|_; '.'|] -> true | _ -> false)
``````

Good start, we test a `char array` for two elements of `anything, '.'`. We can test this easily: `"a." |> Seq.toArray |> isInitial`, which passes.

Now we need a function, `isOnlyInitials` which tells us if a `string` only consists of `initials`, i.e. `D.W.` should be `true`:

``````let isOnlyInitials : string -> bool =
Seq.toArray
>> Array.mapi (fun i x -> (x, i))
>> Array.groupBy (fun (x, i) -> i / 2)
>> Array.map snd
>> Array.map (Array.map fst)
>> Array.forall isInitial
``````

This is an interesting function, because we use `mapi` to give us the index and element, then `groupBy` to put them into pairs, then `map` to only get the second element (`x, i`), then `map (map fst)` to give us just the first element from each sub-array. We're left with just `forall isInitial`, which is equivalent to `Array.filter isInitial |> Array.length = 0`. Simple, right?

## Let's appy this to our existing function

So how do we apply this? We need to modify `splitNames` to account for the initial, and for the compound name. To do so, we'll get rid of the `match`, because we're now just going to work our fuzzy logic. First we need to get the parts:

``````let nameParts = ' ' |> name.Split
``````

That's a good start, and it gives us an array of `string` objects representing each part. Previously this was all we needed, but now we need to put them into groupings based on our new rules.

Now we need to "fold" over our names, and collect them into an array of names that are based on our rules. For this we'll build a folder that takes a `string list` and then a `string` and returns a `string list`, with the first element being the last element tested.

The basic idea here is to match `acc` with `prev::rem`, and if `prev` is <= 3 characters, and it's not only initials, then our new `prev` should be `prev part`.

``````nameParts
|> Array.fold (fun acc pt ->
match acc with
| prev::rem
when prev |> Seq.length <= 3
&& prev |> (isOnlyInitials >> not) ->
(sprintf "%s %s" prev pt)::rem
| _ -> pt::acc
) []
``````

## Make it work for Ron, darnit!

Pretty simple, right? So if we test it with a basic data-set: `["John Mac Gregor"; "Rhonda La Rouse"; "John C. Mac Gregor"; "Rhonda A. La Rouse"; "La Ronda Doe"] |> List.map splitNames;;`, we can see that it returns what we expected. Now, we lose one particular case I ran into: `"Do Ray Doe"`, which turns into `(Some "Do Ray", None, Some "Doe")`, when it should have been `(Some "Do", Some "Ray", Some "Doe")`. We also lost support for `Ron Doe`, which returns all `None` because `Ron` is paired with `Doe`. To fix this we'll alter our `when` guard clause:

``````let splitNames (name : string) =
let isInitial : char array -> bool = (function | [|_; '.'|] -> true | _ -> false)
let isOnlyInitials : string -> bool =
Seq.toArray
>> Array.mapi (fun i x -> (x, i))
>> Array.groupBy (fun (x, i) -> i / 2)
>> Array.map snd
>> Array.map (Array.map fst)
>> Array.forall isInitial
let nameParts = ' ' |> name.Split
let nameParts =
nameParts
|> Array.fold (fun acc pt ->
match acc with
| prev::rem
when prev |> Seq.length <= 3
&& prev |> (isOnlyInitials >> not)
&& rem |> List.length >= 1 ->
(sprintf "%s %s" prev pt)::rem
| prev::rem
when prev |> Seq.length <= 2
&& prev |> (isOnlyInitials >> not) ->
(sprintf "%s %s" prev pt)::rem
| _ -> pt::acc
) []
|> Array.ofList
|> Array.rev
match nameParts with
| [|first; middle; last|] -> (Some first, Some middle, Some last)
| [|first; last|] -> (Some first, None, Some last)
| _ -> (None, None, None)
``````

We also added a second `prev::rem` case, the first matches if there's already more than one other word with a length of <= 3 (so `Ron Doe` is fixed), the second matches any if the first word is <= 2 characters (working for `La Ronda`, but not properly for `Bo David`). Now we could add a filter for `Bo` to not match on the `<= 2` guard clause, but that adds a maintainability issue. Instead, we just document that it's 99% accurate, and what cases we know won't work properly.

Today's lesson is somewhat short, but introduces a very complex topic. I recommend that you make sure you fully understand what is happening, as it's important to know. Also, if anyone has a better idea on the `isOnlyInitials` function, I'd be very interested in hearing about it. The grouping by `index / 2` works, but I'm sure there's a better way to do it.

My friend Chris asked on Twitter, if it was possible to load an F# module to PowerShell. For anyone who's heavily familiar with PowerShell, we know that it's very possible to load a .NET assembly to PowerShell, and run methods and operations in it.

To run an F# module in PowerShell, we simply need to add an extra step: load `FSharp.Core.dll` before we try to call our method.

To test this, I built a quick F# project: `PowerShell Test Library`, and added the following code:

``````namespace PowerShell_Test_Library

type TestClass() =
member this.run () = printfn "This is an F# function"
static member runStatic () = printfn "This is a static F# function"
``````

Basic, right? This is just demonstrative, but let's take a peek and see how we can load it. The first step to loading a .NET library into PowerShell is to call `System.Reflection.Assembly.LoadFile`, which is done as follows:

``````[System.Reflection.Assembly]::LoadFile("C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug\PowerShell_Test_Library.dll")
``````

Yes, you need the full path when using it like this. Next, we load `FSharp.Core.dll`:

``````[System.Reflection.Assembly]::LoadFile("C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug\FSharp.Core.dll")
``````

We can call our methods as follows:

``````(New-Object PowerShell_Test_Library.TestClass).run()
[PowerShell_Test_Library.TestClass]::runStatic()
``````

So, altogether, our PowerShell is:

``````[System.Reflection.Assembly]::LoadFile("C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug\PowerShell_Test_Library.dll")
[System.Reflection.Assembly]::LoadFile("C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug\FSharp.Core.dll")
(New-Object PowerShell_Test_Library.TestClass).run()
[PowerShell_Test_Library.TestClass]::runStatic()
``````

If we run this, we get the following:

``````PS C:\Users\ebrown\Desktop> [System.Reflection.Assembly]::LoadFile("C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug\PowerShell_Test_Library.dll")

GAC    Version        Location
---    -------        --------
False  v4.0.30319     C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin...

PS C:\Users\ebrown\Desktop> [System.Reflection.Assembly]::LoadFile("C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug\FSharp.Core.dll")

GAC    Version        Location
---    -------        --------
False  v4.0.30319     C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin...

PS C:\Users\ebrown\Desktop> (New-Object PowerShell_Test_Library.TestClass).run()
This is an F# function
PS C:\Users\ebrown\Desktop> [PowerShell_Test_Library.TestClass]::runStatic()
This is a static F# function
PS C:\Users\ebrown\Desktop>
``````

And it's that easy.

For anyone familiar with PowerShell, if your `ExecutionPolicy` does not permit running a script, you can use the following function to run it instead:

``````function Run-Script ([string]\$script)
{
\$policy = Get-ExecutionPolicy
Set-ExecutionPolicy -Force -Scope CurrentUser -ExecutionPolicy Bypass
& ".\\$script"
Set-ExecutionPolicy -Force -Scope CurrentUser -ExecutionPolicy \$policy
}
``````

Drop that in your PowerShell console, and then call `Run-Script ScriptFile.ps1`, which for me is `Run-Script Script.ps1`. This will set and restore your `ExecutionPolicy` to run the script:

``````PS C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug> function Run-Script ([string]\$script)
>> {
>> \$policy = Get-ExecutionPolicy
>> Set-ExecutionPolicy -Force -Scope CurrentUser -ExecutionPolicy Bypass
>> & ".\\$script"
>> Set-ExecutionPolicy -Force -Scope CurrentUser -ExecutionPolicy \$policy
>> }
PS C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug> Run-Script Script.ps1

GAC    Version        Location
---    -------        --------
False  v4.0.30319     C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin...
False  v4.0.30319     C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin...
This is an F# function
This is a static F# function

PS C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug>
``````

And this concludes our main lesson for today. :)

## Make it a little better

Of course, we can make this just a little better.

First, we don't want to have to manually specify the path. So we'll get the current path:

``````\$path = (Get-Item -Path "." -Verbose).FullName
``````

Second, we don't really need to see the `GAC` lines from the libraries being added, we don't really care about them. The easiest way to do this is to redirect the output to `\$null`, via `> \$null`:

``````[System.Reflection.Assembly]::LoadFile("\$path\PowerShell_Test_Library.dll") > \$null
``````

This will make our new script less verbose.

``````PS C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug> Run-Script Script.ps1
This is an F# function
This is a static F# function
PS C:\Users\ebrown\Documents\Visual Studio 2017\Projects\FSharp Tests\PowerShell Test Library\bin\Debug>
``````

It also builds a reusable framework for future scripts: we can replace the first `LoadFile` with our apporpriate file, so we may actually want to abstract that:

``````function Init-Script (\$dll)
{
\$path = (Get-Item -Path "." -Verbose).FullName
}
``````

Then the actual script code is:

``````Init-Script "PowerShell_Test_Library" > \$null
(New-Object PowerShell_Test_Library.TestClass).run()
[PowerShell_Test_Library.TestClass]::runStatic()
``````

And life is much better.

# Knowing the terminology is important

Lesson 8: Let's Normalize Our Data!

Recently I was talking with a colleague of mine, and we were discussing certain syntax differences between two of the major .NET languages: C#, and VB.NET. (Sorry, he's not a fan of F# so there's no love for it there.) One of the issues we came across was the following syntax:

#### VB.NET:

``````Public Property Items As List(Of String)
``````

#### C#:

``````public List<string> Items { get; set; }
``````

In particular, we were concerned with the `(Of String)` and `<string>` bits - if you remember back to lesson 4, I talked about the `<'a>` bit of `let flatten<'a> : 'a array array -> 'a array = Array.collect id`, which was an F# generic-type parameter, or a placeholder for a type. It means you can call essentially replace `'a` with any type (in this case you actually can, because there are no type constraints) and the function should still work. The `(Of String)` and `<string>` bits of the examples above are the same principle: each one is a type substituted for a generic type parameter, that is, the "generic type" of `List` is now a `string`. In VB.NET this is accomplished by the verbose `Of T`, in C# and F# it's accomplished by `<T>`, which hilariously, as my friend Chris points out on Twitter, is pronounced as `Of T` by pretty much any seasoned developer. (This should make things a little easier to understand: a `'a array` is just an `array of 'a`.)

But this brought up an even larger issue - it's important for people to understand the terminology of the system they work in. I can't stress this enough, and if you remember back to that lesson, I even had the following to say:

The `<'a>` is an F# generic type-parameter. Feel free to look those up to learn more.

I really hope you looked it up, because it's important to understand what it means. Often times in any field we get into the habit of using field-specific terms, in this case "generic type parameter", but other terms may be things like "monad" (often found in Haskell documentation), "access(ibility) modifier" (`public/protected/internal/private`, etc.), or even something as simple as a "mouse". (When talking to a computer group, "mouse" often refers to the physical device used to move a cursor on a GUI, but if you are speaking to a verternarian or someone from animal-control, I assure you it's not as clear-cut.)

## The Electronics Example

In fact, recently I was looking for advice on a specific motor to use with a PLC, and I mentioned to a person that I needed a "DC brushless motor" - now if you don't know how "DC" or "brushless" apply to "motor", that sentence can be confusing, misleading, or altogether unable to be understood, but in the context it meant "a device that converts direct-current electric energy into rotary motion, without the use of any brushes." Again, not too confusing, but if you don't understand or incorrectly understand the term, it can be very difficult to rectify the situation. (The particular person I was talking to kept telling me "you can use a car starter" - this is problematic because a car starter is usually a brushed DC motor - I specifically needed brushless.)

So, I want to take a moment today to, instead of writing code, go over some of the common (and hopefully some of the less common) terms you might run into in programming. Some of which I'll describe, and some of which I'll tell you the term, but expect you to research it.

## Some Terms

The first set of terms is pretty basic, these are usually basic types in a programming language, or can be trivially represented.

• Method: a section of programming code with an identifier and optionally parameters that may be called by other code, and may or may not return a value;
• Function: a method that always returns a value;
• Subroutine: a method that never returns a value, or returns a `void`-type;
• Boolean or Bool: a value that has two states: true or false;
• String: a sequence of characters that represent a section of text;
• Integer: a whole number (that is, a number with no fractional portions);
• Single-Precision Floating-point Value, Single or Float: a number with a fractional part, often represented as a 32-bit value (Do note that some languages treat a `float` as a `double`, research the language ahead of time to make sure you understand which is the case);
• Double-Precision Floating-point Value or Double: a number with a fractional part, often represented as a 64-bit value;
• Char: a single character, often represented as an 8-bit, 16-bit or 32-bit value (Different languages and systems use different endian-ness, bit-sizes and encodings for `char`, your specific details may vary);
• Byte: an 8-bit value either signed or unsigned depending on language;
• Endian-ness: the order in which bits and bytes appear, this may be most-significant value first, or least-significant value first (see big-endian and little-endian);
• Bit: see Boolean, this is a single value that is represented by either a `0` or `1`;
• Signed/Unsigned: whether a number has the ability to represent values below 0 (`signed`) or may only represent values above 0 (`unsigned`);
• Short or Int16: typically a 16-bit integer, either signed or unsigned;
• Long or Int64: typically a 64-bit integer, either signed or unsigned;
• Word: this is not a "word" in the sense of a sentence, but is typically an alias for a "short" or "Int16" - it is two bytes, always (I separated this and the next two from the number types closest to them because they have a literal translation: they are always the specified number of bytes - often times platforms and languages will indicate that a "numeric" type is actually one of these, such as "int" typically being a "DWord");
• DWord or Double Word: two words (4 bytes);
• QWord or Quad Word: four words (8 bytes);

Regarding numbers: often times languages provide number types in a pair, such as `sbyte` and `byte` for "signed byte" and "unsigned byte", or `short` and `ushort` for "signed short" and "unsigned short", respectively. This often creates confusion between languages, and to make matters worse C and C++ don't always use the same bit-length for each data-type. (On some platforms `int` in C may be 8, 16, or 32 bits.) The terminology I'm using is referring to the .NET types, for the most part, and it also refers in general on a broader level. (An `Int16` is the same 16-bits no matter what, and most people in the software development industry understand `short` is an alias for `Int16`.)

Some more types and type-modifiers:

• Class: often refers to a reference-type that is a collection of related fields, properties, methods and events;
• Struct: often refers to a value-type that is a collection of related fields, properties, methods and events;
• Reference-Type: a type that is, by default, treated as a reference rather than an actual value, this is similar but not the same as a pointer - the value of the reference type is a memory-address that points to the actual values;
• Value-Type: a type that is, by default, treated as an actual value, that is, a single continuous grouping of memory that represents a single result (things like "numbers" - one would never expect `int value = 0` to be a pointer to a memory location holding an "object" with the value `0` - it's just expected that `value` is now the literal 0);
• Interface, Protocol or Contract: usually refers to a collection of properties, methods and events which a class or struct intends to fully implement;
• List: may either refer to a "linked-list" or a "non-linked list" of values, usually of the same type - the list often has a "count" of the number of items contained, and often methods like "add", "remove" and "contains" to indicate if an element is in the list;
• Sequence or Enumerable: any sequence of values, usually of the same type, that are in order but may not always be known ahead-of-time - these are often "lazily" processed, in that only the current element of the sequence is known, the next element may or may not actually have been found already - usually a `List` can be treated like a `Sequence`, as a `List` is inherently a sequence of values;
• Array: a set of values, usually of the same type, that are directly indexable and statically referenced in memory - a string, for example, can be seen as an "array of characters" - these may or may not expose a "count";
• Access(ibility) Modifier: a keyword that indicates what "access level" a class, interface, enumeration, property, field, event or method allows;
• Base Class: in the concept of OOP and inheritance, the class that is the next level up the heirarchy;
• Abstract Class: a class that is not concrete, but may provide some portions of an implementation of - may only be inherited by other classes;

• OOP or Object-Oriented Programming: designing programs around the model of data-flow being done through inheritance, and polymorphism (often statement-based);
• Functional Programming: designing programs around the model of data-flow being done through functions and functional composition, rather than classes or inheritence (often expression-based);
• Expression: a series of instructions that evaluate to a value (such as `1+1`);
• Statement: typically an "expression" that doesn't return a result (a call to a method, for example, with a `void` return type);
• Polymorphism - look this one up on your own, far too long to describe properly here;
• Instruction: typically refers to a basic operation that may be performed on a lower-level than usual programming;

Some basic algorithms you should research:

• Binary Search
• Binary Search Tree
• Heap Sort
• Quicksort
• Insertion Sort

I'll probably be updating this list as time goes on, but at the very least these are some terms you should get mostly acquainted with.

Todays post is (obviously) more about the higher-level aspects of programming, and less about doing something. I should have convered this information previously, but I didn't. This will become a reference document for future ideas as well (anything I explain here I'll not explain in the future), so keep updated on it. As we've progressed quickly, we'll continue to do so. I'm largely out of content for this series, we'll do a few updates to our existing program to make it slightly more effective and easier to follow, but I hope to have a new adventure worked out by Monday.

# Let's normalize our data!

Lesson 7: Working out our Step 2 Process
Lesson 9: Knowing the Terminology is Important

Finally, time to write out our step 2 code. This has been a bit of a lengthy journey so far, and most of that is my fault, but it's finally time to write some code to do what we want. I'm going to throw a lot at you very quickly, but it's important to not lose focus. If you have questions, do some web-searches while you're thinking of them, it'll help you stay in the mindset I hope.

The first step here is to define a type, or a few types. We need 4 to boot.

``````type NameMap =
| Combined of c : int
| Split of firstName : int * middleName : int option * lastName : int

type MagicIdMap =
| Combined of c : int
| Split of ids : int array

Sheet : ISheet
Name : NameMap
Phone : int
Email : int
MagicIds : MagicIdMap
}

type User = {
FirstName : string
MiddleName : string option
LastName : string
Email : string
MagicId : int
}
``````

You should understand what those are by now, so I won't go into any detail. We're still using NPOI, we're still using F#, and we're still tyring to be as idiomatic as possible.

## Some boilerplate

You should also remember this little guy:

``````let getVal (cell : ICell) : obj option =
match cell with
| null -> None
| _ ->
match cell.CellType with
| CellType.Boolean -> cell.BooleanCellValue |> box |> Some
| CellType.Numeric -> cell.NumericCellValue |> box |> Some
| CellType.String -> cell.StringCellValue |> box |> Some
| _ -> None
``````

And we're also going to add a few other helpers:

``````let startsWith v (x : string) = v |> x.StartsWith

let atLeastOne items =
if items |> Seq.isEmpty then [None] |> Seq.ofList
else items

let colIndex (cell : ICell) = cell.ColumnIndex
``````

None of which should be difficult. The `atLeastone` will always make sure a `Seq` is not empty, so that we can simplify tests later. We need to be able to filter out cells based on what value they have, so we have this function:

``````let cellWhere pred (row : IRow) =
row.Cells
|> (Seq.map (fun c -> (c, c |> (getVal >> Option.map Object.toStr)))
>> Seq.filter (fun (c, v) -> v |> (function | None -> false | Some n -> n |> pred))
>> Seq.map (fun (c, v) -> v |> Option.map (fun y -> c))
>> atLeastOne)
``````

Interestingly, this will give us an `ICell option sequence`, which means we can then `Seq.choose` on it if we need, or in the case of only one result we can `Seq.head`. We need to filter our header out first, so we want something like the following:

``````let mapHeader (sheet, row) =
match row |> getName, row |> getPhone, row |> getDateAdded, row |> getEmail, row |> getMagicIds with
| Some name, Some phone, Some dateAdded, Some email, Some magicIds ->
{ Sheet = sheet
Name = name
Phone = phone
Email = email
MagicIds = magicIds }
|> Some
| _ -> None
``````

This is pretty simple: match the result of the `get___` function for each record property with `option`, and if we have `Some` of everything then return it. This doesn't care what type anything is for the most part, the `get___` function will decide how the row maps to the type.

## Grab some values

For `Phone`, `DateAdded`, and `Email` it's simple:

``````let getPhone row =
match row |> (cellWhere ((=) "Phone Number") >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None

| Some cell -> cell |> (colIndex >> Some)
| _ -> None

let getEmail row =
match row |> (cellWhere ((=) "Email") >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None
``````

In fact, we could replace all these with the same method:

``````let getCol col row =
match row |> (cellWhere ((=) col) >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None

let getPhone : IRow -> int option = getCol "Phone Number"
let getEmail : IRow -> int option = getCol "Email"
``````

So that's what we'll do. Awesome, simple and efficient. Avoids repetiveness, and keeps us in check. The next step is either `Magic ID` or `Name`. Since the `Magic ID` is easier, we'll do the name first.

``````let getName row =
match row |> (cellWhere ((=) "Name") >> Seq.head),
row |> (cellWhere ((=) "First Name") >> Seq.head),
row |> (cellWhere ((=) "Middle Name") >> Seq.head),
row |> (cellWhere ((=) "Last Name") >> Seq.head) with
| Some cell, _, _, _ -> cell |> (colIndex >> NameMap.Combined >> Some)
| _, Some f, m, Some l ->
NameMap.Split (f |> colIndex, m |> Option.map colIndex, l |> colIndex)
|> Some
| _ -> None
``````

Oh, that's not so bad. If this was the harder of the two, that can only mean that `Magic ID` must be pretty simple.

``````let getMagicIds row =
match row |> (cellWhere ((=) "Magic ID") >> Seq.head), row |> cellWhere (startsWith "Magic ID") with
| Some cell, _ -> cell |> (colIndex >> Combined >> Some)
| _, cells when cells |> Seq.isEmpty = false -> cells |> (Seq.choose id >> Seq.map colIndex >> Seq.toArray >> Split >> Some)
| _ -> None
``````

Dang, is that it? Do note that we're actually applying the `str1 = str2` function as a partial application, which gives us:

• Shorter, cleaner syntax;
• The potential for higher-level optimizations;
• An easier process to follow;

You'll also notice I've been doing `x |> (f >> g)` instead of `x |> f |> g`, this is a slightly longer syntax (two parenthesis added), but it makes things clearer to me: the operation is one, continuous operation, there's no partial flow here. The `f >> g` composition is a single operation, and it makes it easy to abstract that out to a new function if necessary.

Alright, so we've built the header, the next step is to map them:

``````let range uBound = [0..uBound - 1]
let getRow i (x : ISheet) = i |> x.GetRow
let testFileName = "C:\Users\ebrown\Desktop\Step2_Testing.xlsx"
let workbook =
use fs = new FileStream(testFileName, FileMode.Open, FileAccess.Read)
XSSFWorkbook(fs)
workbook.NumberOfSheets
|> (range
>> Seq.choose (workbook.GetSheetAt >> Option.ofObj)
>> Seq.map (fun sheet -> (sheet, sheet |> (getRow 0 >> Option.ofObj)))
>> Seq.choose (fun (sheet, r1) -> r1 |> Option.map(fun r -> (sheet, r)))
``````

Now our `headers` value will be a `Header sequence`, which allows us to do all sorts of stuff with them, including `headers |> Seq.iter (fun header -> printfn "%s: %A" header.Sheet.SheetName header)`. A debug print, pretty simple, no?

## Get some data from our spreadsheet

Now we get to a slightly more difficult task: how do we get the users? Well first, I want to define the syntax I expect from our API:

``````let users =
|> Seq.map (getUsers workbook)
``````

Ok, that's reasonable. We expect to call a `getUsers` function which is an `IWorkbook -> Header -> User array`. We actually have most of the structure built, so we'll go back and define our `getUsers` function.

As I was defining it, I realized the addition of `workbook` to the `getUsers` call is superfluous: our `header` already has a reference to `ISheet`, which is all we need here, so we'll drop that call and it becomes a basic one-liner:

``````let users = header |> Seq.map getUsers
``````

Rightey-o then, let's move on. For `getUsers` we need to do a couple things:

1. We need to collect each row from the `ISheet`;
2. We need to split each row into the relevant user-array;
3. We need to flatten the result;

This is somewhat easy, since we know the steps:

``````let getUsers header =
[1..sheet.LastRowNum]
|> (Seq.choose (sheet.GetRow >> Option.ofObj)
>> Seq.toArray
>> Array.flatten
>> Array.map separateUsers
>> Array.flatten)
``````

Easy enough, nothing too complex here. Let's build the `getUserRows` function first. We'll actually build a new `type` here, to hold our intermediate row. (This helps keep each function short and clean.)

``````type UserRow = {
FirstName : string
MiddleName : string option
LastName : string
Phone : string option
Email : string option
MagicId : string
}
``````

Keep in mind that this is an intermediate record type - we don't care as much about our values, we're going to transform this to the final type in the `separateUsers` function. Now `getUserRows` is responsible for doing our name transformation, which we agreed upon before, and it's responsible for splitting a `MagicIdMap.Split` into multiple rows.

## Get the Cell values

From here we know we need a couple other things, one of which is the ability to get a value of a cell:

``````let getColVal c (row : IRow) = c |> row.GetCell |> getVal
``````

And we need to `separateUsers`, which is the easy part:

``````let separateUsers user =
user.MagicId
|> String.splitOpt (StringSplitOptions.RemoveEmptyEntries) ' '
|> Seq.map(fun id ->
{ User.FirstName = user.FirstName
MiddleName = user.MiddleName
LastName = user.LastName
Email = user.Email |> (function | None -> String.Empty | Some v -> v)
MagicId = id |> int })
``````

Pretty basic, iterate our `MagicId` split string and work out what they are.

## Build the massive, ugly function

Then we want to write `getUserRows`, which is long and quite possibly disobeys some best-practices, but it does exactly what we want, and well:

``````let getUserRows header (row : IRow) : UserRow seq option =
let splitNames (name : string) =
match ' ' |> name.Split with
| [|first; last|] -> (Some first, None, Some last)
| [|first; middle; last|] -> (Some first, Some middle, Some last)
| _ -> (None, None, None)

let buildUserVals first middle last phone dateAdded email magicId =
{ UserRow.FirstName = first
MiddleName = middle
LastName = last
Phone = phone |> Option.map Object.toStr
Email = email |> Option.map Object.toStr
MagicId = magicId |> Object.toStr }

let finalResult first middle last =
| MagicIdMap.Combined magicIdCol ->
match row |> getColVal magicIdCol with
| Some magicId ->
|> Seq.ofArray
|> Some
| _ -> None
| MagicIdMap.Split ids ->
ids
|> Seq.choose (fun id ->
match row |> getColVal id with
| Some magicId ->
|> Some
| _ -> None)
|> Some

| NameMap.Split (f, m, l) ->
match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
row |> getColVal f |> Option.map (Object.toStr),
m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
row |> getColVal l |> Option.map (Object.toStr) with
| Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
| _, Some f, m, Some l -> finalResult f m l
| _ -> None
| NameMap.Combined n ->
match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
| Some (Some f, m, Some l) -> finalResult f m l
| _ -> None
``````

So what's happening here? Well, the first function inside here is `splitNames`, which is trivial:

``````let splitNames (name : string) =
match ' ' |> name.Split with
| [|first; last|] -> (Some first, None, Some last)
| [|first; middle; last|] -> (Some first, Some middle, Some last)
| _ -> (None, None, None)
``````

If we split the name into two pieces, then return the two pieces as the first and third tuple elements. If we split it into three pieces, then return `first, middle, last`. We'll assume that names are fairly standard in the manner they're non-standard.

Next' we have this:

``````let buildUserVals first middle last phone dateAdded email magicId =
{ UserRow.FirstName = first
MiddleName = middle
LastName = last
Phone = phone |> Option.map Object.toStr
Email = email |> Option.map Object.toStr
MagicId = magicId |> Object.toStr }
``````

This is basically a function which takes our relevant paraneters, and builds the final `UserRow` value from them. Pretty easy, nothing complex.

The `finalResult` starts getting hinky:

``````let finalResult first middle last =
| MagicIdMap.Combined magicIdCol ->
match row |> getColVal magicIdCol with
| Some magicId ->
|> Seq.ofArray
|> Some
| _ -> None
| MagicIdMap.Split ids ->
ids
|> Seq.choose (fun id ->
match row |> getColVal id with
| Some magicId ->
|> Some
| _ -> None)
|> Some
``````

So this tests the `MagicIds` against the `MagicIdMap`, if it's `Split` then we're returning the array of users immediately. If it's `Combined` then we're returning a one-element array, built from the remaining values.

In hindsight, we really should clean some of this up to make it simpler:

``````let finalResult first middle last =
let phone = row |> getColVal header.Phone
let email = row |> getColVal header.Email
| MagicIdMap.Combined magicIdCol ->
match row |> getColVal magicIdCol with
| Some magicId ->
[|buildUserVals first middle last phone dateAdded email magicId|]
|> Seq.ofArray
|> Some
| _ -> None
| MagicIdMap.Split ids ->
ids
|> Seq.choose (fun id ->
match row |> getColVal id with
| Some magicId ->
buildUserVals first middle last phone dateAdded email magicId
|> Some
| _ -> None)
|> Some
``````

That's better, we don't need to inline our `phone`, `dateAdded` and `email` data.

Finally, the `match` which is executed:

``````match userRow.Name with
| NameMap.Split (f, m, l) ->
match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
row |> getColVal f |> Option.map (Object.toStr),
m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
row |> getColVal l |> Option.map (Object.toStr) with
| Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
| _, Some f, m, Some l -> finalResult f m l
| _ -> None
| NameMap.Combined n ->
match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
| Some (Some f, m, Some l) -> finalResult f m l
| _ -> None
``````

This just tests the name information, and does the relevant transformations. You'll notice the first sub-`match` tests four values, that's because I'm lazy and refuse to put yet another sub-`match` expression in. Finally, we can run and test our code:

``````let users = headers |> Seq.map getUsers
users |> Seq.iter (printfn "%A")
``````

This should write some output like the following:

``````seq
[{FirstName = "John";
MiddleName = Some "C.";
LastName = "2";
Email = "email@example.com";
MagicId = 287;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "2";
Email = "otheremail@example.com";
MagicId = 331;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "2";
Email = "otheremail@example.com";
MagicId = 334;}; {FirstName = "Ron";
MiddleName = null;
LastName = "Doe";
Email = "";
MagicId = 872;}; ...]
seq
[{FirstName = "John";
MiddleName = Some "C.";
LastName = "Doe";
Email = "email@example.com";
MagicId = 287;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "Doe";
Email = "otheremail@example.com";
MagicId = 331;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "Doe";
Email = "otheremail@example.com";
MagicId = 334;}; {FirstName = "Ron";
MiddleName = null;
LastName = "Doe";
Email = "";
MagicId = 872;}; ...]
``````

## Finally, save our data

Finally, we need to save our data. This is a pretty easy step, we'll modify `getUsers` just a little, and then user our `users` sequence to finish things off:

``````let getUsers header =
[1..sheet.LastRowNum]
|> (Seq.choose (sheet.GetRow >> Option.ofObj)
>> Seq.flatten
>> Seq.map separateUsers
>> Seq.flatten))

let users = headers |> Seq.map getUsers
``````

As you can see, the only modification is to have it map users and a header together, then we'll drop it all to a new workbook and sheet, and save.

Saving things is actually quite simple, we define a few helpers to set values, then we build a `save` function that takes our `header` and `users` and saves them to the new workbook:

``````let setCellValueOptStr (cell : ICell) : string option -> unit =
Option.map cell.SetCellValue >> ignore
let setCellValueStr (cell : ICell) : string -> unit =
cell.SetCellValue >> ignore
let saveSingle (row : IRow) (user : User) =
(0 |> row.CreateCell, user.FirstName) ||> setCellValueStr
(1 |> row.CreateCell, user.MiddleName) ||> setCellValueOptStr
(2 |> row.CreateCell, user.LastName) ||> setCellValueStr
(3 |> row.CreateCell, user.Email) ||> setCellValueStr
(4 |> row.CreateCell, user.MagicId |> string) ||> setCellValueStr
let saveHeader (row : IRow) =
(0 |> row.CreateCell, "First Name") ||> setCellValueStr
(1 |> row.CreateCell, "Middle Name") ||> setCellValueStr
(2 |> row.CreateCell, "Last Name") ||> setCellValueStr
(3 |> row.CreateCell, "Email") ||> setCellValueStr
(4 |> row.CreateCell, "Magic ID") ||> setCellValueStr
``````

The helpers make it easy to keep items separated so that concerns are only on their own scope.

``````let save header users =
let newBook = XSSFWorkbook()
let newSheetName = "Process_Spec"
let newSheet = newSheetName |> newBook.CreateSheet
let savedRows =
users
|> Seq.fold (fun row user ->
let newRow = row |> newSheet.CreateRow
(newRow, user) ||> saveSingle
row + 1) 1
let datePart = System.DateTime.Now.ToString("yyyy-MM-dd")
let newBookFile =
sprintf "%s\%s-%s%s" fileInfo.DirectoryName newFile datePart fileInfo.Extension
use fw = File.Create(newBookFile)
newBook.Write(fw)
printfn "Saved to %s" newBookFile
savedRows
``````

The `sprintf` is similar to `printf` except that it returns the format string instead of printing directly.

We can save our data with a single line: `users |> Seq.iter (fun value -> value ||> save |> ignore)`. This will run through the entire pricess at this point and then save it all. (The `let users = headers |> Seq.map getUsers` doesn't do anything other than build a chain of `Seq` functions to create our result.)

And bam, we're done. Overally we have a hundred or so lines of code, and they completely normalize our example data. The next lesson will try to normalize things a bit better, and take some other oddities into account (such as names like "La Rouse", "Mac Gregor", etc.).

As we conclude this lesson (which, let's face it, is mostly a code-dump) I have been reflecting on my experiences with the IT world, software development, F#, C#, PHP, and other languages and experiences, and reminding myself that everything I've been working towards, building this lesson series, developing products for my company, building software to test my own knowledge, and even sharing some experiences with others have been strongly defined to push myself towards teaching and sharing knowledge with others. I realized (not so long ago) that the only thing I truly enjoy doing is helping others succeed, so I've been working towards that day-by-day for years. This series itself is a testament to that: I'm not expecting to get anything out of it other than hoping at least one person finds something here userful. That's all I want, and all I ask. If you find something useful, reach out on Twitter or email me. One of my (many) emails is elliott [dot] brown or ebrown (either work), at this very domain. (Robots should have a hard time with that, but I have faith in the people reading this to be able to parse it out.)