using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 9)

Knowing the terminology is important

Lesson 8: Let's Normalize Our Data!

Recently I was talking with a colleague of mine, and we were discussing certain syntax differences between two of the major .NET languages: C#, and VB.NET. (Sorry, he's not a fan of F# so there's no love for it there.) One of the issues we came across was the following syntax:

VB.NET:

Public Property Items As List(Of String)

C#:

public List<string> Items { get; set; }

In particular, we were concerned with the (Of String) and <string> bits - if you remember back to lesson 4, I talked about the <'a> bit of let flatten<'a> : 'a array array -> 'a array = Array.collect id, which was an F# generic-type parameter, or a placeholder for a type. It means you can call essentially replace 'a with any type (in this case you actually can, because there are no type constraints) and the function should still work. The (Of String) and <string> bits of the examples above are the same principle: each one is a type substituted for a generic type parameter, that is, the "generic type" of List is now a string. In VB.NET this is accomplished by the verbose Of T, in C# and F# it's accomplished by <T>, which hilariously, as my friend Chris points out on Twitter, is pronounced as Of T by pretty much any seasoned developer. (This should make things a little easier to understand: a 'a array is just an array of 'a.)

But this brought up an even larger issue - it's important for people to understand the terminology of the system they work in. I can't stress this enough, and if you remember back to that lesson, I even had the following to say:

The <'a> is an F# generic type-parameter. Feel free to look those up to learn more.

I really hope you looked it up, because it's important to understand what it means. Often times in any field we get into the habit of using field-specific terms, in this case "generic type parameter", but other terms may be things like "monad" (often found in Haskell documentation), "access(ibility) modifier" (public/protected/internal/private, etc.), or even something as simple as a "mouse". (When talking to a computer group, "mouse" often refers to the physical device used to move a cursor on a GUI, but if you are speaking to a verternarian or someone from animal-control, I assure you it's not as clear-cut.)

The Electronics Example

In fact, recently I was looking for advice on a specific motor to use with a PLC, and I mentioned to a person that I needed a "DC brushless motor" - now if you don't know how "DC" or "brushless" apply to "motor", that sentence can be confusing, misleading, or altogether unable to be understood, but in the context it meant "a device that converts direct-current electric energy into rotary motion, without the use of any brushes." Again, not too confusing, but if you don't understand or incorrectly understand the term, it can be very difficult to rectify the situation. (The particular person I was talking to kept telling me "you can use a car starter" - this is problematic because a car starter is usually a brushed DC motor - I specifically needed brushless.)

So, I want to take a moment today to, instead of writing code, go over some of the common (and hopefully some of the less common) terms you might run into in programming. Some of which I'll describe, and some of which I'll tell you the term, but expect you to research it.

Some Terms

The first set of terms is pretty basic, these are usually basic types in a programming language, or can be trivially represented.

  • Method: a section of programming code with an identifier and optionally parameters that may be called by other code, and may or may not return a value;
  • Function: a method that always returns a value;
  • Subroutine: a method that never returns a value, or returns a void-type;
  • Boolean or Bool: a value that has two states: true or false;
  • String: a sequence of characters that represent a section of text;
  • Integer: a whole number (that is, a number with no fractional portions);
  • Single-Precision Floating-point Value, Single or Float: a number with a fractional part, often represented as a 32-bit value (Do note that some languages treat a float as a double, research the language ahead of time to make sure you understand which is the case);
  • Double-Precision Floating-point Value or Double: a number with a fractional part, often represented as a 64-bit value;
  • Char: a single character, often represented as an 8-bit, 16-bit or 32-bit value (Different languages and systems use different endian-ness, bit-sizes and encodings for char, your specific details may vary);
  • Byte: an 8-bit value either signed or unsigned depending on language;
  • Endian-ness: the order in which bits and bytes appear, this may be most-significant value first, or least-significant value first (see big-endian and little-endian);
  • Bit: see Boolean, this is a single value that is represented by either a 0 or 1;
  • Signed/Unsigned: whether a number has the ability to represent values below 0 (signed) or may only represent values above 0 (unsigned);
  • Short or Int16: typically a 16-bit integer, either signed or unsigned;
  • Long or Int64: typically a 64-bit integer, either signed or unsigned;
  • Word: this is not a "word" in the sense of a sentence, but is typically an alias for a "short" or "Int16" - it is two bytes, always (I separated this and the next two from the number types closest to them because they have a literal translation: they are always the specified number of bytes - often times platforms and languages will indicate that a "numeric" type is actually one of these, such as "int" typically being a "DWord");
  • DWord or Double Word: two words (4 bytes);
  • QWord or Quad Word: four words (8 bytes);

Regarding numbers: often times languages provide number types in a pair, such as sbyte and byte for "signed byte" and "unsigned byte", or short and ushort for "signed short" and "unsigned short", respectively. This often creates confusion between languages, and to make matters worse C and C++ don't always use the same bit-length for each data-type. (On some platforms int in C may be 8, 16, or 32 bits.) The terminology I'm using is referring to the .NET types, for the most part, and it also refers in general on a broader level. (An Int16 is the same 16-bits no matter what, and most people in the software development industry understand short is an alias for Int16.)

Some more types and type-modifiers:

  • Class: often refers to a reference-type that is a collection of related fields, properties, methods and events;
  • Struct: often refers to a value-type that is a collection of related fields, properties, methods and events;
  • Reference-Type: a type that is, by default, treated as a reference rather than an actual value, this is similar but not the same as a pointer - the value of the reference type is a memory-address that points to the actual values;
  • Value-Type: a type that is, by default, treated as an actual value, that is, a single continuous grouping of memory that represents a single result (things like "numbers" - one would never expect int value = 0 to be a pointer to a memory location holding an "object" with the value 0 - it's just expected that value is now the literal 0);
  • Interface, Protocol or Contract: usually refers to a collection of properties, methods and events which a class or struct intends to fully implement;
  • List: may either refer to a "linked-list" or a "non-linked list" of values, usually of the same type - the list often has a "count" of the number of items contained, and often methods like "add", "remove" and "contains" to indicate if an element is in the list;
  • Sequence or Enumerable: any sequence of values, usually of the same type, that are in order but may not always be known ahead-of-time - these are often "lazily" processed, in that only the current element of the sequence is known, the next element may or may not actually have been found already - usually a List can be treated like a Sequence, as a List is inherently a sequence of values;
  • Array: a set of values, usually of the same type, that are directly indexable and statically referenced in memory - a string, for example, can be seen as an "array of characters" - these may or may not expose a "count";
  • Access(ibility) Modifier: a keyword that indicates what "access level" a class, interface, enumeration, property, field, event or method allows;
  • Base Class: in the concept of OOP and inheritance, the class that is the next level up the heirarchy;
  • Abstract Class: a class that is not concrete, but may provide some portions of an implementation of - may only be inherited by other classes;

Paradigms and paradigm-related terms:

  • OOP or Object-Oriented Programming: designing programs around the model of data-flow being done through inheritance, and polymorphism (often statement-based);
  • Functional Programming: designing programs around the model of data-flow being done through functions and functional composition, rather than classes or inheritence (often expression-based);
  • Expression: a series of instructions that evaluate to a value (such as 1+1);
  • Statement: typically an "expression" that doesn't return a result (a call to a method, for example, with a void return type);
  • Polymorphism - look this one up on your own, far too long to describe properly here;
  • Instruction: typically refers to a basic operation that may be performed on a lower-level than usual programming;

Some basic algorithms you should research:

  • Binary Search
  • Binary Search Tree
  • Heap Sort
  • Quicksort
  • Insertion Sort

I'll probably be updating this list as time goes on, but at the very least these are some terms you should get mostly acquainted with.


Todays post is (obviously) more about the higher-level aspects of programming, and less about doing something. I should have convered this information previously, but I didn't. This will become a reference document for future ideas as well (anything I explain here I'll not explain in the future), so keep updated on it. As we've progressed quickly, we'll continue to do so. I'm largely out of content for this series, we'll do a few updates to our existing program to make it slightly more effective and easier to follow, but I hope to have a new adventure worked out by Monday.

Getting started with programming and getting absolutely nowhere (Part 8)

Let's normalize our data!

Lesson 7: Working out our Step 2 Process
Lesson 9: Knowing the Terminology is Important

Finally, time to write out our step 2 code. This has been a bit of a lengthy journey so far, and most of that is my fault, but it's finally time to write some code to do what we want. I'm going to throw a lot at you very quickly, but it's important to not lose focus. If you have questions, do some web-searches while you're thinking of them, it'll help you stay in the mindset I hope.

The first step here is to define a type, or a few types. We need 4 to boot.

type NameMap =
    | Combined of c : int
    | Split of firstName : int * middleName : int option * lastName : int

type MagicIdMap =
    | Combined of c : int
    | Split of ids : int array

type Header = {
    Sheet : ISheet
    Name : NameMap
    Phone : int
    DateAdded : int
    Email : int
    MagicIds : MagicIdMap
}

type User = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicId : int
}

You should understand what those are by now, so I won't go into any detail. We're still using NPOI, we're still using F#, and we're still tyring to be as idiomatic as possible.

Some boilerplate

You should also remember this little guy:

let getVal (cell : ICell) : obj option = 
    match cell with
    | null -> None
    | _ ->
        match cell.CellType with
        | CellType.Boolean -> cell.BooleanCellValue |> box |> Some
        | CellType.Numeric -> cell.NumericCellValue |> box |> Some
        | CellType.String -> cell.StringCellValue |> box |> Some
        | _ -> None

And we're also going to add a few other helpers:

let startsWith v (x : string) = v |> x.StartsWith

let atLeastOne items =
    if items |> Seq.isEmpty then [None] |> Seq.ofList
    else items

let colIndex (cell : ICell) = cell.ColumnIndex

None of which should be difficult. The atLeastone will always make sure a Seq is not empty, so that we can simplify tests later. We need to be able to filter out cells based on what value they have, so we have this function:

let cellWhere pred (row : IRow) =
    row.Cells
    |> (Seq.map (fun c -> (c, c |> (getVal >> Option.map Object.toStr)))
    >> Seq.filter (fun (c, v) -> v |> (function | None -> false | Some n -> n |> pred))
    >> Seq.map (fun (c, v) -> v |> Option.map (fun y -> c))
    >> atLeastOne)

Interestingly, this will give us an ICell option sequence, which means we can then Seq.choose on it if we need, or in the case of only one result we can Seq.head. We need to filter our header out first, so we want something like the following:

let mapHeader (sheet, row) =
    match row |> getName, row |> getPhone, row |> getDateAdded, row |> getEmail, row |> getMagicIds with
    | Some name, Some phone, Some dateAdded, Some email, Some magicIds ->
        { Sheet = sheet
          Name = name
          Phone = phone
          DateAdded = dateAdded
          Email = email
          MagicIds = magicIds }
        |> Some
    | _ -> None

This is pretty simple: match the result of the get___ function for each record property with option, and if we have Some of everything then return it. This doesn't care what type anything is for the most part, the get___ function will decide how the row maps to the type.

Grab some values

For Phone, DateAdded, and Email it's simple:

let getPhone row =
    match row |> (cellWhere ((=) "Phone Number") >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

let getDateAdded row =
    match row |> (cellWhere ((=) "Date Added") >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

let getEmail row =
    match row |> (cellWhere ((=) "Email") >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

In fact, we could replace all these with the same method:

let getCol col row =
    match row |> (cellWhere ((=) col) >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

let getPhone : IRow -> int option = getCol "Phone Number"
let getDateAdded : IRow -> int option = getCol "Date Added"        
let getEmail : IRow -> int option = getCol "Email"

So that's what we'll do. Awesome, simple and efficient. Avoids repetiveness, and keeps us in check. The next step is either Magic ID or Name. Since the Magic ID is easier, we'll do the name first.

let getName row =
    match row |> (cellWhere ((=) "Name") >> Seq.head),
          row |> (cellWhere ((=) "First Name") >> Seq.head),
          row |> (cellWhere ((=) "Middle Name") >> Seq.head),
          row |> (cellWhere ((=) "Last Name") >> Seq.head) with
    | Some cell, _, _, _ -> cell |> (colIndex >> NameMap.Combined >> Some)
    | _, Some f, m, Some l ->
        NameMap.Split (f |> colIndex, m |> Option.map colIndex, l |> colIndex)
        |> Some
    | _ -> None

Oh, that's not so bad. If this was the harder of the two, that can only mean that Magic ID must be pretty simple.

let getMagicIds row =
    match row |> (cellWhere ((=) "Magic ID") >> Seq.head), row |> cellWhere (startsWith "Magic ID") with
    | Some cell, _ -> cell |> (colIndex >> Combined >> Some)
    | _, cells when cells |> Seq.isEmpty = false -> cells |> (Seq.choose id >> Seq.map colIndex >> Seq.toArray >> Split >> Some)
    | _ -> None

Dang, is that it? Do note that we're actually applying the str1 = str2 function as a partial application, which gives us:

  • Shorter, cleaner syntax;
  • The potential for higher-level optimizations;
  • An easier process to follow;

You'll also notice I've been doing x |> (f >> g) instead of x |> f |> g, this is a slightly longer syntax (two parenthesis added), but it makes things clearer to me: the operation is one, continuous operation, there's no partial flow here. The f >> g composition is a single operation, and it makes it easy to abstract that out to a new function if necessary.

Map the headers

Alright, so we've built the header, the next step is to map them:

let range uBound = [0..uBound - 1]
let getRow i (x : ISheet) = i |> x.GetRow
let testFileName = "C:\Users\ebrown\Desktop\Step2_Testing.xlsx"
let workbook =
    use fs = new FileStream(testFileName, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)
let headers =
    workbook.NumberOfSheets
    |> (range
    >> Seq.choose (workbook.GetSheetAt >> Option.ofObj)
    >> Seq.map (fun sheet -> (sheet, sheet |> (getRow 0 >> Option.ofObj)))
    >> Seq.choose (fun (sheet, r1) -> r1 |> Option.map(fun r -> (sheet, r)))
    >> Seq.choose mapHeader)

Now our headers value will be a Header sequence, which allows us to do all sorts of stuff with them, including headers |> Seq.iter (fun header -> printfn "%s: %A" header.Sheet.SheetName header). A debug print, pretty simple, no?

Get some data from our spreadsheet

Now we get to a slightly more difficult task: how do we get the users? Well first, I want to define the syntax I expect from our API:

let users =
    headers
    |> Seq.map (getUsers workbook)

Ok, that's reasonable. We expect to call a getUsers function which is an IWorkbook -> Header -> User array. We actually have most of the structure built, so we'll go back and define our getUsers function.

As I was defining it, I realized the addition of workbook to the getUsers call is superfluous: our header already has a reference to ISheet, which is all we need here, so we'll drop that call and it becomes a basic one-liner:

let users = header |> Seq.map getUsers

Rightey-o then, let's move on. For getUsers we need to do a couple things:

  1. We need to collect each row from the ISheet;
  2. We need to split each row into the relevant user-array;
  3. We need to flatten the result;

This is somewhat easy, since we know the steps:

let getUsers header =
    let sheet = header.Sheet
    [1..sheet.LastRowNum]
    |> (Seq.choose (sheet.GetRow >> Option.ofObj)
    >> Seq.choose (getUserRows header)
    >> Seq.toArray
    >> Array.flatten
    >> Array.map separateUsers
    >> Array.flatten)

Easy enough, nothing too complex here. Let's build the getUserRows function first. We'll actually build a new type here, to hold our intermediate row. (This helps keep each function short and clean.)

type UserRow = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Phone : string option
    DateAdded : string option
    Email : string option
    MagicId : string
}

Keep in mind that this is an intermediate record type - we don't care as much about our values, we're going to transform this to the final type in the separateUsers function. Now getUserRows is responsible for doing our name transformation, which we agreed upon before, and it's responsible for splitting a MagicIdMap.Split into multiple rows.

Get the Cell values

From here we know we need a couple other things, one of which is the ability to get a value of a cell:

let getColVal c (row : IRow) = c |> row.GetCell |> getVal

And we need to separateUsers, which is the easy part:

let separateUsers user =
    user.MagicId
    |> String.splitOpt (StringSplitOptions.RemoveEmptyEntries) ' '
    |> Seq.map(fun id ->
        { User.FirstName = user.FirstName
          MiddleName = user.MiddleName
          LastName = user.LastName
          Email = user.Email |> (function | None -> String.Empty | Some v -> v)
          MagicId = id |> int })

Pretty basic, iterate our MagicId split string and work out what they are.

Build the massive, ugly function

Then we want to write getUserRows, which is long and quite possibly disobeys some best-practices, but it does exactly what we want, and well:

let getUserRows header (row : IRow) : UserRow seq option =
    let splitNames (name : string) =
        match ' ' |> name.Split with
        | [|first; last|] -> (Some first, None, Some last)
        | [|first; middle; last|] -> (Some first, Some middle, Some last)
        | _ -> (None, None, None)

    let buildUserVals first middle last phone dateAdded email magicId =
        { UserRow.FirstName = first
          MiddleName = middle
          LastName = last
          Phone = phone |> Option.map Object.toStr
          DateAdded = dateAdded |> Option.map Object.toStr
          Email = email |> Option.map Object.toStr
          MagicId = magicId |> Object.toStr }

    let finalResult first middle last =
        match header.MagicIds with
        | MagicIdMap.Combined magicIdCol ->
            match row |> getColVal magicIdCol with
            | Some magicId ->
                [|buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId|]
                |> Seq.ofArray
                |> Some
            | _ -> None
        | MagicIdMap.Split ids ->
            ids
            |> Seq.choose (fun id ->
                match row |> getColVal id with
                | Some magicId ->
                    buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId
                    |> Some
                | _ -> None)
            |> Some

    match header.Name with
    | NameMap.Split (f, m, l) ->
        match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
              row |> getColVal f |> Option.map (Object.toStr),
              m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
              row |> getColVal l |> Option.map (Object.toStr) with
        | Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
        | _, Some f, m, Some l -> finalResult f m l
        | _ -> None
    | NameMap.Combined n -> 
        match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
        | Some (Some f, m, Some l) -> finalResult f m l
        | _ -> None

So what's happening here? Well, the first function inside here is splitNames, which is trivial:

let splitNames (name : string) =
    match ' ' |> name.Split with
    | [|first; last|] -> (Some first, None, Some last)
    | [|first; middle; last|] -> (Some first, Some middle, Some last)
    | _ -> (None, None, None)

If we split the name into two pieces, then return the two pieces as the first and third tuple elements. If we split it into three pieces, then return first, middle, last. We'll assume that names are fairly standard in the manner they're non-standard.

Next' we have this:

let buildUserVals first middle last phone dateAdded email magicId =
    { UserRow.FirstName = first
      MiddleName = middle
      LastName = last
      Phone = phone |> Option.map Object.toStr
      DateAdded = dateAdded |> Option.map Object.toStr
      Email = email |> Option.map Object.toStr
      MagicId = magicId |> Object.toStr }

This is basically a function which takes our relevant paraneters, and builds the final UserRow value from them. Pretty easy, nothing complex.

The finalResult starts getting hinky:

let finalResult first middle last =
    match header.MagicIds with
    | MagicIdMap.Combined magicIdCol ->
        match row |> getColVal magicIdCol with
        | Some magicId ->
            [|buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId|]
            |> Seq.ofArray
            |> Some
        | _ -> None
    | MagicIdMap.Split ids ->
        ids
        |> Seq.choose (fun id ->
            match row |> getColVal id with
            | Some magicId ->
                buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId
                |> Some
            | _ -> None)
        |> Some

So this tests the MagicIds against the MagicIdMap, if it's Split then we're returning the array of users immediately. If it's Combined then we're returning a one-element array, built from the remaining values.

In hindsight, we really should clean some of this up to make it simpler:

let finalResult first middle last =
    let phone = row |> getColVal header.Phone
    let dateAdded = row |> getColVal header.DateAdded
    let email = row |> getColVal header.Email
    match header.MagicIds with
    | MagicIdMap.Combined magicIdCol ->
        match row |> getColVal magicIdCol with
        | Some magicId ->
            [|buildUserVals first middle last phone dateAdded email magicId|]
            |> Seq.ofArray
            |> Some
        | _ -> None
    | MagicIdMap.Split ids ->
        ids
        |> Seq.choose (fun id ->
            match row |> getColVal id with
            | Some magicId ->
                buildUserVals first middle last phone dateAdded email magicId
                |> Some
            | _ -> None)
        |> Some

That's better, we don't need to inline our phone, dateAdded and email data.

Finally, the match which is executed:

match userRow.Name with
| NameMap.Split (f, m, l) ->
    match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
          row |> getColVal f |> Option.map (Object.toStr),
          m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
          row |> getColVal l |> Option.map (Object.toStr) with
    | Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
    | _, Some f, m, Some l -> finalResult f m l
    | _ -> None
| NameMap.Combined n -> 
    match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
    | Some (Some f, m, Some l) -> finalResult f m l
    | _ -> None

This just tests the name information, and does the relevant transformations. You'll notice the first sub-match tests four values, that's because I'm lazy and refuse to put yet another sub-match expression in. Finally, we can run and test our code:

let users = headers |> Seq.map getUsers
users |> Seq.iter (printfn "%A")

This should write some output like the following:

seq
  [{FirstName = "John";
    MiddleName = Some "C.";
    LastName = "2";
    Email = "email@example.com";
    MagicId = 287;}; {FirstName = "Jane";
                      MiddleName = Some "L.";
                      LastName = "2";
                      Email = "otheremail@example.com";
                      MagicId = 331;}; {FirstName = "Jane";
                                        MiddleName = Some "L.";
                                        LastName = "2";
                                        Email = "otheremail@example.com";
                                        MagicId = 334;}; {FirstName = "Ron";
                                                          MiddleName = null;
                                                          LastName = "Doe";
                                                          Email = "";
                                                          MagicId = 872;}; ...]
seq
  [{FirstName = "John";
    MiddleName = Some "C.";
    LastName = "Doe";
    Email = "email@example.com";
    MagicId = 287;}; {FirstName = "Jane";
                      MiddleName = Some "L.";
                      LastName = "Doe";
                      Email = "otheremail@example.com";
                      MagicId = 331;}; {FirstName = "Jane";
                                        MiddleName = Some "L.";
                                        LastName = "Doe";
                                        Email = "otheremail@example.com";
                                        MagicId = 334;}; {FirstName = "Ron";
                                                          MiddleName = null;
                                                          LastName = "Doe";
                                                          Email = "";
                                                          MagicId = 872;}; ...]

Finally, save our data

Finally, we need to save our data. This is a pretty easy step, we'll modify getUsers just a little, and then user our users sequence to finish things off:

let getUsers header =
    let sheet = header.Sheet
    (header,
     [1..sheet.LastRowNum]
     |> (Seq.choose (sheet.GetRow >> Option.ofObj)
     >> Seq.choose (getUserRows header)
     >> Seq.flatten
     >> Seq.map separateUsers
     >> Seq.flatten))

let users = headers |> Seq.map getUsers

As you can see, the only modification is to have it map users and a header together, then we'll drop it all to a new workbook and sheet, and save.

Saving things is actually quite simple, we define a few helpers to set values, then we build a save function that takes our header and users and saves them to the new workbook:

let setCellValueOptStr (cell : ICell) : string option -> unit =
    Option.map cell.SetCellValue >> ignore
let setCellValueStr (cell : ICell) : string -> unit =
    cell.SetCellValue >> ignore
let saveSingle (row : IRow) (user : User) =
    (0 |> row.CreateCell, user.FirstName) ||> setCellValueStr
    (1 |> row.CreateCell, user.MiddleName) ||> setCellValueOptStr
    (2 |> row.CreateCell, user.LastName) ||> setCellValueStr
    (3 |> row.CreateCell, user.Email) ||> setCellValueStr
    (4 |> row.CreateCell, user.MagicId |> string) ||> setCellValueStr
let saveHeader (row : IRow) =
    (0 |> row.CreateCell, "First Name") ||> setCellValueStr
    (1 |> row.CreateCell, "Middle Name") ||> setCellValueStr 
    (2 |> row.CreateCell, "Last Name") ||> setCellValueStr
    (3 |> row.CreateCell, "Email") ||> setCellValueStr
    (4 |> row.CreateCell, "Magic ID") ||> setCellValueStr

The helpers make it easy to keep items separated so that concerns are only on their own scope.

let save header users =
    let newFile = header.Sheet.SheetName
    let newBook = XSSFWorkbook()
    let newSheetName = "Process_Spec"
    let newSheet = newSheetName |> newBook.CreateSheet
    0 |> newSheet.CreateRow |> saveHeader
    let savedRows =
        users
        |> Seq.fold (fun row user ->
            let newRow = row |> newSheet.CreateRow
            (newRow, user) ||> saveSingle
            row + 1) 1
    let datePart = System.DateTime.Now.ToString("yyyy-MM-dd")
    let newBookFile =
        sprintf "%s\%s-%s%s" fileInfo.DirectoryName newFile datePart fileInfo.Extension
    use fw = File.Create(newBookFile)
    newBook.Write(fw)
    printfn "Saved to %s" newBookFile
    savedRows

The sprintf is similar to printf except that it returns the format string instead of printing directly.

We can save our data with a single line: users |> Seq.iter (fun value -> value ||> save |> ignore). This will run through the entire pricess at this point and then save it all. (The let users = headers |> Seq.map getUsers doesn't do anything other than build a chain of Seq functions to create our result.)

And bam, we're done. Overally we have a hundred or so lines of code, and they completely normalize our example data. The next lesson will try to normalize things a bit better, and take some other oddities into account (such as names like "La Rouse", "Mac Gregor", etc.).


As we conclude this lesson (which, let's face it, is mostly a code-dump) I have been reflecting on my experiences with the IT world, software development, F#, C#, PHP, and other languages and experiences, and reminding myself that everything I've been working towards, building this lesson series, developing products for my company, building software to test my own knowledge, and even sharing some experiences with others have been strongly defined to push myself towards teaching and sharing knowledge with others. I realized (not so long ago) that the only thing I truly enjoy doing is helping others succeed, so I've been working towards that day-by-day for years. This series itself is a testament to that: I'm not expecting to get anything out of it other than hoping at least one person finds something here userful. That's all I want, and all I ask. If you find something useful, reach out on Twitter or email me. One of my (many) emails is elliott [dot] brown or ebrown (either work), at this very domain. (Robots should have a hard time with that, but I have faith in the people reading this to be able to parse it out.)

Getting started with programming and getting absolutely nowhere (Part 7)

Working out our Step 2 process

Lesson 6: It's Time to Accept Arguments
Lesson 8: Let's Normalize Our Data!

We've finished working out Step 3, this is a good start as we needed to know what format would be best for Step 3, and as it turns out the best format is the same as the output format, so we're going to try to get our data there.

Step 2 is actually a painfully brutal process. We end up having to fight several discrepencies, including the following:

Sheet 1

  • Some users have the middle initial of their middle name in the same cell as their first name, instead of in a new cell;
  • Some users have multiple email addresses (separated by semicolons) in the Email column;
  • Some users have multiple Magic ID values, in multiple columns;
  • Some users lack a Magic ID completely;

Sheet 2

  • Some users have a middle name, some a middle initial, and some neither, all in the same cell;
  • Some users have multiple email addresses (separated by semicolons) in the Email column;
  • Some users have multiple Magic ID values, in the same column;
  • Some users lack a Magic ID completely;

These are all cases we want to account for, excepting Email addresses. We won't actually do anything about multiple email addresses, since we don't care about them. For multiple Magic ID values we will make them separate rows, with the same information for the other columns. Thus, our transformation should take the sample data and end up with something like:

First    Middle    Last    Phone           Date Added    Email                                          Magic ID
John     C.        Doe     111-555-1234    29-Aug-17     email@example.com                              287
Jane     L.        Doe     123-555-4321    29-Aug-17     otheremail@example.com                         331
Jane     L.        Doe     123-555-4321    29-Aug-17     otheremail@example.com                         334
Ron                Doe     321-555-5678    29-Aug-17                                                    872
Edna     Viola     Doe     8515559637      28-Aug-17     viola.doe@example.com; edna.doe@example.com    975

So by the time we finish, we should be able to transform both sheets into that exact data-set. We'll then (at a later date) explore how we can modify it to account for multi-part last names, such as LA ROUSE, or MAC GREGOR. This is a lot harder than it looks, and we'll only worry about that for Sheet 2, because in Sheet 1 it's always separated out.

Build a step-by-step process

Software development isn't just writing code all day long, and in fact, in my workplace it's 60% problem solving, 20% tech-support, and then 20% writing code. We can ignore the tech-support bit from our adventure, we're only going to focus on the 75% problem solving and 25% writing code. This lesson will be entirely problem solving, we won't write much code today, but we will implement the solutions to these problems in the next lesson.

Alright, so breaking down the problem is more an "art form" than anything else - you'll learn how to do it best from doing it, you'll develop your own "style". We're going to walk through these in my "style" of solving them, but I do want you to try and solve them on your own as you get time. It's important to explore these in your own way, as how I solve problems may not work best for you.

The first thing I want to do is define the broadest possible definition of the problem. That is, what is our main, end goal? In our case, we want to normalize a specification sheet. Great, that's a good start. We now need to determine what individual parts are in this broader problem, and solve them. We could do this through a tree-diagram of the situation, breaking it down into nodes from top-to-bottom. I recommend you try doing this, as it really does help focus your thinking.

To "normalize our specification" we have some sub problems to solve first. (After all, there's no "normalize specification" API for us to use yet.) These might be:

  1. Load a specification file;
  2. Determine what each column means;
  3. Process all the rows of the specification;
  4. Save each row of the specification;

That's more reasonable, but some of them still need broken down further, most specifically 2, 3, and 4. They each have sub-parts, so let's get to them.

  1. Load / read a specification file;
  2. (2.1) Read each sheet and do the following;
  3. (2.2) Read the first row of the sheet;
  4. (2.3) Analyze each cell from the first row;
  5. (2.4) Determine what each cell from the first row indicates;
  6. (3.1) Read each line of data from the sheet;
  7. (3.2) Read each column of data from the sheet;
  8. (3.3) Split data from each line as appropriate (combined names, magic ID's, etc.);
  9. (3.4) Filter out rows without requisite data;
  10. (3.5) Process each column of data into the apporpriate type;
  11. (4.1) Create a new "post-normalization" sheet from the source sheet;
  12. (4.2) Write each record to the new sheet;
  13. (4.3) Repeat to 2 (2.1);
  14. (4.4) Write the spreadsheet back out;

This is non-exhaustive. We still have more to cover yet, but we'll start thinking about each step and how it relates to the overall goal. We're going to design the flow of our program, which is usually somewhat fun. We'll walk through each step and think of it in human terms, so that we know what our process looks like.

The first step is easy, and I said we were mostly going to solve the problem we have, which is true, but we pretty much covered this in a previous lesson so I'll give you the code to work with:

let workbook =
    use fs = new FileStream(parameters.Filename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

Pretty basic, and sets us up with our workbook. Next, we want to read each sheet to perform some "task", right? This is not as easy as it could be with NPOI, but it's still not hard. We're going to iterate from 0 to workbook.NumberOfSheets, and get the sheet with workbook.GetSheetAt, which takes an integer index. So this is also somewhat easy, and could be written like:

[1..workbook.NumberOfSheets] |> List.iter (fun si ->
    let sheet = si |> workbook.GetSheetAt |> Option.ofObj
    ...
)

Again, pretty basic so we'll go with it today because this means we're done with these steps. The GetSheetAt can return null if there's no sheet, so we convert it to an Option out of the gate to work F#'s pattern matching better. We'll move on to 2.2, which is simple. Read the first row, which we could call as 0 |> sheet.GetRow. At this point, we're wondering how anything could actually be difficult, until we get to this next step. "Analyze each cell from the first row."

Analyzing our header

Alright, so I said this was going to be hard, and it is going to be by nature, but we're going to make it easy because we will make a few assumptions. Let's digress and explore our imagination for a moment. First off: how many different ways can we write the headers of our sheets? "FirstName, First Name, fName, First, F NAME," and we could go on for a while, ignoring case changes even. (Those all just have whitespace and word differences.) One of the more lucrative things to get into with programming is machine learning: teaching a machine how to think like a human. (For those who aren't aware: the human mind is the single greatest pattern and system recognition engine ever created.) In this type of situation, as a person, it's easy to come up with what the column represents igven any of those titles, but for the computer it's not so simple. We have to tell it what "pattern" to recognize.

Now if we were really clever, we could write some sort of regex that would indicate if a "word" represents the idea of a First Name or not: (f(irst)?\s?(name))|(first), of course this ignores the term "given name" (often used as the First Name), so we're SoL on that front, but we could make another regex ((g(iven)?\s?(name))|(given)) and put them together (((f(irst)?\s?(name))|(first))|((g(iven)?\s?(name))|(given))) and feel pretty damn good about ourselves, because now this should always detect a phrase indicating First Name, right? What about Name, First? Damn, add another regex. Oh, and 1st Name, 1st, Name 1, etc., add _ and - support. By the time we're done we might have the following (and we don't even get started on spelling correction):

(((f(irst)?)|(g(iven)?)|[1f]st)[\s_-]?(name)|(first|given|[1f]st))|(name(,?[\s_-]?(first|given|[1f](st)?)))

I don't know about you, but I don't want to be the person dealing with that. This is where machine learning might come in, teach the machine how to recognize terms that would indicate the First Name, I.e. the name we call them by informally. Or, we can make the blind assumption that our data will only ever use "First Name" as the title of the column for First Name, I think that's what we'll do.

So, if we find a single Name column, then we'll assume it contains all three names put together. If we find a First Name and Last Name column, then we'll assume it contains the first and middle name in the First Name column, and the full Last Name on it's own, and if we find First Name, Middle Name, and Last Name, then we assume that it contains each name in it's own column, with the possibility that First Name is the first name and middle initial.

We can then process the remaining columns, until we get to Magic ID, which we have two possibilities for: we either have multiple Magic ID columns, or we have a single column. This is not nearly as difficult to solve as the previous situation. We're just going to read in any header with the name Magic ID, and if we find one named exactly "Magic ID", then it means they could be joined in that column, but if we find multiple columns, then it means that they're separated.

Reading the data

Alright, hardest part over (we're not into the easy waters yet, but we're close): now we need to actually interpret our data. This is not so hard, what we're going to do here is essentially load row-by-row into a type, which we've yet to define but it's rather trivial:

type SpecUser = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicID : int
}

So what we'll do here is load a row, parse the columns (based on what we detected from the header) and then continue to the next one. If a row doesn't make a valid SpecUser, then remove it. (So Seq.choose will be useful here.)

The hardest part with this will be the splitting on multiple MagicID values, which is really not all that hard. If we Seq.choose (fun x -> [|...|]) |> Seq.flatten we'll get what we need. (Remember Seq.flatten from my earlier lessons?)

Once we've read the data, creating the new sheet is trivial. We'll actually create a whole new workbook because our sheet name should be Process_Spec, and we want to save every input sheet. This is quite trivial, and it won't take long to handle.

Finally, we'll write the apporpriate spreadsheets back out. These will be the single Process_Spec sheet which is loaded from the input spreadsheets. Again, we coveredthese before, and it's not hard to write this particular code.

Tidying up

I'm going to leave off here for today - the next lesson will get into the code for all this, but it's important for us to go over, in people-speak, what needs to happen, because as we get into code it'll start to become more and more clear why I went this route.


Bonus day: Finding a Gap in an aribtrary Sequence

Recently I had to write this code for a personal project, and since we didn't do much code above I suppose I'll share because that's the type of kind-hearted person I am.

The goal here is bewilderingly simple: given a formula for the expected item at an index F(x), and a manner of determining the current item at an index G(x), we want to detect what is the first item where F(x) <> G(x). Simple, right? A basic version might be to:

let firstNonEqualItem = [iFirst..iLast] |> Seq.filter (fun i -> i |> f <> i |> g) |> Seq.head

Yeah, this works, an example:

let items = [|0;1;2;3;4;5;7;8;9|]
let firstUnequalItem = [0..8] |> Seq.filter (fun i -> items.[i] <> i) |> Seq.head

But what obvious problem does this have? It does a linear search, which means if there are many millions of values, it finds the first value starting at the beginning, each and every time. If the first missing value is 10-million elements in, well it searches through the first 10-million elements before returning. Now what if I told you we could search the same 10-million elements with 24 iterations? What if I told you we could find any missing element in that 10-million item array in 24 or fewer iterations, each and every time, no matter where it was?

This is, in fact, not hard. The biggest problem is trying to define the plain-english for an algorithm that does it, but let's take an example. We have an array:

0 1 2 3 4 5 7 8 9

We want to find what element is missing, well we all know it's 6, but how do we teach a computer to find it?

The easiest method is to define two variables, which represent the left and right portions of the array we're in, we'll call them l and r. (If your fonts suck, those are lower-case L and R, respectively.) We'll define them as by default the first and last element, respectively.

0 1 2 3 4 5 7 8 9
l               r

Now, we'll find the element in the middle of l and r, since l = 0 and r = 8, that's 4. (l + r / 2) We'll evaluate this element, called m, and determine if f(m) = g(m).

0 1 2 3 4 5 7 8 9
l       m       r

In our case, f(4) = 4 is true, so now we'll set l to m + 1.

0 1 2 3 4 5 7 8 9
          l     r

Next, we'll repeat. This time we look at f(6), which is 7, and it's not true, so we set r to m.

0 1 2 3 4 5 7 8 9
          l m   r
0 1 2 3 4 5 7 8 9
          l r
0 1 2 3 4 5 7 8 9
          l r
          m

Next we'll test m again, which is 5, and f(5) = 5 is true, so we move l to m + 1.

0 1 2 3 4 5 7 8 9
            l
            r

Once l and r are in the same position, we return f(l), which is 6 in this case. Do note we create m three times, which means we only have to iterate 4 times to find a missing value in this array. You can calculate this as the base-2 logarithm of n, where n is the array count. (So, log(n) / log(2) if you have a cheap calculator that only does base-10 or base-e logarithms like mine.) The code, for a basic F# algorithm of this, is as follows:

let Zero = bigint 0
let One = bigint 1
let NegOne = bigint -1
let searchForOpening equals formula getItem count =
    let expected v = (v |> getItem, v |> formula) ||> equals
    let rec alg l r =
        match l = r with
        | true -> formula l
        | false ->
            let mid = (l + r) >>> 1
            match mid |> expected with
            | true -> (mid + One, r) ||> alg
            | false -> (l, mid) ||> alg
    match count with
    | r when r = 0I -> r |> formula
    | r ->
        match r |> expected with
        | true -> (r + One) |> formula
        | false -> (Zero, r) ||> alg 
let items = [|0I; 1I; 2I; 3I; 4I; 5I; 7I; 8I; 9I;|]
searchForOpening (fun a b -> a = b) id (fun a -> items.[int a]) (items |> Array.length |> bigint |> (+) NegOne)

Pretty simple, no? Do note that this algorithm assumes f(x) and g(x) are pure functions that can work off an arbitrary index. Also, the >>> may be unfamiliar to you - look up the F# operators, and you're looking for the "bitwise operators". It essentially shifts everything right n bits (1 in this case). A >>> 1 is a fancy way of saying / 2, but much faster (in many cases) and no likelihood for confusion as to whether it rounds, and if so what direction.

And if you're doubting the efficiency of the algorithm, the following test should do you some good:

searchForOpening
    (=)
    id
    (function | a when a <= 1000000000000000000000000000000000000000000000000000000000000I -> a | a -> a + One + One)
    2000000000000000000000000000000000000000000000000000000000000I

We find the missing value of an enormous series of elements, by cheating, basically. This runs about 200 iterations to find the missing value. Two-hundred, that's it. We search through a sequence of elements with 60 zeros in it so fast that it's undeniable as to it's efficiency. This can be run in F# Interactive with great ease and speed - you won't notice a delay, which means it saves us many CPU cycles. The caveat is that we need a manner of comparing equality, which isn't always possible, and we need a pure-function that can evaluate what the current and expected elements are. In fact, we'll up the anty even more:

searchForOpening
    (=)
    id
    (function | a when a <= 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000I -> a | a -> a + One + One)
    20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000I

This searches through two googol values, that's even more enormous, and it does it in a very quick time. (About 333 iterations.) So if anyone tells you "F# can't be quick" or "you can't search through a set of numbers that big", tell them they're very wrong. We just did it, and in 20 lines of code.


This was another relatively short lesson, but no less important than any other: it's absolutely inarguably imperative that you know how to solve problems. This is the single biggest thing I want to stress through this whole adventure: knowing a language and API is great, but if you don't know how to solve the problems you're about as useful as a piece of burnt toast.

Step2_Testing.xlsx (11.5KB)

Getting started with programming and getting absolutely nowhere (Part 6)

It's time to accept arguments

Lesson 5: Now We Begin Our Adventure
Lesson 7: Working out our Step 2 Process

One of the many (very necessary) core features of a program is to take some sort of argument. You want an easy user experience, and you also want to be able to script it. You want to be able to automate your automation, and for that we need to accept arguments. We'll do this through the less appropriate method, but only because it's super easy and allows us to be lazy. We'll allow arguments to our application to be passed in via the CLI, which means we can do some pretty nifty things.

What is an argument?

If you've ever worked with the Windows Command Line Interface (we'll call it the CLI from now on) you hace probably used some command that takes arguments, such as any of the following:

notepad <FILENAME>
ipconfig [/allcompartnets] [/? | /all | ...]

All three of these commands have the same thing in common: they take arguments. It may not have occurred to you (I know it certainly didn't occur immediately to me) but each of these commands is just part of a program that does something. When you type notepad test.txt you're calling the notepad program with the first argument being test.txt. But when you call ipconfig, each command argument is preceeded by a name which starts with a /, such as ipconfig /all, that calls the ipconfig program and passes /all as an argument. From there, the ipconfig program decides how to interpret that; however, for this command the first argument can be something that has sub-properties, like /renew [adapter], which is an argument with an additional property called adapter, completely optional.

This means that ipconfig has no default parameter, like notepad does. This means that the command syntax has meaning, when you notepad you don't want to have to type notepad -file test.txt all the time, you want it to just understand that you're always going to be referencing a file. Usually, we'd like to have the caller specify arguments by name, but sometimes there's a "default" argument, which for us might be the "new list filename".

Now here's a pretty nifty bit of information: when we drag-and-drop a file onto any application, Windows calls that application with the first argument being the filename. So we might want to support automatically filling the newListFilename with whatever the first argument is. Likewise, we might also want to support doing the same with the oldListFilename for the second argument.

Ok, so what are you getting at here?

Right, so I rambled a bit in the previous section, but what I'm getting at is that we're going to prepare our program for CLI calling, and even a drag-and-drop form of calling. We'll do all this in a quick swoop of functions, a module, and a new type.

The first thing we need to know is how to get the arguments. In an F# program you have a main function defined as follows (usually):

[<EntryPoint>]
let main argv = ...

The argv is what we're going to concern ourselves with. This is basically an array, of an arbitrary length, that contains each argument that was fed to the call of the program. The first argument in the array is the first argument after the program name, so in our cd SomeDirectory example, SomeDirectory will be the first argument in argv. If the argument is surrounded by quotes ("argument because it has spaces") then the quotes are stripped before it's delivered to us. So doing cd "E:\My Directory\" will put E:\My Directory\ as the first string in argv. It's worth noting that argv is always a string array, so if an argument should be an integer we have to parse that out ourselves.

Now I'm not going to go into the details of a full-class argument parser, because we're dealing with simple arguments. We want to define argv.[0] as the newListFilename, we'll define argv.[1] as oldListFilename, and argv.[2] as the sourceSheetName. We'll do this through a type that defines our parameters:

type Parameters = {
    NewListFilename : string
    OldListFilename : string
    SourceSheetName : string
}

We'll also define a fromDefault function:

let fromDefault defaults args =
    match args with
    | [|newListFilename; oldListFilename; sourceSheetName|] -> Some { defaults with NewListFilename = newListFilename; OldListFilename = oldListFilename; SourceSheetName = sourceSheetName }
    | [|newListFilename; oldListFilename|] -> Some { defaults with NewListFilename = newListFilename; OldListFilename = oldListFilename }
    | [|newListFilename|] -> Some { defaults with NewListFilename = newListFilename }
    | _ -> None

This should be pretty obvious as to what is happening. If args matches an array of three elements, then newListFilename is the first, oldListFilename is the second, and SourceSheetName is the last. If it matches 2, then SourceSheetName is left as default, if it matches 1 then only newListFilename is set, and if it's none then None is returned. This means we can indicate if we even had any CLI arguments, because we may want to show a help page (like cd) or do some default stuff (like ipconfig).

Pull the parameters in

Now ideally we would set the values not-set to None as options, but we're going to do things slightly different, which may not be the most idiomatic method, but it works and it works well. We'll define some default parameters:

let defaults = { NewListFilename = ""; OldListFilename = ""; SourceSheetName = "Process_Spec" }

And then we'll collect them as the first line of our main function:

let parameters =
    match argv |> Parameters.fromDefault defaults with
    | Some p when p.NewListFilename <> "" && p.OldListFilename <> "" -> p
    | Some p when p.NewListFilename <> "" ->
        printf "Enter the old file name to load: "
        { p with OldListFilename = Console.ReadLine() |> String.stripDQuotes }
    | _ ->
        printf "Enter the new file name to load: "
        let temp = { defaults with NewListFilename = Console.ReadLine() |> String.stripDQuotes }
        printf "Enter the old file name to load: "
        { temp with OldListFilename = Console.ReadLine() |> String.stripDQuotes }

So this is pretty self explanatory, and I leave it to you to understand. Our String.stripDQuotes is only present on the Console.ReadLine() calls because dragging-and-dropping a file to our window will actually embed the name with quotes if it has spaces.

We'll then drop the rest of our code (which we write in the previous section) in to our program. The last bit of our main function should be something like the following:

match argv |> Parameters.fromDefault defaults with
| Some p when p.NewListFilename <> "" && p.OldListFilename <> "" ->
    printfn "Done."
| _ ->
    printfn "Done. Press enter to exit."
    Console.ReadLine() |> ignore
0

We're re-matching the arguments, only to determine how to end processing. If it was called from the CLI with arguments, then we did not prompt the user for input, and as such we will not make them press "enter" to quit. As I'm finished with the lesson for today, I recommend you try to parse arguments with names, that is, /sourceSheetName="Process_Spec", or /sourceSheetName "Process_Spec", such that we could call it from the CLI as step3program.exe /SourceSheetName "Process_Spec" and have it prompt us for the filenames.


The lesson for today was a lot shorter than I had hoped it would be, but it's no less important than any others. I haven't yet decided if I'll be posting a less on Monday as it's a holday here in the U.S., but I have a sneaking suspicion that I'll have on ready by then. Until next time, Brown-out.

Getting started with programming and getting absolutely nowhere (Part 5)

Now we begin our adventure

Lesson 4: Moving on to Solving Business Problems
Lesson 6: It's Time to Accept Arguments

Finally, we've been working on this for a little more than a week now, it's time we get to actually processing a spreadsheet. We're going to solve one of the many business problems that I've had, though it's going to be significantly anonymized. The same principles will hold true, and I've specifically structured this data so that we have to fight some of the same issues I've had to fight. While this is technically an example project, this is entirely based on a real-world problem, and everything we'll use here holds true. (In fact, with minimal modification this example project can be used to solve the same issue I am at work.)

Now, we're going to explore this over a few posts, we are going to jump right in and start developing, but in stages. I want to do things one part at a time, so that you can see how and why it's important to formulate a "thought process" while developing software. As I mentioned before we're going to use real problems, and I won't give you all the answers, but that doesn't mean I won't give you some of the answers. We're going to begin today with loading and reading our Excel spreadsheet, we won't do much else because I want you to think about what the next step might be.

Defining the problem

The business problem we're going to solve first is based upon a list of items I receive on a periodic basis, that has tens of thousands of record, which I have to parse and insert into a database. Previously, this was done with a few simple steps in Excel, though sometimes it took a lot more than just a few steps. Some spreadsheets are not even a little bit normalized, and some have very bad data in them. One of them that I receive (which we'll actually solve last) comes in a very weird format, and I have to do some clever tricks to get it in the correct format. Once I've normalized the spreadsheet, I compare it to the most recently received spreadsheet of the same category, and determine what records are "new" to the database based on what is new to this spreadsheet.

So, our process will have three major steps, one of which is optional:

  1. (Optional) Preprocess the really messed up spreadsheet
  2. Normalize the Spreadsheet, splitting data appropriately.
  3. Compare the sheet to the previous version, determine what records are new and indicate them.

These are actually three separate applications that I have built at work, they all support "drag and drop" functionality, which means you can just drag a spreadsheet on to the "step" represented by the application. Between steps there is a short manual process, but that's only because I haven't yet automated it. (One day, I will.)

We're going to start with step 3, the reason for this being that if we start with the last step, we can work from what our expected output is to the input. It doesn't matter which order you go in, but for something as unguided as this I like to work backwards: figure out what our expected output is, then design a function (or set of functions) that can transform our input to the output. Since steps 1 and 2 only put the data in an "intermediate form" we don't care what the output is, we can make step 1 turn our output into whatever is appropriate for step 2, and step 2 can prepare it for step 3. The actual outputs there are irrelevant, we need to figure out what the output of step 3 is, then manipulate our input to fit it. You can work these problems in either direction, and in the future we'll work some problems from input » output, but today we'll stick to the output « input manner.

What output do we need?

This is the easiest part -- we simply need an excel sheet that has the following columns from our input data in this order: First Name, Middle Name, Last Name, Email, Magic ID. If a "person" has multiple Magic ID values, then we want multiple records per value. So if Elliott Brown has 1 and 2 for Magic ID, then there should be two records: Elliott V. Brown ... 1 and Elliott V. Brown ... 2. Simple, right?

So let's look at our input and try to determine how to format it to our output. This is actually going to be somewhat challenging, it may look easy, but it really isn't. The first issue we note is that there are three name columns on the first sheet, but for some records the middle initial is part of the first name. Well that's not good. We also note that there are two Magic ID columns, and some records have neither, some have one, and some have both. Alright, that's not as bad to deal with. And of course, we have two different email formats: some have none, some have one, and one has email1; email2. Looking at Sheet 2, we get even less consistent: if a user had multiple Magic ID values then they're seperated by a number of spaces, and the first, middle and last names are all in the same column. Oh boy.

Every single one of these problems is somethign that my real data has: some records have a combination of the problems, some have one or two, some have none. We're going to build our application on the assumption that every row may have any combination of problems, but some of the problems in the sheet will be consistent.

The Magic ID format is going to be consistent across the whole sheet: if the sheet has one record with multiple columns for the Magic ID, then all records on the sheet will have that same structure. If the sheet has one name with all the names combined, then all records on the sheet will be like that. However, this only applies to first and last name, the first and middle may be arbitrarily combined or separated, so we still have to test for that.

Let's define a type for our output

This is a good start to getting our output in the right format - if we know what business rules apply to all output records then we can build a type that meets them. In our case, the following type is applicable:

type ListPerson = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicId : string
}

So now we have a type which enforces our business rules: for our output we can only have a type that meets all those criteria, thus it must have a FirstName, LastName, Email and MagicId, but it might have a MiddleName.

Building an input to create our output

We cannot forget that we're currently thinking about the matching bit, not the normalization. So right now we only care about how to form some data for our output, and since we know that the output is the same as the input, just skinnied down (because we're removing items in this list that are on the previous one), then we can decide that our input must be the same format as our output. Thus, the ListPerson record should also be used to enforce the input.

This said, we know we need to take a list of ListPerson in from the current list, and a ListPerson list from the previous list, and find the mutually exclusive values. This is actually really Easy with F#, and we can define a function that does this:

let listCompare other el =
    other |> List.filter (compare el) |> List.length = 0

This takes an "other" list an element from the current list to search for. This uses a custom compare function, for reasons we'll discuss shortly, filters the list down to only elements where compare returns true, and returns true if there are no elements in the list. This is pretty simple, and let's us call it by:

List.filter (listCompare oldListMembers)

Now we could, and should, modify listCompare to be more generic. Instead of using our forced compare function (defined elsewhere) we should let it take a function, but I'm going to leave that up to you.

Why did we use a custom comparer, and am I going to show it to you?

Of course, this custom compare function is defined so that we can make an optional comparison. By default F# build a full equality comparer for a record type, but that means that if MiddleName is a string option, and in one list it's Some value, but in another list it's None, we'll treat those as unique records. In our particular case we don't want to do that: we want these to compare based on the data that both records have present. So if MiddleName is Some v for both, then they can compare MiddleName, but if it's None for either, then don't compare it.

The custom compare function is basically the following:

let compare (m1 : ListPerson) (m2 : ListPerson) =
    // Compare mandatory values first
    m1.FirstName = m2.FirstName &&
    m1.LastName = m2.LastName &&
    m1.Email = m2.Email &&
    m1.MagicId = m2.MagicId &&
    // Default the match to `true` if either is `None` so that the previous conditions are the only influence
    match m1.MiddleName, m2.MiddleName with
    | Some a, Some b -> a = b
    | _ -> true

As you can see it just checks everything, and doesn't count a difference in the type of option stored as a false result.

We'll also want to do a List.distinctBy after our filter - in my real data I often get duplicated records within the same list but with or without a MiddleName, for example. We'll actually cheat a bit, and do the following:

List.distinctBy (fun u -> { u with MiddleName = None })

By setting MiddleName to None, then using that for the distinctBy, we'll only return one of the two records (and since the MiddleName is tertiary data in my situation, I don't care which) so that we don't have to deal with duplicates on that front either.

So in the end, we'll end up with some code that looks as follows:

let listCompare other el =
    other |> List.filter (compare el) |> List.length = 0

printfn "Getting members not on old list."
let newListUnique =
    newListMembers
    |> List.filter (listCompare oldListMembers)
    |> List.distinctBy (fun u -> { u with MiddleName = None })
printfn "Getting members not on new list."
let oldListUnique =
    oldListMembers
    |> List.filter (listCompare newListMembers)
    |> List.distinctBy (fun u -> { u with MiddleName = None })

Well that's pretty basic. We should extract that fun u to a new function, which we may as well do now:

let distinction u = { u with MiddleName = None }

printfn "Getting members not on old list."
let newListUnique =
    newListMembers
    |> List.filter (listCompare oldListMembers)
    |> List.distinctBy distinction
printfn "Getting members not on new list."
let oldListUnique =
    oldListMembers
    |> List.filter (listCompare newListMembers)
    |> List.distinctBy distinction

And hell, we could even make that a total composition:

let listCompare other el = other |> List.filter (compare el) |> List.length = 0
let distinction u = { u with MiddleName = None }
let filterBy other = List.filter (listCompare other) >> List.distinctBy distinction

printfn "Getting members not on old list."
let newListUnique =
    newListMembers
    |> filterBy oldListMembers
printfn "Getting members not on new list."
let oldListUnique =
    oldListMembers
    |> filterBy newListMembers

Easy enough, it also helps us see what's happening more clearly, since the steps are clearly isolated.

Pulling some data in

Now that we know what our filter is, what our data is, how it all interacts, let's pull in some good data. We're going to pull in data that already fits our model, which is included in the Step 3 Preparation workbook that's attached. There are just a few rows, like the rest of our data, but there's enough that we can do something with it. NPOI, and F# in general, make it really easy to load data. In fact, way easier than it should be. The F# compiler contains a lot of tricks up it's sleeve to make our lives easier, and we'll explore a couple of those pretty easily.

Today we're going to actually pull all the data in, filter it, and write it out. To pull the data in is pretty easy:

let newBookFilename = "C:\Users\Elliott\Desktop\Step3_Testing_New.xlsx"
let newBook =
    use fs = new FileStream(newBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

let oldBookFilename = "C:\Users\Elliott\Desktop\Step3_Testing_Old.xlsx"
let oldBook =
    use fs = new FileStream(oldBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

To do the XSSFWorkbook bit (which is just a construction of the XSSFWorkbook class from our FileStream) you need to open NPOI.XSSF.UserModel beforehand. I like to arrange these at the top of my file when they're large opens like this.

The next step is to pull our Process_Spec sheet in:

let newSheet = "Process_Spec" |> newBook.GetSheet
let oldSheet = "Process_Spec" |> oldBook.GetSheet

Well that was simple. We're already about 1/4 the way done. Let's pull in a list of all our specification members:

let newListMembers =
    [1..newSheet.LastRowNum]
    |> Seq.choose (newSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

let oldListMembers =
    [1..oldSheet.LastRowNum]
    |> Seq.choose (oldSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

Well this is really easy. And look, we've set ourselves up to just drop in the filtering we wrote in the previous section, damn F# really makes this simple. We start at 1 instead of 0 as we want to skip the header row. We're actually almost done.

We want to save three new sheets on our newWorkbook now: the list of all members on the newBook not on the oldBook, the list of all members on the oldBook not on the newBook, and then we'll format the list of members not on the oldBook in a specific format. This is used at work to drop into a separate Excel sheet that I format my SQL queries with, so some day this will be automated.

let newSheetName = "Process_Spec_New_Add"
let oldSheetName = "Process_Spec_Old_Add"
let templateSheetName = "Process_Spec_Template"

let header = { FirstName = "First Name"; MiddleName = Some "Middle Name"; LastName = "Last Name"; Email = "Email"; MagicId = "Magic ID" }
let addSheet func (workbook : IWorkbook) name list =
    match name |> workbook.GetSheet |> Option.ofObj with
    | Some _ -> name |> workbook.GetSheetIndex |> workbook.RemoveSheetAt
    | None -> ()

    let sheet = name |> workbook.CreateSheet
    (header::list) |> List.iteri (fun i el -> i |> sheet.CreateRow |> func el)

addSheet fillRow newBook newSheetName newListUnique
addSheet fillRow newBook oldSheetName oldListUnique
addSheet fillTemplateRow newBook templateSheetName newListUnique

let write (book : IWorkbook) =
    use fs = File.Create("Spec3_Testing_Finished.xlsx")
    book.Write(fs)

write newBook

And we're done. Boy, that was easy. I left the implemenation of getSpecMember, fillRow and fillTemplateRow out, so let me include them quick for you. They're quite easy to define, and they really do give us more control over what's going on. We end up needing two functions first, which are helpers to set values on a cell:

let setValue (s : string) (cell:ICell) =
    s |> cell.SetCellValue
let setValueOpt (s : string option) (cell:ICell) =
    s |> Option.map (fun s -> cell |> setValue s) |> ignore

After that, we need another helper to get values, this is pretty easy and I excluded formulas because I expect our previous process to strip them out entirely.

let getVal (cell : ICell) : string option = 
    cell
    |> Option.ofObj
    |> Option.map (fun cell ->
        cell.CellType
        |> (function | CellType.Boolean -> cell.BooleanCellValue |> Object.toStr | CellType.Numeric -> cell.NumericCellValue |> Object.toStr | _ -> cell.StringCellValue))

You remember our Object.toStr method, right? If not, I recommend trying to implement it from scratch, based on the idea that it returns o.ToString(). We then have a getSpecMember method:

let getSpecMember (row : IRow) =
    let get = row.GetCell >> getVal
    match 0 |> get, 1 |> get, 2 |> get, 3 |> get, 4 |> get with
    | Some fName, mName, Some lName, Some email, Some magicId ->
        Some { FirstName = fName; MiddleName = mName; LastName = lName; Email = email; MagicId = magicId }
    | _ -> None

We use the options to tell us if we have a valid spec or not, when matched with Seq.choose it will filter away all the None values., Next are the basic fillRow and fillTemplateRow, which should be self-explanatory:

let fillRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId

let fillTemplateRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId
    5 |> row.CreateCell |> setValue (el.MagicId |> String.digitsOnly)

The difference in them is that fillTemplateRow includes an additional column for the MagicId, but with only digit characters in it. In the real data I often get other characters in that column, which may or may not be valid, and for the processing I have to do I only want to consider the numeric characters at times. To use the IRow, ICell and IWorkbook you'll need to open NPOI.SS.UserModel, but then that's it. We have all the pieces to complete our task, in fact, we're pretty much done with it.

Put it together and ship it

At this point, we're done with the third step, and now we know what format our output from Step 2 should be in. This is important to establish as it allows us to work the two ends together - we'll actually work from the start and finish towards each other when we start with Step 2, which I think will end up being covered over more than one post, because it's a long and painful process. Hopefully today's post showed you how to actually start working on projects like a software engineer, and started to allow you to build up your knowledge and thought process. It's important to know how to solve problems, that's probably the biggest thing I see new software engineers struggle with. You need to break it down, and start thinking about what steps need to be done. Build things one step at a time - don't mind of it goes slowly at first, you'll eventually get to the point where you just start breaking problems down without even thinking about them.

By the time we finish, working it as an F# script should look something like:

open System
open System.IO
open NPOI.XSSF.UserModel
open NPOI.SS.UserModel

type ListPerson = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicId : string
}

let compare (m1 : ListPerson) (m2 : ListPerson) =
    // Compare mandatory values first
    m1.FirstName = m2.FirstName &&
    m1.LastName = m2.LastName &&
    m1.Email = m2.Email &&
    m1.MagicId = m2.MagicId &&
    // Default the match to `true` if either is `None` so that the previous conditions are the only influence
    match m1.MiddleName, m2.MiddleName with
    | Some a, Some b -> a = b
    | _ -> true

let setValue (s : string) (cell:ICell) =
    s |> cell.SetCellValue
let setValueOpt (s : string option) (cell:ICell) =
    s |> Option.map (fun s -> cell |> setValue s) |> ignore

let getVal (cell : ICell) : string option = 
    cell
    |> Option.ofObj
    |> Option.map (fun cell ->
        cell.CellType
        |> (function | CellType.Boolean -> cell.BooleanCellValue |> Object.toStr | CellType.Numeric -> cell.NumericCellValue |> Object.toStr | _ -> cell.StringCellValue))

let listCompare other el = other |> List.filter (compare el) |> List.length = 0
let distinction u = { u with MiddleName = None }
let filterBy other = List.filter (listCompare other) >> List.distinctBy distinction

let getSpecMember (row : IRow) =
    let get = row.GetCell >> getVal
    match 0 |> get, 1 |> get, 2 |> get, 3 |> get, 4 |> get with
    | Some fName, mName, Some lName, Some email, Some magicId ->
        Some { FirstName = fName; MiddleName = mName; LastName = lName; Email = email; MagicId = magicId }
    | _ -> None

let fillRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId

let fillTemplateRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId
    5 |> row.CreateCell |> setValue (el.MagicId |> String.digitsOnly)

let newBookFilename = "C:\Users\EBrown\Desktop\Step3_Testing_New.xlsx"
let newBook =
    use fs = new FileStream(newBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

let oldBookFilename = "C:\Users\EBrown\Desktop\Step3_Testing_Old.xlsx"
let oldBook =
    use fs = new FileStream(oldBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

let newSheet = "Process_Spec" |> newBook.GetSheet
let oldSheet = "Process_Spec" |> oldBook.GetSheet

let newListMembers =
    [1..newSheet.LastRowNum]
    |> Seq.choose (newSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

let oldListMembers =
    [1..oldSheet.LastRowNum]
    |> Seq.choose (oldSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

printfn "Getting members not on old list."
let newListUnique = newListMembers |> filterBy oldListMembers
printfn "Getting members not on new list."
let oldListUnique = oldListMembers |> filterBy newListMembers

let newSheetName = "Process_Spec_New_Add"
let oldSheetName = "Process_Spec_Old_Add"
let templateSheetName = "Process_Spec_Template"

let header = { FirstName = "First Name"; MiddleName = Some "Middle Name"; LastName = "Last Name"; Email = "Email"; MagicId = "Magic ID" }
let addSheet func (workbook : IWorkbook) name list =
    match name |> workbook.GetSheet |> Option.ofObj with
    | Some _ -> name |> workbook.GetSheetIndex |> workbook.RemoveSheetAt
    | None -> ()

    let sheet = name |> workbook.CreateSheet
    (header::list) |> List.iteri (fun i el -> i |> sheet.CreateRow |> func el)

addSheet fillRow newBook newSheetName newListUnique
addSheet fillRow newBook oldSheetName oldListUnique
addSheet fillTemplateRow newBook templateSheetName newListUnique

let write (book : IWorkbook) =
    use fs = File.Create("C:\Users\EBrown\Desktop\Step3_Testing_Finished.xlsx")
    book.Write(fs)

write newBook

This is roughly 120 lines of code. We read two spreadsheets, filtered the members from each down to only unique values, and saved them back out in 120 lines of code. That's phenomenal. In the past five posts, we went from doing basic function work, to processing a full-on spreadsheet and filtering data in it. You're going to see that we're going to continue this trend of build and improve over the next few posts. We'll actually work out how to replace the variables we use with new and improved parameters, allowing us to send parameters to our application via arguments, which means we will allow the user to have a more dynamic interaction.

As we continue to explore (and I recommend you explore some on your own) we'll get more in-depth looks at how we can process these spreadsheets more effectively. We'll get in to doing the actual, raw-level morphing coming up soon, and actually translate our bad sheets into decent sheets, and decent sheets into good sheets.

This is all actually fairly straightforward, and you'll see once we get started that it's not nearly as hard as I make it out to be. We'll go through things rather quickly, but once it all clicks you should be able to go back to the previous posts and review them to make certain you understand what's happening. Most of what I do is pretty simple, and fairly reasonable, but there may be things that confuse you a bit or make you think a little harde than normal - that's intended. I want you to have to think and understand, I want you to think critically about what's going on, becuase that is the only way you become a better programmer.


I started this whole blog series for a friend of mine (currently a U.S. Marine deployed to Turkey) who has asked me, quote possibly fifty times over the past three years or so, to teach him how to become a programmer. (I mentioned this in the first of these articles.) We both haven't had schedules well enough to connect on it, so this series should allow a person to work at their own pace while still allowing me to work at mine. If anyone has any suggestions for material to cover, or how, please tweet at me (@EBrown8534) or comment, I really want to make this a good series for as many people as possible. And feel free to let me know if you like anything in particular as well, I would really love to hear any input you might have.

Second, if you're using VBA in Excel (or, really, any other VBA environment), this GoFundMe could really use your support! These guys and gals have been building a wonderful add-on for the Excel VBE (and they've even spent a good amount of time making it work in all VBE's, such as AutoCAD, SolidWorks, CorelDRAW and Sage 300 ERP), for a while now, all on their own free time. For the amount of work required, that's incredible. Even a few bucks will help them - they're a really great group.

Spreadsheets: Step3_Testing_Old.xlsx (9KB) Step3_Testing_New.xlsx (9.1KB)