using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 8)

Let's normalize our data!

Lesson 7: Working out our Step 2 Process
Lesson 9: Knowing the Terminology is Important

Finally, time to write out our step 2 code. This has been a bit of a lengthy journey so far, and most of that is my fault, but it's finally time to write some code to do what we want. I'm going to throw a lot at you very quickly, but it's important to not lose focus. If you have questions, do some web-searches while you're thinking of them, it'll help you stay in the mindset I hope.

The first step here is to define a type, or a few types. We need 4 to boot.

type NameMap =
    | Combined of c : int
    | Split of firstName : int * middleName : int option * lastName : int

type MagicIdMap =
    | Combined of c : int
    | Split of ids : int array

type Header = {
    Sheet : ISheet
    Name : NameMap
    Phone : int
    DateAdded : int
    Email : int
    MagicIds : MagicIdMap
}

type User = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicId : int
}

You should understand what those are by now, so I won't go into any detail. We're still using NPOI, we're still using F#, and we're still tyring to be as idiomatic as possible.

Some boilerplate

You should also remember this little guy:

let getVal (cell : ICell) : obj option = 
    match cell with
    | null -> None
    | _ ->
        match cell.CellType with
        | CellType.Boolean -> cell.BooleanCellValue |> box |> Some
        | CellType.Numeric -> cell.NumericCellValue |> box |> Some
        | CellType.String -> cell.StringCellValue |> box |> Some
        | _ -> None

And we're also going to add a few other helpers:

let startsWith v (x : string) = v |> x.StartsWith

let atLeastOne items =
    if items |> Seq.isEmpty then [None] |> Seq.ofList
    else items

let colIndex (cell : ICell) = cell.ColumnIndex

None of which should be difficult. The atLeastone will always make sure a Seq is not empty, so that we can simplify tests later. We need to be able to filter out cells based on what value they have, so we have this function:

let cellWhere pred (row : IRow) =
    row.Cells
    |> (Seq.map (fun c -> (c, c |> (getVal >> Option.map Object.toStr)))
    >> Seq.filter (fun (c, v) -> v |> (function | None -> false | Some n -> n |> pred))
    >> Seq.map (fun (c, v) -> v |> Option.map (fun y -> c))
    >> atLeastOne)

Interestingly, this will give us an ICell option sequence, which means we can then Seq.choose on it if we need, or in the case of only one result we can Seq.head. We need to filter our header out first, so we want something like the following:

let mapHeader (sheet, row) =
    match row |> getName, row |> getPhone, row |> getDateAdded, row |> getEmail, row |> getMagicIds with
    | Some name, Some phone, Some dateAdded, Some email, Some magicIds ->
        { Sheet = sheet
          Name = name
          Phone = phone
          DateAdded = dateAdded
          Email = email
          MagicIds = magicIds }
        |> Some
    | _ -> None

This is pretty simple: match the result of the get___ function for each record property with option, and if we have Some of everything then return it. This doesn't care what type anything is for the most part, the get___ function will decide how the row maps to the type.

Grab some values

For Phone, DateAdded, and Email it's simple:

let getPhone row =
    match row |> (cellWhere ((=) "Phone Number") >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

let getDateAdded row =
    match row |> (cellWhere ((=) "Date Added") >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

let getEmail row =
    match row |> (cellWhere ((=) "Email") >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

In fact, we could replace all these with the same method:

let getCol col row =
    match row |> (cellWhere ((=) col) >> Seq.head) with
    | Some cell -> cell |> (colIndex >> Some)
    | _ -> None

let getPhone : IRow -> int option = getCol "Phone Number"
let getDateAdded : IRow -> int option = getCol "Date Added"        
let getEmail : IRow -> int option = getCol "Email"

So that's what we'll do. Awesome, simple and efficient. Avoids repetiveness, and keeps us in check. The next step is either Magic ID or Name. Since the Magic ID is easier, we'll do the name first.

let getName row =
    match row |> (cellWhere ((=) "Name") >> Seq.head),
          row |> (cellWhere ((=) "First Name") >> Seq.head),
          row |> (cellWhere ((=) "Middle Name") >> Seq.head),
          row |> (cellWhere ((=) "Last Name") >> Seq.head) with
    | Some cell, _, _, _ -> cell |> (colIndex >> NameMap.Combined >> Some)
    | _, Some f, m, Some l ->
        NameMap.Split (f |> colIndex, m |> Option.map colIndex, l |> colIndex)
        |> Some
    | _ -> None

Oh, that's not so bad. If this was the harder of the two, that can only mean that Magic ID must be pretty simple.

let getMagicIds row =
    match row |> (cellWhere ((=) "Magic ID") >> Seq.head), row |> cellWhere (startsWith "Magic ID") with
    | Some cell, _ -> cell |> (colIndex >> Combined >> Some)
    | _, cells when cells |> Seq.isEmpty = false -> cells |> (Seq.choose id >> Seq.map colIndex >> Seq.toArray >> Split >> Some)
    | _ -> None

Dang, is that it? Do note that we're actually applying the str1 = str2 function as a partial application, which gives us:

  • Shorter, cleaner syntax;
  • The potential for higher-level optimizations;
  • An easier process to follow;

You'll also notice I've been doing x |> (f >> g) instead of x |> f |> g, this is a slightly longer syntax (two parenthesis added), but it makes things clearer to me: the operation is one, continuous operation, there's no partial flow here. The f >> g composition is a single operation, and it makes it easy to abstract that out to a new function if necessary.

Map the headers

Alright, so we've built the header, the next step is to map them:

let range uBound = [0..uBound - 1]
let getRow i (x : ISheet) = i |> x.GetRow
let testFileName = "C:\Users\ebrown\Desktop\Step2_Testing.xlsx"
let workbook =
    use fs = new FileStream(testFileName, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)
let headers =
    workbook.NumberOfSheets
    |> (range
    >> Seq.choose (workbook.GetSheetAt >> Option.ofObj)
    >> Seq.map (fun sheet -> (sheet, sheet |> (getRow 0 >> Option.ofObj)))
    >> Seq.choose (fun (sheet, r1) -> r1 |> Option.map(fun r -> (sheet, r)))
    >> Seq.choose mapHeader)

Now our headers value will be a Header sequence, which allows us to do all sorts of stuff with them, including headers |> Seq.iter (fun header -> printfn "%s: %A" header.Sheet.SheetName header). A debug print, pretty simple, no?

Get some data from our spreadsheet

Now we get to a slightly more difficult task: how do we get the users? Well first, I want to define the syntax I expect from our API:

let users =
    headers
    |> Seq.map (getUsers workbook)

Ok, that's reasonable. We expect to call a getUsers function which is an IWorkbook -> Header -> User array. We actually have most of the structure built, so we'll go back and define our getUsers function.

As I was defining it, I realized the addition of workbook to the getUsers call is superfluous: our header already has a reference to ISheet, which is all we need here, so we'll drop that call and it becomes a basic one-liner:

let users = header |> Seq.map getUsers

Rightey-o then, let's move on. For getUsers we need to do a couple things:

  1. We need to collect each row from the ISheet;
  2. We need to split each row into the relevant user-array;
  3. We need to flatten the result;

This is somewhat easy, since we know the steps:

let getUsers header =
    let sheet = header.Sheet
    [1..sheet.LastRowNum]
    |> (Seq.choose (sheet.GetRow >> Option.ofObj)
    >> Seq.choose (getUserRows header)
    >> Seq.toArray
    >> Array.flatten
    >> Array.map separateUsers
    >> Array.flatten)

Easy enough, nothing too complex here. Let's build the getUserRows function first. We'll actually build a new type here, to hold our intermediate row. (This helps keep each function short and clean.)

type UserRow = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Phone : string option
    DateAdded : string option
    Email : string option
    MagicId : string
}

Keep in mind that this is an intermediate record type - we don't care as much about our values, we're going to transform this to the final type in the separateUsers function. Now getUserRows is responsible for doing our name transformation, which we agreed upon before, and it's responsible for splitting a MagicIdMap.Split into multiple rows.

Get the Cell values

From here we know we need a couple other things, one of which is the ability to get a value of a cell:

let getColVal c (row : IRow) = c |> row.GetCell |> getVal

And we need to separateUsers, which is the easy part:

let separateUsers user =
    user.MagicId
    |> String.splitOpt (StringSplitOptions.RemoveEmptyEntries) ' '
    |> Seq.map(fun id ->
        { User.FirstName = user.FirstName
          MiddleName = user.MiddleName
          LastName = user.LastName
          Email = user.Email |> (function | None -> String.Empty | Some v -> v)
          MagicId = id |> int })

Pretty basic, iterate our MagicId split string and work out what they are.

Build the massive, ugly function

Then we want to write getUserRows, which is long and quite possibly disobeys some best-practices, but it does exactly what we want, and well:

let getUserRows header (row : IRow) : UserRow seq option =
    let splitNames (name : string) =
        match ' ' |> name.Split with
        | [|first; last|] -> (Some first, None, Some last)
        | [|first; middle; last|] -> (Some first, Some middle, Some last)
        | _ -> (None, None, None)

    let buildUserVals first middle last phone dateAdded email magicId =
        { UserRow.FirstName = first
          MiddleName = middle
          LastName = last
          Phone = phone |> Option.map Object.toStr
          DateAdded = dateAdded |> Option.map Object.toStr
          Email = email |> Option.map Object.toStr
          MagicId = magicId |> Object.toStr }

    let finalResult first middle last =
        match header.MagicIds with
        | MagicIdMap.Combined magicIdCol ->
            match row |> getColVal magicIdCol with
            | Some magicId ->
                [|buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId|]
                |> Seq.ofArray
                |> Some
            | _ -> None
        | MagicIdMap.Split ids ->
            ids
            |> Seq.choose (fun id ->
                match row |> getColVal id with
                | Some magicId ->
                    buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId
                    |> Some
                | _ -> None)
            |> Some

    match header.Name with
    | NameMap.Split (f, m, l) ->
        match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
              row |> getColVal f |> Option.map (Object.toStr),
              m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
              row |> getColVal l |> Option.map (Object.toStr) with
        | Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
        | _, Some f, m, Some l -> finalResult f m l
        | _ -> None
    | NameMap.Combined n -> 
        match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
        | Some (Some f, m, Some l) -> finalResult f m l
        | _ -> None

So what's happening here? Well, the first function inside here is splitNames, which is trivial:

let splitNames (name : string) =
    match ' ' |> name.Split with
    | [|first; last|] -> (Some first, None, Some last)
    | [|first; middle; last|] -> (Some first, Some middle, Some last)
    | _ -> (None, None, None)

If we split the name into two pieces, then return the two pieces as the first and third tuple elements. If we split it into three pieces, then return first, middle, last. We'll assume that names are fairly standard in the manner they're non-standard.

Next' we have this:

let buildUserVals first middle last phone dateAdded email magicId =
    { UserRow.FirstName = first
      MiddleName = middle
      LastName = last
      Phone = phone |> Option.map Object.toStr
      DateAdded = dateAdded |> Option.map Object.toStr
      Email = email |> Option.map Object.toStr
      MagicId = magicId |> Object.toStr }

This is basically a function which takes our relevant paraneters, and builds the final UserRow value from them. Pretty easy, nothing complex.

The finalResult starts getting hinky:

let finalResult first middle last =
    match header.MagicIds with
    | MagicIdMap.Combined magicIdCol ->
        match row |> getColVal magicIdCol with
        | Some magicId ->
            [|buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId|]
            |> Seq.ofArray
            |> Some
        | _ -> None
    | MagicIdMap.Split ids ->
        ids
        |> Seq.choose (fun id ->
            match row |> getColVal id with
            | Some magicId ->
                buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId
                |> Some
            | _ -> None)
        |> Some

So this tests the MagicIds against the MagicIdMap, if it's Split then we're returning the array of users immediately. If it's Combined then we're returning a one-element array, built from the remaining values.

In hindsight, we really should clean some of this up to make it simpler:

let finalResult first middle last =
    let phone = row |> getColVal header.Phone
    let dateAdded = row |> getColVal header.DateAdded
    let email = row |> getColVal header.Email
    match header.MagicIds with
    | MagicIdMap.Combined magicIdCol ->
        match row |> getColVal magicIdCol with
        | Some magicId ->
            [|buildUserVals first middle last phone dateAdded email magicId|]
            |> Seq.ofArray
            |> Some
        | _ -> None
    | MagicIdMap.Split ids ->
        ids
        |> Seq.choose (fun id ->
            match row |> getColVal id with
            | Some magicId ->
                buildUserVals first middle last phone dateAdded email magicId
                |> Some
            | _ -> None)
        |> Some

That's better, we don't need to inline our phone, dateAdded and email data.

Finally, the match which is executed:

match userRow.Name with
| NameMap.Split (f, m, l) ->
    match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
          row |> getColVal f |> Option.map (Object.toStr),
          m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
          row |> getColVal l |> Option.map (Object.toStr) with
    | Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
    | _, Some f, m, Some l -> finalResult f m l
    | _ -> None
| NameMap.Combined n -> 
    match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
    | Some (Some f, m, Some l) -> finalResult f m l
    | _ -> None

This just tests the name information, and does the relevant transformations. You'll notice the first sub-match tests four values, that's because I'm lazy and refuse to put yet another sub-match expression in. Finally, we can run and test our code:

let users = headers |> Seq.map getUsers
users |> Seq.iter (printfn "%A")

This should write some output like the following:

seq
  [{FirstName = "John";
    MiddleName = Some "C.";
    LastName = "2";
    Email = "email@example.com";
    MagicId = 287;}; {FirstName = "Jane";
                      MiddleName = Some "L.";
                      LastName = "2";
                      Email = "otheremail@example.com";
                      MagicId = 331;}; {FirstName = "Jane";
                                        MiddleName = Some "L.";
                                        LastName = "2";
                                        Email = "otheremail@example.com";
                                        MagicId = 334;}; {FirstName = "Ron";
                                                          MiddleName = null;
                                                          LastName = "Doe";
                                                          Email = "";
                                                          MagicId = 872;}; ...]
seq
  [{FirstName = "John";
    MiddleName = Some "C.";
    LastName = "Doe";
    Email = "email@example.com";
    MagicId = 287;}; {FirstName = "Jane";
                      MiddleName = Some "L.";
                      LastName = "Doe";
                      Email = "otheremail@example.com";
                      MagicId = 331;}; {FirstName = "Jane";
                                        MiddleName = Some "L.";
                                        LastName = "Doe";
                                        Email = "otheremail@example.com";
                                        MagicId = 334;}; {FirstName = "Ron";
                                                          MiddleName = null;
                                                          LastName = "Doe";
                                                          Email = "";
                                                          MagicId = 872;}; ...]

Finally, save our data

Finally, we need to save our data. This is a pretty easy step, we'll modify getUsers just a little, and then user our users sequence to finish things off:

let getUsers header =
    let sheet = header.Sheet
    (header,
     [1..sheet.LastRowNum]
     |> (Seq.choose (sheet.GetRow >> Option.ofObj)
     >> Seq.choose (getUserRows header)
     >> Seq.flatten
     >> Seq.map separateUsers
     >> Seq.flatten))

let users = headers |> Seq.map getUsers

As you can see, the only modification is to have it map users and a header together, then we'll drop it all to a new workbook and sheet, and save.

Saving things is actually quite simple, we define a few helpers to set values, then we build a save function that takes our header and users and saves them to the new workbook:

let setCellValueOptStr (cell : ICell) : string option -> unit =
    Option.map cell.SetCellValue >> ignore
let setCellValueStr (cell : ICell) : string -> unit =
    cell.SetCellValue >> ignore
let saveSingle (row : IRow) (user : User) =
    (0 |> row.CreateCell, user.FirstName) ||> setCellValueStr
    (1 |> row.CreateCell, user.MiddleName) ||> setCellValueOptStr
    (2 |> row.CreateCell, user.LastName) ||> setCellValueStr
    (3 |> row.CreateCell, user.Email) ||> setCellValueStr
    (4 |> row.CreateCell, user.MagicId |> string) ||> setCellValueStr
let saveHeader (row : IRow) =
    (0 |> row.CreateCell, "First Name") ||> setCellValueStr
    (1 |> row.CreateCell, "Middle Name") ||> setCellValueStr 
    (2 |> row.CreateCell, "Last Name") ||> setCellValueStr
    (3 |> row.CreateCell, "Email") ||> setCellValueStr
    (4 |> row.CreateCell, "Magic ID") ||> setCellValueStr

The helpers make it easy to keep items separated so that concerns are only on their own scope.

let save header users =
    let newFile = header.Sheet.SheetName
    let newBook = XSSFWorkbook()
    let newSheetName = "Process_Spec"
    let newSheet = newSheetName |> newBook.CreateSheet
    0 |> newSheet.CreateRow |> saveHeader
    let savedRows =
        users
        |> Seq.fold (fun row user ->
            let newRow = row |> newSheet.CreateRow
            (newRow, user) ||> saveSingle
            row + 1) 1
    let datePart = System.DateTime.Now.ToString("yyyy-MM-dd")
    let newBookFile =
        sprintf "%s\%s-%s%s" fileInfo.DirectoryName newFile datePart fileInfo.Extension
    use fw = File.Create(newBookFile)
    newBook.Write(fw)
    printfn "Saved to %s" newBookFile
    savedRows

The sprintf is similar to printf except that it returns the format string instead of printing directly.

We can save our data with a single line: users |> Seq.iter (fun value -> value ||> save |> ignore). This will run through the entire pricess at this point and then save it all. (The let users = headers |> Seq.map getUsers doesn't do anything other than build a chain of Seq functions to create our result.)

And bam, we're done. Overally we have a hundred or so lines of code, and they completely normalize our example data. The next lesson will try to normalize things a bit better, and take some other oddities into account (such as names like "La Rouse", "Mac Gregor", etc.).


As we conclude this lesson (which, let's face it, is mostly a code-dump) I have been reflecting on my experiences with the IT world, software development, F#, C#, PHP, and other languages and experiences, and reminding myself that everything I've been working towards, building this lesson series, developing products for my company, building software to test my own knowledge, and even sharing some experiences with others have been strongly defined to push myself towards teaching and sharing knowledge with others. I realized (not so long ago) that the only thing I truly enjoy doing is helping others succeed, so I've been working towards that day-by-day for years. This series itself is a testament to that: I'm not expecting to get anything out of it other than hoping at least one person finds something here userful. That's all I want, and all I ask. If you find something useful, reach out on Twitter or email me. One of my (many) emails is elliott [dot] brown or ebrown (either work), at this very domain. (Robots should have a hard time with that, but I have faith in the people reading this to be able to parse it out.)