using Programming;

A Blog about some of the intrinsics related to programming and how one can get the best out of various languages.

Getting started with programming and getting absolutely nowhere (Part 5)

Now we begin our adventure

Lesson 4: Moving on to Solving Business Problems
Lesson 6: It's Time to Accept Arguments

Finally, we've been working on this for a little more than a week now, it's time we get to actually processing a spreadsheet. We're going to solve one of the many business problems that I've had, though it's going to be significantly anonymized. The same principles will hold true, and I've specifically structured this data so that we have to fight some of the same issues I've had to fight. While this is technically an example project, this is entirely based on a real-world problem, and everything we'll use here holds true. (In fact, with minimal modification this example project can be used to solve the same issue I am at work.)

Now, we're going to explore this over a few posts, we are going to jump right in and start developing, but in stages. I want to do things one part at a time, so that you can see how and why it's important to formulate a "thought process" while developing software. As I mentioned before we're going to use real problems, and I won't give you all the answers, but that doesn't mean I won't give you some of the answers. We're going to begin today with loading and reading our Excel spreadsheet, we won't do much else because I want you to think about what the next step might be.

Defining the problem

The business problem we're going to solve first is based upon a list of items I receive on a periodic basis, that has tens of thousands of record, which I have to parse and insert into a database. Previously, this was done with a few simple steps in Excel, though sometimes it took a lot more than just a few steps. Some spreadsheets are not even a little bit normalized, and some have very bad data in them. One of them that I receive (which we'll actually solve last) comes in a very weird format, and I have to do some clever tricks to get it in the correct format. Once I've normalized the spreadsheet, I compare it to the most recently received spreadsheet of the same category, and determine what records are "new" to the database based on what is new to this spreadsheet.

So, our process will have three major steps, one of which is optional:

  1. (Optional) Preprocess the really messed up spreadsheet
  2. Normalize the Spreadsheet, splitting data appropriately.
  3. Compare the sheet to the previous version, determine what records are new and indicate them.

These are actually three separate applications that I have built at work, they all support "drag and drop" functionality, which means you can just drag a spreadsheet on to the "step" represented by the application. Between steps there is a short manual process, but that's only because I haven't yet automated it. (One day, I will.)

We're going to start with step 3, the reason for this being that if we start with the last step, we can work from what our expected output is to the input. It doesn't matter which order you go in, but for something as unguided as this I like to work backwards: figure out what our expected output is, then design a function (or set of functions) that can transform our input to the output. Since steps 1 and 2 only put the data in an "intermediate form" we don't care what the output is, we can make step 1 turn our output into whatever is appropriate for step 2, and step 2 can prepare it for step 3. The actual outputs there are irrelevant, we need to figure out what the output of step 3 is, then manipulate our input to fit it. You can work these problems in either direction, and in the future we'll work some problems from input » output, but today we'll stick to the output « input manner.

What output do we need?

This is the easiest part -- we simply need an excel sheet that has the following columns from our input data in this order: First Name, Middle Name, Last Name, Email, Magic ID. If a "person" has multiple Magic ID values, then we want multiple records per value. So if Elliott Brown has 1 and 2 for Magic ID, then there should be two records: Elliott V. Brown ... 1 and Elliott V. Brown ... 2. Simple, right?

So let's look at our input and try to determine how to format it to our output. This is actually going to be somewhat challenging, it may look easy, but it really isn't. The first issue we note is that there are three name columns on the first sheet, but for some records the middle initial is part of the first name. Well that's not good. We also note that there are two Magic ID columns, and some records have neither, some have one, and some have both. Alright, that's not as bad to deal with. And of course, we have two different email formats: some have none, some have one, and one has email1; email2. Looking at Sheet 2, we get even less consistent: if a user had multiple Magic ID values then they're seperated by a number of spaces, and the first, middle and last names are all in the same column. Oh boy.

Every single one of these problems is somethign that my real data has: some records have a combination of the problems, some have one or two, some have none. We're going to build our application on the assumption that every row may have any combination of problems, but some of the problems in the sheet will be consistent.

The Magic ID format is going to be consistent across the whole sheet: if the sheet has one record with multiple columns for the Magic ID, then all records on the sheet will have that same structure. If the sheet has one name with all the names combined, then all records on the sheet will be like that. However, this only applies to first and last name, the first and middle may be arbitrarily combined or separated, so we still have to test for that.

Let's define a type for our output

This is a good start to getting our output in the right format - if we know what business rules apply to all output records then we can build a type that meets them. In our case, the following type is applicable:

type ListPerson = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicId : string
}

So now we have a type which enforces our business rules: for our output we can only have a type that meets all those criteria, thus it must have a FirstName, LastName, Email and MagicId, but it might have a MiddleName.

Building an input to create our output

We cannot forget that we're currently thinking about the matching bit, not the normalization. So right now we only care about how to form some data for our output, and since we know that the output is the same as the input, just skinnied down (because we're removing items in this list that are on the previous one), then we can decide that our input must be the same format as our output. Thus, the ListPerson record should also be used to enforce the input.

This said, we know we need to take a list of ListPerson in from the current list, and a ListPerson list from the previous list, and find the mutually exclusive values. This is actually really Easy with F#, and we can define a function that does this:

let listCompare other el =
    other |> List.filter (compare el) |> List.length = 0

This takes an "other" list an element from the current list to search for. This uses a custom compare function, for reasons we'll discuss shortly, filters the list down to only elements where compare returns true, and returns true if there are no elements in the list. This is pretty simple, and let's us call it by:

List.filter (listCompare oldListMembers)

Now we could, and should, modify listCompare to be more generic. Instead of using our forced compare function (defined elsewhere) we should let it take a function, but I'm going to leave that up to you.

Why did we use a custom comparer, and am I going to show it to you?

Of course, this custom compare function is defined so that we can make an optional comparison. By default F# build a full equality comparer for a record type, but that means that if MiddleName is a string option, and in one list it's Some value, but in another list it's None, we'll treat those as unique records. In our particular case we don't want to do that: we want these to compare based on the data that both records have present. So if MiddleName is Some v for both, then they can compare MiddleName, but if it's None for either, then don't compare it.

The custom compare function is basically the following:

let compare (m1 : ListPerson) (m2 : ListPerson) =
    // Compare mandatory values first
    m1.FirstName = m2.FirstName &&
    m1.LastName = m2.LastName &&
    m1.Email = m2.Email &&
    m1.MagicId = m2.MagicId &&
    // Default the match to `true` if either is `None` so that the previous conditions are the only influence
    match m1.MiddleName, m2.MiddleName with
    | Some a, Some b -> a = b
    | _ -> true

As you can see it just checks everything, and doesn't count a difference in the type of option stored as a false result.

We'll also want to do a List.distinctBy after our filter - in my real data I often get duplicated records within the same list but with or without a MiddleName, for example. We'll actually cheat a bit, and do the following:

List.distinctBy (fun u -> { u with MiddleName = None })

By setting MiddleName to None, then using that for the distinctBy, we'll only return one of the two records (and since the MiddleName is tertiary data in my situation, I don't care which) so that we don't have to deal with duplicates on that front either.

So in the end, we'll end up with some code that looks as follows:

let listCompare other el =
    other |> List.filter (compare el) |> List.length = 0

printfn "Getting members not on old list."
let newListUnique =
    newListMembers
    |> List.filter (listCompare oldListMembers)
    |> List.distinctBy (fun u -> { u with MiddleName = None })
printfn "Getting members not on new list."
let oldListUnique =
    oldListMembers
    |> List.filter (listCompare newListMembers)
    |> List.distinctBy (fun u -> { u with MiddleName = None })

Well that's pretty basic. We should extract that fun u to a new function, which we may as well do now:

let distinction u = { u with MiddleName = None }

printfn "Getting members not on old list."
let newListUnique =
    newListMembers
    |> List.filter (listCompare oldListMembers)
    |> List.distinctBy distinction
printfn "Getting members not on new list."
let oldListUnique =
    oldListMembers
    |> List.filter (listCompare newListMembers)
    |> List.distinctBy distinction

And hell, we could even make that a total composition:

let listCompare other el = other |> List.filter (compare el) |> List.length = 0
let distinction u = { u with MiddleName = None }
let filterBy other = List.filter (listCompare other) >> List.distinctBy distinction

printfn "Getting members not on old list."
let newListUnique =
    newListMembers
    |> filterBy oldListMembers
printfn "Getting members not on new list."
let oldListUnique =
    oldListMembers
    |> filterBy newListMembers

Easy enough, it also helps us see what's happening more clearly, since the steps are clearly isolated.

Pulling some data in

Now that we know what our filter is, what our data is, how it all interacts, let's pull in some good data. We're going to pull in data that already fits our model, which is included in the Step 3 Preparation workbook that's attached. There are just a few rows, like the rest of our data, but there's enough that we can do something with it. NPOI, and F# in general, make it really easy to load data. In fact, way easier than it should be. The F# compiler contains a lot of tricks up it's sleeve to make our lives easier, and we'll explore a couple of those pretty easily.

Today we're going to actually pull all the data in, filter it, and write it out. To pull the data in is pretty easy:

let newBookFilename = "C:\Users\Elliott\Desktop\Step3_Testing_New.xlsx"
let newBook =
    use fs = new FileStream(newBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

let oldBookFilename = "C:\Users\Elliott\Desktop\Step3_Testing_Old.xlsx"
let oldBook =
    use fs = new FileStream(oldBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

To do the XSSFWorkbook bit (which is just a construction of the XSSFWorkbook class from our FileStream) you need to open NPOI.XSSF.UserModel beforehand. I like to arrange these at the top of my file when they're large opens like this.

The next step is to pull our Process_Spec sheet in:

let newSheet = "Process_Spec" |> newBook.GetSheet
let oldSheet = "Process_Spec" |> oldBook.GetSheet

Well that was simple. We're already about 1/4 the way done. Let's pull in a list of all our specification members:

let newListMembers =
    [1..newSheet.LastRowNum]
    |> Seq.choose (newSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

let oldListMembers =
    [1..oldSheet.LastRowNum]
    |> Seq.choose (oldSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

Well this is really easy. And look, we've set ourselves up to just drop in the filtering we wrote in the previous section, damn F# really makes this simple. We start at 1 instead of 0 as we want to skip the header row. We're actually almost done.

We want to save three new sheets on our newWorkbook now: the list of all members on the newBook not on the oldBook, the list of all members on the oldBook not on the newBook, and then we'll format the list of members not on the oldBook in a specific format. This is used at work to drop into a separate Excel sheet that I format my SQL queries with, so some day this will be automated.

let newSheetName = "Process_Spec_New_Add"
let oldSheetName = "Process_Spec_Old_Add"
let templateSheetName = "Process_Spec_Template"

let header = { FirstName = "First Name"; MiddleName = Some "Middle Name"; LastName = "Last Name"; Email = "Email"; MagicId = "Magic ID" }
let addSheet func (workbook : IWorkbook) name list =
    match name |> workbook.GetSheet |> Option.ofObj with
    | Some _ -> name |> workbook.GetSheetIndex |> workbook.RemoveSheetAt
    | None -> ()

    let sheet = name |> workbook.CreateSheet
    (header::list) |> List.iteri (fun i el -> i |> sheet.CreateRow |> func el)

addSheet fillRow newBook newSheetName newListUnique
addSheet fillRow newBook oldSheetName oldListUnique
addSheet fillTemplateRow newBook templateSheetName newListUnique

let write (book : IWorkbook) =
    use fs = File.Create("Spec3_Testing_Finished.xlsx")
    book.Write(fs)

write newBook

And we're done. Boy, that was easy. I left the implemenation of getSpecMember, fillRow and fillTemplateRow out, so let me include them quick for you. They're quite easy to define, and they really do give us more control over what's going on. We end up needing two functions first, which are helpers to set values on a cell:

let setValue (s : string) (cell:ICell) =
    s |> cell.SetCellValue
let setValueOpt (s : string option) (cell:ICell) =
    s |> Option.map (fun s -> cell |> setValue s) |> ignore

After that, we need another helper to get values, this is pretty easy and I excluded formulas because I expect our previous process to strip them out entirely.

let getVal (cell : ICell) : string option = 
    cell
    |> Option.ofObj
    |> Option.map (fun cell ->
        cell.CellType
        |> (function | CellType.Boolean -> cell.BooleanCellValue |> Object.toStr | CellType.Numeric -> cell.NumericCellValue |> Object.toStr | _ -> cell.StringCellValue))

You remember our Object.toStr method, right? If not, I recommend trying to implement it from scratch, based on the idea that it returns o.ToString(). We then have a getSpecMember method:

let getSpecMember (row : IRow) =
    let get = row.GetCell >> getVal
    match 0 |> get, 1 |> get, 2 |> get, 3 |> get, 4 |> get with
    | Some fName, mName, Some lName, Some email, Some magicId ->
        Some { FirstName = fName; MiddleName = mName; LastName = lName; Email = email; MagicId = magicId }
    | _ -> None

We use the options to tell us if we have a valid spec or not, when matched with Seq.choose it will filter away all the None values., Next are the basic fillRow and fillTemplateRow, which should be self-explanatory:

let fillRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId

let fillTemplateRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId
    5 |> row.CreateCell |> setValue (el.MagicId |> String.digitsOnly)

The difference in them is that fillTemplateRow includes an additional column for the MagicId, but with only digit characters in it. In the real data I often get other characters in that column, which may or may not be valid, and for the processing I have to do I only want to consider the numeric characters at times. To use the IRow, ICell and IWorkbook you'll need to open NPOI.SS.UserModel, but then that's it. We have all the pieces to complete our task, in fact, we're pretty much done with it.

Put it together and ship it

At this point, we're done with the third step, and now we know what format our output from Step 2 should be in. This is important to establish as it allows us to work the two ends together - we'll actually work from the start and finish towards each other when we start with Step 2, which I think will end up being covered over more than one post, because it's a long and painful process. Hopefully today's post showed you how to actually start working on projects like a software engineer, and started to allow you to build up your knowledge and thought process. It's important to know how to solve problems, that's probably the biggest thing I see new software engineers struggle with. You need to break it down, and start thinking about what steps need to be done. Build things one step at a time - don't mind of it goes slowly at first, you'll eventually get to the point where you just start breaking problems down without even thinking about them.

By the time we finish, working it as an F# script should look something like:

open System
open System.IO
open NPOI.XSSF.UserModel
open NPOI.SS.UserModel

type ListPerson = {
    FirstName : string
    MiddleName : string option
    LastName : string
    Email : string
    MagicId : string
}

let compare (m1 : ListPerson) (m2 : ListPerson) =
    // Compare mandatory values first
    m1.FirstName = m2.FirstName &&
    m1.LastName = m2.LastName &&
    m1.Email = m2.Email &&
    m1.MagicId = m2.MagicId &&
    // Default the match to `true` if either is `None` so that the previous conditions are the only influence
    match m1.MiddleName, m2.MiddleName with
    | Some a, Some b -> a = b
    | _ -> true

let setValue (s : string) (cell:ICell) =
    s |> cell.SetCellValue
let setValueOpt (s : string option) (cell:ICell) =
    s |> Option.map (fun s -> cell |> setValue s) |> ignore

let getVal (cell : ICell) : string option = 
    cell
    |> Option.ofObj
    |> Option.map (fun cell ->
        cell.CellType
        |> (function | CellType.Boolean -> cell.BooleanCellValue |> Object.toStr | CellType.Numeric -> cell.NumericCellValue |> Object.toStr | _ -> cell.StringCellValue))

let listCompare other el = other |> List.filter (compare el) |> List.length = 0
let distinction u = { u with MiddleName = None }
let filterBy other = List.filter (listCompare other) >> List.distinctBy distinction

let getSpecMember (row : IRow) =
    let get = row.GetCell >> getVal
    match 0 |> get, 1 |> get, 2 |> get, 3 |> get, 4 |> get with
    | Some fName, mName, Some lName, Some email, Some magicId ->
        Some { FirstName = fName; MiddleName = mName; LastName = lName; Email = email; MagicId = magicId }
    | _ -> None

let fillRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId

let fillTemplateRow el (row : IRow) =
    0 |> row.CreateCell |> setValue el.FirstName
    1 |> row.CreateCell |> setValueOpt el.MiddleName
    2 |> row.CreateCell |> setValue el.LastName
    3 |> row.CreateCell |> setValue el.Email
    4 |> row.CreateCell |> setValue el.MagicId
    5 |> row.CreateCell |> setValue (el.MagicId |> String.digitsOnly)

let newBookFilename = "C:\Users\EBrown\Desktop\Step3_Testing_New.xlsx"
let newBook =
    use fs = new FileStream(newBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

let oldBookFilename = "C:\Users\EBrown\Desktop\Step3_Testing_Old.xlsx"
let oldBook =
    use fs = new FileStream(oldBookFilename, FileMode.Open, FileAccess.Read)
    XSSFWorkbook(fs)

let newSheet = "Process_Spec" |> newBook.GetSheet
let oldSheet = "Process_Spec" |> oldBook.GetSheet

let newListMembers =
    [1..newSheet.LastRowNum]
    |> Seq.choose (newSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

let oldListMembers =
    [1..oldSheet.LastRowNum]
    |> Seq.choose (oldSheet.GetRow >> Option.ofObj)
    |> Seq.choose getSpecMember
    |> Seq.toList

printfn "Getting members not on old list."
let newListUnique = newListMembers |> filterBy oldListMembers
printfn "Getting members not on new list."
let oldListUnique = oldListMembers |> filterBy newListMembers

let newSheetName = "Process_Spec_New_Add"
let oldSheetName = "Process_Spec_Old_Add"
let templateSheetName = "Process_Spec_Template"

let header = { FirstName = "First Name"; MiddleName = Some "Middle Name"; LastName = "Last Name"; Email = "Email"; MagicId = "Magic ID" }
let addSheet func (workbook : IWorkbook) name list =
    match name |> workbook.GetSheet |> Option.ofObj with
    | Some _ -> name |> workbook.GetSheetIndex |> workbook.RemoveSheetAt
    | None -> ()

    let sheet = name |> workbook.CreateSheet
    (header::list) |> List.iteri (fun i el -> i |> sheet.CreateRow |> func el)

addSheet fillRow newBook newSheetName newListUnique
addSheet fillRow newBook oldSheetName oldListUnique
addSheet fillTemplateRow newBook templateSheetName newListUnique

let write (book : IWorkbook) =
    use fs = File.Create("C:\Users\EBrown\Desktop\Step3_Testing_Finished.xlsx")
    book.Write(fs)

write newBook

This is roughly 120 lines of code. We read two spreadsheets, filtered the members from each down to only unique values, and saved them back out in 120 lines of code. That's phenomenal. In the past five posts, we went from doing basic function work, to processing a full-on spreadsheet and filtering data in it. You're going to see that we're going to continue this trend of build and improve over the next few posts. We'll actually work out how to replace the variables we use with new and improved parameters, allowing us to send parameters to our application via arguments, which means we will allow the user to have a more dynamic interaction.

As we continue to explore (and I recommend you explore some on your own) we'll get more in-depth looks at how we can process these spreadsheets more effectively. We'll get in to doing the actual, raw-level morphing coming up soon, and actually translate our bad sheets into decent sheets, and decent sheets into good sheets.

This is all actually fairly straightforward, and you'll see once we get started that it's not nearly as hard as I make it out to be. We'll go through things rather quickly, but once it all clicks you should be able to go back to the previous posts and review them to make certain you understand what's happening. Most of what I do is pretty simple, and fairly reasonable, but there may be things that confuse you a bit or make you think a little harde than normal - that's intended. I want you to have to think and understand, I want you to think critically about what's going on, becuase that is the only way you become a better programmer.


I started this whole blog series for a friend of mine (currently a U.S. Marine deployed to Turkey) who has asked me, quote possibly fifty times over the past three years or so, to teach him how to become a programmer. (I mentioned this in the first of these articles.) We both haven't had schedules well enough to connect on it, so this series should allow a person to work at their own pace while still allowing me to work at mine. If anyone has any suggestions for material to cover, or how, please tweet at me (@EBrown8534) or comment, I really want to make this a good series for as many people as possible. And feel free to let me know if you like anything in particular as well, I would really love to hear any input you might have.

Second, if you're using VBA in Excel (or, really, any other VBA environment), this GoFundMe could really use your support! These guys and gals have been building a wonderful add-on for the Excel VBE (and they've even spent a good amount of time making it work in all VBE's, such as AutoCAD, SolidWorks, CorelDRAW and Sage 300 ERP), for a while now, all on their own free time. For the amount of work required, that's incredible. Even a few bucks will help them - they're a really great group.

Spreadsheets: Step3_Testing_Old.xlsx (9KB) Step3_Testing_New.xlsx (9.1KB)