Let's normalize our data!
Lesson 7: Working out our Step 2 Process
Lesson 9: Knowing the Terminology is Important
Finally, time to write out our step 2 code. This has been a bit of a lengthy journey so far, and most of that is my fault, but it's finally time to write some code to do what we want. I'm going to throw a lot at you very quickly, but it's important to not lose focus. If you have questions, do some web-searches while you're thinking of them, it'll help you stay in the mindset I hope.
The first step here is to define a type, or a few types. We need 4 to boot.
type NameMap =
| Combined of c : int
| Split of firstName : int * middleName : int option * lastName : int
type MagicIdMap =
| Combined of c : int
| Split of ids : int array
type Header = {
Sheet : ISheet
Name : NameMap
Phone : int
DateAdded : int
Email : int
MagicIds : MagicIdMap
}
type User = {
FirstName : string
MiddleName : string option
LastName : string
Email : string
MagicId : int
}
You should understand what those are by now, so I won't go into any detail. We're still using NPOI, we're still using F#, and we're still tyring to be as idiomatic as possible.
Some boilerplate
You should also remember this little guy:
let getVal (cell : ICell) : obj option =
match cell with
| null -> None
| _ ->
match cell.CellType with
| CellType.Boolean -> cell.BooleanCellValue |> box |> Some
| CellType.Numeric -> cell.NumericCellValue |> box |> Some
| CellType.String -> cell.StringCellValue |> box |> Some
| _ -> None
And we're also going to add a few other helpers:
let startsWith v (x : string) = v |> x.StartsWith
let atLeastOne items =
if items |> Seq.isEmpty then [None] |> Seq.ofList
else items
let colIndex (cell : ICell) = cell.ColumnIndex
None of which should be difficult. The atLeastone
will always make sure a Seq
is not empty, so that we can simplify tests later. We need to be able to filter out cells based on what value they have, so we have this function:
let cellWhere pred (row : IRow) =
row.Cells
|> (Seq.map (fun c -> (c, c |> (getVal >> Option.map Object.toStr)))
>> Seq.filter (fun (c, v) -> v |> (function | None -> false | Some n -> n |> pred))
>> Seq.map (fun (c, v) -> v |> Option.map (fun y -> c))
>> atLeastOne)
Interestingly, this will give us an ICell option sequence
, which means we can then Seq.choose
on it if we need, or in the case of only one result we can Seq.head
. We need to filter our header out first, so we want something like the following:
let mapHeader (sheet, row) =
match row |> getName, row |> getPhone, row |> getDateAdded, row |> getEmail, row |> getMagicIds with
| Some name, Some phone, Some dateAdded, Some email, Some magicIds ->
{ Sheet = sheet
Name = name
Phone = phone
DateAdded = dateAdded
Email = email
MagicIds = magicIds }
|> Some
| _ -> None
This is pretty simple: match the result of the get___
function for each record property with option
, and if we have Some
of everything then return it. This doesn't care what type anything is for the most part, the get___
function will decide how the row maps to the type.
Grab some values
For Phone
, DateAdded
, and Email
it's simple:
let getPhone row =
match row |> (cellWhere ((=) "Phone Number") >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None
let getDateAdded row =
match row |> (cellWhere ((=) "Date Added") >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None
let getEmail row =
match row |> (cellWhere ((=) "Email") >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None
In fact, we could replace all these with the same method:
let getCol col row =
match row |> (cellWhere ((=) col) >> Seq.head) with
| Some cell -> cell |> (colIndex >> Some)
| _ -> None
let getPhone : IRow -> int option = getCol "Phone Number"
let getDateAdded : IRow -> int option = getCol "Date Added"
let getEmail : IRow -> int option = getCol "Email"
So that's what we'll do. Awesome, simple and efficient. Avoids repetiveness, and keeps us in check. The next step is either Magic ID
or Name
. Since the Magic ID
is easier, we'll do the name first.
let getName row =
match row |> (cellWhere ((=) "Name") >> Seq.head),
row |> (cellWhere ((=) "First Name") >> Seq.head),
row |> (cellWhere ((=) "Middle Name") >> Seq.head),
row |> (cellWhere ((=) "Last Name") >> Seq.head) with
| Some cell, _, _, _ -> cell |> (colIndex >> NameMap.Combined >> Some)
| _, Some f, m, Some l ->
NameMap.Split (f |> colIndex, m |> Option.map colIndex, l |> colIndex)
|> Some
| _ -> None
Oh, that's not so bad. If this was the harder of the two, that can only mean that Magic ID
must be pretty simple.
let getMagicIds row =
match row |> (cellWhere ((=) "Magic ID") >> Seq.head), row |> cellWhere (startsWith "Magic ID") with
| Some cell, _ -> cell |> (colIndex >> Combined >> Some)
| _, cells when cells |> Seq.isEmpty = false -> cells |> (Seq.choose id >> Seq.map colIndex >> Seq.toArray >> Split >> Some)
| _ -> None
Dang, is that it? Do note that we're actually applying the str1 = str2
function as a partial application, which gives us:
- Shorter, cleaner syntax;
- The potential for higher-level optimizations;
- An easier process to follow;
You'll also notice I've been doing x |> (f >> g)
instead of x |> f |> g
, this is a slightly longer syntax (two parenthesis added), but it makes things clearer to me: the operation is one, continuous operation, there's no partial flow here. The f >> g
composition is a single operation, and it makes it easy to abstract that out to a new function if necessary.
Map the headers
Alright, so we've built the header, the next step is to map them:
let range uBound = [0..uBound - 1]
let getRow i (x : ISheet) = i |> x.GetRow
let testFileName = "C:\Users\ebrown\Desktop\Step2_Testing.xlsx"
let workbook =
use fs = new FileStream(testFileName, FileMode.Open, FileAccess.Read)
XSSFWorkbook(fs)
let headers =
workbook.NumberOfSheets
|> (range
>> Seq.choose (workbook.GetSheetAt >> Option.ofObj)
>> Seq.map (fun sheet -> (sheet, sheet |> (getRow 0 >> Option.ofObj)))
>> Seq.choose (fun (sheet, r1) -> r1 |> Option.map(fun r -> (sheet, r)))
>> Seq.choose mapHeader)
Now our headers
value will be a Header sequence
, which allows us to do all sorts of stuff with them, including headers |> Seq.iter (fun header -> printfn "%s: %A" header.Sheet.SheetName header)
. A debug print, pretty simple, no?
Get some data from our spreadsheet
Now we get to a slightly more difficult task: how do we get the users? Well first, I want to define the syntax I expect from our API:
let users =
headers
|> Seq.map (getUsers workbook)
Ok, that's reasonable. We expect to call a getUsers
function which is an IWorkbook -> Header -> User array
. We actually have most of the structure built, so we'll go back and define our getUsers
function.
As I was defining it, I realized the addition of workbook
to the getUsers
call is superfluous: our header
already has a reference to ISheet
, which is all we need here, so we'll drop that call and it becomes a basic one-liner:
let users = header |> Seq.map getUsers
Rightey-o then, let's move on. For getUsers
we need to do a couple things:
- We need to collect each row from the
ISheet
;
- We need to split each row into the relevant user-array;
- We need to flatten the result;
This is somewhat easy, since we know the steps:
let getUsers header =
let sheet = header.Sheet
[1..sheet.LastRowNum]
|> (Seq.choose (sheet.GetRow >> Option.ofObj)
>> Seq.choose (getUserRows header)
>> Seq.toArray
>> Array.flatten
>> Array.map separateUsers
>> Array.flatten)
Easy enough, nothing too complex here. Let's build the getUserRows
function first. We'll actually build a new type
here, to hold our intermediate row. (This helps keep each function short and clean.)
type UserRow = {
FirstName : string
MiddleName : string option
LastName : string
Phone : string option
DateAdded : string option
Email : string option
MagicId : string
}
Keep in mind that this is an intermediate record type - we don't care as much about our values, we're going to transform this to the final type in the separateUsers
function. Now getUserRows
is responsible for doing our name transformation, which we agreed upon before, and it's responsible for splitting a MagicIdMap.Split
into multiple rows.
Get the Cell values
From here we know we need a couple other things, one of which is the ability to get a value of a cell:
let getColVal c (row : IRow) = c |> row.GetCell |> getVal
And we need to separateUsers
, which is the easy part:
let separateUsers user =
user.MagicId
|> String.splitOpt (StringSplitOptions.RemoveEmptyEntries) ' '
|> Seq.map(fun id ->
{ User.FirstName = user.FirstName
MiddleName = user.MiddleName
LastName = user.LastName
Email = user.Email |> (function | None -> String.Empty | Some v -> v)
MagicId = id |> int })
Pretty basic, iterate our MagicId
split string and work out what they are.
Build the massive, ugly function
Then we want to write getUserRows
, which is long and quite possibly disobeys some best-practices, but it does exactly what we want, and well:
let getUserRows header (row : IRow) : UserRow seq option =
let splitNames (name : string) =
match ' ' |> name.Split with
| [|first; last|] -> (Some first, None, Some last)
| [|first; middle; last|] -> (Some first, Some middle, Some last)
| _ -> (None, None, None)
let buildUserVals first middle last phone dateAdded email magicId =
{ UserRow.FirstName = first
MiddleName = middle
LastName = last
Phone = phone |> Option.map Object.toStr
DateAdded = dateAdded |> Option.map Object.toStr
Email = email |> Option.map Object.toStr
MagicId = magicId |> Object.toStr }
let finalResult first middle last =
match header.MagicIds with
| MagicIdMap.Combined magicIdCol ->
match row |> getColVal magicIdCol with
| Some magicId ->
[|buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId|]
|> Seq.ofArray
|> Some
| _ -> None
| MagicIdMap.Split ids ->
ids
|> Seq.choose (fun id ->
match row |> getColVal id with
| Some magicId ->
buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId
|> Some
| _ -> None)
|> Some
match header.Name with
| NameMap.Split (f, m, l) ->
match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
row |> getColVal f |> Option.map (Object.toStr),
m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
row |> getColVal l |> Option.map (Object.toStr) with
| Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
| _, Some f, m, Some l -> finalResult f m l
| _ -> None
| NameMap.Combined n ->
match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
| Some (Some f, m, Some l) -> finalResult f m l
| _ -> None
So what's happening here? Well, the first function inside here is splitNames
, which is trivial:
let splitNames (name : string) =
match ' ' |> name.Split with
| [|first; last|] -> (Some first, None, Some last)
| [|first; middle; last|] -> (Some first, Some middle, Some last)
| _ -> (None, None, None)
If we split the name into two pieces, then return the two pieces as the first and third tuple elements. If we split it into three pieces, then return first, middle, last
. We'll assume that names are fairly standard in the manner they're non-standard.
Next' we have this:
let buildUserVals first middle last phone dateAdded email magicId =
{ UserRow.FirstName = first
MiddleName = middle
LastName = last
Phone = phone |> Option.map Object.toStr
DateAdded = dateAdded |> Option.map Object.toStr
Email = email |> Option.map Object.toStr
MagicId = magicId |> Object.toStr }
This is basically a function which takes our relevant paraneters, and builds the final UserRow
value from them. Pretty easy, nothing complex.
The finalResult
starts getting hinky:
let finalResult first middle last =
match header.MagicIds with
| MagicIdMap.Combined magicIdCol ->
match row |> getColVal magicIdCol with
| Some magicId ->
[|buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId|]
|> Seq.ofArray
|> Some
| _ -> None
| MagicIdMap.Split ids ->
ids
|> Seq.choose (fun id ->
match row |> getColVal id with
| Some magicId ->
buildUserVals first middle last (row |> getColVal header.Phone) (row |> getColVal header.DateAdded) (row |> getColVal header.Email) magicId
|> Some
| _ -> None)
|> Some
So this tests the MagicIds
against the MagicIdMap
, if it's Split
then we're returning the array of users immediately. If it's Combined
then we're returning a one-element array, built from the remaining values.
In hindsight, we really should clean some of this up to make it simpler:
let finalResult first middle last =
let phone = row |> getColVal header.Phone
let dateAdded = row |> getColVal header.DateAdded
let email = row |> getColVal header.Email
match header.MagicIds with
| MagicIdMap.Combined magicIdCol ->
match row |> getColVal magicIdCol with
| Some magicId ->
[|buildUserVals first middle last phone dateAdded email magicId|]
|> Seq.ofArray
|> Some
| _ -> None
| MagicIdMap.Split ids ->
ids
|> Seq.choose (fun id ->
match row |> getColVal id with
| Some magicId ->
buildUserVals first middle last phone dateAdded email magicId
|> Some
| _ -> None)
|> Some
That's better, we don't need to inline our phone
, dateAdded
and email
data.
Finally, the match
which is executed:
match userRow.Name with
| NameMap.Split (f, m, l) ->
match row |> getColVal f |> Option.map (Object.toStr >> splitNames),
row |> getColVal f |> Option.map (Object.toStr),
m |> Option.bind (fun m -> row |> getColVal m |> Option.map Object.toStr),
row |> getColVal l |> Option.map (Object.toStr) with
| Some (Some f, None, Some m), _, _, Some l -> finalResult f (m |> Some) l
| _, Some f, m, Some l -> finalResult f m l
| _ -> None
| NameMap.Combined n ->
match row |> getColVal n |> Option.map (Object.toStr >> splitNames) with
| Some (Some f, m, Some l) -> finalResult f m l
| _ -> None
This just tests the name information, and does the relevant transformations. You'll notice the first sub-match
tests four values, that's because I'm lazy and refuse to put yet another sub-match
expression in. Finally, we can run and test our code:
let users = headers |> Seq.map getUsers
users |> Seq.iter (printfn "%A")
This should write some output like the following:
seq
[{FirstName = "John";
MiddleName = Some "C.";
LastName = "2";
Email = "email@example.com";
MagicId = 287;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "2";
Email = "otheremail@example.com";
MagicId = 331;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "2";
Email = "otheremail@example.com";
MagicId = 334;}; {FirstName = "Ron";
MiddleName = null;
LastName = "Doe";
Email = "";
MagicId = 872;}; ...]
seq
[{FirstName = "John";
MiddleName = Some "C.";
LastName = "Doe";
Email = "email@example.com";
MagicId = 287;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "Doe";
Email = "otheremail@example.com";
MagicId = 331;}; {FirstName = "Jane";
MiddleName = Some "L.";
LastName = "Doe";
Email = "otheremail@example.com";
MagicId = 334;}; {FirstName = "Ron";
MiddleName = null;
LastName = "Doe";
Email = "";
MagicId = 872;}; ...]
Finally, save our data
Finally, we need to save our data. This is a pretty easy step, we'll modify getUsers
just a little, and then user our users
sequence to finish things off:
let getUsers header =
let sheet = header.Sheet
(header,
[1..sheet.LastRowNum]
|> (Seq.choose (sheet.GetRow >> Option.ofObj)
>> Seq.choose (getUserRows header)
>> Seq.flatten
>> Seq.map separateUsers
>> Seq.flatten))
let users = headers |> Seq.map getUsers
As you can see, the only modification is to have it map users and a header together, then we'll drop it all to a new workbook and sheet, and save.
Saving things is actually quite simple, we define a few helpers to set values, then we build a save
function that takes our header
and users
and saves them to the new workbook:
let setCellValueOptStr (cell : ICell) : string option -> unit =
Option.map cell.SetCellValue >> ignore
let setCellValueStr (cell : ICell) : string -> unit =
cell.SetCellValue >> ignore
let saveSingle (row : IRow) (user : User) =
(0 |> row.CreateCell, user.FirstName) ||> setCellValueStr
(1 |> row.CreateCell, user.MiddleName) ||> setCellValueOptStr
(2 |> row.CreateCell, user.LastName) ||> setCellValueStr
(3 |> row.CreateCell, user.Email) ||> setCellValueStr
(4 |> row.CreateCell, user.MagicId |> string) ||> setCellValueStr
let saveHeader (row : IRow) =
(0 |> row.CreateCell, "First Name") ||> setCellValueStr
(1 |> row.CreateCell, "Middle Name") ||> setCellValueStr
(2 |> row.CreateCell, "Last Name") ||> setCellValueStr
(3 |> row.CreateCell, "Email") ||> setCellValueStr
(4 |> row.CreateCell, "Magic ID") ||> setCellValueStr
The helpers make it easy to keep items separated so that concerns are only on their own scope.
let save header users =
let newFile = header.Sheet.SheetName
let newBook = XSSFWorkbook()
let newSheetName = "Process_Spec"
let newSheet = newSheetName |> newBook.CreateSheet
0 |> newSheet.CreateRow |> saveHeader
let savedRows =
users
|> Seq.fold (fun row user ->
let newRow = row |> newSheet.CreateRow
(newRow, user) ||> saveSingle
row + 1) 1
let datePart = System.DateTime.Now.ToString("yyyy-MM-dd")
let newBookFile =
sprintf "%s\%s-%s%s" fileInfo.DirectoryName newFile datePart fileInfo.Extension
use fw = File.Create(newBookFile)
newBook.Write(fw)
printfn "Saved to %s" newBookFile
savedRows
The sprintf
is similar to printf
except that it returns the format string instead of printing directly.
We can save our data with a single line: users |> Seq.iter (fun value -> value ||> save |> ignore)
. This will run through the entire pricess at this point and then save it all. (The let users = headers |> Seq.map getUsers
doesn't do anything other than build a chain of Seq
functions to create our result.)
And bam, we're done. Overally we have a hundred or so lines of code, and they completely normalize our example data. The next lesson will try to normalize things a bit better, and take some other oddities into account (such as names like "La Rouse", "Mac Gregor", etc.).
As we conclude this lesson (which, let's face it, is mostly a code-dump) I have been reflecting on my experiences with the IT world, software development, F#, C#, PHP, and other languages and experiences, and reminding myself that everything I've been working towards, building this lesson series, developing products for my company, building software to test my own knowledge, and even sharing some experiences with others have been strongly defined to push myself towards teaching and sharing knowledge with others. I realized (not so long ago) that the only thing I truly enjoy doing is helping others succeed, so I've been working towards that day-by-day for years. This series itself is a testament to that: I'm not expecting to get anything out of it other than hoping at least one person finds something here userful. That's all I want, and all I ask. If you find something useful, reach out on Twitter or email me. One of my (many) emails is elliott [dot] brown or ebrown (either work), at this very domain. (Robots should have a hard time with that, but I have faith in the people reading this to be able to parse it out.)