Dyota's blog

PowerShell: data profiling formulas using .NET

This is a companion or a continuation of this post on Excel data profiling formulas.

In Excel, we have Excel table objects that we are comparing between, and we write in columns with the data checks in them.

With PowerShell, our "tables" will be text files. Here, I will assume that we're working with .csvs. We will write in columns for the data checks, and produce a new file with the data checks in them (leaving the source file intact). The advantage PowerShell has over Excel is that we can generate new files. This means that for each type of check, we can generate a file that only contains the faults, so that we can work on them in isolation.

In Excel, the main technique was using the COUNTIFS() function - that appeared everywhere. Here, the main technique is using this class:

[System.Collections.Generic.Hashset][1]

For the rest of this post, I will simply refer to it as [HashSet].

The main table that we are checking is called $data. All of the reference tables will be called $ref1, $ref2, and so on.

Setup

Getting the data

I will assume that the data is an array of PowerShell objects that is created by doing Import-Csv or something equivalent

$filename = ''

$data = Import-Csv $filename -Encoding utf8BOM

Constructing keys

In Excel, you can check against multiple properties simply by referring to those properties either in the main table or the reference tables.

The way of working with [HashSet] is that we will construct a key which is a concatenated string of the properties that you are interested in.

This function wil generate a key based on an array of property names.

function generateKey ([array] $properties) {
    $key = ($properties | % { $row.$_ }) -join "|"
    return $key
}

Methods

We can take advantage of two methods in [HashSet]:

Because HashSets are unique sets, if we attempt to .Add() another element into the set that is already there, the method will return 'false`. We can use this to our advantage when filtering for uniqueness.

Applications

Check if this record exists in another table

using namespace System.Collections.Generic

# extract data from files

$filename = '.\NM_ResourceAssignment.csv'

$referenceFilename = '.\NM_ProjectTasks.csv'

$baseName = Split-Path $filename -LeafBase

$ext = Split-Path $filename -Extension

$outFilename = "$($baseName)_changed$ext"

$data = Import-Csv $filename -Encoding utf8BOM |
    select -first 10

$ref = Import-Csv $referenceFilename -Encoding utf8BOM 


function generateKey ([array] $properties) {
    $key = ($properties | % { $row.$_ }) -join "|"
    return $key
}

$keys = [HashSet[string]]::new([StringComparer]::OrdinalIgnoreCase)

# reference table
foreach ($row in $ref) {
    $properties = 'Project', 'TaskID'
    $key = generateKey $properties
    # $key

    [void] $keys.Add($key)
}

# main table
foreach ($row in $data) {
    $properties = 'Project', 'Task'
    $key = generateKey $properties
    
    $checkName = 'Result'
    
    $row | Add-Member -NotePropertyName $checkName -NotePropertyValue $null

    $condition = $keys.Contains($key)

    $row.$checkName = if ($condition) { $true } else { $false }
}

$data

Check if all records in this table are unique

Check if this value exists in another table, in two different columns

Split by delimiter and pick the first element

Checking for date overlaps

Contains text

[1] What is a hash set?