PowerShell: Data Processing

28 Jul, 2023

Sometimes, instead of using Power Query to process data, I need to use PowerShell. These are my learnings, from the perspective of a mainly Power Query user, translating those skills and techniques into PowerShell.

Why?

Long refresh times

Sometimes, the refresh time in Power Query takes such a long time for complex algorithms that it creates a long-term bottleneck. I would rather move these bottlenecks out of Power BI so that I don't have to endure the long refresh times every time I need to update the data model.

The case that motivated me to do this was this: I have a project where the data that I receive is in the form of Excel spreadsheets. These Excel spreadsheets are downloaded from an ERP system, and therefore are merely aspects of the underlying database.

Instead of having direct access to the database, I have just facets of it, which I then use to "recreate" the database's data model within Power BI. Doing this necessitates a few "round trips" with the data, where I'm pulling the transactional data in, and then "distinct"-ing the attributes to normalise them into their own table, etc, for multiple table.

Lessons

The lessons here are not sorted or categorised in any way. They are just the ones that I found notable.

`[List]`

We really need a [List] data type here. Bring this in right at the top.

using namespace System.Collections.Generic; # for List

`[ordered]`

PowerShell tends to mix up the order of attributes of hashtables (records). I like them kept in a particular specified order, just like in Power Query. So, prepend all hashtables with [ordered].

Text encoding

One of the datasets that I am working with contains people's names, which includes some characters that are outside of the English Latin alphabet set. The file format is .csv.

Now, it is not enough to call Import-Csv and then Export-Csv and call it a day. It took me some time to realise that the special characters were not coming out correctly. The file needs to be imported with the right encoding, and then exporting with (possibly) a different encoding.

In my case, I had to import as UTF7, check that the special characters came through in the shell, and then export as Unicode, and check that the special characters came through in the final output file.

How did I know what text encoding to import as? I did it by trial and error. I isolated the record that had the special characters (with Select-Object), and imported it multiple times, with different text encodings each time, to see which one worked correctly.

Unicode text encoding, will bloat the file by a lot. There is no need to use this if the final output file does not contain special characters.

`Table.Combine()`

Apparently, there is no need to Table.Combine() in PowerShell.

If you do the following, $data will contain all of the data in all of the CSV files, already combined.

$data = dir '*.csv' |
    % {
        Import-Csv $_ 
    }

null values in hashtables

If you have a hashtable, and one of the properties is $null, and you try to Export-Csv, it will fail. It will throw an error that says something like

Object reference not set to an instance of an object.

Instead of $null, use a blank string ('') instead. For example, don't do this:

[ordered] @{
    make: 'Mitsubishi'
    model: 'Lancer'
    year: $null
} | Export-Csv '.\this.csv'

Instead, do this:

[ordered] @{
    make: 'Mitsubishi'
    model: 'Lancer'
    year: ''
} | Export-Csv '.\this.csv'