PowerShell: data profiling using duckdb
I've been on a real journey the last few days.
This is a continuation of this post that I wrote, about using PowerShell and .NET to do data profiling.
The main motivator is this: the data that I want to profiling on is too large to handle in Excel. The formula refresh times become a real limiting factor, and because I'm Power Querying a whole bunch of tables into the same file, the file size gets prohibitive as well. So, I want to work directly with .csv files, and using more powerful programs to do it.
Turns out, using raw PowerShell and .NET is a stupid way of going about things, actually.
The code is really unreadable and hard to follow, not to mention laborious to write.
This post outlines some ways of using duckdb, which is a program that can interact with CSV files (and other data files) using SQL. It handles CSVs natively. So let's use duckdb instead.
Things to know about duckdb:
- Uses SQL to query CSV files (and JSON)
- Can write back to filesystem
- Much more convenient to use than raw PowerShell commandlets
- Supports a few different file encodings
In all of the examples below, I will be invoking a custom function ddb, which reads a CSV file and passes it into memory as a data table:
function ddb ($query) { return duckdb -csv -nullvalue "" -c $query | ConvertFrom-Csv }
Reading Files
Full documentation for reading/writing CSV files.
Duckdb can convenient read from file paths that are passed in. The majority of the time I use read_csv(). Here is an example:
$file1 = 'something.csv'
$file2 = 'somethingelse.csv'
$data = ddb @"
select
*
from read_csv( [ '$file1', '$file2' ], filename = true, dateformat = '%Y-%m-%d' )
"@
Features:
- Reading multiple files at once
- For each row, indicate which file is came from using the
filename = trueparameter - Specify the date format (full list of specifiers)
Writing files
If I need to write back out to a CSV, I use PowerShell for that, no Duckdb.
Expressions
String splitting
The main function to split strings is string_split(). Choose elements from the resulting array by using square brackets for array access (index starts at 1, not 0).
string_split(Value, '|')[1]
Patterns
Replacing values, via changes file
The pattern here is to have the original file ($original), a changes file ($changes), and join them by a common key. If the record exists in $changes, then use the value in $changes.
select
coalesce(changes.Value, original.Value) as Value
from read_csv('$original') as original
left join read_csv('$changes') as changes on original.key = changes.key
Get duplicates
select
original.*
, (
select count(*)
from read_csv('$original') as o2
where
o.key1 = o2.key1
and o.key2 = o2.key2
) as duplicate_count
from read_csv('$original') as o