Dyota's blog

PowerShell: data profiling using duckdb

I've been on a real journey the last few days.

This is a continuation of this post that I wrote, about using PowerShell and .NET to do data profiling.

The main motivator is this: the data that I want to profiling on is too large to handle in Excel. The formula refresh times become a real limiting factor, and because I'm Power Querying a whole bunch of tables into the same file, the file size gets prohibitive as well. So, I want to work directly with .csv files, and using more powerful programs to do it.

Turns out, using raw PowerShell and .NET is a stupid way of going about things, actually.

The code is really unreadable and hard to follow, not to mention laborious to write.

This post outlines some ways of using duckdb, which is a program that can interact with CSV files (and other data files) using SQL. It handles CSVs natively. So let's use duckdb instead.

Things to know about duckdb:

In all of the examples below, I will be invoking a custom function ddb, which reads a CSV file and passes it into memory as a data table:

function ddb ($query) { return duckdb -csv -nullvalue "" -c $query | ConvertFrom-Csv }

Reading Files

Full documentation for reading/writing CSV files.

Duckdb can convenient read from file paths that are passed in. The majority of the time I use read_csv(). Here is an example:

$file1 = 'something.csv'
$file2 = 'somethingelse.csv'

$data = ddb @"

select
* 
from read_csv( [ '$file1', '$file2'  ], filename = true, dateformat = '%Y-%m-%d' )

"@

Features:

Writing files

If I need to write back out to a CSV, I use PowerShell for that, no Duckdb.

Expressions

String splitting

The main function to split strings is string_split(). Choose elements from the resulting array by using square brackets for array access (index starts at 1, not 0).

string_split(Value, '|')[1]

Patterns

Replacing values, via changes file

The pattern here is to have the original file ($original), a changes file ($changes), and join them by a common key. If the record exists in $changes, then use the value in $changes.

select
    coalesce(changes.Value, original.Value) as Value
from read_csv('$original') as original
    left join read_csv('$changes') as changes on original.key = changes.key

Get duplicates

select
    original.*
    , (
        select count(*)
        from read_csv('$original') as o2
        where 
            o.key1 = o2.key1
            and o.key2 = o2.key2
    ) as duplicate_count
from read_csv('$original') as o