Dyota's blog

ETL tactics

In the absence of a real and robust data pipeline, here are some lessons from recent experience.

Move some things out of Power Query

Power Query is powerful, but sometimes it is not performant. I had a few recursive functions that were running on masses on data, and it blew out my refresh times to the order of half an hour.

I rewrote this whole script in PowerShell and it moved much faster. It means that I will need to manually trigger the PowerShell script to run, but I will accept that in exchange for reasonable refresh times in Power BI.

Some things don't need to be refreshed every time

Some tables do not need to be refreshed every time. Some of the faster-changing, faster-updating may need a weekly or a daily refresh, but some others will only change about once a month.

This is where having an external script (e.g. in PowerShell) is handy. It is not necessary to burn computing power on something that won't change very much.

Don't "normalise" data in Power Query

This will come at a big performance cost. I'm talking about recreating dimension tables from a fully loaded fact table. Do this stuff outside. The PQ engine will need to suck in all of your transactions, and then recreate the dimension tables.

Check for uniqueness

Sometimes, a table that you expect to be distinct (e.g. can be used as a primary key) is not distinct. It can save a lot of heartache if you check for uniqueness upfront, instead of when your data model errors out.

Use Table.Profile before proceeding to make sure if you need to cut down your data before proceeding further.