Power Query: Regex-ish II
Note: I just did a quick Google on "Power Query regex" and some techniques appeared in the results that is pretty much this method here. However, I swear that I came up with this independently, and if anything I feel legitimised that the technique is sound.
I have previously written on how to simulate "regex" in Power Query. The solution I present here in this post is way cooler.
The problem
Observe the following list of project names. There is a "project number" in there, that is the form of XXXXXX-XXXXX (six numbers, a dash, then five numbers).
| Project Display Names |
|---|
| AUS-GGG-311012-01663_UA23013 |
| PER-GGG-311012-01440-Mining |
| PER-GGG-B&P: 412078-48529-Mining Company 2 |
| BNE->-311001-00032-Mining-Contract - Offsite |
| ADL-TES-BHP-SRE SPS |
You will see that it's not a simple matter to extract the project number from this. You can't split by a single delimiter and pick out a particular position. The project numbers are bounded by different characters - sometimes dashes, sometimes underscores, and so on. Sometimes there is no project number. You can "land on" the project number by specifying a position in the string.
The solution here is to break up each project display name into a list of chunks. Imagine that a chunking machine goes along the string left to right, taking out strings of a certain width, and storing them in an array. In that array, there will be a chunk that matches the XXXXXX-XXXXX signature. This then must be our project number, and we take it.
convertToSignature
This function takes a string, and converts it to a "signature". In this scenario, a "signature" is the pattern of numbers and non-numbers of the string, where a 1 is a number, and a 0 is a non-number.
Code
convertToSignature = (_text as text) as text =>
let
// character codes for numbers 0 - 9
numberCharacters = {48..57},
// break up the string into a list of characters
characterList = Text.ToList(_text),
// tag each number as a 1, and non-number as a 0
converted = List.Transform(
characterList,
each
let
thisCharacter = Character.ToNumber(_),
isNumber = List.Contains(numberCharacters, thisCharacter)
in
if isNumber then "1" else "0"
),
// this is a string of 1s and 0s that marks which characters are numbers and which aren't
signature = Text.Combine(converted)
in
signature
For example, if we convert all of the project display names to these "signatures", this is what we get.
| Project Display Name Signatures |
|---|
0000000011111101111100011111 |
000000001111110111110000000 |
000000000000011111101111100000000000000001 |
0000000011111101111100000000000000000000000000 |
0000000000000000000 |
Scanning through one string
Let's take the first string in our list: AUS-GGG-311012-01663_UA23013.
To go through the scanning algorithm, we will need two things: the displayName, and the targetSignature. The targetSignature in this situation is 111111011111.
Our scan length is the length of the targetSignature. We are going to break up the displayName into segments of this length, going from head to tail.
targetSignatureLength = Text.Length(targetSignature)
The following passage makes sure the "scanner" starts at the head of the string, and stops right at the tail.
length = Text.Length(displayName),
numberOfSegments = if (length-targetSignatureLength) < 0 then 0 else length-targetSignatureLength,
scanSegments = List.Transform(
{0..numberOfSegments},
each Text.Middle(displayName, _, targetSignatureLength)
)
The result is the following list (truncated):
| Segments |
|---|
| AUS-GGG-3110 |
| US-GGG-31101 |
| ... |
| -311012-0166 |
| 311012-01663 |
| 11012-01663_ |
| ... |
As a check step (this won't be in the final algorithm), if we convert each of those segments into a "signature", this is what we get:
signatures = List.Transform(
scanSegments,
each convertToSignature(_)
)
| Signatures |
|---|
| 000000001111 |
| 000000011111 |
| ... |
| 011111101111 |
| 111111011111 |
| 111110111110 |
| ... |
Knowing this, we can now filter for the segment that matches the target signature.
pickSegment = List.Select(
scanSegments,
each convertToSignature(_) = targetSignature
)
The last part is to deal with the scenario where we don't have a match at all.
projectNumber = if List.Count(pickSegment) = 0 then null else Text.Combine(pickSegment, "")
We can now package this up as a function:
extractStringMatchingSignature = (displayName as text, targetSignature as text) =>
let
targetSignatureLength = Text.Length(targetSignature),
length = Text.Length(displayName),
numberOfSegments = if (length-targetSignatureLength) < 0 then 0 else length-targetSignatureLength,
scanSegments = List.Transform(
{0..numberOfSegments},
each Text.Middle(displayName, _, targetSignatureLength)
),
pickSegment = List.Select(
scanSegments,
each convertToSignature(_) = targetSignature
),
projectNumber = if List.Count(pickSegment) = 0 then null else Text.Combine(pickSegment, "")
in
projectNumber
Result
The final result looks like this.
| Project Number |
|---|
| 311012-01663 |
| 311012-01440 |
| 412078-48529 |
| 311001-00032 |
| null |