Power Query: Regex-ish II
Note: I just did a quick Google on "Power Query regex" and some techniques appeared in the results that is pretty much this method here. However, I swear that I came up with this independently, and if anything I feel legitimised that the technique is sound.
I have previously written on how to simulate "regex" in Power Query. The solution I present here in this post is way cooler.
The problem
Observe the following list of project names. There is a "project number" in there, that is the form of XXXXXX-XXXXX (six numbers, a dash, then five numbers).
Project Display Names |
---|
AUS-GGG-311012-01663_UA23013 |
PER-GGG-311012-01440-Mining |
PER-GGG-B&P: 412078-48529-Mining Company 2 |
BNE--311001-00032-Mining-Contract - Offsite |
ADL-TES-BHP-SRE SPS |
You will see that it's not a simple matter to extract the project number from this. You can't split by a single delimiter and pick out a particular position. The project numbers are bounded by different characters - sometimes dashes, sometimes underscores, and so on. Sometimes there is no project number. You can "land on" the project number by specifying a position in the string.
The solution here is to break up each project display name into a list of chunks. Imagine that a chunking machine goes along the string left to right, taking out strings of a certain width, and storing them in an array. In that array, there will be a chunk that matches the XXXXXX-XXXXX signature. This then must be our project number, and we take it.
convertToSignature
This function takes a string, and converts it to a "signature". In this scenario, a "signature" is the pattern of numbers and non-numbers of the string, where a 1
is a number, and a 0
is a non-number.
Code
convertToSignature = (_text as text) as text =>
let
// character codes for numbers 0 - 9
numberCharacters = {48..57},
// break up the string into a list of characters
characterList = Text.ToList(_text),
// tag each number as a 1, and non-number as a 0
converted = List.Transform(
characterList,
each
let
thisCharacter = Character.ToNumber(_),
isNumber = List.Contains(numberCharacters, thisCharacter)
in
if isNumber then "1" else "0"
),
// this is a string of 1s and 0s that marks which characters are numbers and which aren't
signature = Text.Combine(converted)
in
signature
For example, if we convert all of the project display names to these "signatures", this is what we get.
Project Display Name Signatures |
---|
0000000011111101111100011111 |
000000001111110111110000000 |
000000000000011111101111100000000000000001 |
0000000011111101111100000000000000000000000000 |
0000000000000000000 |
Scanning through one string
Let's take the first string in our list: AUS-GGG-311012-01663_UA23013.
To go through the scanning algorithm, we will need two things: the displayName
, and the targetSignature
. The targetSignature
in this situation is 111111011111
.
Our scan length is the length of the targetSignature
. We are going to break up the displayName
into segments of this length, going from head to tail.
targetSignatureLength = Text.Length(targetSignature)
The following passage makes sure the "scanner" starts at the head of the string, and stops right at the tail.
length = Text.Length(displayName),
numberOfSegments = if (length-targetSignatureLength) < 0 then 0 else length-targetSignatureLength,
scanSegments = List.Transform(
{0..numberOfSegments},
each Text.Middle(displayName, _, targetSignatureLength)
)
The result is the following list (truncated):
Segments |
---|
AUS-GGG-3110 |
US-GGG-31101 |
... |
-311012-0166 |
311012-01663 |
11012-01663_ |
... |
As a check step (this won't be in the final algorithm), if we convert each of those segments into a "signature", this is what we get:
signatures = List.Transform(
scanSegments,
each convertToSignature(_)
)
Signatures |
---|
000000001111 |
000000011111 |
... |
011111101111 |
111111011111 |
111110111110 |
... |
Knowing this, we can now filter for the segment that matches the target signature.
pickSegment = List.Select(
scanSegments,
each convertToSignature(_) = targetSignature
)
The last part is to deal with the scenario where we don't have a match at all.
projectNumber = if List.Count(pickSegment) = 0 then null else Text.Combine(pickSegment, "")
We can now package this up as a function:
extractStringMatchingSignature = (displayName as text, targetSignature as text) =>
let
targetSignatureLength = Text.Length(targetSignature),
length = Text.Length(displayName),
numberOfSegments = if (length-targetSignatureLength) < 0 then 0 else length-targetSignatureLength,
scanSegments = List.Transform(
{0..numberOfSegments},
each Text.Middle(displayName, _, targetSignatureLength)
),
pickSegment = List.Select(
scanSegments,
each convertToSignature(_) = targetSignature
),
projectNumber = if List.Count(pickSegment) = 0 then null else Text.Combine(pickSegment, "")
in
projectNumber
Result
The final result looks like this.
Project Number |
---|
311012-01663 |
311012-01440 |
412078-48529 |
311001-00032 |
null |