Dyota's blog

Power Query: Regex-ish II

Note: I just did a quick Google on "Power Query regex" and some techniques appeared in the results that is pretty much this method here. However, I swear that I came up with this independently, and if anything I feel legitimised that the technique is sound.

I have previously written on how to simulate "regex" in Power Query. The solution I present here in this post is way cooler.

The problem

Observe the following list of project names. There is a "project number" in there, that is the form of XXXXXX-XXXXX (six numbers, a dash, then five numbers).

Project Display Names
AUS-GGG-311012-01663_UA23013
PER-GGG-311012-01440-Mining
PER-GGG-B&P: 412078-48529-Mining Company 2
BNE--311001-00032-Mining-Contract - Offsite
ADL-TES-BHP-SRE SPS

You will see that it's not a simple matter to extract the project number from this. You can't split by a single delimiter and pick out a particular position. The project numbers are bounded by different characters - sometimes dashes, sometimes underscores, and so on. Sometimes there is no project number. You can "land on" the project number by specifying a position in the string.

The solution here is to break up each project display name into a list of chunks. Imagine that a chunking machine goes along the string left to right, taking out strings of a certain width, and storing them in an array. In that array, there will be a chunk that matches the XXXXXX-XXXXX signature. This then must be our project number, and we take it.

convertToSignature

This function takes a string, and converts it to a "signature". In this scenario, a "signature" is the pattern of numbers and non-numbers of the string, where a 1 is a number, and a 0 is a non-number.

Code

convertToSignature = (_text as text) as text => 
    let
        // character codes for numbers 0 - 9
        numberCharacters = {48..57},
        // break up the string into a list of characters
        characterList = Text.ToList(_text),
        // tag each number as a 1, and non-number as a 0
        converted = List.Transform(
            characterList, 
            each 
                let
                    thisCharacter = Character.ToNumber(_),
                    isNumber = List.Contains(numberCharacters, thisCharacter)
                in
                    if isNumber then "1" else "0"
        ),
        // this is a string of 1s and 0s that marks which characters are numbers and which aren't
        signature = Text.Combine(converted)
    in
        signature

For example, if we convert all of the project display names to these "signatures", this is what we get.

Project Display Name Signatures
0000000011111101111100011111
000000001111110111110000000
000000000000011111101111100000000000000001
0000000011111101111100000000000000000000000000
0000000000000000000

Scanning through one string

Let's take the first string in our list: AUS-GGG-311012-01663_UA23013.

To go through the scanning algorithm, we will need two things: the displayName, and the targetSignature. The targetSignature in this situation is 111111011111.

Our scan length is the length of the targetSignature. We are going to break up the displayName into segments of this length, going from head to tail.

targetSignatureLength = Text.Length(targetSignature)

The following passage makes sure the "scanner" starts at the head of the string, and stops right at the tail.

length = Text.Length(displayName),
numberOfSegments = if (length-targetSignatureLength) < 0 then 0 else length-targetSignatureLength,
scanSegments = List.Transform(
    {0..numberOfSegments},
    each Text.Middle(displayName, _, targetSignatureLength)
)

The result is the following list (truncated):

Segments
AUS-GGG-3110
US-GGG-31101
...
-311012-0166
311012-01663
11012-01663_
...

As a check step (this won't be in the final algorithm), if we convert each of those segments into a "signature", this is what we get:

signatures = List.Transform(
    scanSegments, 
    each convertToSignature(_)
)
Signatures
000000001111
000000011111
...
011111101111
111111011111
111110111110
...

Knowing this, we can now filter for the segment that matches the target signature.

pickSegment = List.Select(
    scanSegments, 
    each convertToSignature(_) =  targetSignature
)

The last part is to deal with the scenario where we don't have a match at all.

projectNumber = if List.Count(pickSegment) = 0 then null else Text.Combine(pickSegment, "")

We can now package this up as a function:

extractStringMatchingSignature = (displayName as text, targetSignature as text) =>
    let
        targetSignatureLength = Text.Length(targetSignature),
        length = Text.Length(displayName),
        numberOfSegments = if (length-targetSignatureLength) < 0 then 0 else length-targetSignatureLength,
        scanSegments = List.Transform(
            {0..numberOfSegments},
            each Text.Middle(displayName, _, targetSignatureLength)
        ),
        pickSegment = List.Select(
            scanSegments, 
            each convertToSignature(_) = targetSignature
        ),
        projectNumber = if List.Count(pickSegment) = 0 then null else Text.Combine(pickSegment, "")
    in
        projectNumber

Result

The final result looks like this.

Project Number
311012-01663
311012-01440
412078-48529
311001-00032
null

#patternmatching #powerquery #regex #strings