Dyota's blog

Excel: data profiling formulas

It has been so long since I've had to use lots of Excel. Claude has had to help me a lot! There are a lot of new formulas that I've missed out on in the last 10 years as well.

Here are some techniques for some common data-profiling tasks. I needed to use all of these to make data integrity checks across several different tables and data sets.

Check if this record exists in another table

Using COUNTIF:

=COUNTIFS(
    table2[Column1], [@Column1],
    table2[Column2], [@[Column2]],
    ...
) > 0

Check if all records in this table are unique

This is similar to checking if a record exists on a table, except that the target table is itself, and the equality is 1.

Using COUNTIF:

=COUNTIFS(
    [Column1], [@Column1],
    [Column2], [@[Column2]],
    ...
) = 1

Using SUMPRODUCT:

=SUMPRODUCT(
     1 
     * ([Column1] = [@Column1]) 
     * ([Column2] = [@Column2])
     * ...
) = 1

Check if this value exists in another table, in two different columns

Here, the + operator acts as or.

= SUMPRODUCT(
    ( table2[Column1] = [@[This Value]] )
    + ( table2[Column2] = [@[This Value]] )
) > 0

Check if string has special characters

=LET(
    text, [@text],
    chars, MID(text, SEQUENCE(LEN(text)), 1),
    codes, CODE(chars),
    isDigit, (codes >= 48) * (codes <= 57),
    isUpperLetter, (codes >= 65) * (codes <= 90),
    isLowerLetter, (codes >= 97) * (codes <= 122),
    isAlphanumeric, isDigit + isUpperLetter + isLowerLetter,
    SUM(isAlphanumeric) = LEN(text)
)

Split by delimiter and pick the first element

Much better than the old way of combining FIND and LEFT.

=CHOOSECOLS(TEXTSPLIT([@text], " - "), 1)

Checking for date overlaps

This is the context of a project. Imagine you have projects, and within those projects, you have tasks. The main assumption is that tasks should be sequential, never overlap in time ranges, and that it spans the project from beginning to end. You want to find out if there are any projects that have tasks that are overlapping.

The idea is that, if you take the durations of all of the tasks, they should be less than or equal to the duration of the total project. If the total duration of the tasks is more than the project duration, then there must be an overlap somewhere.

= LET(
    combinedDuration, SUMPRODUCT(( [Project] = [@Project]   )   *  ( [enddate] - [startdate] + 1) ),
    minToMax, MAXIFS([enddate], [Project], [@Project]) - MINIFS([startdate], [Project], [@Project]) + 1,
    IF(
        COUNTIFS([Project], [@Project]) > 1,
        combinedDuration <= minToMax,
        TRUE
    )
)

Contains text

The equivalent to Text.Contains() is this:

=ISNUMBER(
    SEARCH(
        "searchtext", 
        Table1[Column1]
    )
)

#dataanalysis #excel