Dyota's blog

PowerShell: job listing web scraping

The big question: how relevant am I, and how many people might want me?

I have a particular set of skills that I'd like to use in my day-to-day work, but I don't know how well it matches up with the job market at the moment. In in phase of my career where I'm taking lots of short-term contract jobs, so I find that I have to find a new job every 4-6 months. So the question is: can I match myself up with popular keywords and buzzwords that are in demand today?

The strategy I took was to scrape a job listings website, and analyse the data. I would "search" for a search term (e.g. "Power BI"), and download all of the job listings I can get from the search results. There are a few of these search terms.

I also have a separate list of keywords that I want to count for. For every job listing downloaded, I would count how many times each keyword appears. How many times does "Power BI" or "agile", or "data analysis" appear? I'd get that as a summary, which I can then visualise in an Excel table.

Seek.com.au is a very popular job seeker's website, so I'll take job listings from there as a representation of the job market. Also, behind the scenes, it's fairly easy to scrape.

Acknowledgements

Before I start, shoutout to Tony Lu who built What's the Salary?. I use his API in this flow - we'll get to this in the analyse section.

Technical approach

PowerShell

In contrast to my previous scraping exercise, where I used Node.js and JavaScript libraries, this time I used PowerShell.

Why? Well, I've been using PowerShell a lot recently, and I wanted to stretch my capabilities. I had a little taste of what it was like to "parse" HTML using regex, and I wanted to take it a little further this time.

But why PowerShell? PowerShell comes with all Windows machines and is a fairly powerful programming language, which means I didn't need to install any extra software. It has built in commandlets to download data from the internet and also write data to local files, so I didn't even need any extra PowerShell libraries/modules for this. The syntax can be similar to JavaScript (which I'm familiar with), and it can take advantage of the .NET framework right out of the box. Being good at PowerShell means I can use it for home and for work.

Regex

I only recently learned how to write regualr expressions and how to take advantage of the PowerShell commandlets and methods that make use of them. I used to turn away from regex because of how arcane it looked, and I didn't have a use case for picking out patterns of text.

Turns out, picking out text in between HTML tags is a good use case for regex, and it also means that I can get on with some web scraping without being dependent of JavaScript libraries.

Specifics

Search page

The first step is to compose the url for the seach results page.

The structure of the url looks like this (spaces included for clarity):

    $urls = "https://www.seek.com.au/ $searchTerm -jobs/in-perth?page= $i &sortmode=ListedDate"

I then take my list of search terms, and hit that url for as many pages as I want (in this case, 5).

For each target, I hit the URL, download the contents as a HTML file, and then extract the data.

# These are the types of jobs that I want to look for
$searchTerms = @(
    'Power BI'
    'Power Apps'
    'Power Automate'
    'Power Platform'
)

$searchTerms |
        ForEach-Object {
            $searchTerm = $_.replace(' ', '-').tolower()
    
            for ($i = 1; $i -le 5; $i++) {
    
                Write-Host Retrieving `"$searchTerm`" page $i ...
                
                $url = "https://www.seek.com.au/$searchTerm-jobs/in-perth?page=$i&sortmode=ListedDate"
                
                $filename = "search-$searchTerm-$i.html"
                
                # download HTML
                (Invoke-WebRequest $url).content > "html/searches/$filename"
                
                # download JSON
                getReduxData 'searches' $filename
            }
        }

Why do I save it as a HTML file first? Why don't I do processing straight away, in-memory, on the fly?

I did this because it was much easier to debug. If I did transformations on the fly, if something went wrong, I might not ever know about it. If I download the data into a staging folder first, I can do transformations on it later and I can check back to see what I did was correct.

This also saves me from have to download all of this data over and over again - I can download it once and work off the locally saved files for succeeding steps.

You'll notice that there's a function call right at the end called getReduxData. What does this mean?

I don't know much about React and Redux at the moment, so I'm not going to guess, but what I do know is that it's basically all the information we're looking for in a JavaScript object! Hooray! This means that we barely have to do any HTML parsing at all!

All we have to do is isolate that object, pipe through ConvertFrom-Json, and from there, we have a PowerShell object ready for us to manipulate.

SEEK_REDUX_DATA

From inspecting the HTML, all I know is that all of the data we're looking for is on a single line. So after storing the HTML into the string array variable $html...

# this is an array
[string[]]$html = Get-Content "./html/$foldername/$filename"

... I'm going to look for a single line that has the text "window.SEEK_REDUX_DATA = {" in it.

# this is an array, not a string
[string[]]$reduxData = ($html -match "window.SEEK_REDUX_DATA = {")

Aside: why did I have to add a comment saying that these are array, and typecast them as array? That's because I kept forgetting that they were actually arrays and not strings, and was repeatedly getting confused with the results I was getting.

The next thing to do is only capture the parts that were in the curly braces. Note the regex here:

[void]($reduxData[0] -match "(?<={)(?<json>.*)(?=})")

The results of this regex is a variable called $matches. In this case, I made it so that our match is accessible by $matches['json']. That's what the (?<json>) - is allows you to explicitly name the match you're looking for. This is a functionality that PowerShell gives you!

I then save all of these json objects in a folder:

$reduxDataJson = ("{$($matches['json'])}" | ConvertFrom-Json)

    $filename = popOffExtension $filename.split('/')[-1]

    ($reduxDataJson | ConvertTo-Json -Depth 6) > "./json/$foldername/$filename.json"

The full function for getReduxData is here:

function getReduxData($foldername, $filename) {
    # this is an array
    [string[]]$html = Get-Content "./html/$foldername/$filename"

    # this is an array, not a string
    [string[]]$reduxData = ($html -match "window.SEEK_REDUX_DATA = {")

    [void]($reduxData[0] -match "(?<={)(?<json>.*)(?=})")

    $reduxDataJson = ("{$($matches['json'])}" | ConvertFrom-Json)

    $filename = popOffExtension $filename.split('/')[-1]

    ($reduxDataJson | ConvertTo-Json -Depth 6) > "./json/$foldername/$filename.json"
    
    return $reduxDataJson
}

Downloading each job listing

At this juncture, we only have the search results pages - we haven't dug into each individual job listing just yet.

In the search page json, there is something called jobIds. We need to collect all of these jobIds, so that we can navigate to each individual job listing page and download the HTML.

Here is how we do that:

function collectAllJobIds() {
    [System.Collections.ArrayList]$allJobIds = @()
    
    # get all of the job Ids
    Get-ChildItem "./json/searches" |
        ForEach-Object {
            # this is an array
            $jobIds = (Get-Content $_ | ConvertFrom-Json).results.jobIds
            $jobIds |
                ForEach-Object {
                    [void]$allJobIds.add($_)
                }
            }
    
    # for all unique job IDs, download the HTML
    $allJobIds  | 
        Select-Object -Unique |
        ForEach-Object {
            downloadJobAdHtml "https://www.seek.com.au/job/$_"
        }
}

That function at the end downloadJobAdHtml is the one that actually hits the internet and downloads the page.

function downloadJobAdHtml($target) {
    Write-Host Hitting target $target ...
    $html = (Invoke-WebRequest $target).content
    $title = ($html -match '(?<=>).*?(?=</title>)')[0]
    [void]($title -match '(?<=>).*?(?=</title>)')
    $title = ($matches[0].replace("/", ' ')).replace('|', ' ')
    Write-Host Downloading $title ...
    $html > "html/jobs/$title.html"
}

SEEK_REDUX_DATA for each job listing

Now, each job listing page also has its own SEEK_REDUX_DATA. In this part, we go through all of the HTML pages downloaded for each individual job, and scrape out the json data within.

function pullOutJson() {
    Get-ChildItem "./html/jobs" |
        ForEach-Object {
            getReduxData 'jobs' $_.Name
        }
}

Analyse

The final step is to actually analyse all of the job listings that we've just downloaded.

The keywords that I want to analyse for are set out in this hashtable (truncated for brevity):

$keywords = @(
    @{ name = 'Power Platform'; pattern = 'Power Platform' }
    @{ name = 'Power Apps'; pattern = 'Power(\s?)App(s?)' }
    @{ name = 'Power BI'; pattern = 'Power(\s?)BI' }
    @{ name = 'Power Automate'; pattern = 'Power(\s?)Automate|Flow' }
    @{ name = 'Dynamics 365'; pattern = 'D(ynamics?)(\s?)365' }
    @{ name = 'Power Query'; pattern = 'Power(\s?)Query|M code' }
    @{name = 'Certification'; pattern = 'Certifi(cation|ied)' }
    @{name = 'PowerShell'; pattern = 'Power(\s?)Shell' }
    @{name = 'SPFx'; pattern = 'SPFx' }
    @{name = 'PnP'; pattern = 'PnP ' }
    @{name = 'DevOps'; pattern = 'Dev(\s)Ops' }
    @{name = 'Azure'; pattern = 'Azure' }
)

For the actual script, I first set up some arrays and variables.

[System.Collections.ArrayList]$allJobs = @();
[System.Collections.ArrayList]$allDescriptions = @();

# This is so that we can display progress on-screen
[int]$total = (Get-ChildItem "./json/jobs").count
[int]$progress = 0

For all of the jobs that I've got...

Get-ChildItem "./json/jobs" |
    ForEach-Object {
        ...
    }

We're looking to make up an object that reads like this:

{
    "title": "xxx",
    "url": "www.seek.com.au/job/xxx", 
    "listingDate": "xx/xx/xx",
    "advertiser": "xxx", 
    "salary": "xxx",
    "salaryType": "xxx", 
    "workType": "xxx"
}

This is how I do that:

$thisJob['title'] = $jobAd.title
$thisJob['url'] = "https://www.seek.com.au/job/$($jobAd.id)"
$thisJob['listingDate'] = $jobAd.listingDate
$thisJob['advertiser'] = $jobAd.advertiser.description
$thisJob['salary'] = if ($null -eq $jobAd.salary) {
    Write-Host "Fetching salary for $($jobAd.title)... "
    $res = (Invoke-WebRequest "https://qdjrmx4vb1.execute-api.ap-southeast-2.amazonaws.com/chickennuggets?jobId=$jobId")
    "?$(($res.content | ConvertFrom-Json).jobinfoarray.upperlimit[0])"
}
else {
    $jobAd.salary
}

$thisJob['salaryType'] = $jobAd.salaryType
$thisJob['workType'] = $jobAd.workType

Note that with salary, if there is no salary listed on Seek.com.au, I hit the What's the Salary API to get the salary range.

The next step is to analyse the keywords:

$keywords | 
    ForEach-Object {
        $jobAdDetails = $jobAd.jobAdDetails
        $count = ($jobAdDetails | Select-String -Pattern $_.pattern -AllMatches).matches.length
        # add a property to the hashtable for every keyword
        $thisJob[$_.name] = $count
    }
    [void]$allJobs.Add($thisJob)
    [void]$allDescriptions.Add(($jobAdDetails -replace '<[^>]+>', '' ))

    # Update progress to display on-screen
    $progress++
    Write-Host $(($progress / $total).ToString("P"))

Finally, save the results!

($allJobs | ConvertTo-Json) > analysis.json
($allDescriptions -join "`n") > descriptions.txt

analysis.json can finally be read in Excel, and I can get a snapshot of the landscape for what keywords are popular!

Conclusion

This is a long post, so I don't want to extend this any more than I need to. I won't provide an actual analysis of which keywords turn out to be popular, especially given that this will only give you a snapshot on one given day.

However, this really demonstrated to me that PowerShell can do a lot of the really cool stuff that some people might reach to Python or JavaScript for.

#html #powershell #web