Dyota's blog

House price analysis in Node.js and Excel

What

A friend of mine in the market for houses, and had asked me to do an analysis of sold house prices in a select list of suburbs around Perth, Western Australia.

The question was: what is the average price per square meter, per year, per suburb? I already had some Node.js (JavaScript) scripts set up for scraping real estate websites, and so with some tweaks, I was able to run this.

How

Outline

  1. I scrape a real estate website using a Node script
  2. I download the data into JSON
  3. I transform the data, again using a Node script
  4. Analyse the data in Power Query and Pivot Tables in Excel <3

Target website

I chose domain.com.au for this over (realestate.com.au)[realestate.com.au]. At the time (late 2020) when I wrote these scripts, Domain's website was so much friendlier for scraping - all the tags and classes were easier to identify and the HTML node tree was easier to make sense of.

Download the data into JSON

I used two libraries to do this:

I used the Node standard filesystem API to write and save the results into JSON files.

URL targets

The URL to hit is the results page of Domain.com.au's sold listings. the URL structure is as follows (spaces inserted for clarity):

    let url = `https://www.domain.com.au/sold-listings/?suburb= ${suburb} &excludepricewithheld=0&page= ${i} &ssubs=0`

I would loop through all of the target suburbs, and for each suburb, loop through pages 1 to 30, to get all of the results I wanted.

The list of suburbs has to follow Domain's convention for them:

const suburbs = [
    'victoria-park-wa-6100',
	'east-victoria-park-wa-6101',
	'st-james-wa-6102',
	...
]

Postcodes

I didn't have a clever way of looking up postcodes. Because there were only a few suburbs to scrape for, I just used Australia Post's (Postcode Finder)[https://auspost.com.au/postcode] and looked up the postcodes one by one.

Australia Post does provide postcode (data for business use)[https://auspost.com.au/business/marketing-and-communications/access-data-and-insights/address-data/postcode-data], but they are expensive. There was a PDF that you can use, but trying to parse out the PDF into structured data was way more trouble than it was worth for this short project.

CSS Selectors

The following were the CSS selectors that identifies each piece of data on the page.

CSS Selector Data
p[data-testid=listing-card-price] Sold price of the property
span[data-testid=address-line1] First line of the address (property number and street name)
span[data-testid=address-line2] Second line of the address: suburb ("locality"), state, and postcode
a.address URL address to the listing
span.css-693528 Type of property (house, apartment, vacant lot, etc.)
[data-testid=property-features] Number of beds, bathrooms, parking spaces, and, if available, area in square meters. This is a div, and needs to be traversed further to pull out each individual piece of data.
span.css-1nj9ymt Sell method and date (e.g. sold by private treaty 31 May 2021). The sell method and the date needs to be parsed out; I did this using regex (/\d{2}\s\w{3}\s\d{4}/ will parse out the date for you).

Data cleanup

All of the data obtained by the scrape will be strings. They will need to be converted to the appropriate data format.

The price needs the "$" sign removed, comma separators remove (if necessary) and the rest of the numbers converted to type number.

The data needs to be converted to a ISO date for interoperability.

Each feature of the property will be a string, e.g. "1 Bed". These will need to be separated out into two components: the quantity, and the type (e.g. into something like { quantity: 1, type: 'beds' }). Additionally, the quantity for Area will be something like "1,203m2", which will also need the number extracted from it.

Analysis

Once I got the data in JSON all cleaned up, I imported it into an Excel sheet using Power Query to clean it up and make a column for price per square meter, and did the actual analysis with Pivot Tables.

Pivot Tables can group the data by suburb by year, and then do averages for the price per area. I thought of doing this in PowerShell or in JavaScript itself using d3 or something but, why go through all the pain when Excel can save the day in a few button clicks?

Results

House prices per square meter

[
  { "Year": "2011", "EAST PERTH": "", "EAST VICTORIA PARK": "", "GLENDALOUGH": "", "KENSINGTON": "", "LATHLAIN": "", "LEEDERVILLE": "", "NORTH PERTH": "", "RIVERVALE": "", "SOUTH PERTH": "", "ST JAMES": "", "SUBIACO": "", "VICTORIA PARK": " $1,790.34 ", "WEMBLEY": "", "WEST PERTH": null },
  { "Year": "2012", "EAST PERTH": "", "EAST VICTORIA PARK": " $1,048.06 ", "GLENDALOUGH": "", "KENSINGTON": "", "LATHLAIN": "", "LEEDERVILLE": "", "NORTH PERTH": "", "RIVERVALE": "", "SOUTH PERTH": "", "ST JAMES": " $1,168.40 ", "SUBIACO": "", "VICTORIA PARK": "", "WEMBLEY": "", "WEST PERTH": null },
  { "Year": "2013", "EAST PERTH": "", "EAST VICTORIA PARK": " $2,614.02 ", "GLENDALOUGH": " $2,964.29 ", "KENSINGTON": " $1,455.91 ", "LATHLAIN": " $1,363.64 ", "LEEDERVILLE": "", "NORTH PERTH": "", "RIVERVALE": " $1,107.25 ", "SOUTH PERTH": "", "ST JAMES": " $1,114.47 ", "SUBIACO": " $3,162.15 ", "VICTORIA PARK": " $2,864.31 ", "WEMBLEY": " $3,427.02 ", "WEST PERTH": " $6,764.71 " },
  { "Year": "2014", "EAST PERTH": "", "EAST VICTORIA PARK": " $2,334.17 ", "GLENDALOUGH": "", "KENSINGTON": " $1,488.50 ", "LATHLAIN": "", "LEEDERVILLE": " $2,901.43 ", "NORTH PERTH": "", "RIVERVALE": " $1,921.41 ", "SOUTH PERTH": " $2,220.56 ", "ST JAMES": " $1,348.88 ", "SUBIACO": " $3,044.76 ", "VICTORIA PARK": " $1,506.50 ", "WEMBLEY": " $3,045.13 ", "WEST PERTH": " $3,220.34 " },
  { "Year": "2015", "EAST PERTH": "", "EAST VICTORIA PARK": " $1,564.91 ", "GLENDALOUGH": " $3,362.45 ", "KENSINGTON": " $1,888.19 ", "LATHLAIN": " $1,711.80 ", "LEEDERVILLE": " $3,566.30 ", "NORTH PERTH": " $2,124.74 ", "RIVERVALE": " $1,659.24 ", "SOUTH PERTH": " $2,230.48 ", "ST JAMES": "$991.05 ", "SUBIACO": " $3,841.10 ", "VICTORIA PARK": " $2,574.76 ", "WEMBLEY": " $2,251.85 ", "WEST PERTH": " $3,295.00 " },
  { "Year": "2016", "EAST PERTH": "", "EAST VICTORIA PARK": " $1,561.32 ", "GLENDALOUGH": " $3,734.29 ", "KENSINGTON": " $1,762.98 ", "LATHLAIN": " $1,493.02 ", "LEEDERVILLE": " $3,815.71 ", "NORTH PERTH": " $2,112.28 ", "RIVERVALE": " $1,616.57 ", "SOUTH PERTH": " $3,202.61 ", "ST JAMES": " $1,370.11 ", "SUBIACO": " $3,722.80 ", "VICTORIA PARK": " $2,243.49 ", "WEMBLEY": " $1,669.21 ", "WEST PERTH": " $4,264.29 " },
  { "Year": "2017", "EAST PERTH": "", "EAST VICTORIA PARK": " $1,543.88 ", "GLENDALOUGH": " $2,505.16 ", "KENSINGTON": " $1,265.86 ", "LATHLAIN": " $1,619.38 ", "LEEDERVILLE": " $2,449.40 ", "NORTH PERTH": " $2,187.24 ", "RIVERVALE": " $1,134.16 ", "SOUTH PERTH": " $2,981.80 ", "ST JAMES": " $1,244.86 ", "SUBIACO": " $3,345.53 ", "VICTORIA PARK": " $2,304.39 ", "WEMBLEY": " $1,955.12 ", "WEST PERTH": " $4,061.17 " },
  { "Year": "2018", "EAST PERTH": "", "EAST VICTORIA PARK": " $1,572.40 ", "GLENDALOUGH": " $2,773.72 ", "KENSINGTON": " $1,810.68 ", "LATHLAIN": " $1,349.13 ", "LEEDERVILLE": " $3,359.73 ", "NORTH PERTH": " $2,225.28 ", "RIVERVALE": " $1,535.59 ", "SOUTH PERTH": " $2,933.27 ", "ST JAMES": " $1,046.09 ", "SUBIACO": " $4,103.18 ", "VICTORIA PARK": " $1,909.06 ", "WEMBLEY": " $2,145.56 ", "WEST PERTH": " $4,174.23 " },
  { "Year": "2019", "EAST PERTH": " $3,428.38 ", "EAST VICTORIA PARK": " $1,572.47 ", "GLENDALOUGH": " $1,898.91 ", "KENSINGTON": " $1,903.99 ", "LATHLAIN": " $1,258.85 ", "LEEDERVILLE": " $3,276.35 ", "NORTH PERTH": " $2,105.98 ", "RIVERVALE": " $1,369.21 ", "SOUTH PERTH": " $2,669.41 ", "ST JAMES": " $1,023.81 ", "SUBIACO": " $3,719.04 ", "VICTORIA PARK": " $2,512.98 ", "WEMBLEY": " $5,805.87 ", "WEST PERTH": " $4,176.97 " },
  { "Year": "2020", "EAST PERTH": " $4,205.14 ", "EAST VICTORIA PARK": " $1,487.31 ", "GLENDALOUGH": " $1,023.53 ", "KENSINGTON": " $1,617.71 ", "LATHLAIN": " $1,266.13 ", "LEEDERVILLE": " $2,976.00 ", "NORTH PERTH": " $2,260.60 ", "RIVERVALE": " $1,554.27 ", "SOUTH PERTH": " $3,360.86 ", "ST JAMES": " $1,220.54 ", "SUBIACO": " $3,980.94 ", "VICTORIA PARK": " $2,181.37 ", "WEMBLEY": " $2,321.73 ", "WEST PERTH": " $3,526.38 " },
  { "Year": "2021", "EAST PERTH": " $4,234.82 ", "EAST VICTORIA PARK": " $1,711.41 ", "GLENDALOUGH": " $3,185.93 ", "KENSINGTON": " $2,089.63 ", "LATHLAIN": " $1,876.13 ", "LEEDERVILLE": " $3,601.04 ", "NORTH PERTH": " $2,893.21 ", "RIVERVALE": " $2,191.05 ", "SOUTH PERTH": " $3,597.56 ", "ST JAMES": " $1,462.97 ", "SUBIACO": " $4,781.45 ", "VICTORIA PARK": " $2,738.60 ", "WEMBLEY": " $2,748.25 ", "WEST PERTH": " $3,888.88 " },
  { "Year": "2022", "EAST PERTH": " $4,676.63 ", "EAST VICTORIA PARK": " $1,928.95 ", "GLENDALOUGH": " $1,725.89 ", "KENSINGTON": " $1,972.98 ", "LATHLAIN": " $1,643.46 ", "LEEDERVILLE": " $3,277.61 ", "NORTH PERTH": " $2,992.78 ", "RIVERVALE": " $2,393.63 ", "SOUTH PERTH": " $3,388.19 ", "ST JAMES": " $1,362.88 ", "SUBIACO": " $4,611.89 ", "VICTORIA PARK": " $3,104.88 ", "WEMBLEY": " $10,466.53 ", "WEST PERTH": " $3,998.30 " }
]

Number of data points used

[
  { "Year": "2011", "EAST PERTH": "", "EAST VICTORIA PARK": "", "GLENDALOUGH": "", "KENSINGTON": "", "LATHLAIN": "", "LEEDERVILLE": "", "NORTH PERTH": "", "RIVERVALE": "", "SOUTH PERTH": "", "ST JAMES": "", "SUBIACO": "", "VICTORIA PARK": "2", "WEMBLEY": "", "WEST PERTH": null },
  { "Year": "2012", "EAST PERTH": "", "EAST VICTORIA PARK": "1", "GLENDALOUGH": "", "KENSINGTON": "", "LATHLAIN": "", "LEEDERVILLE": "", "NORTH PERTH": "", "RIVERVALE": "", "SOUTH PERTH": "", "ST JAMES": "2", "SUBIACO": "", "VICTORIA PARK": "", "WEMBLEY": "", "WEST PERTH": null },
  { "Year": "2013", "EAST PERTH": "", "EAST VICTORIA PARK": "5", "GLENDALOUGH": "1", "KENSINGTON": "10", "LATHLAIN": "1", "LEEDERVILLE": "", "NORTH PERTH": "", "RIVERVALE": "2", "SOUTH PERTH": "", "ST JAMES": "3", "SUBIACO": "5", "VICTORIA PARK": "3", "WEMBLEY": "4", "WEST PERTH": "1" },
  { "Year": "2014", "EAST PERTH": "", "EAST VICTORIA PARK": "8", "GLENDALOUGH": "", "KENSINGTON": "9", "LATHLAIN": "", "LEEDERVILLE": "2", "NORTH PERTH": "", "RIVERVALE": "8", "SOUTH PERTH": "7", "ST JAMES": "3", "SUBIACO": "7", "VICTORIA PARK": "3", "WEMBLEY": "9", "WEST PERTH": "3" },
  { "Year": "2015", "EAST PERTH": "", "EAST VICTORIA PARK": "12", "GLENDALOUGH": "4", "KENSINGTON": "5", "LATHLAIN": "4", "LEEDERVILLE": "10", "NORTH PERTH": "5", "RIVERVALE": "11", "SOUTH PERTH": "6", "ST JAMES": "12", "SUBIACO": "24", "VICTORIA PARK": "8", "WEMBLEY": "17", "WEST PERTH": "10" },
  { "Year": "2016", "EAST PERTH": "", "EAST VICTORIA PARK": "23", "GLENDALOUGH": "4", "KENSINGTON": "17", "LATHLAIN": "4", "LEEDERVILLE": "2", "NORTH PERTH": "25", "RIVERVALE": "20", "SOUTH PERTH": "12", "ST JAMES": "14", "SUBIACO": "38", "VICTORIA PARK": "12", "WEMBLEY": "11", "WEST PERTH": "7" },
  { "Year": "2017", "EAST PERTH": "", "EAST VICTORIA PARK": "21", "GLENDALOUGH": "2", "KENSINGTON": "8", "LATHLAIN": "4", "LEEDERVILLE": "10", "NORTH PERTH": "47", "RIVERVALE": "17", "SOUTH PERTH": "26", "ST JAMES": "18", "SUBIACO": "32", "VICTORIA PARK": "13", "WEMBLEY": "34", "WEST PERTH": "7" },
  { "Year": "2018", "EAST PERTH": "", "EAST VICTORIA PARK": "21", "GLENDALOUGH": "4", "KENSINGTON": "6", "LATHLAIN": "4", "LEEDERVILLE": "18", "NORTH PERTH": "42", "RIVERVALE": "22", "SOUTH PERTH": "23", "ST JAMES": "18", "SUBIACO": "42", "VICTORIA PARK": "14", "WEMBLEY": "33", "WEST PERTH": "11" },
  { "Year": "2019", "EAST PERTH": "4", "EAST VICTORIA PARK": "28", "GLENDALOUGH": "1", "KENSINGTON": "5", "LATHLAIN": "6", "LEEDERVILLE": "10", "NORTH PERTH": "50", "RIVERVALE": "29", "SOUTH PERTH": "30", "ST JAMES": "23", "SUBIACO": "44", "VICTORIA PARK": "19", "WEMBLEY": "38", "WEST PERTH": "24" },
  { "Year": "2020", "EAST PERTH": "52", "EAST VICTORIA PARK": "63", "GLENDALOUGH": "1", "KENSINGTON": "19", "LATHLAIN": "14", "LEEDERVILLE": "17", "NORTH PERTH": "79", "RIVERVALE": "61", "SOUTH PERTH": "45", "ST JAMES": "47", "SUBIACO": "28", "VICTORIA PARK": "25", "WEMBLEY": "57", "WEST PERTH": "24" },
  { "Year": "2021", "EAST PERTH": "101", "EAST VICTORIA PARK": "103", "GLENDALOUGH": "10", "KENSINGTON": "27", "LATHLAIN": "20", "LEEDERVILLE": "28", "NORTH PERTH": "106", "RIVERVALE": "113", "SOUTH PERTH": "75", "ST JAMES": "55", "SUBIACO": "65", "VICTORIA PARK": "51", "WEMBLEY": "68", "WEST PERTH": "56" },
  { "Year": "2022", "EAST PERTH": "50", "EAST VICTORIA PARK": "34", "GLENDALOUGH": "1", "KENSINGTON": "11", "LATHLAIN": "10", "LEEDERVILLE": "11", "NORTH PERTH": "38", "RIVERVALE": "52", "SOUTH PERTH": "21", "ST JAMES": "23", "SUBIACO": "30", "VICTORIA PARK": "17", "WEMBLEY": "28", "WEST PERTH": "26" }
]

#excel #javascript #powerquery