Daily updated dashboard that aggregates and shows current and past property listing metrics.

Demo

[!CAUTION] [31 Jul 2025] Data for this dashboard is no longer updated due to maintenance costs.

Try it out here!

See the statistics and trends of property listings:

demo1

And changes per region per property type:

demo2 demo3

And historical trends as well:

demo4

Purpose

Real estate is a hot topic in Singapore. There has been countless of research papers, data analysis, and free platforms that analyse past transaction data from public datasets such as data.gov.sg, and URA. However, these analyses all miss a major piece of data from listing marketplaces where sellers put up their asking prices which serves as the driver of transaction prices.

Listing data is not easy to obtain because there are no public datasets for it. Data has to be frequently retrieved from marketplaces guarded by heavy anti-bot mechanisms and anti-scrape UI/UX website design. Hence, I created a pipeline which retrieves this data daily and makes it available for visualisation and download.

I hope that this will make real estate listing analysis more accessible to all, rather than having this ability being kept for analyst of the major marketplaces only.

You can download today’s listings here

Happy exploring!


Writeup

Design

The entire app is made of 3 parts:

  1. Webscraper - a physical PC running scrape jobs daily
  2. Data processing pipeline - some functions in the cloud that load cleaned data into a database
  3. Dashboard frontend - a Streamlit-hosted app which retrieves data and visualises it

Components (1) and (2) are managed in my public project repo.

Webscraper

By far the toughest part of the project, because of the amount of uncertainty and duct-taping that had to be done to make this work.

Each scrape produces 2000+ page HTMLs (500MB zipped) which are saved to S3, partitioned by date.

The local PC only keeps the past 7 days of raw scrapes, in case the upload to S3 fails and needs to re-run.

I won’t go into the details of the scraper here as I want to keep it from getting blocked for as long as possible but the gist of it is:

  • At a given time each day, run a .bat script that will launch a legitimate Chrome browser to the listings site.
  • Use PyAutoGUI to control the screen and bypass the CAPTCHA (“click to verify you are not a robot”)
  • Once CAPTCHA is bypassed, the browser headers now hold a cookie that will validate any request for a fixed amount of time.
  • Download these headers and pass them along with any HTTP request to the site to show that you are not a robot.
  • Run a Python script that goes through all 2000+ pages on the site using request.session + the headers we collected.

Decoupling scrape from cleaning

While it would be simpler to do the HTML cleaning and compiling without saving the actual HTML file of each page, I chose to decouple the 2 processes and incur more raw storage costs and infrastructure due to the fickleness of the scraping process.

Completing the scrape as fast as possible without errors was the most critical part of the pipeline. If a job errored out due to some cleaning logic failure, the entire scrape will be lost. Unlike regular API scraping with error handling and retries, the anti-bot system forces you to start from scratch and get new validation cookies. At best, the job is re-run a few hours later, at worst, the entire day’s data is missing.

There is also a time constraint to the scrape, which is the expiry time of the bot-validation cookie. Having to balance between sending requests fast enough to complete the scrape within this duration, managing local disk write speeds, and not getting rate-limited was the hardest part of this project.

Data Processing

This part is straightforward as the daily data volume is tiny for the kind of workloads I have dealt with before.

I have a single lambda function that can run both a transform and analysis step for a given date t based on the arguments supplied to it.

  1. Partitioning/ staging

    The function is designed to run on a per date partition basis. This means that a date ‘2024-09-01’ can be specified and the function will only process data from that S3 prefix. This allows me to rerun processing for specific partitions, and also loop through them in parallel if a complete refresh is needed.

  2. Cleaning

    First step of cleaning is to takes the HTMLs and extract the relevant fields, then compile into a single dataframe. This dataframe is stored in the function’s memory. This part was quite compute intensive because the script had to read-IO 2000+ HTMLs and regex them.

  3. Transformations

    I then run some simple transformation logic such as string formatting, renaming fields, relabelling values, price calculations, etc. This part was designed to be modular, so that any number of dataframe transformations can be defined and chained in sequence to get the desired output.

  4. Storage

    The cleaned dataframe is uploaded to S3 for storage. I also push a copy to Google Cloud Storage and use that to COPY over to Google BigQuery, since I will need a datawarehouse to run SQL for the next step of analytical transforms. (PS. BigQuery and Google Compute are free for life!)

  5. Analytics transforms

    The function can then be triggered to do analysis which runs a set of templated SQL queries with the dynamic parameters filled in. So far, the only parameters are the run dates. The SQL returns a dataframe which is then re-inserted into the data warehouse, under an analytics-specific table.

    The current SQL is a simple GROUP BY of different property types and then an aggregation of different metrics such as price, PSF, volume.

    In the future, if different analysis are required, the function can be extended by adding more SQLs in the analysis configurations.

Dashboard

Finally, I read from the data warehouse to deliver the data into Streamlit dataframes. The dashboard app uses both dataframes (tables) and simple Streamlit components such as charts and metrics blocks.

  1. Hosting

    I am hosting the app on Streamlit Cloud because it’s really simple and I didn’t want to bother with EC2 networking/ ECS costs for hosting a mostly idle web-app.

  2. Data Caching

    Rather than have the app query the data warehouse everytime the user refreshes the page, the app first checks if the metadata of the files it has in the cache matches the latest date in the warehouse. It only re-downloads the files if they do not match. Hence, the warehouse data is only pulled at most once a day to reduce load.

Improvements

There are lots of things you can do with the listings data, although I am not sure how much of it is valuable to be displaying on a real-time/ daily basis.

One of the more interesting use cases is the ability to see how listing trends change over time, for specific regions and property types. As a more data is collected, I will be able to display the long-term trends. Maybe some of you will find this useful for your real estate research.

Let me know if there is any specific feature you would like to see on this dashboard!