Project overview

In this project, I developed a system designed for collection and analysis of real estate offers. Its primary goal is to estimate the market value of various properties, including houses, lands, and apartments. It consists of four stages:

  • data scraping
  • data visualization
  • machine learning model for price estimation
  • model API (in progress)

All data is collected exclusively for educational purposes and is not utilized commercially. Personal data is not stored.

Data scraping

Data used in this project is obtained from two sources:

For each of them, three property types are considered:

  • houses
  • lands
  • apartments

Web scraping process is implemented using abstract classes and inheritance according to the following scheme:

flowchart LR
    classDef abstractClass fill:#4444ffee, color:#fffc;
    
    PropertyScraper --> OtodomScraper
    PropertyScraper --> DomiportaScraper
    OtodomScraper --> OtodomHouseScraper
    OtodomScraper --> OtodomLandScraper
    OtodomScraper --> OtodomApartmentScraper
    DomiportaScraper --> DomiportaHouseScraper
    DomiportaScraper --> DomiportaLandScraper
    DomiportaScraper --> DomiportaApartmentScraper
    
    class PropertyScraper,OtodomScraper,DomiportaScraper abstractClass;
     

Each concrete (non-abstract) class refers to one table in a relational database where data is stored. Moreover, the scraping process is divided into two separate parts. For each combination of data source and property type (for example otodom apartments), data acquisition process consists of:

  • searching offers based on assumed filters and saving URL addresses into Redis database
  • scraping offers corresponding to those URLs and saving them to a database

All scraping classes are orchestrated by a CLI (Command Line Interface) tool, which enables convenient execution of the appropriate scraper. For instance, you can search URLs of Otodom apartments by:”

1
python orchestrator.py search otodom apartments

Or scrape offers of lands on Domiporta by:

1
python orchestrator.py scrape domiporta lands

For each concrete scraping class, the ETL process works according to the flowchart. For the example purposes I’m considering OtodomApartmentScraper:

flowchart TB
    subgraph SEARCH
      Crontab1[Crontab job]  --> |RUN| Python1[python orchestrator.py search otodom apartments]
      Python1 --> |USE| OAS1[OtodomApartmentScraper class]
      OAS1 --> |SAVE| Redis[URLs on Redis]
    end
    
    subgraph SCRAPE
      Crontab2[Crontab job] --> |RUN| Python2[python orchestrator.py scrape otodom apartments]
      Python2 --> |LOAD| Redis
      Redis --> |USE| OAS2[OtodomApartmentScraper class]
      OAS2 --> |SAVE| Postgres[PostgreSQL database]
    end

Data visualization

After the data acquisition process, we move on to the visualization part. For this purpose, I created a web dashboard which contains interactive charts.

On the dashboard you will find:

  1. Distributions of:
    • house/land/apartment area
    • price and price per square meter
    • number of offers in various regions
  2. Time-related changes in:
    • the number of properties offered
    • the average price
  3. A map featuring marked locations of properties

and even more. Below, you can find examples of figures for:

Houses

Lands

Apartments

Model training

In the initial approach, machine learning models were trained only using Otodom data as this service provides more information about the properties. A dedicated Random Forest Regressor model was developed for each type of property, employing a scikit-learn Pipeline. Prior to the model training, a feature engineering process was undertaken to preprocess the data before modeling.

The preprocessing steps on the example of houses data includes the following:

  • Transformation of the advert type (agency or private) into a boolean value
  • Transformation of the market type (primary or secondary) into a boolean value
  • Label encoding of the weekday and season corresponding to when the offer was posted
  • Calculating time difference between the offer and an arbitrarily chosen timestamp (2023-01-01) to reflect offer’s position on a timeline
  • Label encoding of the house location (country/suburban/city)
  • One hot encoding of province and subregion of the property

For both further preprocessing (feature scaling) and model training, a grid search was applied using the following parameters:

  • StandardScaler(), MinMaxScaler() for feature scaling
  • 400, 500, 600 for n_estimators
  • 70, 80, 90 for max_depth

Additionally, I experimented with feature extraction, however this approach resulted in a decrease in model performance.

The final pipeline, configured for optimal performance in terms of mean absolute error (with a use of cross validation), includes the following elements:

model

Metrics calculated for the whole houses dataset:

Mean absolute error [PLN]: 165599.68

Mean absolute percentage error [%]: 25.7

The model’s performance is far from perfect, as it is influenced by various factors including the subjective nature of house evaluations. It effectively handles key aspects like location and size, but the diversity in interior details of houses presents a challenge. Additionally, the model doesn’t utilize all property features, which further impacts its accuracy. This current version is an initial step in exploring the possibilities of price modeling, with potential for future enhancements. Its primary goal is to estimate market value, rather than to establish a universal formula for price evaluation.

API for the model

The API utilizes pickled models stored on Google Cloud Storage. It supports two types of get endpoints for each property type:

  • estimate_price_from_json
  • estimate_price_otodom_offer

You can find the code in the linked repository. The API is not hosted due to big memory usage caused by loading serialized models.