Overture Maps ETL Pipeline

⚡ Key Features

📦 Apache Iceberg

ACID transactions, time travel, schema evolution on S3 with AWS Glue catalog

🗺️ Geosquare Grid
Advanced spatial indexing (level 12 = ~60m² grid) replacing geohash

📂 Dual Aggregation

Direct grid aggregation + catchment area (buffer/isochrone) methods

🏷️ Category Mapping

2000+ raw Overture categories → standardized 3-level hierarchy

🔄 Monthly Updates

Incremental ETL from Overture releases (2026-04-15.0)

📊 Dashboard

Streamlit + PyDeck for interactive visualization

🔄 Data Pipeline Architecture

📡

Overture Maps (S3)

GeoParquet format, global coverage

🔄

ETL Pipeline (9 Steps)

Python + DuckDB + PyArrow

🗄️

AWS S3 + Glue Catalog

Iceberg tables (poi_raw, poi_master, poi_clean)

📊

Streamlit Dashboard

Interactive POI visualization

⚙️ Configuration (AWS)

# AWS Glue + S3 + Iceberg Configuration

release: "2026-04-15.0"

catalog:
  type: "glue"

warehouse: "s3://geosquare-warehouse"

db_name: "geosquare_poi"

aws:
  region: "ap-southeast-3"

projects:
  - name: "geosquare"
    gid_level: 12
    aggregation:
      precisions: [12]
    advanced_aggregation:
      mode: "buffer"
      buffer:
        radius_meters: 500
      catchment:
        provider: "valhalla"
            

📈 Processing Distribution

🔬 Two Aggregation Methods

Method 1: Direct Grid Aggregation

Simple count per Geosquare grid cell at specified precision level.

Level 12: ~60m² cells
Fast computation: O(n) complexity
Use case: Overall POI density
Table: poi_stats_geosquare12

Method 2: Catchment/Isochrone

Calculate service area per POI (buffer or isochrone), then intersect with grid.

Buffer: Fixed radius (e.g., 500m)
Isochrone: Travel time (e.g., 5 min)
Use case: Accessibility analysis
Table: poi_adv_stats_geosquare12

📋 Sample ETL Execution

# Run full pipeline (9 steps)
$ python run_etl.py

# Run specific steps
$ python run_etl.py --steps 1,2,3,4

# Run aggregation only
$ python run_etl.py --steps 7

# Run advanced aggregation (isochrone)
$ python run_etl.py --steps 8

# Override config via CLI
$ python run_etl.py --city jakarta --limit 1000

# With custom config
$ python run_etl.py --config config.aws.yaml