How to Extract OSM Road Networks with Osmium

To extract OSM road networks with Osmium, stream the .pbf file using the osmium extract CLI with a spatial filter (bounding box or GeoJSON polygon) and apply tag-based filtering for routing-relevant highway classes. For automated pipelines, use the pyosmium Python bindings to implement a SimpleHandler that filters ways, extracts directional attributes (oneway, maxspeed, lanes), and exports to GeoJSON or routing-ready PBF without loading the full dataset into RAM. This streaming architecture is critical when Building Directed Graphs from OSM PBF Files for logistics routing, fleet optimization, or urban mobility simulations.

Environment & Compatibility Requirements

Osmium is a C++ library with Python bindings that require compiled extensions. Compatibility varies by platform, and missing system dependencies will cause pip install osmium to fail during wheel compilation.

Component Minimum Version Notes
OS Linux (glibc 2.17+), macOS 11+, WSL2 (Windows) Native Windows builds are deprecated; use WSL2 or conda-forge
Python 3.9–3.12 3.8 lacks required C-API features for pyosmium
libosmium 3.0.0+ Required for PBF streaming and geometry factories
System Deps zlib, protobuf, bzip2, expat Install via package manager before pip install

Recommended installation:

conda install -c conda-forge pyosmium osmium-tools

If using pip, install headers first:

# Ubuntu/Debian
sudo apt install libosmium-dev zlib1g-dev libprotobuf-dev libbz2-dev libexpat1-dev
pip install osmium

Refer to the official Osmium Tool documentation for platform-specific troubleshooting and advanced build flags.

CLI Extraction Workflow

The fastest path for regional extraction uses the osmium extract command with a bounding box and tag filter. Osmium streams the continental .pbf sequentially, retaining only ways containing routing-critical tags, and writes a spatially clipped output without holding the source file in memory.

osmium extract \
  -b -122.5,37.7,-122.3,37.8 \
  --keep-tags "highway,oneway,maxspeed,lanes,bridge,tunnel,surface" \
  north-america-latest.osm.pbf \
  -o sf_roads.pbf

Key parameters:

  • -b min_lon,min_lat,max_lon,max_lat: Defines the bounding box. Coordinates must be in WGS84 order (longitude, latitude).
  • --keep-tags: Accepts a comma-separated list. Prevents attribute bloat while preserving metadata required for routing cost functions.
  • -o: Output path. Supports .pbf, .osm, and .o5m.

For polygon-based clipping, replace -b with --polygon region.geojson. Osmium automatically handles multipolygon boundaries and preserves topology during clipping. The resulting .pbf can be ingested directly into graph builders or converted to GeoJSON using osmium export.

Python Automation Script

Backend developers and GIS engineers typically require programmatic extraction to integrate with CI/CD pipelines or custom graph builders. The following script uses osmium.SimpleHandler to filter roads, reconstruct geometries, and export to GeoJSON:

import osmium
import json
import sys

class RoadNetworkHandler(osmium.SimpleHandler):
    # Routing-relevant highway classes per OSM tagging conventions
    HIGHWAY_FILTER = {
        'motorway', 'trunk', 'primary', 'secondary', 'tertiary',
        'residential', 'unclassified', 'service', 'motorway_link',
        'trunk_link', 'primary_link', 'secondary_link', 'tertiary_link'
    }

    def __init__(self, output_path):
        super().__init__()
        self.output_path = output_path
        self.nodes = {}
        self.features = []

    def node(self, n):
        # Cache node coordinates for way geometry reconstruction
        self.nodes[n.id] = (n.location.lon, n.location.lat)

    def way(self, w):
        highway = w.tags.get('highway')
        if highway not in self.HIGHWAY_FILTER:
            return

        coords = []
        for nd in w.nodes:
            if nd.ref in self.nodes:
                coords.append(self.nodes[nd.ref])
            else:
                return  # Skip incomplete geometries

        feature = {
            "type": "Feature",
            "properties": {
                "highway": highway,
                "oneway": w.tags.get("oneway", "no"),
                "maxspeed": w.tags.get("maxspeed", ""),
                "lanes": w.tags.get("lanes", ""),
                "surface": w.tags.get("surface", ""),
                "osm_id": w.id
            },
            "geometry": {
                "type": "LineString",
                "coordinates": coords
            }
        }
        self.features.append(feature)

    def end(self):
        with open(self.output_path, 'w') as f:
            json.dump({"type": "FeatureCollection", "features": self.features}, f, indent=2)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python extract_roads.py <input.pbf> <output.geojson>")
        sys.exit(1)

    handler = RoadNetworkHandler(sys.argv[2])
    handler.apply_file(sys.argv[1], locations=True)

Execution:

python extract_roads.py north-america-latest.osm.pbf sf_roads.geojson

The handler caches nodes in memory, filters ways against a strict highway allowlist, and extracts routing attributes. For a complete breakdown of tag semantics and edge-case handling, consult the OSM Wiki highway key documentation.

Routing & Graph Integration

Extracted road networks rarely feed directly into production routing engines. Before ingestion, you must normalize directional constraints, resolve disconnected components, and assign traversal costs. This preprocessing stage is foundational to OSM Graph Architecture & Network Modeling, where raw geometries are transformed into adjacency structures optimized for Dijkstra, A*, or contraction hierarchies.

When preparing data for graph builders:

  1. Normalize oneway values: Convert yes, 1, -1, and reversible into boolean or directional flags.
  2. Parse maxspeed: Strip units (km/h, mph) and convert to a consistent numeric baseline.
  3. Handle lanes: Split multi-lane highways into parallel edges when modeling capacity or toll routing.
  4. Validate geometry: Remove self-intersecting ways and snap endpoints to ensure graph connectivity.

Performance & Best Practices

  • Stream over load: Never parse a full .pbf into memory. Use CLI extraction for regional cuts, and reserve pyosmium for attribute transformation or custom filtering.
  • Pre-filter at source: Apply --keep-tags during extraction to reduce downstream I/O and parsing overhead.
  • Use locations=True: In apply_file(), this flag enables coordinate access. Omitting it returns None for node locations, breaking geometry reconstruction.
  • Batch writes: For large extracts, buffer features and flush to disk in chunks rather than appending to a single JSON array.
  • Validate topology: Run osmium fileinfo and osmium tags-filter post-extraction to verify tag retention and spatial bounds.

By combining Osmium’s streaming parser with strict tag filtering, you can reliably produce lightweight, routing-ready datasets that scale from municipal boundaries to continental networks.