Osmium’s streaming parser is the lowest-memory path to a clean, routing-ready road dataset from a raw OSM PBF file. This page covers the exact variant where you need both a spatial clip and a tag filter before feeding ways into building directed graphs from OSM PBF files — a two-command CLI sequence followed by a pyosmium SimpleHandler that reconstructs geometries without loading the full dataset into RAM. Both the CLI workflow and the Python path sit inside the broader OSM Graph Architecture & Network Modeling pipeline, and the output feeds directly into edge construction for logistics routing, fleet optimization, and urban mobility simulation.

The key constraint this page solves is the ordering problem: osmium extract clips by geometry but ignores tag content, while osmium tags-filter selects by tag but ignores geometry. Running them out of order — or trying to merge them — either bloats the intermediate file or silently drops valid features. The pattern below runs extract first, then filter, and uses pyosmium with locations=True to safely reconstruct node coordinates on way members.

When to use this approach

Use the osmium extract → tags-filter → pyosmium SimpleHandler pipeline when any of the following conditions apply:

  • Regional scope, continental source: You are working with a planet or continent PBF (multi-GB) but only need one metro area, corridor, or country. Running pyosmium directly on a 60 GB planet file multiplies processing time by a factor proportional to the unwanted data; clipping first is mandatory.
  • CI/CD graph rebuilds: Automated pipelines that rebuild a routing graph weekly or nightly need a deterministic, scriptable sequence. The CLI steps produce reproducible intermediate files that can be cached between runs when the source PBF has not changed.
  • Custom attribute filtering: You need oneway, maxspeed, lanes, and surface on every feature rather than accepting the defaults from a pre-packaged routing engine. pyosmium gives you direct tag access at the way level.
  • Memory-constrained environments: The streaming approach holds at most one way’s node list in memory at a time. A 512 MB VM can process a multi-GB regional extract without swapping.

This pattern is less appropriate when you need turn restrictions — those require parsing OSM relation members, which demands a multi-pass handler. For that case, see handling turn restrictions in routing graphs.

Environment and installation

Osmium is a C++ library; pyosmium requires compiled native extensions. Missing system headers are the most common installation failure.

Component Minimum version Notes
OS Linux (glibc 2.17+), macOS 11+, WSL2 Native Windows unsupported; use WSL2 or conda
Python 3.9–3.12 3.8 lacks required C-API features
libosmium 2.20.0+ Required for PBF streaming and geometry factories
zlib any current PBF decompression
protobuf 3.x OSM PBF format encoding
bzip2, expat any current Legacy OSM XML and compression support

Recommended — conda-forge (pre-compiled binaries, no header deps):

conda install -c conda-forge pyosmium osmium-tools

Alternative — pip with system headers (Ubuntu/Debian):

sudo apt install libosmium-dev zlib1g-dev libprotobuf-dev libbz2-dev libexpat1-dev
pip install osmium

Verify the install:

osmium version          # e.g. osmium 1.16.0
python -c "import osmium; print(osmium.__version__)"

Implementation

Step 1 — Clip the region by bounding box

Spatial clipping must precede tag filtering. An unfiltered continent PBF passed to tags-filter still processes every non-road feature before discarding it.

osmium extract \
  --bbox -122.5,37.7,-122.3,37.8 \
  north-america-latest.osm.pbf \
  --output sf_bbox.pbf

--bbox accepts min_lon,min_lat,max_lon,max_lat in WGS84 order (longitude before latitude — reversed from the GIS convention). For polygon-based clipping replace --bbox with --polygon region.geojson.

Step 2 — Filter by highway tag

osmium tags-filter \
  sf_bbox.pbf \
  w/highway=motorway,trunk,primary,secondary,tertiary,\
residential,unclassified,service,\
motorway_link,trunk_link,primary_link,secondary_link,tertiary_link \
  --output sf_roads.pbf

The w/ prefix restricts matching to way entities. Without it, the filter may match nodes on some osmium builds, yielding a PBF with zero usable way geometries. Comma-separated values are treated as OR — all listed highway=* classes are kept.

Step 3 — Stream with pyosmium and export GeoJSON

The handler below reconstructs way geometries and emits GeoJSON features with the routing-relevant tags. The critical parameter is locations=True in apply_file(): it instructs pyosmium to populate nd.location on every way member node. Omitting it means nd.location.valid() always returns False and the geometry loop produces nothing.

# requires: osmium, json, sys (stdlib)
import osmium
import json
import sys
from pathlib import Path


class RoadNetworkHandler(osmium.SimpleHandler):
    """Filter highway ways and reconstruct geometries from resolved node locations."""

    # Routing-relevant highway classes per OSM tagging conventions
    HIGHWAY_FILTER: frozenset[str] = frozenset({
        "motorway", "trunk", "primary", "secondary", "tertiary",
        "residential", "unclassified", "service",
        "motorway_link", "trunk_link", "primary_link",
        "secondary_link", "tertiary_link",
    })

    def __init__(self) -> None:
        super().__init__()
        self.features: list[dict] = []

    def way(self, w: osmium.osm.Way) -> None:  # type: ignore[name-defined]
        highway = w.tags.get("highway")
        if highway not in self.HIGHWAY_FILTER:
            return

        coords: list[tuple[float, float]] = []
        for nd in w.nodes:
            if not nd.location.valid():
                return  # discard way if any node coordinate is missing
            coords.append((nd.location.lon, nd.location.lat))

        if len(coords) < 2:
            return

        self.features.append({
            "type": "Feature",
            "properties": {
                "osm_id": w.id,
                "highway": highway,
                "oneway": w.tags.get("oneway", "no"),
                "maxspeed": w.tags.get("maxspeed", ""),
                "lanes": w.tags.get("lanes", ""),
                "surface": w.tags.get("surface", ""),
                "name": w.tags.get("name", ""),
            },
            "geometry": {"type": "LineString", "coordinates": coords},
        })


def extract_roads(input_pbf: str, output_geojson: str) -> int:
    handler = RoadNetworkHandler()
    # locations=True resolves node coordinates on way member nodes
    handler.apply_file(input_pbf, locations=True)
    geojson = {"type": "FeatureCollection", "features": handler.features}
    Path(output_geojson).write_text(json.dumps(geojson, separators=(",", ":")))
    return len(handler.features)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python extract_roads.py <input.pbf> <output.geojson>")
        sys.exit(1)
    n = extract_roads(sys.argv[1], sys.argv[2])
    print(f"Exported {n} road features to {sys.argv[2]}")

Run:

python extract_roads.py sf_roads.pbf sf_roads.geojson

Osmium extraction pipeline — data flow

The diagram below shows how data moves through the three-stage pipeline: spatial clip, tag filter, and streaming geometry handler.

Osmium extraction pipeline Three-stage pipeline: osmium extract clips by bounding box, osmium tags-filter keeps routable highways, pyosmium SimpleHandler reconstructs geometries and exports GeoJSON. Planet / continent .osm.pbf (multi-GB source) osmium extract --bbox / --polygon spatial clip → regional PBF osmium tags-filter w/highway=… roads-only PBF GeoJSON pyosmium SOURCE STAGE 1 STAGE 2 OUTPUT

Key parameters and tuning

Parameter Scope Recommended value Sensitivity
--bbox lon/lat order osmium extract min_lon,min_lat,max_lon,max_lat Swapping lat/lon silently clips the wrong area
w/ entity prefix osmium tags-filter Always include for way filtering Omitting drops to node-only matching on some builds
highway=* allowlist both stages Include *_link classes Missing link roads fragments arterial networks
locations=True apply_file() Always set when accessing way node coordinates Omitting causes nd.location.valid() to always return False
Batch write threshold pyosmium handler Flush to disk every 100 000 features Unbounded in-memory list causes OOM on large extracts
Output format osmium export .geojson for inspection; .pbf for pipeline pass-through GeoJSON is ~5× larger than PBF for the same feature set

The highway allowlist deserves particular care. Excluding service roads drops parking lot connectors and private access roads that are essential for last-mile delivery modelling. Excluding *_link classes (such as motorway_link and primary_link) severs ramp connections and disconnects the motorway network from arterials, which causes graph fragmentation — a failure mode described in detail in graph fragmentation prevention in OSM data.

Integration points

The GeoJSON produced by the pyosmium handler is not yet a graph. It is a feature collection of LineString geometries with routing-relevant tags. The next processing steps depend on the target engine:

NetworkX / igraph: Load the GeoJSON with GeoPandas, split ways at shared endpoints to create edge-node pairs, and build a DiGraph by applying the oneway tag. Edges that carry oneway=yes or oneway=1 become directed arcs; oneway=-1 reverses the direction. For normalized cost assignment, follow the configuring edge weights for freight logistics patterns — maxspeed needs unit stripping and default imputation before it is usable as a travel-time weight.

OSRM: Pass the filtered PBF (not the GeoJSON) to osrm-extract. The tag filter step ensures OSRM processes only routable ways, cutting profile pre-processing time significantly on large datasets. See deploying OSRM with Docker for local routing for the full Docker-based setup.

Valhalla: Feed the filtered PBF into valhalla_build_admins and valhalla_build_tiles. The reduced file size shortens tile-build time. Valhalla’s costing model still reads maxspeed and surface from the PBF — preserving those tags in the filter is essential.

GeoPackage / PostGIS staging: Use ogr2ogr to load the GeoJSON into a spatially indexed staging table before graph construction. A spatial index on the geometry column cuts nearest-node snapping time from O(n) per query to sub-millisecond for typical fleet sizes.

Validation checklist

Run these checks after each extraction to catch problems before they propagate into graph construction:

  1. Feature count is non-zero: python -c "import json; d=json.load(open('sf_roads.geojson')); print(len(d['features']))" — an empty collection almost always means a missing w/ prefix or wrong bbox coordinate order.
  2. Bounding box matches intent: osmium fileinfo -e sf_roads.pbf prints spatial bounds; verify they match the target area.
  3. All expected highway classes are present: Group features by properties.highway and confirm that motorway_link and primary_link appear — their absence indicates a truncated allowlist.
  4. No features with empty coordinate arrays: Filter with [f for f in features if len(f['geometry']['coordinates']) < 2] — non-empty results indicate locations=True was not effective or the PBF had corrupt node references.
  5. oneway values are within the expected set: Count distinct values; values outside {yes, no, 1, -1, reversible, alternating} indicate upstream data quality issues that will cause silent directional errors in graph construction.
  6. File size is plausible: A filtered metro-area GeoJSON for a city like San Francisco should be 5–30 MB. A 200 KB file almost certainly has an overly restrictive filter; a 2 GB file suggests the clip step was skipped.
Troubleshooting: nd.location.valid() always returns False

You omitted locations=True in apply_file(). Without this flag, pyosmium does not populate node coordinates on way member nodes. Every nd.location.valid() call returns False, the geometry loop returns early on every way, and the handler emits zero features. Fix: handler.apply_file(input_pbf, locations=True).

Troubleshooting: osmium tags-filter keeps zero features

The most common cause is a missing w/ entity prefix. The filter expression highway=motorway without w/ targets nodes on some osmium builds. Use w/highway=motorway,trunk,... explicitly. Also check that the input PBF is not an empty file — run osmium fileinfo sf_bbox.pbf to confirm it contains way objects.

Troubleshooting: pip install osmium fails during wheel compilation

The pip wheel requires the libosmium C++ development headers. Install them first: sudo apt install libosmium-dev zlib1g-dev libprotobuf-dev libbz2-dev libexpat1-dev. On macOS: brew install libosmium protobuf. If headers are unavailable, use conda install -c conda-forge pyosmium osmium-tools instead, which ships pre-compiled binaries.

Troubleshooting: osmium extract --bbox silently clips the wrong area

The --bbox argument uses longitude-first order (min_lon,min_lat,max_lon,max_lat), which is the reverse of the latitude-first convention used by many GIS tools. If the clipped PBF contains zero or unexpected features, swap the coordinate pairs. For example, the San Francisco bbox is -122.5,37.7,-122.3,37.8, not 37.7,-122.5,37.8,-122.3.


Related