Osmium’s streaming parser is the lowest-memory path to a clean, routing-ready road dataset from a raw OSM PBF file. This page covers the exact variant where you need both a spatial clip and a tag filter before feeding ways into building directed graphs from OSM PBF files — a two-command CLI sequence followed by a pyosmium SimpleHandler that reconstructs geometries without loading the full dataset into RAM. Both the CLI workflow and the Python path sit inside the broader OSM Graph Architecture & Network Modeling pipeline, and the output feeds directly into edge construction for logistics routing, fleet optimization, and urban mobility simulation.
The key constraint this page solves is the ordering problem: osmium extract clips by geometry but ignores tag content, while osmium tags-filter selects by tag but ignores geometry. Running them out of order — or trying to merge them — either bloats the intermediate file or silently drops valid features. The pattern below runs extract first, then filter, and uses pyosmium with locations=True to safely reconstruct node coordinates on way members.
When to use this approach
Use the osmium extract → tags-filter → pyosmium SimpleHandler pipeline when any of the following conditions apply:
- Regional scope, continental source: You are working with a planet or continent PBF (multi-GB) but only need one metro area, corridor, or country. Running
pyosmiumdirectly on a 60 GB planet file multiplies processing time by a factor proportional to the unwanted data; clipping first is mandatory. - CI/CD graph rebuilds: Automated pipelines that rebuild a routing graph weekly or nightly need a deterministic, scriptable sequence. The CLI steps produce reproducible intermediate files that can be cached between runs when the source PBF has not changed.
- Custom attribute filtering: You need
oneway,maxspeed,lanes, andsurfaceon every feature rather than accepting the defaults from a pre-packaged routing engine.pyosmiumgives you direct tag access at the way level. - Memory-constrained environments: The streaming approach holds at most one way’s node list in memory at a time. A 512 MB VM can process a multi-GB regional extract without swapping.
This pattern is less appropriate when you need turn restrictions — those require parsing OSM relation members, which demands a multi-pass handler. For that case, see handling turn restrictions in routing graphs.
Environment and installation
Osmium is a C++ library; pyosmium requires compiled native extensions. Missing system headers are the most common installation failure.
| Component | Minimum version | Notes |
|---|---|---|
| OS | Linux (glibc 2.17+), macOS 11+, WSL2 | Native Windows unsupported; use WSL2 or conda |
| Python | 3.9–3.12 | 3.8 lacks required C-API features |
libosmium |
2.20.0+ | Required for PBF streaming and geometry factories |
zlib |
any current | PBF decompression |
protobuf |
3.x | OSM PBF format encoding |
bzip2, expat |
any current | Legacy OSM XML and compression support |
Recommended — conda-forge (pre-compiled binaries, no header deps):
conda install -c conda-forge pyosmium osmium-tools
Alternative — pip with system headers (Ubuntu/Debian):
sudo apt install libosmium-dev zlib1g-dev libprotobuf-dev libbz2-dev libexpat1-dev
pip install osmium
Verify the install:
osmium version # e.g. osmium 1.16.0
python -c "import osmium; print(osmium.__version__)"
Implementation
Step 1 — Clip the region by bounding box
Spatial clipping must precede tag filtering. An unfiltered continent PBF passed to tags-filter still processes every non-road feature before discarding it.
osmium extract \
--bbox -122.5,37.7,-122.3,37.8 \
north-america-latest.osm.pbf \
--output sf_bbox.pbf
--bbox accepts min_lon,min_lat,max_lon,max_lat in WGS84 order (longitude before latitude — reversed from the GIS convention). For polygon-based clipping replace --bbox with --polygon region.geojson.
Step 2 — Filter by highway tag
osmium tags-filter \
sf_bbox.pbf \
w/highway=motorway,trunk,primary,secondary,tertiary,\
residential,unclassified,service,\
motorway_link,trunk_link,primary_link,secondary_link,tertiary_link \
--output sf_roads.pbf
The w/ prefix restricts matching to way entities. Without it, the filter may match nodes on some osmium builds, yielding a PBF with zero usable way geometries. Comma-separated values are treated as OR — all listed highway=* classes are kept.
Step 3 — Stream with pyosmium and export GeoJSON
The handler below reconstructs way geometries and emits GeoJSON features with the routing-relevant tags. The critical parameter is locations=True in apply_file(): it instructs pyosmium to populate nd.location on every way member node. Omitting it means nd.location.valid() always returns False and the geometry loop produces nothing.
# requires: osmium, json, sys (stdlib)
import osmium
import json
import sys
from pathlib import Path
class RoadNetworkHandler(osmium.SimpleHandler):
"""Filter highway ways and reconstruct geometries from resolved node locations."""
# Routing-relevant highway classes per OSM tagging conventions
HIGHWAY_FILTER: frozenset[str] = frozenset({
"motorway", "trunk", "primary", "secondary", "tertiary",
"residential", "unclassified", "service",
"motorway_link", "trunk_link", "primary_link",
"secondary_link", "tertiary_link",
})
def __init__(self) -> None:
super().__init__()
self.features: list[dict] = []
def way(self, w: osmium.osm.Way) -> None: # type: ignore[name-defined]
highway = w.tags.get("highway")
if highway not in self.HIGHWAY_FILTER:
return
coords: list[tuple[float, float]] = []
for nd in w.nodes:
if not nd.location.valid():
return # discard way if any node coordinate is missing
coords.append((nd.location.lon, nd.location.lat))
if len(coords) < 2:
return
self.features.append({
"type": "Feature",
"properties": {
"osm_id": w.id,
"highway": highway,
"oneway": w.tags.get("oneway", "no"),
"maxspeed": w.tags.get("maxspeed", ""),
"lanes": w.tags.get("lanes", ""),
"surface": w.tags.get("surface", ""),
"name": w.tags.get("name", ""),
},
"geometry": {"type": "LineString", "coordinates": coords},
})
def extract_roads(input_pbf: str, output_geojson: str) -> int:
handler = RoadNetworkHandler()
# locations=True resolves node coordinates on way member nodes
handler.apply_file(input_pbf, locations=True)
geojson = {"type": "FeatureCollection", "features": handler.features}
Path(output_geojson).write_text(json.dumps(geojson, separators=(",", ":")))
return len(handler.features)
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: python extract_roads.py <input.pbf> <output.geojson>")
sys.exit(1)
n = extract_roads(sys.argv[1], sys.argv[2])
print(f"Exported {n} road features to {sys.argv[2]}")
Run:
python extract_roads.py sf_roads.pbf sf_roads.geojson
Osmium extraction pipeline — data flow
The diagram below shows how data moves through the three-stage pipeline: spatial clip, tag filter, and streaming geometry handler.
Key parameters and tuning
| Parameter | Scope | Recommended value | Sensitivity |
|---|---|---|---|
--bbox lon/lat order |
osmium extract |
min_lon,min_lat,max_lon,max_lat |
Swapping lat/lon silently clips the wrong area |
w/ entity prefix |
osmium tags-filter |
Always include for way filtering | Omitting drops to node-only matching on some builds |
highway=* allowlist |
both stages | Include *_link classes |
Missing link roads fragments arterial networks |
locations=True |
apply_file() |
Always set when accessing way node coordinates | Omitting causes nd.location.valid() to always return False |
| Batch write threshold | pyosmium handler |
Flush to disk every 100 000 features | Unbounded in-memory list causes OOM on large extracts |
| Output format | osmium export |
.geojson for inspection; .pbf for pipeline pass-through |
GeoJSON is ~5× larger than PBF for the same feature set |
The highway allowlist deserves particular care. Excluding service roads drops parking lot connectors and private access roads that are essential for last-mile delivery modelling. Excluding *_link classes (such as motorway_link and primary_link) severs ramp connections and disconnects the motorway network from arterials, which causes graph fragmentation — a failure mode described in detail in graph fragmentation prevention in OSM data.
Integration points
The GeoJSON produced by the pyosmium handler is not yet a graph. It is a feature collection of LineString geometries with routing-relevant tags. The next processing steps depend on the target engine:
NetworkX / igraph: Load the GeoJSON with GeoPandas, split ways at shared endpoints to create edge-node pairs, and build a DiGraph by applying the oneway tag. Edges that carry oneway=yes or oneway=1 become directed arcs; oneway=-1 reverses the direction. For normalized cost assignment, follow the configuring edge weights for freight logistics patterns — maxspeed needs unit stripping and default imputation before it is usable as a travel-time weight.
OSRM: Pass the filtered PBF (not the GeoJSON) to osrm-extract. The tag filter step ensures OSRM processes only routable ways, cutting profile pre-processing time significantly on large datasets. See deploying OSRM with Docker for local routing for the full Docker-based setup.
Valhalla: Feed the filtered PBF into valhalla_build_admins and valhalla_build_tiles. The reduced file size shortens tile-build time. Valhalla’s costing model still reads maxspeed and surface from the PBF — preserving those tags in the filter is essential.
GeoPackage / PostGIS staging: Use ogr2ogr to load the GeoJSON into a spatially indexed staging table before graph construction. A spatial index on the geometry column cuts nearest-node snapping time from O(n) per query to sub-millisecond for typical fleet sizes.
Validation checklist
Run these checks after each extraction to catch problems before they propagate into graph construction:
- Feature count is non-zero:
python -c "import json; d=json.load(open('sf_roads.geojson')); print(len(d['features']))"— an empty collection almost always means a missingw/prefix or wrong bbox coordinate order. - Bounding box matches intent:
osmium fileinfo -e sf_roads.pbfprints spatial bounds; verify they match the target area. - All expected highway classes are present: Group features by
properties.highwayand confirm thatmotorway_linkandprimary_linkappear — their absence indicates a truncated allowlist. - No features with empty coordinate arrays: Filter with
[f for f in features if len(f['geometry']['coordinates']) < 2]— non-empty results indicatelocations=Truewas not effective or the PBF had corrupt node references. onewayvalues are within the expected set: Count distinct values; values outside{yes, no, 1, -1, reversible, alternating}indicate upstream data quality issues that will cause silent directional errors in graph construction.- File size is plausible: A filtered metro-area GeoJSON for a city like San Francisco should be 5–30 MB. A 200 KB file almost certainly has an overly restrictive filter; a 2 GB file suggests the clip step was skipped.
Troubleshooting: nd.location.valid() always returns False
You omitted locations=True in apply_file(). Without this flag, pyosmium does not populate node coordinates on way member nodes. Every nd.location.valid() call returns False, the geometry loop returns early on every way, and the handler emits zero features. Fix: handler.apply_file(input_pbf, locations=True).
Troubleshooting: osmium tags-filter keeps zero features
The most common cause is a missing w/ entity prefix. The filter expression highway=motorway without w/ targets nodes on some osmium builds. Use w/highway=motorway,trunk,... explicitly. Also check that the input PBF is not an empty file — run osmium fileinfo sf_bbox.pbf to confirm it contains way objects.
Troubleshooting: pip install osmium fails during wheel compilation
The pip wheel requires the libosmium C++ development headers. Install them first: sudo apt install libosmium-dev zlib1g-dev libprotobuf-dev libbz2-dev libexpat1-dev. On macOS: brew install libosmium protobuf. If headers are unavailable, use conda install -c conda-forge pyosmium osmium-tools instead, which ships pre-compiled binaries.
Troubleshooting: osmium extract --bbox silently clips the wrong area
The --bbox argument uses longitude-first order (min_lon,min_lat,max_lon,max_lat), which is the reverse of the latitude-first convention used by many GIS tools. If the clipped PBF contains zero or unexpected features, swap the coordinate pairs. For example, the San Francisco bbox is -122.5,37.7,-122.3,37.8, not 37.7,-122.5,37.8,-122.3.
Related
- pyosmium streaming handler architecture for directed graph construction — parent cluster covering the full PBF-to-graph pipeline
- graph fragmentation prevention in OSM data — why missing link roads and disconnected components break routing
- configuring edge weights for freight logistics — normalizing
maxspeed,oneway, andsurfaceinto traversal costs - handling turn restrictions in routing graphs — multi-pass relation parsing for restriction enforcement