Transforming raw OpenStreetMap Protocolbuffer Binary Format (PBF) extracts into computationally efficient directed graphs is the foundational step in any production routing pipeline. Unlike undirected representations, directed graphs explicitly encode traversal constraints — one-way streets, junction=roundabout implicit directionality, and access-class asymmetries — so that downstream algorithms operate on topologically sound networks from the start. This page is part of OSM Graph Architecture & Network Modeling, the broader reference for graph topology, attribute mapping, and traversal logic in Python-based routing systems.
Prerequisites
Before implementing the extraction pipeline, ensure your environment meets these baseline requirements:
- Python 3.9+ with
piporuv - Core libraries:
pyosmium(streaming PBF parser),networkx(graph construction),shapely(geometry validation),numpy(vectorized coordinate operations) - System resources: 8 GB RAM minimum for state-level extracts; 16 GB or more for national datasets
- Data source: Geofabrik regional extracts or official OSM PBF dumps (
.osm.pbf)
# Install all required dependencies
pip install pyosmium networkx shapely numpy scipy
Conceptual Architecture
The pipeline follows a strict sequence: PBF ingest → way filtering → directionality resolution → graph assembly → validation. Each stage feeds the next without backtracking, which keeps peak memory proportional to the number of routable ways rather than total file size.
The diagram below shows how OSM primitive types flow through the handler into a directed graph structure.
Step-by-Step Implementation
1. Streaming PBF Ingestion
PBF files are compressed and structured for sequential reads. Loading them fully into memory is impractical at regional scale. pyosmium implements a callback-based architecture: define a handler class inheriting from osmium.SimpleHandler, override way(), and call apply_file() with locations=True so node coordinates are resolved on way member nodes. Without that flag, nd.location.valid() returns False for every node and geometry reconstruction is impossible.
# Requires: pyosmium, networkx
import osmium
import networkx as nx
class DirectedGraphBuilder(osmium.SimpleHandler):
"""Stream-parse a PBF file and build a directed routing graph."""
ROUTABLE = {
"motorway", "trunk", "primary", "secondary", "tertiary",
"residential", "unclassified", "living_street", "service",
"motorway_link", "trunk_link", "primary_link",
"secondary_link", "tertiary_link",
}
def __init__(self) -> None:
super().__init__()
self.graph: nx.DiGraph = nx.DiGraph()
def way(self, w: osmium.osm.Way) -> None:
if w.tags.get("highway") not in self.ROUTABLE:
return
node_refs: list[int] = []
for nd in w.nodes:
if not nd.location.valid():
return # incomplete way — drop the whole way
self.graph.add_node(nd.ref, lat=nd.location.lat, lon=nd.location.lon)
node_refs.append(nd.ref)
if len(node_refs) < 2:
return
self._add_edges(w, node_refs)
def _add_edges(self, w: osmium.osm.Way, node_refs: list[int]) -> None:
oneway = w.tags.get("oneway", "no").lower()
is_roundabout = w.tags.get("junction", "").lower() == "roundabout"
forward_only = oneway in ("yes", "true", "1") or is_roundabout
backward_only = oneway in ("-1", "reverse")
attrs = {
"highway": w.tags.get("highway"),
"maxspeed": w.tags.get("maxspeed"),
"name": w.tags.get("name", ""),
"surface": w.tags.get("surface", ""),
"length_m": 0.0, # populate with Haversine after build
}
pairs = list(zip(node_refs, node_refs[1:]))
if backward_only:
pairs = [(v, u) for u, v in pairs]
for u, v in pairs:
if not self.graph.has_edge(u, v):
self.graph.add_edge(u, v, **attrs)
if not forward_only and not backward_only:
for u, v in list(zip(node_refs, node_refs[1:])):
if not self.graph.has_edge(v, u):
self.graph.add_edge(v, u, **attrs)
def build(self, pbf_path: str) -> nx.DiGraph:
self.apply_file(pbf_path, locations=True)
return self.graph
builder = DirectedGraphBuilder()
G = builder.build("region.osm.pbf")
print(f"Nodes: {G.number_of_nodes():,} Edges: {G.number_of_edges():,}")
2. Topological Directionality Resolution
OSM ways are sequences of node references with no inherent direction. Directionality must be inferred from tag semantics applied during ingestion:
| Tag value | Traversal |
|---|---|
oneway=yes / oneway=true / oneway=1 |
Forward only (node sequence order) |
oneway=-1 / oneway=reverse |
Backward only (reversed node sequence) |
oneway=no or absent |
Bidirectional — two directed edges |
junction=roundabout |
Implicit forward directionality |
highway=motorway |
One-way by OSM convention; explicit oneway tag takes precedence |
oneway=reversible |
Model as bidirectional with a time-conditional cost attribute |
When a way is bidirectional, insert both (u, v) and (v, u) as separate edges. This is the correct semantic — do not try to model bidirectionality as a single undirected edge, which would lose the separate attribute sets needed for lane-count asymmetries.
3. Post-Build Edge Length Computation
Edge length is not available from the PBF stream without pre-loading coordinates. After build() returns, compute Haversine distances using vectorized NumPy operations and write them back into graph edge attributes.
# Requires: numpy, networkx
import numpy as np
def haversine_m(lat1: float, lon1: float, lat2: float, lon2: float) -> float:
"""Return the great-circle distance in metres between two WGS84 points."""
R = 6_371_000.0
phi1, phi2 = np.radians(lat1), np.radians(lat2)
dphi = np.radians(lat2 - lat1)
dlam = np.radians(lon2 - lon1)
a = np.sin(dphi / 2) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(dlam / 2) ** 2
return float(2 * R * np.arcsin(np.sqrt(a)))
def populate_edge_lengths(G: nx.DiGraph) -> None:
"""Write length_m into every edge in-place using vectorised NumPy."""
# Build coordinate arrays for all source and target nodes at once
u_arr = np.array([(G.nodes[u]["lat"], G.nodes[u]["lon"]) for u, v in G.edges()])
v_arr = np.array([(G.nodes[v]["lat"], G.nodes[v]["lon"]) for u, v in G.edges()])
R = 6_371_000.0
phi1, phi2 = np.radians(u_arr[:, 0]), np.radians(v_arr[:, 0])
dphi = np.radians(v_arr[:, 0] - u_arr[:, 0])
dlam = np.radians(v_arr[:, 1] - u_arr[:, 1])
a = np.sin(dphi / 2) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(dlam / 2) ** 2
lengths = 2 * R * np.arcsin(np.sqrt(a))
for (u, v), length in zip(G.edges(), lengths):
G[u][v]["length_m"] = float(length)
populate_edge_lengths(G)
4. Validation and Topological Sanitization
Raw OSM data contains inconsistencies that fragment routing networks or create impossible traversals. Run these checks immediately after graph assembly:
# Requires: networkx
import networkx as nx
def validate_routing_graph(G: nx.DiGraph) -> dict:
"""Return a dict of validation metrics; raise if critical thresholds are breached."""
wcc = list(nx.weakly_connected_components(G))
largest_wcc = max(wcc, key=len)
coverage = len(largest_wcc) / G.number_of_nodes()
dangling = [n for n in G.nodes() if G.degree(n) == 1]
zero_length = [(u, v) for u, v, d in G.edges(data=True) if d.get("length_m", 1) <= 0]
missing_highway = [(u, v) for u, v, d in G.edges(data=True) if not d.get("highway")]
metrics = {
"total_nodes": G.number_of_nodes(),
"total_edges": G.number_of_edges(),
"weakly_connected_components": len(wcc),
"largest_component_coverage": round(coverage, 4),
"dangling_nodes": len(dangling),
"zero_length_edges": len(zero_length),
"missing_highway_edges": len(missing_highway),
}
if coverage < 0.90:
raise ValueError(
f"Largest WCC covers only {coverage:.1%} of nodes — "
"graph is heavily fragmented. Check boundary stitching or tag filters."
)
return metrics
report = validate_routing_graph(G)
for k, v in report.items():
print(f" {k}: {v}")
A largest_component_coverage below 0.90 almost always indicates either missing highway link classes (e.g. motorway_link omitted from ROUTABLE) or a PBF extract that was clipped mid-way through connecting ways. See graph fragmentation prevention in OSM data for spatial partitioning and boundary stitching techniques that address this systematically.
Configuration Reference
The parameters below have the highest impact on graph fidelity and memory use. Tune them before committing a pipeline to production.
| Parameter | Location | Default | Notes |
|---|---|---|---|
ROUTABLE highway classes |
DirectedGraphBuilder |
13 classes listed above | Add track, path, or cycleway for multi-modal networks — covered in implementing multi-modal transit layers |
locations=True |
apply_file() |
Must be set | Omitting it makes all node locations invalid |
oneway resolution logic |
_add_edges() |
Handles yes / -1 / roundabout | Extend to handle oneway:bicycle, oneway:bus for modal restrictions |
Edge surface attribute |
attrs dict |
Stored as raw string | Required by configuring edge weights for freight logistics to apply road-surface cost multipliers |
maxspeed storage |
attrs dict |
Raw OSM string (e.g. "50 mph") |
Parse with a regex before use — unit varies by country |
| WCC coverage threshold | validate_routing_graph() |
0.90 | Lower to 0.85 for island/peninsula extracts with expected structural gaps |
Production Optimization and Scaling
Sparse matrix conversion. For routing algorithms that do not rely on NetworkX path functions, convert the DiGraph to a SciPy CSR matrix immediately after validation. This cuts memory by roughly 60% for dense urban graphs and enables vectorized shortest-path computation.
# Requires: networkx, scipy, numpy
import numpy as np
import scipy.sparse as sp
A = nx.to_scipy_sparse_array(G, weight="length_m", format="csr")
print(f"Sparse matrix: {A.shape}, {A.nnz:,} non-zeros, {A.data.nbytes / 1e6:.1f} MB")
Batch tile processing. For national or continental extracts, pre-split the PBF with osmium extract using administrative boundary polygons, then merge subgraphs over spatial overlap zones. Maintain a 500 m buffer at each tile boundary to avoid severing ways that cross the split line.
Attribute precomputation. Compute travel_time_s at build time rather than per-query. Parse maxspeed strings, apply road-class fallback speeds, and write a float attribute to each edge. This eliminates per-query string parsing and aligns with the weight configuration patterns in configuring edge weights for freight logistics.
Spatial indexing for origin/destination snapping. Build an R-tree over node coordinates to accelerate nearest-node lookups. With shapely.STRtree, a 2-million-node graph resolves arbitrary lat/lon pairs in under 1 ms.
# Requires: shapely, numpy, networkx
from shapely.geometry import MultiPoint, Point
from shapely import STRtree
import numpy as np
node_ids = np.array(list(G.nodes()))
coords = np.array([(G.nodes[n]["lon"], G.nodes[n]["lat"]) for n in node_ids])
tree = STRtree([Point(x, y) for x, y in coords])
def snap_to_graph(lon: float, lat: float) -> int:
"""Return the OSM node ID nearest to the given WGS84 coordinate."""
idx = tree.nearest(Point(lon, lat))
return int(node_ids[idx])
Compiled-tool alternatives. For pure routing throughput at planetary scale, osrm-extract and valhalla_build_extract implement optimized C++ pipelines that outperform the Python handler by one to two orders of magnitude. The Python handler remains the right tool when you need custom attribute extraction, experimental cost functions, or integration with pandas/numpy analytical workflows.
Validation and Testing
After building and optimizing, run these checks in CI before any graph is promoted to production:
# Requires: networkx
def routing_sanity_checks(G: nx.DiGraph) -> None:
"""Raise on any structural property that would cause incorrect routing."""
# 1. Graph must be directed
assert isinstance(G, nx.DiGraph), "Graph must be nx.DiGraph, not undirected Graph"
# 2. Every node must have lat/lon attributes
missing_coords = [n for n in G.nodes() if "lat" not in G.nodes[n]]
assert not missing_coords, f"{len(missing_coords)} nodes missing coordinates"
# 3. Every edge must have a positive length
bad_lengths = [(u, v) for u, v, d in G.edges(data=True) if d.get("length_m", 0) <= 0]
assert not bad_lengths, f"{len(bad_lengths)} edges with non-positive length"
# 4. No self-loops — they cause infinite cost in Dijkstra implementations
loops = list(nx.selfloop_edges(G))
assert not loops, f"{len(loops)} self-loop edges found"
# 5. Largest WCC must cover at least 90% of nodes
wcc = list(nx.weakly_connected_components(G))
coverage = max(len(c) for c in wcc) / G.number_of_nodes()
assert coverage >= 0.90, f"Largest WCC only {coverage:.1%} — graph fragmented"
print("All routing sanity checks passed.")
routing_sanity_checks(G)
Pair these structural checks with a regression suite that replays known origin/destination pairs and asserts that computed path lengths stay within ±2% of a reference value across extract updates. The handling turn restrictions in routing graphs page extends this test pattern to restriction-aware path validation.
Troubleshooting
All node locations are invalid — way() drops every node
Root cause: apply_file() was called without locations=True. The internal node cache is never populated, so nd.location.valid() returns False for every member node.
Fix: Change the call to self.apply_file(pbf_path, locations=True). If you need to process the same file in multiple passes, use osmium.SimpleHandler with osmium.io.Reader and an explicit NodeLocationsForWays handler chain.
Motorway interchanges appear as dead-ends in the graph
Root cause: highway=motorway_link was omitted from the ROUTABLE set. Link roads connect motorway segments to each other and to the surface network; without them, every interchange is a dangling terminal node.
Fix: Ensure ROUTABLE includes motorway_link, trunk_link, primary_link, secondary_link, and tertiary_link. Re-run validation — weakly_connected_components count should drop significantly.
Memory spikes to 40+ GB on a regional PBF
Root cause: pyosmium’s internal location cache allocates a dense array indexed by OSM node ID. For large or ID-sparse files (e.g. merged extracts), the array can grow well beyond the node count of routable ways.
Fix: Pre-filter the PBF with osmium tags-filter to retain only routable highway ways and their referenced nodes before passing it to your handler:
osmium tags-filter region.osm.pbf w/highway -o routable.osm.pbf
This typically reduces the input to 5–20% of its original size and brings the location cache to a manageable footprint.
DiGraph has duplicate edges with conflicting attributes
Root cause: Two OSM ways share the same pair of consecutive node IDs (common at complex intersections and dual-carriageway segments modelled as parallel ways). The has_edge() guard in _add_edges() silently drops the second way’s attributes.
Fix: Switch to nx.MultiDiGraph if you need to preserve parallel edges, or implement a merge strategy that picks the lower-cost attribute (e.g. higher speed, lower restriction class) when a duplicate is detected.
Largest WCC coverage is below 90% even with complete link classes
Root cause: The PBF extract was clipped along a straight bounding box that severs ways crossing the boundary mid-segment. Clipped ways contain member nodes with no recorded coordinates, which triggers the return guard and drops the entire way.
Fix: Use osmium extract with the --strategy complete_ways flag to ensure all ways that intersect the extract boundary are written in full, including their out-of-bound node coordinates. For grid-tiled processing, see graph fragmentation prevention in OSM data for a boundary-stitching workflow.
Related
- Graph fragmentation prevention in OSM data — spatial partitioning and boundary stitching to keep routing continuity across tile edges
- Configuring edge weights for freight logistics — vehicle-class cost functions built on the
highway,surface, andmaxspeedattributes assembled here - Mapping node attributes for urban delivery zones — delivery constraints and access rules layered onto the directed graph topology
- Handling turn restrictions in routing graphs — OSM
restrictionrelations parsed and enforced as penalty edges on top of the directed base graph - Extracting OSM road networks with osmium CLI — command-line pre-filtering and clipping techniques that reduce PBF size before Python ingestion