Why does my largest SCC keep shrinking after each OSM extract update?

OSM editors occasionally remove or reclassify ways, converting previously routable segments to `access=private` or `highway=path`. Run a week-over-week component diff and log which node IDs left the SCC — usually 2-5 changed ways account for the whole shrinkage.

Snapping is connecting nodes on different road levels — how do I prevent this?

Filter on the `layer`, `bridge`, and `tunnel` tags before running the KD-tree query. Nodes with differing layer values should never be candidates for snapping regardless of spatial proximity.

What tolerance should I use for node snapping?

For geographic coordinates (EPSG:4326), work in a projected CRS (EPSG:3857 or a local UTM zone) and use 0.5–1.5 metres. Values above 2 metres risk merging nodes that represent distinct physical intersections, particularly in dense urban grids.

Graph Fragmentation Prevention in OSM Data

Q: My stress test failure rate is 0% inside the SCC but routing still fails in production — why?

Production failures usually come from vehicle-class constraints applied at query time: `maxweight`, `maxheight`, or `access=delivery` tags that filter out edges post-SCC construction. Rebuild the SCC using the same access filter your routing engine applies at query time.

Q: How do I handle ferry routes and other non-road segments without fragmenting the graph?

Include `route=ferry` relations as valid edges with a `ferry` highway class and a conservative speed (typically 15 km/h). Excluding them severs island and coastal networks entirely. If ferry travel is undesirable for a vehicle class, add a high penalty weight rather than removing the edge.

Graph fragmentation occurs when a routing network contains disconnected components, isolated nodes, or artificially severed edges that prevent continuous pathfinding. As part of the broader OSM Graph Architecture & Network Modeling pipeline, fragmentation prevention sits between raw OSM ingestion and engine handoff — it is the validation layer that determines whether the topology you built can actually answer routing queries. In production logistics and urban mobility systems, even minor topological gaps trigger routing failures, inflate delivery ETAs, or force fallback to suboptimal heuristic paths.

This guide documents a production-ready workflow for detecting and resolving fragmentation in OpenStreetMap-derived routing graphs. It targets logistics engineers, GIS developers, urban planners, and Python backend developers who need deterministic, connected networks for fleet dispatch, last-mile optimization, and spatial analytics.

Prerequisites

Install the required libraries into a Python 3.9+ environment before starting:

# pip install osmnx>=1.8 networkx>=3.0 shapely>=2.0 pandas>=2.0 geopandas>=0.14 scipy pyproj
import osmnx as ox
import networkx as nx
from shapely.geometry import box
from shapely.ops import transform
import pyproj
import numpy as np
from scipy.spatial import cKDTree
import pandas as pd

You also need:

An OSM PBF extract (regional or metropolitan) from Geofabrik, BBBike, or a local OSM API mirror
Understanding of directed vs. undirected graph semantics and how OSM tags (oneway, access, highway) translate to routing topology
Sufficient RAM for your target region — metropolitan networks with 500k+ edges will require 4–8 GB

Conceptual Architecture

The diagram below shows where fragmentation enters the pipeline and which repair steps address each source.

Each fragmentation source has a distinct remedy. Hard-clipping at administrative boundaries is addressed during ingestion; over-aggressive tag filtering is addressed during graph construction; micro-gap topology errors are addressed by spatial node snapping. The component audit step determines which remedies are actually needed before you apply them.

Step-by-Step Implementation

1. Boundary-Aware Data Ingestion

OSM extracts are often clipped to administrative boundaries. Hard clipping severs roads that cross municipal or regional borders, creating artificial dead-ends. Always apply a spatial buffer — typically 1–3 km — to your area of interest before extraction to capture connecting infrastructure.

# pip install osmnx>=1.8 shapely>=2.0 pyproj
import osmnx as ox
from shapely.geometry import box
from shapely.ops import transform
import pyproj

# Define bounding box for region of interest
minx, miny, maxx, maxy = -74.05, 40.65, -73.90, 40.80
bbox = box(minx, miny, maxx, maxy)

# Reproject to metric CRS, buffer 2 km, reproject back
project_to = pyproj.Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True).transform
project_from = pyproj.Transformer.from_crs("EPSG:3857", "EPSG:4326", always_xy=True).transform

buffered_polygon = transform(project_from, transform(project_to, bbox).buffer(2000))

# Download drivable network with buffer applied
G = ox.graph_from_polygon(buffered_polygon, network_type="drive", retain_all=False)
print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")

Using a buffered polygon instead of a strict bounding box ensures that inter-regional highways and arterial connectors remain intact during the initial fetch.

2. Directed Graph Construction & Access Filtering

Raw OSM data must be parsed into a routing-compatible structure. This involves converting ways to directed edges, applying oneway logic, and filtering non-routable features — pedestrian-only paths, service roads tagged access=private, and ways with highway=construction. Building directed graphs from OSM PBF files establishes this baseline topology; fragmentation checks validate it.

When constructing directed graphs, explicitly handle access restrictions and tag-based filtering using vectorized Pandas operations rather than a Python loop over edges:

# pip install networkx>=3.0 pandas>=2.0
import networkx as nx
import pandas as pd

VALID_HIGHWAYS = {
    "motorway", "trunk", "primary", "secondary", "tertiary",
    "residential", "living_street", "service"
}
BLOCKED_ACCESS = {"private", "no", "customers", "permit"}

# Materialise edge attributes into a DataFrame for vectorised filtering
edge_rows = [
    {"u": u, "v": v, "k": k, **data}
    for u, v, k, data in G.edges(keys=True, data=True)
]
df = pd.DataFrame(edge_rows)

# Identify edges to drop
mask_bad_highway = ~df["highway"].isin(VALID_HIGHWAYS)
mask_bad_access  = df["access"].isin(BLOCKED_ACCESS)
bad_edges = df[mask_bad_highway | mask_bad_access][["u", "v", "k"]].values.tolist()

G_filtered = G.copy()
G_filtered.remove_edges_from((u, v, k) for u, v, k in bad_edges)
isolates = list(nx.isolates(G_filtered))
G_filtered.remove_nodes_from(isolates)
print(f"After filtering — Nodes: {G_filtered.number_of_nodes()}, Edges: {G_filtered.number_of_edges()}")

The critical risk here is over-filtering: removing service roads severs access to warehouse loading docks; removing living_street ways breaks last-100-metre delivery connectivity. Calibrate VALID_HIGHWAYS against your operational vehicle class before committing to a filter set.

3. Connected Component Analysis & SCC Validation

Once the graph is constructed, run a connected components algorithm to identify isolated subgraphs. In routing contexts, you expect one giant strongly connected component (SCC) representing the navigable core, plus minor peripheral fragments representing cul-de-sacs, parking lots, or temporary closures.

# networkx>=3.0 — uses optimised C-backed Tarjan SCC
import networkx as nx

# Weakly connected components (direction-agnostic)
wcc = list(nx.weakly_connected_components(G_filtered))
largest_wcc = max(wcc, key=len)

# Strongly connected components (direction-respecting)
scc = list(nx.strongly_connected_components(G_filtered))
largest_scc = max(scc, key=len)

total_nodes = G_filtered.number_of_nodes()
print(f"WCCs: {len(wcc)} | Largest WCC: {len(largest_wcc)} nodes")
print(f"SCCs: {len(scc)} | Largest SCC: {len(largest_scc)} nodes")
print(f"SCC coverage: {len(largest_scc) / total_nodes:.1%}")

# Surface the size distribution of small SCCs for triage
scc_sizes = sorted([len(c) for c in scc], reverse=True)
print(f"SCC size distribution (top 10): {scc_sizes[:10]}")

A healthy metropolitan drive network maintains a largest SCC ratio above 95% of total nodes. If the ratio drops below 90%, investigate bridge/tunnel connectivity, missing ferry relations, or aggressive tag filtering. The SCC size distribution tells you what you’re dealing with: dozens of 1-node SCCs typically indicate isolated parking nodes; a second SCC containing thousands of nodes usually signals a severed major arterial.

4. Topological Repair via Node Snapping

Fragmentation often stems from micro-gaps: nodes separated by less than 1 metre due to GPS drift, survey inaccuracies, or digitization errors. Spatial reconciliation bridges these gaps without distorting geometry. Project to a metric CRS before snapping so your tolerance is in metres, not decimal degrees.

# pip install numpy scipy pyproj
import numpy as np
from scipy.spatial import cKDTree
import pyproj
from functools import partial

def snap_isolated_nodes(
    G: nx.MultiDiGraph,
    tolerance_m: float = 1.0,
    source_crs: str = "EPSG:4326",
    metric_crs: str = "EPSG:3857"
) -> nx.MultiDiGraph:
    """
    Merge nodes within tolerance_m metres to close micro-gaps.
    Projects to metric_crs for distance computation, merges in place.
    """
    transformer = pyproj.Transformer.from_crs(source_crs, metric_crs, always_xy=True)

    # Build arrays in projected coordinates
    node_ids = []
    coords_proj = []
    for n, data in G.nodes(data=True):
        if "x" in data and "y" in data:
            x_m, y_m = transformer.transform(data["x"], data["y"])
            node_ids.append(n)
            coords_proj.append((x_m, y_m))

    node_ids = np.array(node_ids)
    coords_arr = np.array(coords_proj)

    tree = cKDTree(coords_arr)
    pairs = tree.query_pairs(tolerance_m)  # pairs of array indices

    merged = 0
    for i, j in pairs:
        n1, n2 = int(node_ids[i]), int(node_ids[j])
        if not G.has_node(n1) or not G.has_node(n2) or n1 == n2:
            continue
        # Skip nodes on different vertical levels (bridge/tunnel/layer tags)
        lvl1 = G.nodes[n1].get("layer", 0) or 0
        lvl2 = G.nodes[n2].get("layer", 0) or 0
        if lvl1 != lvl2:
            continue
        nx.contracted_nodes(G, n1, n2, self_loops=False, copy=False)
        merged += 1

    print(f"Snapped {merged} node pairs within {tolerance_m} m")
    return G

G_repaired = snap_isolated_nodes(G_filtered, tolerance_m=1.0)

The layer guard is essential for overpasses and grade-separated intersections. Without it, a node on a bridge deck will snap to the surface road running beneath it, generating a physically impossible connection that OSRM and Valhalla will silently accept but route incorrectly.

5. Attribute Normalization & Weight Calibration

A topologically connected graph is still functionally fragmented if edge weights misrepresent travel cost. Missing maxspeed values, inconsistent surface tags, and uncalibrated turn penalties create situations where paths exist but are computationally avoided — a form of implicit fragmentation that surface-level SCC analysis will not catch.

Standardize speed profiles using highway class defaults where maxspeed is absent, then convert to travel time in seconds. For heavy vehicle routing, configuring edge weights for freight logistics covers maxweight, maxheight, and axle load constraints that must be mapped to edge attributes as part of the same normalization pass.

# numpy vectorised over edge DataFrame — avoids Python loop over edges
import pandas as pd
import numpy as np

SPEED_DEFAULTS_KPH = {
    "motorway": 100, "trunk": 80, "primary": 60, "secondary": 50,
    "tertiary": 40, "residential": 30, "living_street": 20, "service": 15,
}

def normalize_edge_weights(G: nx.MultiDiGraph) -> nx.MultiDiGraph:
    rows = [
        {"u": u, "v": v, "k": k, "highway": data.get("highway", "residential"),
         "maxspeed": data.get("maxspeed"), "length": data.get("length", 0)}
        for u, v, k, data in G.edges(keys=True, data=True)
    ]
    df = pd.DataFrame(rows)

    # Parse maxspeed strings like "50 mph" or "80" — take first numeric token
    df["speed_raw"] = df["maxspeed"].astype(str).str.extract(r"(\d+\.?\d*)")[0].astype(float)
    df["speed_default"] = df["highway"].map(SPEED_DEFAULTS_KPH).fillna(30)
    df["speed_kph"] = df["speed_raw"].where(df["speed_raw"].notna(), df["speed_default"])

    # Travel time in seconds
    df["travel_time"] = (df["length"] / 1000.0) / df["speed_kph"] * 3600.0

    # Write back to graph in a single pass
    tt_map = df.set_index(["u", "v", "k"])["travel_time"].to_dict()
    for (u, v, k), tt in tt_map.items():
        G[u][v][k]["travel_time"] = tt

    return G

G_calibrated = normalize_edge_weights(G_repaired)

6. Pre-Deployment Stress Testing

Before exporting to a routing engine, run a synthetic pathfinding stress test. Generate random origin-destination pairs across the largest SCC and verify that fewer than 0.5% fail to resolve within an acceptable compute time. Sample from the SCC only — not the full node set — so you isolate routing logic from known disconnected periphery.

# networkx>=3.0
import random

def stress_test_routing(G: nx.MultiDiGraph, samples: int = 500, weight: str = "travel_time") -> dict:
    """Returns a dict with failure_rate, avg_path_length, and a list of failing OD pairs."""
    largest_scc = max(nx.strongly_connected_components(G), key=len)
    nodes = list(largest_scc)
    if len(nodes) < 2:
        return {"failure_rate": 1.0, "avg_path_length": None, "failures": []}

    failures = []
    path_lengths = []
    for _ in range(samples):
        src, dst = random.sample(nodes, 2)
        try:
            path = nx.shortest_path(G, src, dst, weight=weight)
            path_lengths.append(len(path))
        except nx.NetworkXNoPath:
            failures.append((src, dst))

    result = {
        "failure_rate": len(failures) / samples,
        "avg_path_length": np.mean(path_lengths) if path_lengths else None,
        "failures": failures[:10],  # keep a sample for diagnosis
    }
    print(f"Routing failure rate: {result['failure_rate']:.4%}")
    print(f"Avg path length (nodes): {result['avg_path_length']:.1f}")
    return result

report = stress_test_routing(G_calibrated, samples=500)

A failure rate above 0.5% inside the SCC indicates residual structural fragmentation — typically unidirectional bottlenecks created by oneway=yes on a connector road with no return path, or improperly inverted turn restrictions. Once validated, serialize with nx.write_graphml() or convert to PBF for engine ingestion.

Configuration Reference

Parameter	Default	Range	Notes
`buffer_dist` (metres)	2000	500–5000	Larger buffers increase download size; 2 km is appropriate for metropolitan networks
`tolerance_m` (node snap)	1.0	0.5–1.5	Values above 2 m risk merging distinct physical intersections in dense urban grids
`SCC coverage threshold`	95%	90–99%	Drop below 90% → investigate; below 80% → ingestion or filter error
`Stress test sample size`	500	200–5000	500 gives ±3% margin; use 2000+ for networks above 1M edges
`Stress test failure threshold`	0.5%	0–2%	Freight networks with heavy access filtering tolerate up to 1.5%
`speed_default` — residential	30 kph	20–50	Override with local road authority data when available
`speed_default` — motorway	100 kph	80–130	EU/US split: 130 kph on French autoroutes, 65 mph on US interstates

Production Optimization and Scaling

For networks above 500k edges, the per-edge Python loop in weight normalization becomes a bottleneck. The vectorized DataFrame approach in Step 5 runs 20–40x faster for typical metropolitan graphs, but for continental-scale processing consider:

Compiled-tool pre-processing with osmium. Use osmium tags-filter to apply highway and access filters before the data ever enters Python. This reduces the working set by 60–80% for drive-only networks and makes the NetworkX ingestion step manageable on a single machine.

# Shell pre-processing — not Python, but saves substantial RAM
osmium tags-filter input.osm.pbf \
  nwr/highway=motorway,trunk,primary,secondary,tertiary,residential,living_street,service \
  -o filtered.osm.pbf --overwrite

Parallelized component analysis. For directed graphs above 2M nodes, NetworkX’s SCC implementation becomes memory-constrained. Consider igraph (via python-igraph) or cuGraph on GPU hardware. Both expose SCC APIs compatible with NetworkX adjacency list input.

KD-tree batching for large snapping jobs. The cKDTree.query_pairs() call returns all pairs at once; for very dense networks this can return millions of candidate pairs. Partition the node array spatially (by grid cell or quadtree bucket) and run snapping tile by tile to bound peak memory.

Validation and Testing

After running the repair pipeline, validate the output against these quantitative targets before passing the graph to a routing engine:

SCC coverage >= 95% of total nodes for metropolitan drive networks
WCC count == 1 for the primary network region (multiple WCCs indicate severed arterials, not just peripheral cul-de-sacs)
Zero edges with travel_time == NaN or <= 0 — a sign of length=0 geometry errors or malformed maxspeed strings
Stress test failure rate < 0.5% sampled from SCC nodes with the same access filter the production engine applies
No node pairs within 0.5 m remaining after snapping — re-run cKDTree.query_pairs(0.5) and assert the result is empty
Week-over-week SCC delta < 2% when updating from a new OSM extract — larger drops indicate a mapper changed a key way classification

Embed checks 1–4 in a CI job that runs on every fresh OSM extract. Version-control graph snapshots alongside extract timestamps so you can bisect regressions to specific OSM changesets.

Troubleshooting

SCC coverage keeps shrinking after weekly OSM extract updates

OSM editors occasionally reclassify ways — converting previously routable segments to access=private or highway=path. Run a week-over-week node diff against the SCC sets:

# Compare two serialised SCC sets
prev_scc = set(map(int, open("scc_prev.txt").read().split()))
curr_scc = set(map(int, open("scc_curr.txt").read().split()))
lost = prev_scc - curr_scc
print(f"Nodes lost from SCC: {len(lost)}")
# Then inspect OSM history for the OSM IDs in `lost`

Usually 2–5 changed ways account for the whole shrinkage. Retrieve the OSM node history via the OSM API (/api/0.6/node/{id}/history) to identify the offending edit.

Node snapping is connecting nodes on different road levels

Filter on the layer, bridge, and tunnel tags before running the KD-tree query. The snap_isolated_nodes function in Step 4 includes this guard via the lvl1 != lvl2 check. If you adapted the function and removed it, add it back. Also verify that OSM source data carries the layer tag on all elevated structures — in some regional extracts these are omitted and must be inferred from bridge=yes / tunnel=yes presence.

Stress test passes but production routing still fails

Production failures typically come from vehicle-class constraints applied at query time: maxweight, maxheight, or access=delivery tags that filter out edges post-SCC construction. Rebuild the SCC using the same access filter your routing engine applies at query time. For handling turn restrictions in routing graphs, the restriction relations may make certain source-destination combinations unreachable even within the SCC — test with the specific vehicle profile active.

Ferry routes and water crossings are fragmenting coastal or island networks

Include route=ferry relations as valid edges with a synthetic ferry highway class and a conservative speed (15 km/h). Excluding them severs island and coastal networks entirely. If ferry travel is undesirable for a vehicle class, add a high penalty weight rather than removing the edge — a 10x cost multiplier makes ferries a last resort while preserving graph connectivity.

Weight normalization produces travel_time = 0 or NaN on some edges

This usually means length=0 geometry errors or malformed maxspeed strings (e.g. "RO:urban" regional codes). The vectorized approach in Step 5 parses the first numeric token, which will return NaN for regional speed codes. Post-normalization, run df[df["travel_time"] <= 0] to surface problem edges, then apply a highway-class fallback for any remaining NaN values with fillna().

Building directed graphs from OSM PBF files with osmium and pyosmium — the upstream topology construction step this page validates
Configuring edge weights for freight logistics — domain-specific weight calibration for maxweight, axle limits, and vehicle-class routing
Handling turn restrictions in routing graphs using OSM relation tags — restriction parsing that can induce functional fragmentation even in topologically connected networks
Mapping node attributes for urban delivery zones — access zone and time-window attributes that affect last-mile reachability
Extracting OSM road networks with osmium for production pipelines — low-level extraction tooling that precedes the connectivity audit

OSM Graph Architecture & Network Modeling