Fusing public transit schedules with OSM road and pedestrian infrastructure into a single routable graph is one of the hardest graph-engineering problems in geospatial routing. The challenge is not simply overlaying two datasets: it requires constructing a unified directed graph where road segments, rail corridors, pedestrian paths, and schedule-driven transit services share consistent impedance semantics and well-defined transfer behaviour. This page sits within the OSM Graph Architecture & Network Modeling reference, specifically covering the integration layer that connects static infrastructure graphs with dynamic, time-dependent transit schedules.
The result of this work feeds directly into query-time routing engines. Without correct transfer node construction and time-dependent edge weights, a multi-modal path query will either ignore transit entirely or produce physically impossible itineraries — a bus that departs before the pedestrian leg from the origin completes, or a transfer that ignores a closed fare gate. Getting these semantics right at the graph-construction stage is far cheaper than patching them in the query engine.
Prerequisites
Ensure the following before assembling the transit layer:
- Python: 3.10 or later (3.11+ recommended for
tomlliband improvedmatchergonomics in routing logic). - Libraries:
networkx>=3.0,osmnx>=1.5,pandas>=2.0,geopandas>=0.14,shapely>=2.0,scipy>=1.11(for KD-tree snapping). - Data sources: A validated OSM PBF regional extract and one or more GTFS feeds (static schedule, not GTFS-RT for this workflow). GTFS-RT real-time feeds are layered on top after the static graph is stable.
- Graph foundation: A directed, weighted
networkx.MultiDiGraphwhere each edge carrieslength_m,speed_kph,travel_time_sec, andmodeattributes. If starting from raw PBF data, follow the workflow in building directed graphs from OSM PBF files to ensure consistent node indexing and topology validation before attempting GTFS integration. - Coordinate reference system: All spatial operations must share a single projected CRS (EPSG:3857 or a local UTM zone). CRS mismatches are the leading cause of stop-snapping failures and silent distance errors.
Install the core stack:
# pip install networkx>=3.0 osmnx>=1.5 pandas>=2.0 geopandas>=0.14 shapely>=2.0 scipy>=1.11
Conceptual Architecture
A multi-modal routing graph has three distinct edge classes:
| Edge class | Source | Weight type | Direction |
|---|---|---|---|
| Static road/pedestrian | OSM PBF | Fixed travel_time_sec |
Directed per oneway tag |
| Transit | GTFS stop_times.txt |
Time-dependent (departure_sec) |
Directed stop → stop |
| Transfer | Spatial snap (OSM ↔ GTFS) | Fixed transfer_penalty_sec + walk |
Bidirectional |
The key invariant is temporal isolation: transit edges carry absolute schedule times that are only meaningful at query time. Do not pre-combine them with static weights. Instead, the routing algorithm evaluates transit edge costs dynamically by comparing its current clock value against each candidate departure, adding waiting time to travel time.
Transfer nodes are the critical seam. Each GTFS stop must snap to exactly one OSM node within a configurable distance tolerance. The transfer edge between them encodes the pedestrian walking time plus any platform overhead (fare gates, stairs). Getting these two values right per stop type determines whether the final itineraries are realistic or physically impossible.
Step-by-Step Implementation
Step 1: Initialize the Base Infrastructure Graph
Load the OSM road and pedestrian network as a directed graph. Include highway=footway, highway=path, highway=cycleway, and all drivable types. Filter out access=private and access=no edges that would block pedestrian-transit transfers.
# requires: osmnx>=1.5, networkx>=3.0
import osmnx as ox
import networkx as nx
def load_base_graph(place_query: str, modes: list[str] | None = None) -> nx.MultiDiGraph:
"""
Load a multi-modal OSM graph with road + pedestrian edges.
modes: list of osmnx network_type values to merge, e.g. ['drive', 'walk']
"""
if modes is None:
modes = ["drive", "walk"]
graphs = [
ox.graph_from_place(place_query, network_type=mode, retain_all=True)
for mode in modes
]
# compose preserves all node/edge attributes from both graphs
G = graphs[0]
for g in graphs[1:]:
G = nx.compose(G, g)
# Enforce required edge attributes with typed defaults
for u, v, k, data in G.edges(data=True, keys=True):
data.setdefault("mode", "road")
length_m = data.get("length", 0.0)
speed_kph = data.get("speed_kph", 50.0)
data["travel_time_sec"] = (length_m / 1000.0) / speed_kph * 3600.0 if speed_kph > 0 else 9999.0
return G
Enforce strict attribute validation after loading. Edges missing length or speed_kph must receive classified defaults (e.g., highway=residential → 30 kph) rather than being silently dropped. Silent drops cause graph fragmentation that only surfaces as routing failures at query time.
Step 2: Ingest and Normalize GTFS Data
Parse stops.txt, routes.txt, trips.txt, and stop_times.txt. The critical detail is GTFS time encoding: stop_times.txt uses HH:MM:SS with hours that can exceed 23 to express overnight trips (25:30:00 = 1:30 AM next service day). Convert to seconds since midnight using vectorized arithmetic — never string-parse row-by-row.
# requires: pandas>=2.0, geopandas>=0.14, shapely>=2.0
import pandas as pd
import geopandas as gpd
from pathlib import Path
def load_gtfs_feed(gtfs_dir: str | Path) -> dict[str, pd.DataFrame | gpd.GeoDataFrame]:
"""
Load and normalize a GTFS static feed from a directory of .txt files.
Returns dict keyed by table name.
"""
d = Path(gtfs_dir)
tables = {}
for name in ["stops", "routes", "trips", "stop_times"]:
tables[name] = pd.read_csv(d / f"{name}.txt", dtype=str)
# Normalize stop geometries to GeoDataFrame
stops = tables["stops"].copy()
stops = stops.dropna(subset=["stop_lat", "stop_lon"])
stops["stop_lat"] = pd.to_numeric(stops["stop_lat"], errors="coerce")
stops["stop_lon"] = pd.to_numeric(stops["stop_lon"], errors="coerce")
stops = stops.dropna(subset=["stop_lat", "stop_lon"])
tables["stops"] = gpd.GeoDataFrame(
stops,
geometry=gpd.points_from_xy(stops["stop_lon"], stops["stop_lat"]),
crs="EPSG:4326"
)
# Vectorized time normalization: HH:MM:SS → seconds since midnight
# Hours > 23 are intentional in GTFS — do NOT subtract 86400
st = tables["stop_times"].copy()
for col in ["arrival_time", "departure_time"]:
if col not in st.columns:
continue
parts = st[col].str.split(":", expand=True).astype(int)
st[f"{col}_sec"] = parts[0] * 3600 + parts[1] * 60 + parts[2]
tables["stop_times"] = st
return tables
After loading, drop stops with missing coordinates and validate that every stop_id referenced in stop_times.txt has a matching entry in stops.txt. Orphaned stop references indicate a corrupted feed and will produce disconnected transfer nodes.
Step 3: Construct Transit Edges and Transfer Nodes
This step has two distinct sub-tasks: building directed trip edges between consecutive stops, and spatially snapping each stop to its nearest OSM node to create transfer edges.
# requires: networkx>=3.0, scipy>=1.11, numpy>=1.24
import networkx as nx
import numpy as np
from scipy.spatial import KDTree
def add_transit_edges(G: nx.MultiDiGraph, tables: dict) -> nx.MultiDiGraph:
"""
Add GTFS-derived transit edges and transfer (snap) edges to G in place.
Transfer tolerance is 150 m (projected, metres).
"""
TRANSFER_TOLERANCE_M = 150.0
TRANSFER_PENALTY_SEC = 180.0 # walking + platform overhead baseline
stops_gdf = tables["stops"].to_crs(epsg=3857) # project for metre-accurate distances
stop_times = tables["stop_times"].sort_values(["trip_id", "stop_sequence"])
# --- Build OSM node KD-tree ---
osm_nodes = list(G.nodes(data=True))
# osmnx nodes have 'x' (lon) and 'y' (lat) in WGS 84; project to 3857
from pyproj import Transformer
xform = Transformer.from_crs("EPSG:4326", "EPSG:3857", always_xy=True)
node_ids, node_coords = [], []
for nid, attrs in osm_nodes:
if "x" in attrs and "y" in attrs:
px, py = xform.transform(attrs["x"], attrs["y"])
node_ids.append(nid)
node_coords.append((px, py))
kd = KDTree(node_coords)
# --- Snap each stop to nearest OSM node ---
stop_to_osm: dict[str, int] = {}
stop_xy = np.column_stack([stops_gdf.geometry.x.values, stops_gdf.geometry.y.values])
dists, idxs = kd.query(stop_xy, workers=-1)
for i, (sid, dist, idx) in enumerate(
zip(stops_gdf["stop_id"].values, dists, idxs)
):
if dist <= TRANSFER_TOLERANCE_M:
osm_node = node_ids[idx]
stop_to_osm[sid] = osm_node
# Add transit stop as a graph node
G.add_node(f"gtfs:{sid}", mode="transit_stop", stop_id=sid)
# Bidirectional transfer edge
walk_sec = dist / 1.2 # 1.2 m/s pedestrian speed
penalty = TRANSFER_PENALTY_SEC + walk_sec
G.add_edge(osm_node, f"gtfs:{sid}",
mode="transfer", transfer_penalty_sec=penalty,
travel_time_sec=penalty, weight=penalty)
G.add_edge(f"gtfs:{sid}", osm_node,
mode="transfer", transfer_penalty_sec=penalty,
travel_time_sec=penalty, weight=penalty)
# --- Add transit (trip) edges ---
grouped = stop_times.groupby("trip_id")
for trip_id, trip_df in grouped:
rows = trip_df.reset_index(drop=True)
for i in range(len(rows) - 1):
dep_stop = rows.loc[i, "stop_id"]
arr_stop = rows.loc[i + 1, "stop_id"]
dep_sec = rows.loc[i, "departure_time_sec"]
arr_sec = rows.loc[i + 1, "arrival_time_sec"]
travel = arr_sec - dep_sec
if travel < 0 or dep_stop not in stop_to_osm or arr_stop not in stop_to_osm:
continue # skip data errors
G.add_edge(
f"gtfs:{dep_stop}", f"gtfs:{arr_stop}",
mode="transit",
trip_id=trip_id,
route_id=rows.loc[i].get("route_id", ""),
departure_sec=dep_sec,
arrival_sec=arr_sec,
travel_time_sec=travel,
weight=travel,
)
return G
Avoid the common mistake of adding transfer edges for every OSM node within the tolerance radius. One stop → one nearest node prevents quadratic edge counts that will collapse query performance.
Step 4: Apply Time-Dependent Impedance at Query Time
Static Dijkstra cannot correctly route across transit edges because the edge cost depends on when you arrive at the origin stop — you may have to wait for the next departure. Implement a modified priority-queue traversal where the state tuple includes current_time.
# requires: heapq (stdlib), networkx>=3.0
import heapq
from typing import Optional
def time_dependent_dijkstra(
G: nx.MultiDiGraph,
source: int,
target: int,
departure_time_sec: int,
) -> tuple[float, list]:
"""
Dijkstra with time-dependent transit edge costs.
State: (cumulative_cost, current_time_sec, node_id)
Returns (total_cost, path_node_list).
"""
dist = {source: 0.0}
prev: dict[int, Optional[int]] = {source: None}
# heap: (cost, current_time, node)
heap = [(0.0, departure_time_sec, source)]
while heap:
cost, t_now, u = heapq.heappop(heap)
if cost > dist.get(u, float("inf")):
continue
if u == target:
break
for _, v, data in G.out_edges(u, data=True):
mode = data.get("mode", "road")
if mode == "transit":
dep = data.get("departure_sec", 0)
if dep < t_now:
continue # missed this departure; skip (next departure handled by parallel edge)
wait = dep - t_now
edge_cost = wait + data.get("travel_time_sec", 0.0)
t_next = dep + data.get("travel_time_sec", 0.0)
else:
edge_cost = data.get("travel_time_sec", data.get("weight", 0.0))
t_next = t_now + edge_cost
new_cost = cost + edge_cost
if new_cost < dist.get(v, float("inf")):
dist[v] = new_cost
prev[v] = u
heapq.heappush(heap, (new_cost, t_next, v))
# Reconstruct path
path, node = [], target
while node is not None:
path.append(node)
node = prev.get(node)
return dist.get(target, float("inf")), list(reversed(path))
For freight and logistics scenarios, configuring edge weights that reflect loading/unloading constraints and vehicle height restrictions must happen before this traversal, not inside it. Keep the routing loop lean.
Configuration Reference
| Parameter | Default | Tuning notes |
|---|---|---|
TRANSFER_TOLERANCE_M |
150 | Lower to 50 m in dense urban grids; raise to 250 m for rural stops |
TRANSFER_PENALTY_SEC |
180 | Add 60–120 s per fare gate; subtract for at-grade open stops |
| Pedestrian walk speed | 1.2 m/s | Reduce to 0.8 m/s for accessibility-sensitive routing |
stop_times sort key |
["trip_id", "stop_sequence"] |
Never sort by time alone — out-of-order sequences occur in real feeds |
KD-tree workers |
-1 (all CPUs) | Set to 1 for reproducible profiling; -1 for production batch snapping |
| Transit edge mode tag | "transit" |
Keep consistent — the query loop branches on this value |
| Transfer edge mode tag | "transfer" |
Separate from "transit" so penalty logic is not double-counted |
| Max wait window | None (unbounded) | Add a MAX_WAIT_SEC = 3600 guard in the traversal to prune unrealistic waits |
Production Optimization and Scaling
Transit layers multiply node counts by 3–5× when stop nodes, transfer edges, and per-departure edges are all materialized. Strategies to contain memory and latency:
Lazy time-dependent weights. Do not materialize a fully time-expanded graph (one node per stop per departure). Instead, store multiple transit edges between the same stop pair (one per trip) and resolve costs during traversal as shown above. Memory scales with trip count, not trip count × time horizon.
Edge attribute stripping. At graph-build time, remove attributes not needed for routing. A production routing graph should only carry mode, travel_time_sec, weight, transfer_penalty_sec, and departure_sec in memory. Strip route_short_name, trip_headsign, and similar display fields — store them in a separate lookup table keyed by trip_id.
Spatial indexing for dynamic queries. Keep a scipy.spatial.KDTree of stop coordinates loaded alongside the graph so that origin/destination stops provided as lat/lon at query time can snap in under 1 ms.
Compiled routing alternative. For high-throughput deployments, consider OpenTripPlanner as a compiled Java engine that natively handles GTFS + OSM with contraction hierarchies. The NetworkX implementation above is correct for moderate-scale analysis and prototyping; OTP or Valhalla’s multi-modal costing configuration is more appropriate above ~10,000 routing queries per minute.
Feed refresh and cache invalidation. Cache frequent origin-destination pairs using a TTL-based store (Redis or DynamoDB). Invalidate the cache on GTFS feed updates, which typically occur weekly for most agencies. Use a blue-green graph swap: build the updated graph into a secondary memory slot and atomically reassign the routing handle, keeping the old graph live until the new one passes validation.
Validation and Testing
After building the combined graph, run this validation sequence before accepting it for production:
Connectivity checks. Every GTFS stop node must have at least one transfer edge to the base network. Isolated stops indicate snapping tolerance misconfiguration or missing pedestrian paths (highway=footway) in the OSM extract.
# requires: networkx>=3.0
def audit_transit_stops(G: nx.MultiDiGraph) -> list[str]:
"""Return stop_ids whose gtfs: nodes have no transfer edges."""
orphans = []
for node in G.nodes:
if not str(node).startswith("gtfs:"):
continue
neighbors = list(G.predecessors(node)) + list(G.successors(node))
transfer_nbrs = [n for n in neighbors if not str(n).startswith("gtfs:")]
if not transfer_nbrs:
orphans.append(node)
return orphans
Temporal consistency. Within every trip, arrival_time_sec at stop N must be ≤ departure_time_sec at stop N. Across consecutive stops, departure_time_sec[i] <= arrival_time_sec[i+1]. Flag negative travel times as feed corruption, not as overnight crossings (overnight crossings are already handled by the hours > 23 encoding).
Routing benchmarks. Run 200–500 randomized origin-destination pairs across different modes and departure times. Compare total travel times against a reference (e.g., the GTFS-based OTP API) for the same city/region. Deviations greater than 20% indicate transfer penalty misconfiguration or a missed departure edge.
Memory profiling. Use tracemalloc or memory_profiler to confirm that adding the transit layer does not push the process past your deployment memory ceiling. A transit graph for a medium European city (2 million OSM nodes, 3 GTFS feeds) typically requires 4–8 GB RAM in networkx.MultiDiGraph format.
Troubleshooting
Transit stops fail to snap — all stops return as orphans
Root cause: CRS mismatch. GTFS coordinates are in WGS 84 (EPSG:4326) but the KD-tree is built from unprojected OSM node coordinates queried in degrees rather than metres.
Fix: Project both the stop GeoDataFrame and the OSM node coordinates to a metric CRS (e.g., EPSG:3857) before building the KD-tree. Check with stops_gdf.crs — it must not be None or EPSG:4326 at the point the KD-tree is queried.
Routing returns "no path found" for valid transit OD pairs
Root cause: The base graph and the transit stop nodes are disconnected. Transfer edges were added to nodes that exist in G.nodes but the base graph was reloaded or replaced after snapping, losing those edges.
Fix: Always add transfer edges to the same G object that will be used for routing. If you serialize and reload the graph, use networkx.readwrite.gpickle or msgpack to preserve all node/edge attributes including the "gtfs:" prefixed transit nodes.
Routing latency spikes above 2 seconds for city-scale graphs
Root cause: Too many transit edges per stop pair. If multiple trips share the same stop sequence, they each add a parallel edge — the traversal examines all of them at every node expansion.
Fix: First strip non-routing edge attributes. Then add a MAX_WAIT_SEC guard in the traversal (e.g., 3600 s) to prune departures too far in the future. For sustained high-throughput, pre-index departures per stop pair in a sorted list and binary-search for the next valid departure rather than scanning all parallel edges.
GTFS stop_times with hours > 23 produce negative travel times
Root cause: Code that incorrectly subtracts 86,400 seconds from times with hours ≥ 24, reversing their sequence position.
Fix: Do not subtract 86,400. GTFS uses 25:30:00 deliberately to express 1:30 AM the next service day as a value that is monotonically greater than the preceding 25:00:00 departure within the same trip. The integer arithmetic hours * 3600 + minutes * 60 + seconds is correct as-is and produces an increasing sequence.
Transfer penalties produce unrealistically long itineraries at major stations
Root cause: A single fixed TRANSFER_PENALTY_SEC value is applied to all stops regardless of station complexity. A simple at-grade bus stop and a three-level underground interchange are treated identically.
Fix: Classify stops by stop_location_type from stops.txt (0 = platform, 1 = station, 2 = entrance, 3 = generic node, 4 = boarding area). Apply tiered penalties: 60–120 s for platforms, 180–300 s for station entrances, 360–480 s for large interchange stations. Store the tier in the transfer edge attribute at snapping time.
Related
- Building directed graphs from OSM PBF files — foundational node indexing and edge topology required before GTFS integration
- Configuring edge weights for freight logistics — conditional impedance patterns that extend to multi-modal penalty matrices
- Graph fragmentation prevention in OSM data — disconnection diagnostics that apply equally to hybrid OSM/GTFS graphs
- Valhalla configuration for multi-modal analysis — compiled-engine alternative for high-throughput multi-modal queries
- Mapping node attributes for urban delivery zones — stop classification and zone-level access constraints that interact with transit transfer logic