technical report
This project uses public City of Vancouver water main and 311 service request data to predict which pipe segments are most likely to leak in the next year. The final output is an interactive web map that visualizes predicted 2026 leak risk for individual pipe segments.
The data pipeline began by importing records from City of Vancouver API endpoints into a PostgreSQL database running inside Docker Desktop. Python was used to request the raw API data and store each record as JSON, while SQL was used to join, filter, clean, and reshape the datasets into a machine learning ready format.
The workflow included joining the transmission and distribution water main datasets, cleaning and combining 311 water leak reports, triangulating reported leak locations to nearby water mains with coordinate information, and transforming the final result into a pipe year dataset. The model was then trained using a time based split, tested with multiple boosting algorithms, and exported to a GeoJSON file for visualization.
Water mains are a core part of municipal infrastructure. They move treated drinking water through the city and help supply connected systems such as hydrants, sprinklers, homes, and businesses. When a water main leaks, the consequences can extend beyond water loss.
A pipe that carries treated water through the municipal water system. Water mains connect to service lines, buildings, hydrants, and other water infrastructure.
A larger pipe that moves water across longer distances or between major parts of the water network. These mains typically carry higher volumes of water.
A local pipe that distributes water through neighbourhoods and to service connections, buildings, hydrants, and other local users.
The unwanted reverse movement of water or contaminants into the drinking water system. Backflow preventers are implemented in every sprinkler system since pressure drops from sprinklers activating. Water in sprinkler systems are extremey dirty since they sit for many years inside the pipes. Leaks, pressure drops, fire events, or earthquake damage can increase backflow risk if protective systems are overwhelmed.
Sprinklers, hydrants, and drinking water connections are all connected to the broader water main system. If a main leaks or loses pressure, hydrants and sprinklers may have reduced pressure during an emergency. A leak can also create water quality concerns if dirt, debris, rust, or contaminants enter damaged infrastructure.
Which water main pipe segments are predicted to leak in the next year?
The goal was to build a model using publicly available data that estimates the probability that each pipe segment will leak in the following year.
The project used public City of Vancouver API datasets. The main sources were water distribution mains, water transmission mains, historical 311 water leak service requests, and recent 311 water leak service requests.
DATASETS = [
{"table_name": "raw_distribution_mains", "url": "...water-distribution-mains..."},
{"table_name": "raw_transmission_mains", "url": "...water-transmission-mains..."},
{"table_name": "raw_311_water_leaks_2009_2021", "url": "...3-1-1-service-requests-2009-2021..."},
{"table_name": "raw_311_water_leaks_2021_present", "url": "...3-1-1-service-requests..."}
]
The transmission and distribution water main datasets were unioned into one water main table. The 311 datasets were also unioned into one leak table. The 311 timestamps were simplified to leak year to make the data easier to join, manipulate, and train on.
Each water main was associated with a pipe ID. The leak reports were matched to water mains using coordinate data and GIS functions. This created a link between point like 311 leak reports and line like pipe geometries.
-- example spatial workflow
CREATE INDEX idx_pipe_geom
ON main
USING GIST (geom_line_utm);
CREATE INDEX idx_leak_geom
ON leakdata
USING GIST (geom_point_utm);
The final dataset was transformed so that each pipe had one row for each year from 2015 to 2026. A target column was created to represent whether the pipe leaked in the following year.
CASE
WHEN LEAD(leaked_that_year) OVER (
PARTITION BY pipe_id
ORDER BY analysis_year
) = 1
THEN 1
ELSE 0
END AS leaks_next_year
The model was split by year rather than randomly. This matters because the task is time dependent: the model should learn from past pipe year records and predict future leak risk.
historical pipe year rows used to fit the model.
a later year used to tune thresholds and compare models.
a held out year used to estimate final performance.
2025 pipe records scored to estimate 2026 leak risk.
Multiple boosting models were tested, including XGBoost and CatBoost. Boosting models build many decision trees in sequence, where each new tree attempts to improve on the previous errors.
XGBoost is a strong model for structured tabular data and provided a useful comparison model for leak prediction.
CatBoost was useful because the dataset includes categorical variables such as material, lining material, and water main type. CatBoost produced the best overall result.
Since leaks are rare, accuracy is not a good main metric. The evaluation therefore focused on precision, recall, F1 score, ROC AUC, and especially PR AUC.
Of the pipes predicted to leak, how many actually leaked?
Of all pipes that actually leaked, how many did the model catch?
A combined measure that balances precision and recall.
Measures ranking quality for the rare leak class.
model = CatBoostClassifier(
iterations=700,
learning_rate=0.05,
depth=6,
loss_function="Logloss",
eval_metric="PRAUC",
auto_class_weights="Balanced",
bootstrap_type="MVS",
random_seed=42
)
CatBoost gave the best validation and testing results among the models tested based on PR AUC. The final model performed meaningfully better than a random baseline, but it is still imperfect because water main leaks are rare and the public feature space is limited.
PR AUC is the area under the precision recall curve, measuring how well a classifier ranks positive cases above negative cases across different thresholds. It is especially useful for imbalanced classification problems where the positive event is rare. The PR AUC baseline is approximately equal to the leak rate. The final model achieved a PR AUC around 3.8 times higher than that baseline, which suggests that it learned meaningful signal.
The model is more useful as a risk ranking tool than as a strict yes no classifier. The table below shows how many actual leaks were found among the highest risk pipes in the held out test year.
In the held out test year, the baseline leak rate was only about 1.07%. This means that randomly selecting 1,000 pipes would be expected to find roughly 11 leaks. The model's top 1,000 highest risk pipes contained 64 actual leaks, which is about 5.95 times better than random selection. The top 5,000 highest risk pipes captured 288 of the 751 actual leaks, or 38.35% of all leaks in the test year.
This means the model is not perfect as a binary classifier, but it does provide a meaningful ranking of pipe risk. In practice, this is useful because inspection and maintenance teams often need to prioritize a limited number of pipe segments rather than classify every pipe with perfect certainty.
After model selection, the CatBoost model was used to score the 2025 pipe year records. Since the target is next year leak status, the 2025 rows are used to predict 2026 leak risk.
scoring_df["leak_probability_next_year"] =
model.predict_proba(X_score)[:, 1]
scoring_df["predicted_leak_next_year"] =
scoring_df["leak_probability_next_year"] >= best_threshold
The predictions were saved to a GeoJSON file so the original pipe line geometries could be displayed on the map. The final dashboard was built with HTML, CSS, JavaScript, Leaflet, and GitHub Pages.
The model shows useful predictive signal, but performance is limited by available features, label quality, and the difficulty of predicting rare infrastructure failures.
Overall, the model should be interpreted as an early warning and prioritization tool. It can help identify elevated risk pipe segments, but infrastructure decisions should combine model output with engineering judgment and field inspection.