technical report

Water Main Leak Prediction Using CatBoost

This project uses public City of Vancouver water main and 311 service request data to predict which pipe segments are most likely to leak in the next year. The final output is an interactive web map that visualizes predicted 2026 leak risk for individual pipe segments.

1. Introduction

The data pipeline began by importing records from City of Vancouver API endpoints into a PostgreSQL database running inside Docker Desktop. Python was used to request the raw API data and store each record as JSON, while SQL was used to join, filter, clean, and reshape the datasets into a machine learning ready format.

The workflow included joining the transmission and distribution water main datasets, cleaning and combining 311 water leak reports, triangulating reported leak locations to nearby water mains with coordinate information, and transforming the final result into a pipe year dataset. The model was then trained using a time based split, tested with multiple boosting algorithms, and exported to a GeoJSON file for visualization.

vancouver api
python import
docker + postgresql
sql + postgis
catboost
html dashboard

2. Background and Motivation

Water mains are a core part of municipal infrastructure. They move treated drinking water through the city and help supply connected systems such as hydrants, sprinklers, homes, and businesses. When a water main leaks, the consequences can extend beyond water loss.

Water Main

A pipe that carries treated water through the municipal water system. Water mains connect to service lines, buildings, hydrants, and other water infrastructure.

Transmission Water Main

A larger pipe that moves water across longer distances or between major parts of the water network. These mains typically carry higher volumes of water.

Distribution Water Main

A local pipe that distributes water through neighbourhoods and to service connections, buildings, hydrants, and other local users.

Backflow

The unwanted reverse movement of water or contaminants into the drinking water system. Backflow preventers are implemented in every sprinkler system since pressure drops from sprinklers activating. Water in sprinkler systems are extremey dirty since they sit for many years inside the pipes. Leaks, pressure drops, fire events, or earthquake damage can increase backflow risk if protective systems are overwhelmed.

Sprinklers, hydrants, and drinking water connections are all connected to the broader water main system. If a main leaks or loses pressure, hydrants and sprinklers may have reduced pressure during an emergency. A leak can also create water quality concerns if dirt, debris, rust, or contaminants enter damaged infrastructure.

  • Water quality issues: leaks can allow dirt, debris, rust, or contaminants to enter damaged infrastructure.
  • Backflow risk: structural damage or pressure changes can overwhelm backflow prevention systems.
  • Property damage: water leaks can damage roads, buildings, and private property.
  • Reduced fire protection: hydrants and sprinklers may have lower pressure during emergency response.
  • Higher costs: leaks can increase water loss, maintenance costs, and potentially bills.
  • Service interruptions: repairs may require water shutoffs, road work, or traffic disruption.

3. Research Question and Goal

research question

Which water main pipe segments are predicted to leak in the next year?

The goal was to build a model using publicly available data that estimates the probability that each pipe segment will leak in the following year.

4. Data Sources

The project used public City of Vancouver API datasets. The main sources were water distribution mains, water transmission mains, historical 311 water leak service requests, and recent 311 water leak service requests.

dataset
purpose
water distribution mains
local water main pipe segments used for neighbourhood level water delivery.
water transmission mains
larger pipe segments used to move water through major parts of the network.
311 water leak requests, 2009–2021
historical public service requests related to water leak cases.
311 water leak requests, 2021–present
recent water leak service requests used to extend the leak history.
DATASETS = [
    {"table_name": "raw_distribution_mains", "url": "...water-distribution-mains..."},
    {"table_name": "raw_transmission_mains", "url": "...water-transmission-mains..."},
    {"table_name": "raw_311_water_leaks_2009_2021", "url": "...3-1-1-service-requests-2009-2021..."},
    {"table_name": "raw_311_water_leaks_2021_present", "url": "...3-1-1-service-requests..."}
]

5. Data Cleaning and Transformation

The transmission and distribution water main datasets were unioned into one water main table. The 311 datasets were also unioned into one leak table. The 311 timestamps were simplified to leak year to make the data easier to join, manipulate, and train on.

Each water main was associated with a pipe ID. The leak reports were matched to water mains using coordinate data and GIS functions. This created a link between point like 311 leak reports and line like pipe geometries.

-- example spatial workflow
CREATE INDEX idx_pipe_geom
ON main
USING GIST (geom_line_utm);

CREATE INDEX idx_leak_geom
ON leakdata
USING GIST (geom_point_utm);

The final dataset was transformed so that each pipe had one row for each year from 2015 to 2026. A target column was created to represent whether the pipe leaked in the following year.

CASE
  WHEN LEAD(leaked_that_year) OVER (
    PARTITION BY pipe_id
    ORDER BY analysis_year
  ) = 1
  THEN 1
  ELSE 0
END AS leaks_next_year

6. Model Training

The model was split by year rather than randomly. This matters because the task is time dependent: the model should learn from past pipe year records and predict future leak risk.

Training

historical pipe year rows used to fit the model.

Validation

a later year used to tune thresholds and compare models.

Testing

a held out year used to estimate final performance.

Scoring

2025 pipe records scored to estimate 2026 leak risk.

Multiple boosting models were tested, including XGBoost and CatBoost. Boosting models build many decision trees in sequence, where each new tree attempts to improve on the previous errors.

Xgboost

XGBoost is a strong model for structured tabular data and provided a useful comparison model for leak prediction.

Catboost

CatBoost was useful because the dataset includes categorical variables such as material, lining material, and water main type. CatBoost produced the best overall result.

Since leaks are rare, accuracy is not a good main metric. The evaluation therefore focused on precision, recall, F1 score, ROC AUC, and especially PR AUC.

precision

Of the pipes predicted to leak, how many actually leaked?

recall

Of all pipes that actually leaked, how many did the model catch?

f1 score

A combined measure that balances precision and recall.

pr auc

Measures ranking quality for the rare leak class.

model = CatBoostClassifier(
    iterations=700,
    learning_rate=0.05,
    depth=6,
    loss_function="Logloss",
    eval_metric="PRAUC",
    auto_class_weights="Balanced",
    bootstrap_type="MVS",
    random_seed=42
)

7. Model Results and Interpretation

CatBoost gave the best validation and testing results among the models tested based on PR AUC. The final model performed meaningfully better than a random baseline, but it is still imperfect because water main leaks are rare and the public feature space is limited.

Final catboost test result

Precision6.33%
Recall27.70%
F1 score10.31%
PR AUC0.0407
ROC auc0.7441
PR AUC lift~3.8×

PR AUC is the area under the precision recall curve, measuring how well a classifier ranks positive cases above negative cases across different thresholds. It is especially useful for imbalanced classification problems where the positive event is rare. The PR AUC baseline is approximately equal to the leak rate. The final model achieved a PR AUC around 3.8 times higher than that baseline, which suggests that it learned meaningful signal.

Top k risk ranking diagnostic

The model is more useful as a risk ranking tool than as a strict yes no classifier. The table below shows how many actual leaks were found among the highest risk pipes in the held out test year.

inspection set
actual leaks found
precision@k
recall@k
lift over random
top 100 pipes
9 of 751
9.00%
1.20%
8.37×
top 250 pipes
23 of 751
9.20%
3.06%
8.56×
top 500 pipes
32 of 751
6.40%
4.26%
5.95×
top 1,000 pipes
64 of 751
6.40%
8.52%
5.95×
top 2,500 pipes
164 of 751
6.56%
21.84%
6.10×
top 5,000 pipes
288 of 751
5.76%
38.35%
5.36×
top 10,000 pipes
353 of 751
3.53%
47.00%
3.28×

In the held out test year, the baseline leak rate was only about 1.07%. This means that randomly selecting 1,000 pipes would be expected to find roughly 11 leaks. The model's top 1,000 highest risk pipes contained 64 actual leaks, which is about 5.95 times better than random selection. The top 5,000 highest risk pipes captured 288 of the 751 actual leaks, or 38.35% of all leaks in the test year.

This means the model is not perfect as a binary classifier, but it does provide a meaningful ranking of pipe risk. In practice, this is useful because inspection and maintenance teams often need to prioritize a limited number of pipe segments rather than classify every pipe with perfect certainty.

8. Prediction and Visualization

After model selection, the CatBoost model was used to score the 2025 pipe year records. Since the target is next year leak status, the 2025 rows are used to predict 2026 leak risk.

scoring_df["leak_probability_next_year"] =
    model.predict_proba(X_score)[:, 1]

scoring_df["predicted_leak_next_year"] =
    scoring_df["leak_probability_next_year"] >= best_threshold

The predictions were saved to a GeoJSON file so the original pipe line geometries could be displayed on the map. The final dashboard was built with HTML, CSS, JavaScript, Leaflet, and GitHub Pages.

9. Discussion and Future Improvements

The model shows useful predictive signal, but performance is limited by available features, label quality, and the difficulty of predicting rare infrastructure failures.

  • richer leak history: include exact dates, repeated reports, seasonal effects, and repair timing.
  • local spatial features: count nearby historical leaks within 50m, 100m, or 250m.
  • better labels: remove ambiguous reports and filter matches by distance from the pipe.
  • more context: add pressure zones, soil type, road work, traffic, weather, or maintenance records if available.
  • probability calibration: improve how well predicted probabilities match real world leak rates.
  • operational testing: compare predictions against inspection capacity and maintenance priorities.

Overall, the model should be interpreted as an early warning and prioritization tool. It can help identify elevated risk pipe segments, but infrastructure decisions should combine model output with engineering judgment and field inspection.