About This Project

In this assignment at the University of Pennsylvania Stuart Weitzman School of Design, CPLN 675 Land Use and Environmental Modeling, our goal was to develop a predictive flood inundation model using Calgary, Alberta, Canada as the base geography on which to train a model that could be applied to other cities, using the same factors.

For this model we selected the factors of: - Soil Drainage Efficacy (i.e., soil type and its capacity to allow water absorption) - Water Flow Accumulation (i.e., how much–and where–water collects topographically?) - Surface Permeability (i.e., whether impervious surfaces, such as buildings and concrete, sit on a given segment of land area) - Distance to Water Bodies (i.e., how far is a given point in land from where water collects?)

Calgary and Boise

We selected Boise, Idaho as our comparable city.

Setup & Visualizing Our Dependent Variable and Factors

Below, we walk through the steps of setting up and visualizing our data, to help us to understand the apparent relationship between our dependent variable, historic flooding (hist_flood), and our four factors in the Calgay geography.

Importing Preliminary Spatial Data

First, we load our two cities’ boundaries and some other preliminary datasets, and put them in web Mercator projections (CRS 3395).

calgary_boundary <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_BOUNDARY.geojson")


calgary_boundary <- calgary_boundary %>% 
  st_transform(crs = 3395)
  
boise_boundary <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_BOUNDARY.geojson")

boise_boundary <- boise_boundary %>% 
  st_transform(crs = 3395)

#HYDROLOGY
#calgary_hydrology <- st_read("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Data_Calgary/CalgaryHydrology.shp")
  #calgary_hydrology <- calgary_hydrology %>% 
   # st_transform(crs = 3395)

#boise_hydrology1 <- st_read("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Data_Boise/BoiseHydrology1.shp")
#boise_hydrology2 <- st_read("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Data_Boise/BoiseHydrology2.shp")
 # boise_hydrology1 <- boise_hydrology1 %>% 
  #    st_transform(crs = 3395)
  #boise_hydrology2 <- boise_hydrology2 %>% 
   #   st_transform(crs = 3395)
  
calgary_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_MASTERFISH.geojson")

boise_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_MASTERFISH.geojson")

Creating Fishnets

Now, we use R to create fishnets for the two cities with the same cell size and projections, to ensure consistency among forthcoming maps.

calgary_fishnet <- st_make_grid(calgary_boundary,
                                cellsize = 500,
                                square = FALSE) %>% 
  .[calgary_boundary] %>% 
  st_sf() %>% 
  mutate(uniqueID = rownames(.))

ggplot()+
  geom_sf(data = calgary_fishnet,
          fill = "pink")+
  geom_sf(data = calgary_boundary, 
          color = "black", fill = "transparent") +
  mapTheme

boise_fishnet <- st_make_grid(boise_boundary,
                                cellsize = 500,
                                square = FALSE) %>% 
  .[boise_boundary] %>% 
  st_sf() %>% 
  mutate(uniqueID = rownames(.))

ggplot()+
  geom_sf(data = boise_fishnet,
          fill = "lightblue")+
  geom_sf(data = boise_boundary, 
          color = "black", fill = "transparent") +
  mapTheme

#st_write(calgary_fishnet, "C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Derived Files/calgary_fishnet.shp", geometry?=?TRUE)
#st_write(boise_fishnet, "C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Derived Files/boise_fishnet.shp", geometry?=?TRUE)

Dependent Variable: Flooding Maps

Now, we visualize our two geographies with their outlays of historical flooding. Considering historical flooding is the dependent variable which all of our factors in our model are trying to predict, it is good to first understand where flooding occurs as a baseline of our model’s success.

grid.arrange(ncol = 2,
  ggplot() +
    labs(title="Calgary, Historical Flooding and Hydrology",
       subtitle="Calgary; Yellow = Historical flooding observed")+
  geom_sf(data = calgary_boundary,
           color = "black")+
    geom_sf(data=calgary_master_fish  %>% 
               filter(hist_flood == 6),
           fill="yellow",alpha = 0.6,color='transparent') +
  # geom_sf(data=calgary_hydrology,
   #        fill = 'light blue')+
    mapTheme,
  
  ggplot() +
    geom_sf(data = boise_boundary,
           color = "black")+
    geom_sf(data=boise_master_fish  %>% 
               filter(hist_flood == 6),
           fill="yellow",alpha = 0.6,color='transparent') +
   #geom_sf(data=boise_hydrology1,
    #       fill = 'light blue')+
    # geom_sf(data=boise_hydrology2,
     #     fill = 'light blue')+
    mapTheme+
  labs(title="Boise, Historical Flooding and Hydrology",
       subtitle="Boise; Yellow = Historical flooding observed"))

Importing Other Files for Factors and Map Elements

Here, we import our spatial data to map our factors and train our model. Most of this data was cleaned, derived, and assembled using ArcGIS Pro.

Mapping

Calgary

Here, we create maps of our four factors for our base geography, Calgary.

#IMPERVIOUS v. PERVIOUS SURFACES -- Good
ggplot() +
  geom_sf(data = calgary_boundary,
           color = "black")+
  geom_sf(data = calgary_surf_fish,
          aes(fill = Pervious),
          color = "transparent")+
  scale_fill_viridis('Surface Permeability (6 = least)',
                     direction = -1, # Direction of the color palette - either 1 or -1
                     option = 'F',  # Viridis has multiple palettes to choose from
                     alpha = 0.6)+  
  labs(title="Surface Permeability") +
  mapTheme

#DISTANCE TO WATER BODY -- Good
ggplot() +
  geom_sf(data = calgary_boundary,
          color = "black")+
  geom_sf(data = calgary_disth2o_fish,
          aes(fill = gridcode),
          color = "transparent")+
  scale_fill_viridis('Distance to Water (6 = closest)',
                     direction = -1, # Direction of the color palette - either 1 or -1
                     option = 'F',
                     alpha = 0.6)+  
  labs(title="Distance to Water Body") +
  mapTheme

  #FLOW ACCUMULATION -- good
  ggplot() +
    geom_sf(data = calgary_boundary,
            color = "black")+
    geom_sf(data = calgary_flow_fish,
            aes(fill = FlowAcc),
            color = "transparent")+
    scale_fill_viridis('Flow Accumulation (6 = greatest)',
                       direction = -1, 
                       option = 'F', 
                       alpha = 0.6)+  
    labs(title="Water Flow Accumulation") +
    mapTheme

Boise

Likewise, here we visualize the same four factors within Boise’s boundaries.

###Boise
  #SOIL DRAINAGE -- Good
  ggplot() +
   geom_sf(data = boise_boundary,
                         fill = "transparent", color = "black")+
                 geom_sf(data = boise_drain_fish,
                         aes(fill = DrainRank),
                         color = "transparent")+
                 scale_fill_viridis('Soil Drainage Efficacy (6 = best)',
                                    direction = -1, # Direction of the color palette - either 1 or -1
                                    option = 'F',  # Viridis has multiple palettes to choose from
                                    alpha = 0.6)+  
                 labs(title="Soil Type & Drainage") +
                 mapTheme

#FLOW ACCUMULATION -- good
  ggplot() +
   geom_sf(data = boise_boundary,
                         fill = "transparent", color = "black")+
                 geom_sf(data = boise_flow_fish,
                         aes(fill = gridcode),
                         color = "transparent")+
                 scale_fill_viridis('Flow Accumulation (6 = greatest)',
                                    direction = -1, 
                                    option = 'F',  
                                    alpha = 0.6)+  
                 labs(title="Water Flow Accumulation") +
                 mapTheme

#IMPERVIOUS v. PERVIOUS SURFACES -- good
  ggplot() +
    geom_sf(data = boise_boundary,
                         color = "black")+
                 geom_sf(data = boise_surf_fish,
                         aes(fill = gridcode),
                         color = "transparent")+
                 scale_fill_viridis('Surface Permeability (6 = least)',
                                    direction = -1, 
                                    option = 'F',  
                                    alpha = 0.6)+  
                 labs(title="Surface Permeability") +
                 mapTheme

#DISTANCE TO WATER BODY -- good
ggplot() +
  geom_sf(data = boise_boundary,
                         color = "black")+
                 geom_sf(data = boise_disth2o_fish,
                         aes(fill = gridcode),
                         color = "transparent")+
                 scale_fill_viridis('Distance to Water (6 = closest)',
                                    direction = -1, # Direction of the color palette - either 1 or -1
                                    option = 'F',
                                    alpha = 0.6)+  
                 labs(title="Distance to Water Body") +
                 mapTheme

Data Wrangling

Here, we clean the tabular data underlying the spatial data to prepare it for running through the model.

Plotting

In these plots, we see how our factors in Calgary average out in areas within the city that have flooding versus those that have not flooded. We notice that areas that have soil with better drainage efficacy (DrainRank) have been more likely to not flood. Likewise, areas with lots of flow accumulation are more likely to flood than areas with minimal accumulation.

Model Building

Here, we walk through the steps of building and training our model, and then applying that model to our other geography, Boise.

Binomial Model

Here, we employ a glm linear model, which finds that soil drainage efficacy (“DrainRank”) and Flow Accumulation (“FlowAcc”) are extremely statistically significant in their relationship to Calgary’s historical flooding. Distance to water body (“disth2o_gr”) was also found to be statistically significant. Surface permeability (“Pervious”), however, did not have a very strong relationship with observed flooding in Calgary.

calgaryModel <- glm(hist_flood ~ DrainRank + FlowAcc + Pervious + disth2o_gr,
                            family="binomial"(link="logit"), data = calgaryTrain1 %>%
                            as.data.frame() %>% dplyr::select(-geometry,-OBJECTID,hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr))                     

summary(calgaryModel)
## 
## Call:
## glm(formula = hist_flood ~ DrainRank + FlowAcc + Pervious + disth2o_gr, 
##     family = binomial(link = "logit"), data = calgaryTrain1 %>% 
##         as.data.frame() %>% dplyr::select(-geometry, -OBJECTID, 
##         hist_flood, DrainRank, FlowAcc, Pervious, disth2o_gr))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.829715   0.604726   1.372 0.170047    
## DrainRank   -1.573263   0.171256  -9.187  < 2e-16 ***
## FlowAcc      0.495187   0.058706   8.435  < 2e-16 ***
## Pervious     0.007057   0.045789   0.154 0.877513    
## disth2o_gr   0.121958   0.034912   3.493 0.000477 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2026.9  on 7153  degrees of freedom
## Residual deviance: 1718.4  on 7149  degrees of freedom
## AIC: 1728.4
## 
## Number of Fisher Scoring iterations: 8

Model Validation

The items below interrogate how successful our model is at predicting Calgary’s flooding, using several different visualizations and statistical evaluations. In both the histogram and filled line graph, our predictions are that flooded areas have about a .5 chance of flooding, while there is more nuance for areas that do not have historical flooding.

classProbs <- predict(calgaryModel, calgaryTest1, type="response")

hist(classProbs)

testProbs <- data.frame(obs = as.numeric(calgaryTest1$hist_flood),
                        pred = classProbs)

ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) + 
  geom_density() +
  facet_grid(obs ~ .) + 
  xlab("Probability") + 
  geom_vline(xintercept = .5) +
     scale_fill_manual(values = c("#669999", "#00CCCC"),
                           labels = c("Not Flooded","Flooded"),
                           name = "") +
    plotTheme

Confusion Metrics

The confusion metrics predict a high degree of accuracy with our Calgary model, while still generating false positives and false negatives.

testProbs$predClass  = ifelse(testProbs$pred > .05 ,1,0)
#predicting that if its greater than 50% its a 1, less is 0

caret::confusionMatrix(reference = as.factor(testProbs$obs), 
                       data = as.factor(testProbs$predClass), 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2454   43
##          1  537   32
##                                           
##                Accuracy : 0.8108          
##                  95% CI : (0.7965, 0.8246)
##     No Information Rate : 0.9755          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0587          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.42667         
##             Specificity : 0.82046         
##          Pos Pred Value : 0.05624         
##          Neg Pred Value : 0.98278         
##              Prevalence : 0.02446         
##          Detection Rate : 0.01044         
##    Detection Prevalence : 0.18558         
##       Balanced Accuracy : 0.62356         
##                                           
##        'Positive' Class : 1               
## 
## How it works:
  # **Predicted = 0, Observed = 0 -> True Negative**
  # **Predicted = 1, Observed = 1 -> True Positive**
  #  **Predicted = 1, Observed = 0 -> False Positive**
  #  **Predicted = 0, Observed = 1 -> False Negative**
    # **1. Sensitivity - the proportion of actual positives (1's) that were predicted to be positive. Also known as "true positive rate".**
    # **2. Specificity - The proportion of actual negatives (0's) that were predicted to be negatives. Also known as "true negative rate".**

ROC Curve

Given that the ROC Curve is not bowed out to the top and left edges, the model is not tremendously over-fit.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7751

Cross Validation

Cross validation tests the strength of our model on sample data. We test our model on 100 test sets, and then we estimate the strength of our model.

ctrl<- trainControl(method = "cv", 
                     number = 100, 
                     p = 0.7, 
                     savePredictions = TRUE)

cvFit <- train(as.factor(hist_flood) ~ .,  data =
                 calgary_vars %>% 
                 as.data.frame() %>% 
              dplyr::select(hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr), 
               method="glm", family="binomial",
               trControl = ctrl)

cvFit
## Generalized Linear Model 
## 
## 10220 samples
##     4 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 10118, 10117, 10118, 10118, 10118, 10117, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.9702579  0

The histogram of our cross validation shows that all tests of our model are highly accurate.

ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Accuracy",
       y="Count")+
  plotTheme

Attaching Predictions to Calgary Model

In the final step of building our Calgary model, we attach our model’s predictions to our original Calgary fishnet…

Visualizing Calgary Predictions

…and we visualize our model’s prediction of flooding against the historical data of observed flooding.

ggplot() + 
  geom_sf(data=calgary1, aes(fill=factor(ntile(allPredictions,5))), 
          colour=NA) +
   scale_fill_manual(values=c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                    labels=as.character(quantile(calgary1$allPredictions,
                                                 c(0, .25, .5, .75, 1),
                                                 na.rm=T)),
                    name="Predicted Quintile\nProbabilities(%)\n(Darkest=\nHighest)") +
    geom_sf(data=calgary1  %>% 
               filter(hist_flood == 1), 
               fill="yellow",alpha = .4, colour="NA") +
  mapTheme +
  labs(title="Observed and Predicted Flooding Areas",
       subtitle="Calgary; Yellow = Historical flooding; Gradient = Predicted flooding (Darkest = Most)")

We see that our Calgary model has some unusual vertical striping. We are not sure why this happened, but suspect it bay be the result of incongruity units among our datasets visualized. We elaborate below as to the potential implications of this, and how we would go about adjusting our model to try to rectify this incorrect visualization.

Calgary Confusion Metrics:

By running confusion metrics, we see that our model predicts true negatives and true positives well. However, it has a lot of false positives because soil drainage efficacy appears to be over-weighted, affecting the shape of the model and the confusion metrics visualization.

calgary1 %>%
  mutate(confResult=case_when(allPredictions < 4 & hist_flood==0 ~ "True_Negative",
                              allPredictions >= 1 & hist_flood==1 ~ "True_Positive",
                              allPredictions < 4 & hist_flood==1 ~ "False_Negative",
                              allPredictions >= 4 & hist_flood==0 ~ "False_Positive")) %>%
  ggplot()+
  geom_sf(aes(fill = confResult), color = "transparent")+
  scale_fill_manual(values = c("Red","Light Blue","Orange","Pink"),
                    name="Outcomes")+
  labs(title="Calgary Confusion Metrics") +
  mapTheme

Data Wrangling - Boise

Here, we pull in our Boise data, which is our original fishnet, but fully populated with Boise’s data on the four factors on which we trained our Calgary model.

boise_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_MASTERFISH.geojson")
## Reading layer `BOISE_MASTERFISH' from data source 
##   `https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_MASTERFISH.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 2255 features and 15 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -12954270 ymin: 5360116 xmax: -12923270 ymax: 5388839
## Projected CRS: WGS 84 / World Mercator
boise_vars <- 
    boise_master_fish %>% 
    dplyr::select(hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr,OBJECTID)
  
boise_vars <- boise_vars %>% mutate(hist_flood=recode(hist_flood, '1'='0', '6'='1'))
boise_vars <- boise_vars %>% mutate(hist_flood = as.numeric(hist_flood))

boise_allPredictions <- 
  predict(cvFit, boise_vars, type="prob")[,2] 

boise1 <- 
  cbind(boise_vars,boise_allPredictions) %>% 
  mutate(boise_allPredictions = round(boise_allPredictions * 100))

Testing the Calgary Flood Model on Boise

Here, we apply our Calgary flood model to our other geography, Boise. We attach the prediction values, trained on the Calgary model, to Boise, based on Boise’s spatial distribution of the four factors that we input to our Calgary model.

ggplot() + 
  geom_sf(data=boise1, aes(fill=factor(ntile(boise_allPredictions,5))), 
          colour=NA) +
  scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                    labels=as.character(quantile(boise1$boise_allPredictions,
                                                            c(0, .25, .5, .75, 1),
                                                 na.rm=T)),
                    name="Predicted Quintile\nProbabilities(%)\n(Darkest=\nHighest)") +
    geom_sf(data=boise1  %>% 
               filter(hist_flood == 1), 
               fill="yellow",alpha = 0.5, colour="NA") +
 coord_sf(xlim = c(-12955270, -12923270), ylim = c(5352033, 5394976), expand = FALSE) +
  mapTheme +
  labs(title="Observed and Predicted Flooding Areas",
       subtitle="Boise; Yellow = Historical flooding; Gradient = Predicted flooding (Darkest = Most)")

Results

Unfortunately, it does not seem that our model works too well on Boise! The areas where our model predicted that Boise would flood have historically been free of flooding. While this may be a result of inaccurate/uncorrelated factors, it could also be the result of data incongruities that caused the visible, vertical striping on our Calgary model.

NEXT STEPS

If we were to revisit our model to tune it to be more accurate, we would train it to be more sensitive to surface permeability, and to de-prioritize soil drainage efficacy, which influenced the whole shape of the prediction in ways that are not accurate to the historical flood patterns. In order to do this, we might need to use a multivariate model that allows us to manually weight one factor more highly against another.

As a fist step in diagnosing why the model did not work very well on Boise, would run the confusion metrics on Boise to see where there is misalignment.

---
title: "Predictive Flood Modelling for Boise, Idaho"
author: "Lindsey Hover, Samriddhi Khare & Anastasia Osorio"
date: "4/6/2023"
output: 
  html_document:
    theme: readable
    toc: true
    toc_float: true
    code_download: true
    code_folding: hide
editor_options: 
  markdown: 
    wrap: 72
---

```{r eval=FALSE, include=FALSE}
rm(list=ls())
#setwd("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/RScripts")

install.packages("rmarkdown") 
install.packages("knitr")
install.packages("caret") 
install.packages("pscl")
install.packages("plotROC") install.packages("pROC")
install.packages("kableExtra") install.packages("gridExtra")
install.packages("raster") 
install.packages("kableExtra") install.packages("geojsonio)

```

```{r setup, include=FALSE,message = FALSE,cache=TRUE}
knitr::opts_chunk$set(echo = TRUE)
options(scipen=999)
library(knitr)
```

```{r libraries, message=FALSE, warning=FALSE, include=FALSE}
library(caret)
library(plotROC)
library(pROC)
library(sf)
library(tidyverse)
library(knitr)
library(kableExtra)
library(tigris)
library(viridis)
library(ggplot2)
library(gridExtra)
library(raster)
#library(rgdal)
library(kableExtra)
library(geojsonio)

```

```{r mapTheme, include=FALSE}
mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())
```

# About This Project

In this assignment at the University of Pennsylvania Stuart Weitzman
School of Design, CPLN 675 Land Use and Environmental Modeling, our goal
was to develop a predictive flood inundation model using Calgary,
Alberta, Canada as the base geography on which to train a model that
could be applied to other cities, using the same factors.

For this model we selected the factors of: - Soil Drainage Efficacy
(i.e., soil type and its capacity to allow water absorption) - Water
Flow Accumulation (i.e., how much--and where--water collects
topographically?) - Surface Permeability (i.e., whether impervious
surfaces, such as buildings and concrete, sit on a given segment of land
area) - Distance to Water Bodies (i.e., how far is a given point in land
from where water collects?)

# Calgary and Boise

We selected Boise, Idaho as our comparable city.

# Setup & Visualizing Our Dependent Variable and Factors

Below, we walk through the steps of setting up and visualizing our data,
to help us to understand the apparent relationship between our dependent
variable, historic flooding (hist_flood), and our four factors in the
Calgay geography.

### Data Sources

-   <https://opendata.cityofboise.org/>
-   <https://opengisdata-idahodeq.opendata.arcgis.com/>
-   <https://insideidaho.org/browse-org.html#>
-   <https://apps.nationalmap.gov/downloader/>
-   <https://data.calgary.ca/>
-   <https://open.alberta.ca/opendata>

### Importing Preliminary Spatial Data

First, we load our two cities' boundaries and some other preliminary
datasets, and put them in web Mercator projections (CRS 3395).

```{r load_boundary_data, echo=TRUE, message=FALSE, warning=FALSE, results='hide'}
calgary_boundary <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_BOUNDARY.geojson")


calgary_boundary <- calgary_boundary %>% 
  st_transform(crs = 3395)
  
boise_boundary <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_BOUNDARY.geojson")

boise_boundary <- boise_boundary %>% 
  st_transform(crs = 3395)

#HYDROLOGY
#calgary_hydrology <- st_read("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Data_Calgary/CalgaryHydrology.shp")
  #calgary_hydrology <- calgary_hydrology %>% 
   # st_transform(crs = 3395)

#boise_hydrology1 <- st_read("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Data_Boise/BoiseHydrology1.shp")
#boise_hydrology2 <- st_read("C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Data_Boise/BoiseHydrology2.shp")
 # boise_hydrology1 <- boise_hydrology1 %>% 
  #    st_transform(crs = 3395)
  #boise_hydrology2 <- boise_hydrology2 %>% 
   #   st_transform(crs = 3395)
  
calgary_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_MASTERFISH.geojson")

boise_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_MASTERFISH.geojson")


```

### Creating Fishnets

Now, we use R to create fishnets for the two cities with the same cell
size and projections, to ensure consistency among forthcoming maps.

```{r fishnets, echo=TRUE}
calgary_fishnet <- st_make_grid(calgary_boundary,
                                cellsize = 500,
                                square = FALSE) %>% 
  .[calgary_boundary] %>% 
  st_sf() %>% 
  mutate(uniqueID = rownames(.))

ggplot()+
  geom_sf(data = calgary_fishnet,
          fill = "pink")+
  geom_sf(data = calgary_boundary, 
          color = "black", fill = "transparent") +
  mapTheme


boise_fishnet <- st_make_grid(boise_boundary,
                                cellsize = 500,
                                square = FALSE) %>% 
  .[boise_boundary] %>% 
  st_sf() %>% 
  mutate(uniqueID = rownames(.))

ggplot()+
  geom_sf(data = boise_fishnet,
          fill = "lightblue")+
  geom_sf(data = boise_boundary, 
          color = "black", fill = "transparent") +
  mapTheme


#st_write(calgary_fishnet, "C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Derived Files/calgary_fishnet.shp", geometry?=?TRUE)
#st_write(boise_fishnet, "C:/Users/Lindsey/Desktop/CPLN675_Land_Use_Modeling_Desktop/Midterm/Derived Files/boise_fishnet.shp", geometry?=?TRUE)
```

### Dependent Variable: Flooding Maps

Now, we visualize our two geographies with their outlays of historical
flooding. Considering historical flooding is the dependent variable
which all of our factors in our model are trying to predict, it is good
to first understand where flooding occurs as a baseline of our model's
success.

```{r hist_flooding, echo=TRUE}
grid.arrange(ncol = 2,
  ggplot() +
    labs(title="Calgary, Historical Flooding and Hydrology",
       subtitle="Calgary; Yellow = Historical flooding observed")+
  geom_sf(data = calgary_boundary,
           color = "black")+
    geom_sf(data=calgary_master_fish  %>% 
               filter(hist_flood == 6),
           fill="yellow",alpha = 0.6,color='transparent') +
  # geom_sf(data=calgary_hydrology,
   #        fill = 'light blue')+
    mapTheme,
  
  ggplot() +
    geom_sf(data = boise_boundary,
           color = "black")+
    geom_sf(data=boise_master_fish  %>% 
               filter(hist_flood == 6),
           fill="yellow",alpha = 0.6,color='transparent') +
   #geom_sf(data=boise_hydrology1,
    #       fill = 'light blue')+
    # geom_sf(data=boise_hydrology2,
     #     fill = 'light blue')+
    mapTheme+
  labs(title="Boise, Historical Flooding and Hydrology",
       subtitle="Boise; Yellow = Historical flooding observed"))
```

### Importing Other Files for Factors and Map Elements

Here, we import our spatial data to map our factors and train our model.
Most of this data was cleaned, derived, and assembled using ArcGIS Pro.

```{r fish_shps, message=FALSE, warning=FALSE, include=FALSE, results='hide'}


#SOIL DRAINAGE
calgary_drain_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_DRAINAGE.geojson")
  calgary_drain_fish <- calgary_drain_fish %>% 
    st_transform(crs = 3395)
  
boise_drain_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_DRAINAGE.geojson")
  boise_drain_fish <- boise_drain_fish %>% 
    st_transform(crs = 3395)
    
#FLOW ACCUMULATION -- USES MASTER FILE!!
calgary_flow_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_MASTERFISH.geojson")
  calgary_flow_fish <- calgary_flow_fish %>% 
    st_transform(crs = 3395)

boise_flow_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_FLOW_ACCU.geojson")

#IMPERVIOUS v. PERVIOUS SURFACES
calgary_surf_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_SURFACE.geojson")
  calgary_surf_fish <- calgary_surf_fish %>% 
    st_transform(crs = 3395) 

boise_surf_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_SURFACE.geojson")

#DISTANCE TO WATER BODY
calgary_disth2o_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_DIST_TO_WATER.geojson")
  calgary_disth2o_fish <- calgary_disth2o_fish %>% 
    st_transform(crs = 3395)
  
boise_disth2o_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_DIST_TO_WATER.geojson")
```

# Mapping

#### Calgary

Here, we create maps of our four factors for our base geography,
Calgary.

```{r calgary_factor_maps, echo=FALSE, message=FALSE, warning=FALSE, img_width = 10 }
###Calgary
  #SOIL DRAINAGE -- Good
  ggplot() +
    geom_sf(data = calgary_boundary,
           color = "black")+
    geom_sf(data = calgary_drain_fish,
            aes(fill = DrainageRa),
            color = "transparent")+
    scale_fill_viridis('Soil Drainage Efficacy (6 = best)',
                       direction = -1, # Direction of the color palette - either 1 or -1
                       option = 'F',  # Viridis has multiple palettes to choose from
                       alpha = 0.6)+  
    labs(title="Soil Type & Drainage") +
    mapTheme
 

```

```{r}
#IMPERVIOUS v. PERVIOUS SURFACES -- Good
ggplot() +
  geom_sf(data = calgary_boundary,
           color = "black")+
  geom_sf(data = calgary_surf_fish,
          aes(fill = Pervious),
          color = "transparent")+
  scale_fill_viridis('Surface Permeability (6 = least)',
                     direction = -1, # Direction of the color palette - either 1 or -1
                     option = 'F',  # Viridis has multiple palettes to choose from
                     alpha = 0.6)+  
  labs(title="Surface Permeability") +
  mapTheme

```

```{r}

#DISTANCE TO WATER BODY -- Good
ggplot() +
  geom_sf(data = calgary_boundary,
          color = "black")+
  geom_sf(data = calgary_disth2o_fish,
          aes(fill = gridcode),
          color = "transparent")+
  scale_fill_viridis('Distance to Water (6 = closest)',
                     direction = -1, # Direction of the color palette - either 1 or -1
                     option = 'F',
                     alpha = 0.6)+  
  labs(title="Distance to Water Body") +
  mapTheme

```

```{r}

  #FLOW ACCUMULATION -- good
  ggplot() +
    geom_sf(data = calgary_boundary,
            color = "black")+
    geom_sf(data = calgary_flow_fish,
            aes(fill = FlowAcc),
            color = "transparent")+
    scale_fill_viridis('Flow Accumulation (6 = greatest)',
                       direction = -1, 
                       option = 'F', 
                       alpha = 0.6)+  
    labs(title="Water Flow Accumulation") +
    mapTheme
  
```




#### Boise

Likewise, here we visualize the same four factors within Boise's
boundaries.

```{r boise_factor_maps, echo=TRUE, message=FALSE, warning=FALSE}
###Boise
  #SOIL DRAINAGE -- Good
  ggplot() +
   geom_sf(data = boise_boundary,
                         fill = "transparent", color = "black")+
                 geom_sf(data = boise_drain_fish,
                         aes(fill = DrainRank),
                         color = "transparent")+
                 scale_fill_viridis('Soil Drainage Efficacy (6 = best)',
                                    direction = -1, # Direction of the color palette - either 1 or -1
                                    option = 'F',  # Viridis has multiple palettes to choose from
                                    alpha = 0.6)+  
                 labs(title="Soil Type & Drainage") +
                 mapTheme
```


```{r}
               
#FLOW ACCUMULATION -- good
  ggplot() +
   geom_sf(data = boise_boundary,
                         fill = "transparent", color = "black")+
                 geom_sf(data = boise_flow_fish,
                         aes(fill = gridcode),
                         color = "transparent")+
                 scale_fill_viridis('Flow Accumulation (6 = greatest)',
                                    direction = -1, 
                                    option = 'F',  
                                    alpha = 0.6)+  
                 labs(title="Water Flow Accumulation") +
                 mapTheme

```

```{r}
               
#IMPERVIOUS v. PERVIOUS SURFACES -- good
  ggplot() +
    geom_sf(data = boise_boundary,
                         color = "black")+
                 geom_sf(data = boise_surf_fish,
                         aes(fill = gridcode),
                         color = "transparent")+
                 scale_fill_viridis('Surface Permeability (6 = least)',
                                    direction = -1, 
                                    option = 'F',  
                                    alpha = 0.6)+  
                 labs(title="Surface Permeability") +
                 mapTheme

```

```{r}
               
#DISTANCE TO WATER BODY -- good
ggplot() +
  geom_sf(data = boise_boundary,
                         color = "black")+
                 geom_sf(data = boise_disth2o_fish,
                         aes(fill = gridcode),
                         color = "transparent")+
                 scale_fill_viridis('Distance to Water (6 = closest)',
                                    direction = -1, # Direction of the color palette - either 1 or -1
                                    option = 'F',
                                    alpha = 0.6)+  
                 labs(title="Distance to Water Body") +
                 mapTheme
```

### Data Wrangling

Here, we clean the tabular data underlying the spatial data to prepare
it for running through the model.

```{r calgary_data_manipulation, include=FALSE}
calgary_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/calgary_geojsons/calgary_geojsons/CALGARY_MASTERFISH.geojson")

  calgary_vars <- 
    calgary_master_fish %>% 
    dplyr::select(hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr,OBJECTID)
  
  calgary_vars <- calgary_vars %>% mutate(hist_flood=recode(hist_flood, '1'='0', '6'='1'))
  calgary_vars <- calgary_vars %>% mutate(hist_flood = as.numeric(hist_flood))
  
```

### Plotting

In these plots, we see how our factors in Calgary average out in areas
within the city that have flooding versus those that have not flooded.
We notice that areas that have soil with better drainage efficacy
(DrainRank) have been more likely to not flood. Likewise, areas with
lots of flow accumulation are more likely to flood than areas with
minimal accumulation.

```{r flood_or_not_plots, include=FALSE}
  calgary_plot <-
    calgary_vars %>% 
    as.data.frame() %>% 
    dplyr::select(hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr) %>% 
      gather(variable, value, -hist_flood)
 
  # 6 = previously flooded ; 1 = not flooded
       ggplot(calgary_plot %>%
                group_by(hist_flood, variable) %>%
                summarize(mean = mean(value))) + 
         geom_bar(aes(as.factor(hist_flood), 
                      mean, 
                      fill=as.factor(hist_flood)),
                  stat="identity") + 
         facet_wrap(~variable, ncol=5, scales = "free") +
         scale_fill_manual(values = c("#669999", "#00CCCC"),
                           labels = c("Not Flooded","Flooded"),
                           name = "") +
         labs(x="Flooded or Not", y="Value")
```

# Model Building

Here, we walk through the steps of building and training our model, and
then applying that model to our other geography, Boise.

```{r model_train, include=FALSE}

trainIndex1 <- createDataPartition(calgary_vars$hist_flood, 
                                  p = .70,
list = FALSE,
times = 1)
       
       calgaryTrain1 <- calgary_vars[ trainIndex1,]
       calgaryTest1  <- calgary_vars[-trainIndex1,]
```

### Binomial Model

Here, we employ a glm linear model, which finds that soil drainage
efficacy ("DrainRank") and Flow Accumulation ("FlowAcc") are extremely
statistically significant in their relationship to Calgary's historical
flooding. Distance to water body ("disth2o_gr") was also found to be
statistically significant. Surface permeability ("Pervious"), however,
did not have a very strong relationship with observed flooding in
Calgary.

```{r binomial, echo=TRUE}


calgaryModel <- glm(hist_flood ~ DrainRank + FlowAcc + Pervious + disth2o_gr,
                            family="binomial"(link="logit"), data = calgaryTrain1 %>%
                            as.data.frame() %>% dplyr::select(-geometry,-OBJECTID,hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr))                     

summary(calgaryModel)

```

### Model Validation

The items below interrogate how successful our model is at predicting
Calgary's flooding, using several different visualizations and
statistical evaluations. In both the histogram and filled line graph,
our predictions are that flooded areas have about a .5 chance of
flooding, while there is more nuance for areas that do not have
historical flooding.

```{r validation, echo=TRUE}

classProbs <- predict(calgaryModel, calgaryTest1, type="response")

hist(classProbs)

testProbs <- data.frame(obs = as.numeric(calgaryTest1$hist_flood),
                        pred = classProbs)

ggplot(testProbs, aes(x = pred, fill=as.factor(obs))) + 
  geom_density() +
  facet_grid(obs ~ .) + 
  xlab("Probability") + 
  geom_vline(xintercept = .5) +
     scale_fill_manual(values = c("#669999", "#00CCCC"),
                           labels = c("Not Flooded","Flooded"),
                           name = "") +
    plotTheme
  

```

### Confusion Metrics

The confusion metrics predict a high degree of accuracy with our Calgary
model, while still generating false positives and false negatives.

```{r confusion_matrix, echo=TRUE, message=FALSE, warning=FALSE}
testProbs$predClass  = ifelse(testProbs$pred > .05 ,1,0)
#predicting that if its greater than 50% its a 1, less is 0

caret::confusionMatrix(reference = as.factor(testProbs$obs), 
                       data = as.factor(testProbs$predClass), 
                       positive = "1")



## How it works:
  # **Predicted = 0, Observed = 0 -> True Negative**
  # **Predicted = 1, Observed = 1 -> True Positive**
  #  **Predicted = 1, Observed = 0 -> False Positive**
  #  **Predicted = 0, Observed = 1 -> False Negative**
    # **1. Sensitivity - the proportion of actual positives (1's) that were predicted to be positive. Also known as "true positive rate".**
    # **2. Specificity - The proportion of actual negatives (0's) that were predicted to be negatives. Also known as "true negative rate".**
```

### ROC Curve

Given that the ROC Curve is not bowed out to the top and left edges, the
model is not tremendously over-fit.

```{r roc_curve, echo=FALSE}

ggplot(testProbs, aes(d = obs, m = pred)) + 
  geom_roc(n.cuts = 50, labels = FALSE) + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') 
```

```{r auc, echo=FALSE, warning=FALSE}
auc(testProbs$obs, testProbs$pred)
```

### Cross Validation

Cross validation tests the strength of our model on sample data. We test
our model on 100 test sets, and then we estimate the strength of our
model.

```{r k_fold, echo=TRUE, message=FALSE, warning=FALSE}
ctrl<- trainControl(method = "cv", 
                     number = 100, 
                     p = 0.7, 
                     savePredictions = TRUE)

cvFit <- train(as.factor(hist_flood) ~ .,  data =
                 calgary_vars %>% 
                 as.data.frame() %>% 
              dplyr::select(hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr), 
               method="glm", family="binomial",
               trControl = ctrl)

cvFit

```

The histogram of our cross validation shows that all tests of our model
are highly accurate.

```{r cv_hist, echo=TRUE, message=FALSE, warning=FALSE}
ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) + 
  geom_histogram() +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Accuracy",
       y="Count")+
  plotTheme
```

### Attaching Predictions to Calgary Model

In the final step of building our Calgary model, we attach our model's
predictions to our original Calgary fishnet...

```{r predict_whole, message=FALSE, warning=FALSE, include=FALSE}

allPredictions <- 
  predict(cvFit, calgary_vars, type="prob")[,2]

calgary_vars <- 
  cbind(calgary_vars,allPredictions) %>%
  mutate(allPredictions = round(allPredictions * 100)) 


calgary1 <- calgary_vars %>%
 mutate(PredClass = ifelse(allPredictions > 10, 1, 0))

calgary1 <- calgary1 %>%
 mutate(Correct = ifelse(PredClass == hist_flood, "1", "0"),
         Incorrect = ifelse(PredClass != hist_flood, "1", "0"))

```

### Visualizing Calgary Predictions

...and we visualize our model's prediction of flooding against the
historical data of observed flooding.

```{r predicted_map1, echo=TRUE, message=FALSE, warning=FALSE}
ggplot() + 
  geom_sf(data=calgary1, aes(fill=factor(ntile(allPredictions,5))), 
          colour=NA) +
   scale_fill_manual(values=c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                    labels=as.character(quantile(calgary1$allPredictions,
                                                 c(0, .25, .5, .75, 1),
                                                 na.rm=T)),
                    name="Predicted Quintile\nProbabilities(%)\n(Darkest=\nHighest)") +
    geom_sf(data=calgary1  %>% 
               filter(hist_flood == 1), 
               fill="yellow",alpha = .4, colour="NA") +
  mapTheme +
  labs(title="Observed and Predicted Flooding Areas",
       subtitle="Calgary; Yellow = Historical flooding; Gradient = Predicted flooding (Darkest = Most)")
```

We see that our Calgary model has some unusual vertical striping. We are
not sure why this happened, but suspect it bay be the result of
incongruity units among our datasets visualized. We elaborate below as
to the potential implications of this, and how we would go about
adjusting our model to try to rectify this incorrect visualization.

### Calgary Confusion Metrics:

By running confusion metrics, we see that our model predicts true
negatives and true positives well. However, it has a lot of false
positives because soil drainage efficacy appears to be over-weighted,
affecting the shape of the model and the confusion metrics
visualization.

```{r error_map, echo=TRUE, message=FALSE, warning=FALSE}
calgary1 %>%
  mutate(confResult=case_when(allPredictions < 4 & hist_flood==0 ~ "True_Negative",
                              allPredictions >= 1 & hist_flood==1 ~ "True_Positive",
                              allPredictions < 4 & hist_flood==1 ~ "False_Negative",
                              allPredictions >= 4 & hist_flood==0 ~ "False_Positive")) %>%
  ggplot()+
  geom_sf(aes(fill = confResult), color = "transparent")+
  scale_fill_manual(values = c("Red","Light Blue","Orange","Pink"),
                    name="Outcomes")+
  labs(title="Calgary Confusion Metrics") +
  mapTheme

```

### Data Wrangling - Boise

Here, we pull in our Boise data, which is our original fishnet, but
fully populated with Boise's data on the four factors on which we
trained our Calgary model.

```{r echo=TRUE}

boise_master_fish <- st_read("https://raw.githubusercontent.com/ana-oso/CalgaryvBoise/0ab4aadf9e7d83409aefd5ac8dab3e6515240b1f/geojsons/boise_geojsons/boise_geojsons/BOISE_MASTERFISH.geojson")

boise_vars <- 
    boise_master_fish %>% 
    dplyr::select(hist_flood,DrainRank,FlowAcc,Pervious,disth2o_gr,OBJECTID)
  
boise_vars <- boise_vars %>% mutate(hist_flood=recode(hist_flood, '1'='0', '6'='1'))
boise_vars <- boise_vars %>% mutate(hist_flood = as.numeric(hist_flood))

boise_allPredictions <- 
  predict(cvFit, boise_vars, type="prob")[,2] 

boise1 <- 
  cbind(boise_vars,boise_allPredictions) %>% 
  mutate(boise_allPredictions = round(boise_allPredictions * 100))


```

# Testing the Calgary Flood Model on Boise

Here, we apply our Calgary flood model to our other geography, Boise. We
attach the prediction values, trained on the Calgary model, to Boise,
based on Boise's spatial distribution of the four factors that we input
to our Calgary model.

```{r echo=TRUE}
ggplot() + 
  geom_sf(data=boise1, aes(fill=factor(ntile(boise_allPredictions,5))), 
          colour=NA) +
  scale_fill_manual(values = c("#DEF5E5FF","#49C1ADFF","#357BA2FF","#3E356BFF","#0B0405FF"),
                    labels=as.character(quantile(boise1$boise_allPredictions,
                                                            c(0, .25, .5, .75, 1),
                                                 na.rm=T)),
                    name="Predicted Quintile\nProbabilities(%)\n(Darkest=\nHighest)") +
    geom_sf(data=boise1  %>% 
               filter(hist_flood == 1), 
               fill="yellow",alpha = 0.5, colour="NA") +
 coord_sf(xlim = c(-12955270, -12923270), ylim = c(5352033, 5394976), expand = FALSE) +
  mapTheme +
  labs(title="Observed and Predicted Flooding Areas",
       subtitle="Boise; Yellow = Historical flooding; Gradient = Predicted flooding (Darkest = Most)")

```

# Results

Unfortunately, it does not seem that our model works too well on Boise!
The areas where our model predicted that Boise would flood have
historically been free of flooding. While this may be a result of
inaccurate/uncorrelated factors, it could also be the result of data
incongruities that caused the visible, vertical striping on our Calgary
model.

### NEXT STEPS

If we were to revisit our model to tune it to be more accurate, we would
train it to be more sensitive to surface permeability, and to
de-prioritize soil drainage efficacy, which influenced the whole shape
of the prediction in ways that are not accurate to the historical flood
patterns. In order to do this, we might need to use a multivariate model
that allows us to manually weight one factor more highly against
another.

As a fist step in diagnosing why the model did not work very well on
Boise, would run the confusion metrics on Boise to see where there is
misalignment.
