This document is available here: http://alexknorre.com/files/geodata-with-R-30jun2022.html

Qick start: map of Philly

ph <- get_acs(state = "PA",
              county = "Philadelphia",
              geography = "tract",
              variables = "B19013_001",
                     geometry = TRUE,cache_table = T,
                     output = "wide") %>% 
  mutate(income = B19013_001E)
mapview(ph, zcol = "income")
ggplot(ph) +
  geom_sf(aes(fill = income)) +
  theme_minimal()

Geo data basics

The problem with maps: Earth is an ellipsoid

Various ways of projecting an ellipsoid (or its part) onto a plane

Source: http://practicalgeoskills.blogspot.com/2020/04/map-projections-meaning-and-examples.html

All projections have their own trade-offs

Mercator projection is the most popular one (Google Maps). It’s conformal: a direction (for example, compass arrow) on such map is a straight line. But it distorts areas near poles.

https://www.thetruesize.com

Coordinates

Cartesian (X-Y on a plane) vs spherical/ellipsoidal (latitude and longitude)

Source: Edzer Pebesma, Roger Bivand (CC BY-NC-ND 4.0)

Datum: basis (form of Earth). WGS 84, NAD 83.

Coordinate Reference Systems (CRS)

(Source: https://www.earthdatascience.org/courses/earth-analytics/spatial-data-r/intro-to-coordinate-reference-systems/)

Let’s try to replicate this and play with various CRS using shapefile of states from the US Census website:

# Source: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_20m.zip
us_map <- read_sf("~/Downloads/cb_2018_us_state_20m/")
us_map
## Simple feature collection with 52 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
## Geodetic CRS:  NAD83
## # A tibble: 52 × 10
##    STATEFP STATENS  AFFGEOID    GEOID STUSPS NAME         LSAD     ALAND  AWATER
##    <chr>   <chr>    <chr>       <chr> <chr>  <chr>        <chr>    <dbl>   <dbl>
##  1 24      01714934 0400000US24 24    MD     Maryland     00     2.52e10 6.98e 9
##  2 19      01779785 0400000US19 19    IA     Iowa         00     1.45e11 1.08e 9
##  3 10      01779781 0400000US10 10    DE     Delaware     00     5.05e 9 1.40e 9
##  4 39      01085497 0400000US39 39    OH     Ohio         00     1.06e11 1.03e10
##  5 42      01779798 0400000US42 42    PA     Pennsylvania 00     1.16e11 3.39e 9
##  6 31      01779792 0400000US31 31    NE     Nebraska     00     1.99e11 1.37e 9
##  7 53      01779804 0400000US53 53    WA     Washington   00     1.72e11 1.26e10
##  8 72      01779808 0400000US72 72    PR     Puerto Rico  00     8.87e 9 4.92e 9
##  9 01      01779775 0400000US01 01    AL     Alabama      00     1.31e11 4.59e 9
## 10 05      00068085 0400000US05 05    AR     Arkansas     00     1.35e11 2.96e 9
## # … with 42 more rows, and 1 more variable: geometry <MULTIPOLYGON [°]>
us_map %>% 
    ggplot() +
    geom_sf() +
    theme_minimal()

It does not look good: Alaska, Hawaii, and Puerto Rico are too far away from contiguous states. Let’s try removing them:

us_smaller <- us_map %>% filter(!(NAME %in% c("Alaska","Hawaii", "Puerto Rico")))

us_smaller %>% 
  ggplot() +
  geom_sf() +
  theme_minimal()

We can now try projecting states into different CRS:

p1<-us_smaller %>% 
  st_transform(crs = 3857) %>% 
  ggplot() +
  geom_sf() +
  labs(title = "EPSG: 3857 (Web Mercator,\nprojection (X-Y),\nused in Google Maps)")+
  theme_minimal()

p2<-us_smaller %>% 
  st_transform(crs = 4326) %>% 
  ggplot() +
  geom_sf() +
  labs(title = "EPSG: 4326 (WGS 84,\nelliptical (lot-lan),\nused in Google Earth)")+
  theme_minimal()

p3<-us_smaller %>% 
  st_transform(crs = 2955) %>% 
  ggplot() +
  geom_sf() +
  labs(title = "EPSG: 2955 (UTM projection, Zone 11N)")+
  theme_minimal()

p4<-us_smaller %>% 
  st_transform(crs = 2163) %>% 
  ggplot() +
  geom_sf() +
  labs(title = "EPSG: 2163 (U.S. National Atlas)")+
  theme_minimal()

ggarrange(p1,p2,p3,p4)

Spatial data structures

Points, lines, polygons

Points

shootings <- read.csv("https://phl.carto.com/api/v2/sql?q=SELECT+*,+ST_Y(the_geom)+AS+lat,+ST_X(the_geom)+AS+lng+FROM+shootings&filename=shootings&format=csv&skipfields=cartodb_id")

shootings_sf <- shootings %>% 
  filter(!is.na(lat) & !is.na(lng)) %>% 
  filter(lat > 30) %>% 
  st_as_sf(coords = c("lng","lat"), crs = 4326)

shootings_sf %>% select(location, geometry)
## Simple feature collection with 11996 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.27362 ymin: 39.88576 xmax: -74.95936 ymax: 40.1173
## Geodetic CRS:  WGS 84
## First 10 features:
##                  location                   geometry
## 1      2700 BLOCK REED ST  POINT (-75.18942 39.9357)
## 2   3000 BLOCK MIFFLIN ST POINT (-75.19553 39.93023)
## 3  2100 BLOCK MOUNTAIN ST POINT (-75.18002 39.93117)
## 4  2100 BLOCK MOUNTAIN ST POINT (-75.18002 39.93117)
## 5    1500 BLOCK S 23RD ST POINT (-75.18287 39.93339)
## 6     1600 BLOCK Annin St  POINT (-75.1715 39.93688)
## 7    1200 BLOCK S 22ND ST POINT (-75.18051 39.93611)
## 8    1600 BLOCK S 18TH ST POINT (-75.17497 39.93031)
## 9     600 BLOCK S 27TH ST POINT (-75.18427 39.94566)
## 10     2400 BLOCK REED ST POINT (-75.18425 39.93504)

Let’s map it:

shootings_sf %>% 
  ggplot() +
  geom_sf()

Let’s make it more aesthetically pleasing:

shootings_sf %>% 
  ggplot() +
  geom_sf(lwd = 0.3,alpha = 0.3) +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "br", which_north = "true", 
                          pad_x = unit(0.1, "in"),
                          pad_y = unit(0, "in"),
                          style = north_arrow_fancy_orienteering,height = unit(.3, "in"),
                          width = unit(0.3, "in")) +
     labs(title = "Shootings in Philadelphia") +
     theme_minimal()

Instead of ggplotting, we can also use interactive visualizations:

shootings_sf_selected <- shootings_sf %>% 
  select(date_,location, age,sex,race,fatal)

mapview(shootings_sf_selected,zcol = "fatal")

Polygons and tidycensus

tidycensus is a package to programmatically download data from US Census bureau (decennials censuses and American Community Survey data).

Before using it on your computer, you need to set up an API key: (for this, run ?tidycensus::census_api_key())

Let’s look at the variables ACS 2019 (5 yr) can give us:

acs_vars <- load_variables(2019, "acs5", cache = TRUE)

head(acs_vars)
## # A tibble: 6 × 3
##   name       label                                   concept   
##   <chr>      <chr>                                   <chr>     
## 1 B01001_001 Estimate!!Total:                        SEX BY AGE
## 2 B01001_002 Estimate!!Total:!!Male:                 SEX BY AGE
## 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years  SEX BY AGE
## 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years   SEX BY AGE
## 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE
## 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE
ph <- get_acs(state = "PA",
              county = "Philadelphia",
              geography = "tract",
              variables = "B19013_001",
                     geometry = TRUE,cache_table = T,
                     output = "wide") %>% 
  mutate(income = B19013_001E)
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
ph
## Simple feature collection with 384 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28027 ymin: 39.86705 xmax: -74.95578 ymax: 40.13799
## Geodetic CRS:  NAD83
## First 10 features:
##          GEOID                                                   NAME
## 1  42101014500    Census Tract 145, Philadelphia County, Pennsylvania
## 2  42101004202  Census Tract 42.02, Philadelphia County, Pennsylvania
## 3  42101004102  Census Tract 41.02, Philadelphia County, Pennsylvania
## 4  42101000804   Census Tract 8.04, Philadelphia County, Pennsylvania
## 5  42101025300    Census Tract 253, Philadelphia County, Pennsylvania
## 6  42101028100    Census Tract 281, Philadelphia County, Pennsylvania
## 7  42101008602  Census Tract 86.02, Philadelphia County, Pennsylvania
## 8  42101030700    Census Tract 307, Philadelphia County, Pennsylvania
## 9  42101031101 Census Tract 311.01, Philadelphia County, Pennsylvania
## 10 42101025900    Census Tract 259, Philadelphia County, Pennsylvania
##    B19013_001E B19013_001M                       geometry income
## 1        17991        6274 MULTIPOLYGON (((-75.15131 3...  17991
## 2        47935       14670 MULTIPOLYGON (((-75.15614 3...  47935
## 3        40958       12742 MULTIPOLYGON (((-75.16396 3...  40958
## 4       108293       11643 MULTIPOLYGON (((-75.17067 3... 108293
## 5        41545        3413 MULTIPOLYGON (((-75.18653 4...  41545
## 6        35591       10534 MULTIPOLYGON (((-75.15237 4...  35591
## 7        39394        7829 MULTIPOLYGON (((-75.22156 3...  39394
## 8        49906       11268 MULTIPOLYGON (((-75.0935 40...  49906
## 9        38292        9285 MULTIPOLYGON (((-75.08267 4...  38292
## 10       46548        8716 MULTIPOLYGON (((-75.1809 40...  46548

Let’s try intersecting shootings with census tracts. In other words, let’s project points onto polygons:

shootings_on_tracts <- st_join(ph, shootings_sf_selected)

“Error in st_geos_binop(”intersects”, x, y, sparse = sparse, prepared = prepared, : st_crs(x) == st_crs(y) is not TRUE”

Two datasets have different CRS systems. We need to reproject one into another’s CRS.

st_crs(ph)
## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]
st_crs(shootings_sf_selected)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
ph_wgs84 <- ph %>% 
  st_transform(crs = 4326)
st_crs(ph_wgs84)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         MEMBER["World Geodetic System 1984 (G2139)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

Let’s first attach shootings to each of the tracts:

shootings_on_tracts <- st_join(shootings_sf_selected, ph_wgs84, join = st_within, left = T) 

Then, let’s count how many shootings occured in each of the tracts:

shootings_by_tracts <- shootings_on_tracts %>% count(GEOID)
shootings_by_tracts
## Simple feature collection with 347 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -75.27362 ymin: 39.88576 xmax: -74.95936 ymax: 40.1173
## Geodetic CRS:  WGS 84
## First 10 features:
##          GEOID  n                       geometry
## 1  42101000100  2 MULTIPOINT ((-75.14644 39.9...
## 2  42101000200  6 MULTIPOINT ((-75.15631 39.9...
## 3  42101000300  1     POINT (-75.16274 39.95698)
## 4  42101000402 11 MULTIPOINT ((-75.16585 39.9...
## 5  42101000500  5 MULTIPOINT ((-75.15973 39.9...
## 6  42101000600  7 MULTIPOINT ((-75.15734 39.9...
## 7  42101000700  7 MULTIPOINT ((-75.16614 39.9...
## 8  42101000804  2 MULTIPOINT ((-75.16616 39.9...
## 9  42101000901  6 MULTIPOINT ((-75.16319 39.9...
## 10 42101001002  6 MULTIPOINT ((-75.14439 39.9...

Now we have table with 347 tract labels and each has a count of shootings. Let’s merge it with tracts:

shootings_by_tracts_no_geometry <- shootings_by_tracts %>% st_drop_geometry()
tracts_with_shooting_counts <- merge(ph_wgs84,shootings_by_tracts_no_geometry)
tracts_with_shooting_counts
## Simple feature collection with 347 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28027 ymin: 39.86705 xmax: -74.95578 ymax: 40.1236
## Geodetic CRS:  WGS 84
## First 10 features:
##          GEOID                                                  NAME
## 1  42101000100     Census Tract 1, Philadelphia County, Pennsylvania
## 2  42101000200     Census Tract 2, Philadelphia County, Pennsylvania
## 3  42101000300     Census Tract 3, Philadelphia County, Pennsylvania
## 4  42101000402  Census Tract 4.02, Philadelphia County, Pennsylvania
## 5  42101000500     Census Tract 5, Philadelphia County, Pennsylvania
## 6  42101000600     Census Tract 6, Philadelphia County, Pennsylvania
## 7  42101000700     Census Tract 7, Philadelphia County, Pennsylvania
## 8  42101000804  Census Tract 8.04, Philadelphia County, Pennsylvania
## 9  42101000901  Census Tract 9.01, Philadelphia County, Pennsylvania
## 10 42101001002 Census Tract 10.02, Philadelphia County, Pennsylvania
##    B19013_001E B19013_001M income  n                       geometry
## 1       103585       17012 103585  2 MULTIPOLYGON (((-75.15154 3...
## 2        49871       17900  49871  6 MULTIPOLYGON (((-75.16234 3...
## 3        86296       15622  86296  1 MULTIPOLYGON (((-75.17994 3...
## 4        78947        8808  78947 11 MULTIPOLYGON (((-75.17329 3...
## 5        45438        9588  45438  5 MULTIPOLYGON (((-75.16506 3...
## 6        84632       12428  84632  7 MULTIPOLYGON (((-75.16401 3...
## 7        66923       16742  66923  7 MULTIPOLYGON (((-75.18091 3...
## 8       108293       11643 108293  2 MULTIPOLYGON (((-75.17067 3...
## 9        48929       10484  48929  6 MULTIPOLYGON (((-75.16475 3...
## 10      113431       16717 113431  6 MULTIPOLYGON (((-75.15034 3...
tracts_with_shooting_counts %>% 
  ggplot() +
  geom_sf(aes(fill = n))

mapview(tracts_with_shooting_counts, zcol = "n")

Nice thing about mapview is that you can stack several geometries:

mapview(tracts_with_shooting_counts, zcol = "n") +
  mapview(shootings_sf_selected, zcol = "fatal")

Is there correlation between median household income and shootings in areas?

plot(tracts_with_shooting_counts$income, tracts_with_shooting_counts$n)