December 15, 2017
The other night I attended an R-Ladies meetup where Kaylene McClanahan presented a brief into to R Markdown and set-up a mini-hackathon-style coding session with NYC noise complaint data she prepared. The group I was in was working on mapping, although we didn't have enough time to do everything we wanted, so below I demonstrate how density data can be calculated and used in combination with ggmap()
to produce interesting and useful visualizations.
Note: Code can be found on GitHub for the project (includuing the .Rhtml file used to make this article), and the specific density weighting code can be found here.
First, let's load the data and calculate the response time from the created date
and closed date
variables. Additionally, let's also filter out points with negative response time, response time over 50 hours (unrealistic), and anything with missing coordinates. I will be using base R for all of the (limited) data cleaning required here.
library(ggplot2) library(ggmap) #this library produces color palettes that are easier to discern library(cetcolor) source('weighted_map_raster.r') party_ready = read.csv('data/party_ready.csv') party_sideways = read.csv('data/party_sideways.csv') party_ready$duration = as.numeric(as.POSIXct(party_ready$closed_date) - as.POSIXct(party_ready$created_date)) party_ready_realistic = with(party_ready, party_ready[duration > 0 & duration < 3000 & !is.na(longitude),])
Above I loaded some code I wrote to calculate density values for a map of new york. Below, I will use them as a rasterized overlay of a map using geom_raster()
, which is much, much faster (~100 times) than using geom_tile()
if it is possible to use it. Note that you must use coord_cartesian()
after ggmap()
, since rasterized graphics don't work with non-cartesian projections.
default_density = createWeightedGrid(party_ready_realistic)
## [1] 0.7580717 ## [1] 0.003790359 ## [1] 0.005
weighted_density = createWeightedGrid(party_ready_realistic, variable='duration')
## [1] 0.7580717 ## [1] 0.003790359 ## [1] 0.005
weighted_density$normalized = weighted_density$weight/(1e-10 + default_density$weight) weighted_density$default_density = default_density$weight
The resulting data frames look like this:
cat(kable(head(default_density)))
lat_index | lon_index | longitude | latitude | weight |
---|---|---|---|---|
1 | 1 | -74.25507 | 40.49382 | 0.0499639 |
2 | 1 | -74.25507 | 40.49882 | 0.3094801 |
3 | 1 | -74.25507 | 40.50382 | 3.9012009 |
4 | 1 | -74.25507 | 40.50882 | 3.9115544 |
5 | 1 | -74.25507 | 40.51382 | 3.8654261 |
6 | 1 | -74.25507 | 40.51882 | 3.8625678 |
cat(kable(head(weighted_density)))
lat_index | lon_index | longitude | latitude | weight | normalized | default_density |
---|---|---|---|---|---|---|
1 | 1 | -74.25507 | 40.49382 | 9.243324 | 185.0000 | 0.0499639 |
2 | 1 | -74.25507 | 40.49882 | 53.871410 | 174.0707 | 0.3094801 |
3 | 1 | -74.25507 | 40.50382 | 401.538563 | 102.9269 | 3.9012009 |
4 | 1 | -74.25507 | 40.50882 | 402.550277 | 102.9131 | 3.9115544 |
5 | 1 | -74.25507 | 40.51382 | 393.665244 | 101.8427 | 3.8654261 |
6 | 1 | -74.25507 | 40.51882 | 393.339337 | 101.8336 | 3.8625678 |
We will also load a map of New York City, along with defining a labeler for log-scaled legends and a constant.
nyc <- get_map('new york city, new york', maptype='toner', source='stamen', zoom=11) antilog_formatter = function(x){ round(exp(x)) } r5 = sqrt(5)
Below is a graph of the density of noise complaints within New York City.
ggmap(nyc) + coord_cartesian() + geom_raster(data=weighted_density[weighted_density$normalized > exp(0.101),], aes(x=longitude, y=latitude, fill=log( default_density), alpha=log(default_density))) + scale_alpha_continuous() + scale_fill_gradientn('Density of Complaints', colors=cet_pal(5), label=antilog_formatter, breaks = log(c(1/r5, 1,r5, 4, 4*r5, 20, 20*r5, 100, 100*r5 ))) + ggtitle('Density of Noise Complaints in NYC (per year)') + guides(alpha='none')
## Warning: Removed 1243 rows containing missing values (geom_raster).
Below is an (accurate) estimate of the average wait time for complaints. It seems that the northernmost region has an unusually high wait time.
ggmap(nyc) + coord_cartesian() + geom_raster(data=weighted_density[weighted_density$normalized > exp(0.101),], aes(x=longitude, y=latitude, fill=log( normalized), alpha=log(normalized))) + scale_alpha_continuous() + scale_fill_gradientn('Average Response Time (minutes)', colors=cet_pal(5), label=antilog_formatter) + ggtitle('Average Response Time to Noise Complaints in NYC') + guides(alpha='none')
## Warning: Removed 1243 rows containing missing values (geom_raster).
This is the code that I used to generate the density maps. It is about 2.5 times slower than the kde2d()
from the MASS
package, but its behavior is more explicit, and it can (most importantly) be used to calculate weighted averages. There is also a function kde2d.weighted
from the ggtern
package, but it has a small bug in it and I haven't figured out how to get the results to interpret properly. The code below is about 3-4 times slower, but it works well.
getMatrixElementsByLinearIndex <- function(mat, row_idx, col_idx){ nr = nrow(mat) return(mat[row_idx + nr*(col_idx-1)]) } ## Function creates a kernel density grid, either based on density of points (default) ## or a weighted density # variable - character of a variable that is used to weight the density # kernel_lat_sd - the standard deviation/bandwidth of the kernel in latitude units; default # is approximately 1 mile # tile_latitude_size - the length of each tile's latitude dimension; longitude will scale # with this to be roughly square # max_influence_sd - the maximum number of standard deviations in either axis that a point # can create influence # padding_lat - the amount of latitude added in each direction where the density will still # be calculated; padding for longitude is calculated from this # normalize_weights - if TRUE, then the kernel density will be normalized over each affected # region for each data point createWeightedGrid <- function(data, variable=NULL, kernel_lat_sd=1/69, tile_latitude_size=0.005, max_influence_sd = 2, padding_lat=0.005, normalize_weights=TRUE){ data = data[(!is.na(data$longitude)) & !is.na(data$latitude),] if (!is.null(variable)){ data = data[!is.na(data[,variable]),] } longitude = data$longitude latitude = data$latitude med_lon = mean(range(longitude)) med_lat = mean(range(latitude)) lon_scale_factor = cos(med_lat * pi/180) padding_lon = lon_scale_factor * padding_lat range_lon = range(longitude, na.rm=TRUE) + c(-1, 1) * padding_lon range_lat = range(latitude, na.rm=TRUE) + c(-1, 1) * padding_lat tile_longitude_size = tile_latitude_size * lon_scale_factor padding_lon = lon_scale_factor * padding_lat lat_tile_values = seq(range_lat[1], range_lat[2], tile_latitude_size) lon_tile_values = seq(range_lon[1], range_lon[2], tile_longitude_size) n_lat_tiles = length(lat_tile_values) n_lon_tiles = length(lon_tile_values) weight_matrix = matrix(0, ncol=n_lon_tiles, nrow=n_lat_tiles) lat_to_index <- function(latitude){ round(n_lat_tiles * (latitude - range_lat[1])/diff(range_lat)) + 1 } lon_to_index <- function(longitude){ round(n_lon_tiles * (longitude - range_lon[1])/diff(range_lon)) } lat_index_radius = round(n_lat_tiles * max_influence_sd * tile_latitude_size/diff(range_lat)) lon_index_radius = round(n_lon_tiles * max_influence_sd * tile_longitude_size/diff(range_lon)) calculate_latitude_bounds <- function(latitude){ lat_index = lat_to_index(latitude) bounds = c(pmax(lat_index - lat_index_radius, 1), pmin(lat_index+lat_index_radius, n_lat_tiles)) return(bounds[1]:bounds[2]) } calculate_longitude_bounds <- function(longitude){ lon_index = lon_to_index(longitude) bounds = c(pmax(lon_index - lon_index_radius, 1), pmin(lon_index+lon_index_radius, n_lon_tiles)) return(bounds[1]:bounds[2]) } if (!is.null(variable)){ x = data[,variable] } else{ x = rep(1, nrow(data)) } calculate_distance <- function(latitude, longitude, grid){ sqrt(((grid$longitude-longitude)*lon_scale_factor)^2 + (grid$latitude-latitude)^2) } for (i in 1:nrow(data)){ x_latitude = latitude[i] x_longitude = longitude[i] valid_lat_indexes = calculate_latitude_bounds(x_latitude) valid_lon_indexes = calculate_longitude_bounds(x_longitude) latlon_grid = expand.grid(lat_index = valid_lat_indexes, lon_index = valid_lon_indexes) latlon_grid$latitude = lat_tile_values[latlon_grid$lat_index] latlon_grid$longitude = lon_tile_values[latlon_grid$lon_index] new_weights = 1/(pi*kernel_lat_sd) * exp(-((calculate_distance(x_latitude, x_longitude, latlon_grid)))^2/ (2*kernel_lat_sd)) if (normalize_weights){ new_weights = new_weights/(sum(new_weights) + 1e-8) } weight_matrix[latlon_grid$lat_index + n_lat_tiles * (latlon_grid$lon_index-1)] = weight_matrix[latlon_grid$lat_index + n_lat_tiles * (latlon_grid$lon_index-1)] + new_weights * x[i] } final_grid = expand.grid(lat_index = 1:n_lat_tiles, lon_index=1:n_lon_tiles) final_grid$longitude = lon_tile_values[final_grid$lon_index] final_grid$latitude = lat_tile_values[final_grid$lat_index] final_grid$weight = getMatrixElementsByLinearIndex(weight_matrix, final_grid$lat_index, final_grid$lon_index) return(final_grid) }
Tags: