Shopper & shopkeeper survey methodology

This Rmarkdown document summarizes the methodology used for this small survey of shopkeepers and shoppers on Columbia Road between Mintwood Place & Biltmore Street. The document shows all of the data analysis conducted.

The map above shows the block on which the survey was done.

Methodology

I surveyed roughly 130 people on the sidewalk and 12 “shopkeepers” (both staff in stores and in restaurants) over several days between July 14 2024 and July 27 2024.

The specific methodology changed slightly over time (I realized I needed to ask a few additional questions as I conducted the survey), so I show the results here broken out a number of ways. These tweaks didn’t change the top-line results. If you omit the first day of surveying, for example, or only count people who self-reported as regular shoppers on the last day, the main result (that people in general over-estimate the prevalence of driving and under-estimate walking) remains unchanged.

So what was the methodology?

I did two surveys: one of staff in stores+restaurants, and one of people walking by on the street. I only surveyed people and businesses on the NW side of Columbia Rd between Mintwood and Biltmore (with one exception, Julia’s Empanadas, which I omit from all of the analysis, since it’s actually on 18th St. If you include the data from Julia’s, the results don’t change.)

You can find the survey document here: https://drive.google.com/drive/folders/1iljrrzEF3Ax8WQACVvPGmcji2t2A8Z3V?usp=sharing

That folder also has all of the raw data.

The shopkeeper survey questions were,

How would you estimate the percentage of each mode of transportation that your customers use to get to your business?
How do you get to work?

The passerby questions were,

How would you estimate the percentage of each mode of transportation that people use to get to this strip?
How did you get here today?
How many times a week do you shop on this strip?

On the first day of surveying, I only stopped people going into or out of So’s Your Mom. Originally, I was thinking of doing this survey on a store-by-store basis. But I quickly realized this methodology would take forever to implement; there just aren’t that many people going into a single individual store in any given hour, relative to the total number of people passing by on the street.

So on day two of surveying, I asked everyone who passed by if they’d take the survey.

After day two of surveying, I was worried that I wouldn’t be able to tell, based on the data I was collecting, who actually shopped at these stores, and who was just passing through.

So on day 3 I started asking people “do you actually shop on this strip?” Finally, on day 4, I asked a better, more specific version of this question: “how many times a month do you shop on this strip?”

I should’ve been asking that question from the beginning! But live and learn. For what it’s worth, the main result actually gets more pronounced if you only use the data from this last day of surveying, looking only at people that report being regular shoppers on the strip. (See analysis below).

What can we reasonably infer from this?

This is a relatively small study, so the results should be interpreted with caution, especially viz a viz the smaller subgroups. The main result seems to be that, on average, both passersby, regular shoppers, and staff at businesses tend to over-estimate how much people drive to visit these businesses, and under-estimate walking, transit use, and biking/scootering.

That said! These are averages. Some individual staff at businesses gave estimates that were actually pretty close to the average. And it’s possible that the staff at businesses that gave lower walking estimates are actually correct for their specific business (it might really be the case that more people drive to their business).

(It would be super interesting if any of these businesses collected a little data on how people get to them!)

Otherwise, staff appear more likely to live farther away and drive to the business. People also seem to have a slightly tendency to favor their own mode of transportation when estimating the average shares.

Other caveats

I don’t have hard data on this, but I wonder if drivers/car-users were slightly under-sampled; it seemed like they were less likely to walk down the block to their destination (often a restaurant) and more likely to hop out of an uber directly in front of it. This could’ve made them more likely to decline the survey (“sorry my reservation is for right now”) and less likely to pass my survey station (on the days I wasn’t posted directly in front of the restaurant, for example). That said, I was posted directly in front of Perry’s on day 4, a Friday evening, and the reported % of drivers was still only 12%, well below the 20-25% that passersby and shopkeepers estimated. And for that matter, you could use this same logic to assert that the numbers of cylists & scooter-ers are under-counted, since they were impossible to survey while biking/scootering, and they often lock-up right in front of their destination, which may be down the block from my survey station.

Finally, one should interpret the mode of transport shares carefully in general. It would be easy to look at these results and think “huh well I guess no one bikes or scooters” and then jump to “I guess no one wants to use those modes of transit.” But that would be falling victim to the survivorship fallacy. If there were better, safer biking infrastructure, more people would bike (there have been umpteen studies on this, ex: https://www.ncbi.nlm.nih.gov/search/research-news/13155/#.)

Of course the same could be said for driving! (If you turned Kalorama park into a parking lot, more people might drive to Lapis). Personally I don’t think increasing parking is a good idea, but you get the point: current travel patterns don’t necessarily reflect what we’d want in a better/ideal world.

If you have any questions about the survey methodology, I’d be happy to chat. I am a mega dork who enjoys this stuff. You can reach me at edwardpierrerodrigue _at gmail _dot com

Finally–huge thanks to everyone who participated in the survey! It was really cool to talk to you all!

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width=18, fig.height=10) 

# load the packages we need
library(tidyverse)
library(ggplot2)

# Load the potential shopper data ("psd") from google drive. 
# You can download the original potential shopper survey data as a CSV file using the link below.
psd <- read.csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQrLc_Jw_iuH2UPN01VHxnGnrz8tIRZrrgSAxCxtN9JPu0FzNh-a7i6g14QKxfLkBC0cOt2N4b-zC0t/pub?gid=1149018298&single=true&output=csv')
## Now we'll add shopkeeper survey info (skd='shopkeeper data'):
skd <- read.csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQrLc_Jw_iuH2UPN01VHxnGnrz8tIRZrrgSAxCxtN9JPu0FzNh-a7i6g14QKxfLkBC0cOt2N4b-zC0t/pub?gid=0&single=true&output=csv')

# You can also find all of the data for shopkeepers and potential shoppers here: 
# https://drive.google.com/drive/folders/1iljrrzEF3Ax8WQACVvPGmcji2t2A8Z3V?usp=sharing

# first, remove the response from the cool guy that works at Julia's Empanadas. 
# I'd talked to him originally thinking the survey would cover more of the Adams Morgan / Columbia Rd area
# but it was a fair amount of work just focusing on the strip of Columbia between Mintwood and Biltmore
# so ultimately kept the survey focused on that small but defined area.
# (he estimated 80% of customers drive and 20% walk, so if you include his responses in the analysis, it just makes the main results more pronounced)
skd <- skd[skd$store != "Julias Empanadas",]

cat("\n\nBelow are some basic summary stats\n\n")

## 
## 
## Below are some basic summary stats

# Total number of hours spent surveying potential shoppers:
cat('Approximate total time spend surveying shoppers/passersby: ', sum(psd$total_time[psd$response_num==1]), 'hours.')

## Approximate total time spend surveying shoppers/passersby:  9.38 hours.

# Total number of responses:
cat('\nTotal number of shopper survey responses: ', nrow(psd))

## 
## Total number of shopper survey responses:  130

# Total number of responses, omitting the first day:
cat('\nTotal number of shopper responses, omitting the first day: ', nrow(psd[psd$date != 'July 14 2024',]))

## 
## Total number of shopper responses, omitting the first day:  100

# Total number of non-responses:
nonresponses <- as.numeric(as.character(unlist(
    psd %>% group_by(date) %>% summarise(nonresponses = max(num_non_response)) %>% summarize(sum = sum(nonresponses)))
    ))
cat('\nTotal number of shoppers that declined the survey: ', nonresponses)

## 
## Total number of shoppers that declined the survey:  248

# Total nonresponse rate:
cat('\nTotal nonresponse rate: ', round(nonresponses / (nrow(psd) + nonresponses) * 100, 0), "%\n", 
    'This is actually not a bad response rate!\n')

## 
## Total nonresponse rate:  66 %
##  This is actually not a bad response rate!

# Average responses per hour:
cat('\nAverage number of shopper responses per hour: ', round(nrow(psd) / sum(psd$total_time[psd$response_num==1]), 1))

## 
## Average number of shopper responses per hour:  13.9

# Total number of shopkeeper responses:
cat('\nTotal number of shopkeeper responses: ', nrow(skd))

## 
## Total number of shopkeeper responses:  12

## Clean up so NAs are zero in the perception columns
## (If someone didn't list a particular mode, that should count as "zero")
for (col_name in c("car.estimate..includes.uber.", "transit..bus.or.metro..estimate",
     "walk.estimate", "bike.estimate", "other.estimate",
     "did.not.specify.if.did.not.provide...that.sum.to.100.")) {
  for (ii in 1:nrow(psd)) {
    if ( is.na(psd[[col_name]][ii]) ) {
      psd[[col_name]][ii] <- 0
    }
  }
}

## group together biking and scootering. there was only one scooter respondent
# and they're similar modes and are both done in bike lanes
psd$persons_method_of_travel[psd$persons_method_of_travel == 'scooter'] <- 'bike'
psd$persons_method_of_travel[psd$persons_method_of_travel == 'bike'] <- 'bike or scooter'

# group together biking and scootering in the perceptions columns: 
psd$bike.estimate[psd$Notes %in% c('"other" means scooter', 'Other was "scooter"',
                                 'This person said "scooter" for the "other" category')] <- 
  psd$bike.estimate[psd$Notes %in% c('"other" means scooter', 'Other was "scooter"',
                                   'This person said "scooter" for the "other" category')] +
  psd$other.estimate[psd$Notes %in% c('"other" means scooter', 'Other was "scooter"',
                                   'This person said "scooter" for the "other" category')]

names(psd)[names(psd) == 'bike.estimate'] <- 'bike.or.scooter.estimate'

# remove those scooter values from the other category so we don't double count:
psd$other.estimate[psd$Notes %in% c('"other" means scooter', 'Other was "scooter"',
                                  'This person said "scooter" for the "other" category')] <- 0

The code below charts the data and shows how the results wiggle around if you subset the data in different ways. The main results don’t change.

Plot showing how shopkeepers get to work:

skd %>%
  group_by(how_respondent_gets_to_work) %>%
  mutate(count = n(),
         group_index=row_number()) %>%
  filter(group_index == 1) %>%
  ungroup() %>%
  mutate(percent = count / sum(count) * 100,
         how_respondent_gets_to_work = ifelse(test=how_respondent_gets_to_work=='scooter', 
                                              yes='bike or scooter', 
                                              no=how_respondent_gets_to_work)) %>%
  ggplot() +
  geom_bar(aes(x=how_respondent_gets_to_work, y=percent), stat='identity', fill='#109c37') +
  ggtitle('58% of shopkeepers interviewed drive to work') +
  xlab('How respondent gets to work') +
  labs(caption = paste0("N=", nrow(skd))) +
  theme_minimal() +
  theme(plot.title = element_text(size = 24, face = "bold"),
        axis.title = element_text(size=20),
        axis.text = element_text(size=20),
        plot.caption = element_text(size=20, hjust=0)) +
  scale_y_continuous(breaks = seq(0, 70, 10))

Plot showing how passersby were travelling, including all data:

plot_potential_shopper_mot <- function(df, title) {
  rv1 <-
    df %>%
  group_by(persons_method_of_travel) %>%
  mutate(count = n(),
         group_index=row_number()) %>%
  filter(group_index==1) %>%
  ungroup() %>%
  mutate(percent = count / sum(count) * 100) 
  
  rv2 <-
    rv1 %>%
  ggplot() +
  geom_bar(aes(x=persons_method_of_travel, y=percent), stat='identity', fill='#109c37') +
  ggtitle(title) +
  xlab('How shopper travelled') +
  theme_minimal() +
  theme(plot.title = element_text(size = 24, face = "bold"),
        axis.title = element_text(size=20),
        axis.text = element_text(size=20),
        plot.caption = element_text(size=20, hjust=0)) +
  scale_y_continuous(breaks = seq(0, max(rv1$percent)+10, 10)) +
  labs(caption = paste0("N=", nrow(df)))
  
  rv1 <- rv1 %>% select(persons_method_of_travel, count, percent)
  return(list(rv1, rv2))
}

rv = plot_potential_shopper_mot(psd, title='75% of shoppers arrived on foot')
table = rv[[1]]
chart = rv[[2]]
table

## # A tibble: 4 × 3
##   persons_method_of_travel count percent
##   <chr>                    <int>   <dbl>
## 1 walk                        98   75.4 
## 2 bike or scooter              8    6.15
## 3 car                         11    8.46
## 4 transit                     13   10

chart

Same graph, but omitting respondents from the first day, when I was only surveying people entering So’s Your Mom:

rv = plot_potential_shopper_mot(psd[psd$date !='July 14 2024',], title='Same chart, but without the data from the first surveying day')
table = rv[[1]]
chart = rv[[2]]
table

## # A tibble: 4 × 3
##   persons_method_of_travel count percent
##   <chr>                    <int>   <dbl>
## 1 walk                        74      74
## 2 car                          9       9
## 3 bike or scooter              6       6
## 4 transit                     11      11

chart

Same graph, but only including respondents who were explicitly asked if they shop on this strip and said yes:

rv = plot_potential_shopper_mot(psd[psd$date !='July 19 2024' & psd$do_they_shop_on_this_strip == 'yes',], title='Same chart, but only including people who\nsaid they shopped on the strip on day 3 or said they\nshopped on the strip at least 1/month on survey day 4')
table = rv[[1]]
chart = rv[[2]]
table

## # A tibble: 4 × 3
##   persons_method_of_travel count percent
##   <chr>                    <int>   <dbl>
## 1 walk                        42   82.4 
## 2 bike or scooter              3    5.88
## 3 car                          3    5.88
## 4 transit                      3    5.88

chart

Same graph, but only including respondents said they shopped on the strip at least once a month on the 4th survey day:

rv = plot_potential_shopper_mot(psd[psd$date =='July 26 2024' & psd$do_they_shop_on_this_strip == 'yes',], title='Same chart, but only including people who\nsaid they shopped on the strip at\nleast 1/month on survey day 4')
table = rv[[1]]
chart = rv[[2]]
table

## # A tibble: 4 × 3
##   persons_method_of_travel count percent
##   <chr>                    <int>   <dbl>
## 1 walk                        18   85.7 
## 2 transit                      1    4.76
## 3 bike or scooter              1    4.76
## 4 car                          1    4.76

chart

Now we’ll make the final graph, comparing reported methods of travel, along with shopkeepers’ and shoppers’ perceptions.

make_modes_and_perceptions_graph <- function(psd, skd, font_size=12, main_graph=FALSE, title) {
  # get the mean value for each perceived mode of travel, as well as the number of people that actually used that mode:
  to_plot <-
    psd %>%
    group_by(persons_method_of_travel) %>%
    summarise(`avg % est. to walk` = mean(`walk.estimate`, na.rm=T),
              `avg % est. to drive` = mean(`car.estimate..includes.uber.`, na.rm=T),
              `avg % est. to bike or scooter` = mean(`bike.or.scooter.estimate`, na.rm=T),
              `avg % est. to take transit` = mean(`transit..bus.or.metro..estimate`, na.rm=T),
              `avg % est. to other` = mean(`other.estimate`, na.rm=T),
              n = n())
  
  
  # in a few cases the row-wise totals don't completely sum to 100%, (This happens when someone's %s didn't sum to 100)
  # so we'll adjust for that by normalizing by the row-total. 
  to_plot$total_pct <- rowSums(x = to_plot %>% select("avg % est. to walk", 
                                      "avg % est. to drive",
                                      "avg % est. to bike or scooter", 
                                      "avg % est. to take transit", 
                                      "avg % est. to other" )
                               )
  for (col_name in c( "avg % est. to walk", 
                      "avg % est. to drive",
                      "avg % est. to bike or scooter",
                      "avg % est. to take transit",
                      "avg % est. to other")) {
    
    to_plot[[col_name]] <- to_plot[[col_name]] / to_plot$total_pct
  }
  # now these totals should sum to 100:
  to_plot$total_pct <- rowSums(x = to_plot %>% select("avg % est. to walk", 
                                                      "avg % est. to drive",
                                                      "avg % est. to bike or scooter", 
                                                      "avg % est. to take transit", 
                                                      "avg % est. to other" )
                               )
  
  # pivot the data from wide to long so we can plot it using ggplot
  # I am omitting the "other" category if the person did not define "other"
  # this only happened in a few cases
  to_plot <-
    to_plot %>% 
    select(-total_pct) %>%
    tidyr::pivot_longer(!persons_method_of_travel, names_to = "mode", values_to = "percent") %>%
    filter(mode != 'avg % est. to other') %>%
    mutate(percent = percent * 100) %>%
    filter(mode != 'n')
  
  # create a set of columns that have the actual way people arrived at the strip:
  observed_pct_to_add <-
    data.frame(
      rep("Survey observed", 4),
             c('avg % est. to walk', 'avg % est. to drive', 'avg % est. to bike or scooter', 'avg % est. to take transit'),
             c(nrow(psd[psd$persons_method_of_travel=='walk',]),
               nrow(psd[psd$persons_method_of_travel=='car',]),
               nrow(psd[psd$persons_method_of_travel=='bike or scooter',]),
               nrow(psd[psd$persons_method_of_travel=='transit',]))
      ) 
  names(observed_pct_to_add) <- names(to_plot)
  observed_pct_to_add$percent <- observed_pct_to_add$percent / sum(observed_pct_to_add$percent) * 100
  
  # append the observed counts onto the table with the perception data, so we can show it in one graph:
  to_plot <- dplyr::bind_rows(to_plot, observed_pct_to_add)
  
  # rename the categories so they're easier to understand in the actual chart:
  to_plot$persons_method_of_travel[to_plot$persons_method_of_travel == 'bike or scooter'] <- 
    paste0('Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=', nrow(psd[psd$persons_method_of_travel == 'bike or scooter', ]),')')
  to_plot$persons_method_of_travel[to_plot$persons_method_of_travel == 'car'] <- 
    paste0('Mode share estimate among\nshoppers that arrived by car\n(n=', nrow(psd[psd$persons_method_of_travel == 'car', ]),')')
  to_plot$persons_method_of_travel[to_plot$persons_method_of_travel == 'transit'] <- 
    paste0('Mode share estimate among\nshoppers that arrived by transit\n(n=', nrow(psd[psd$persons_method_of_travel == 'transit', ]),')')
  to_plot$persons_method_of_travel[to_plot$persons_method_of_travel == 'walk'] <- 
    paste0('Mode share estimate among\nshoppers that arrived on foot\n(n=', nrow(psd[psd$persons_method_of_travel == 'walk', ]),')')
  to_plot$persons_method_of_travel[to_plot$persons_method_of_travel == 'Survey observed'] <- 
    paste0('How shoppers actually arrived\n(n=', nrow(psd),')')
  
  # rename the perceptions categories so they're simpler for the chart:
  to_plot$mode[to_plot$mode == 'avg % est. to bike or scooter'] <- 'bike or scooter'
  to_plot$mode[to_plot$mode == 'avg % est. to drive'] <- 'car'
  to_plot$mode[to_plot$mode == 'avg % est. to take transit'] <- 'transit'
  to_plot$mode[to_plot$mode == 'avg % est. to walk'] <- 'walk'
  
  
  # replace NA values with zero in the shopkeeper data:
  skd$walk_est[is.na(skd$walk_est)] <- 0
  skd$drive_est[is.na(skd$drive_est)] <- 0
  skd$transit_est[is.na(skd$transit_est)] <- 0
  skd$bike_scooter_est[is.na(skd$bike_scooter_est)] <- 0
  skd$unspecified_est[is.na(skd$unspecified_est)] <- 0
  
  # Add the shopkeeper estimate data to the main data frame:
  to_add <- data.frame(rep(paste0('Mode share estimate among\nshopkeepers\n(n=', nrow(skd), ')'), 4),
                       c('walk', 'car', 'transit', 'bike or scooter'),
                       c(
                         mean(skd$walk_est, na.rm = T), mean(skd$drive_est, na.rm = T), mean(skd$transit_est, na.rm = T), mean(skd$bike_scooter_est, na.rm = T)
                       ))
  names(to_add) <- names(to_plot)[1:3]
  
  final_plot <- dplyr::bind_rows(to_plot, to_add)
  
  # create a simple "max_percent" variable that'll just let us order the columns, so the chart is easier to read:
  final_plot <-
    final_plot %>%
    group_by(persons_method_of_travel) %>%
    mutate(max_percent = max(percent))
  
  # make the chart!
  p <-
    final_plot %>%
    ggplot() +
    geom_bar(aes(x=persons_method_of_travel, 
                 y=percent, 
                 fill=mode), 
             position="dodge", stat="identity") +
    scale_y_continuous(breaks=seq(from=0, to=max(final_plot$percent) + 5, by=5)) +
    theme_minimal() +
    ylab("Percent") +
    xlab("") +
    ggtitle(title) +
    theme(axis.title.y.left = element_text(size=font_size),
          plot.title = element_text(size = 18, face = "bold"),
          axis.text.y.left = element_text(size=font_size),
          axis.text.x = element_text(size=font_size),
          legend.text=element_text(size=font_size),
          legend.title=element_text(size=font_size))
  
  if (main_graph) {
    p <- 
      p + 
         annotate(
        'curve',
        x = 2.5,
        y = 75,
        yend = 75,
        xend = 1.5,
        linewidth = 1,
        curvature = -0.25,
        arrow = arrow(length = unit(0.5, 'cm'))
      ) +
      annotate("text", x = 2.5, y = 77,
               size=4,
               label = "75% of shoppers actually arrived on\nfoot in the survey") +
      annotate(
        'curve',
        x = 3.4,
        y = 76,
        yend = 70.3,
        xend = 2.47,
        linewidth = 1,
        curvature = 0.15,
        arrow = arrow(length = unit(0.5, 'cm'))
      ) +
      annotate("text", x = 3.86, y = 75.7,
               size=4,
               label = "Shopkeepers underestimate the\n% of shoppers that arrive on foot\nby 7 percentage points, on average\n(and over-estimate driving 3.25x)") +
      annotate(
        'curve',
        x = 5.2,
        y = 67,
        yend = 52.5,
        xend = 6.3,
        linewidth = 1,
        curvature = -0.35,
        arrow = arrow(length = unit(0.5, 'cm'))
      ) +
      annotate("text", x = 4.65, y = 67,
               size=4,
               label = "And even people walking thought only\n52% of shoppers arrived on foot, on average")
  }
   
  
  return(list(p, final_plot))

}

options(width = 2400)
rv = make_modes_and_perceptions_graph(psd, skd, font_size=12, 
                                      main_graph = TRUE, title='On average, both shoppers and shopkeepers underestimated walking & over-estimated car usage')

data_copy <- rv[[2]]
rv[[2]]$percent <- round(rv[[2]]$percent, 0)
rv[[2]] <- rv[[2]][, c('persons_method_of_travel', 'mode', 'percent')]
print.data.frame(rv[[2]])

##                                                      persons_method_of_travel            mode percent
## 1  Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=8)            walk      57
## 2  Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=8)             car      11
## 3  Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=8) bike or scooter      16
## 4  Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=8)         transit      16
## 5             Mode share estimate among\nshoppers that arrived by car\n(n=11)            walk      43
## 6             Mode share estimate among\nshoppers that arrived by car\n(n=11)             car      30
## 7             Mode share estimate among\nshoppers that arrived by car\n(n=11) bike or scooter       8
## 8             Mode share estimate among\nshoppers that arrived by car\n(n=11)         transit      20
## 9         Mode share estimate among\nshoppers that arrived by transit\n(n=13)            walk      32
## 10        Mode share estimate among\nshoppers that arrived by transit\n(n=13)             car      27
## 11        Mode share estimate among\nshoppers that arrived by transit\n(n=13) bike or scooter      15
## 12        Mode share estimate among\nshoppers that arrived by transit\n(n=13)         transit      25
## 13           Mode share estimate among\nshoppers that arrived on foot\n(n=98)            walk      52
## 14           Mode share estimate among\nshoppers that arrived on foot\n(n=98)             car      20
## 15           Mode share estimate among\nshoppers that arrived on foot\n(n=98) bike or scooter       8
## 16           Mode share estimate among\nshoppers that arrived on foot\n(n=98)         transit      20
## 17                                     How shoppers actually arrived\n(n=130)            walk      75
## 18                                     How shoppers actually arrived\n(n=130)             car       8
## 19                                     How shoppers actually arrived\n(n=130) bike or scooter       6
## 20                                     How shoppers actually arrived\n(n=130)         transit      10
## 21                             Mode share estimate among\nshopkeepers\n(n=12)            walk      68
## 22                             Mode share estimate among\nshopkeepers\n(n=12)             car      26
## 23                             Mode share estimate among\nshopkeepers\n(n=12)         transit       3
## 24                             Mode share estimate among\nshopkeepers\n(n=12) bike or scooter       1

rv[[1]]

We can also show the results in terms of the difference between observed means of transport and respondents’ perceptions.

make_dotplot <- function(df, title) {
  actual_var_index <- grep(pattern = "How shoppers actually arrived", x = df$persons_method_of_travel)
  actual_var <- unique(df$persons_method_of_travel[actual_var_index])
  variable_names <- unique(df$persons_method_of_travel)[!unique(df$persons_method_of_travel) %in% actual_var]

  to_plot <- data.frame(matrix(rep(NA, 5), ncol=5))
  names(to_plot) <- c("person_type", "mode", "actual_pct", "perceived_pct", "difference between perceived and actual percent")
  for (ii in variable_names) {
    for (mot in c("walk", "car", "bike or scooter", "transit")) {
      actual <- 0
      perceived <- 0
      actual <- df$percent[df$persons_method_of_travel==actual_var & df$mode == mot]
      perceived <- df$percent[df$persons_method_of_travel==ii & df$mode == mot]
      actual <- max(actual, 0, na.rm=T)
      perceived <- max(perceived, 0, na.rm = T)
      to_add <- c(ii, mot, actual, perceived, perceived - actual)
      to_plot[nrow(to_plot) + 1,] <- to_add
    }
  }
  
  to_plot <- to_plot[is.na(to_plot$person_type)==F,]
  
  to_plot$`difference between perceived and actual percent` <-
    as.numeric(to_plot$`difference between perceived and actual percent`)
  
  ggplot(to_plot, aes(x = `difference between perceived and actual percent`, y = mode)) +
    geom_segment(aes(yend = mode), xend = 0, colour = "grey50") +
    geom_point(size = 3, aes(colour = person_type)) +
      geom_vline(aes(xintercept=0)) +
      facet_grid(person_type ~ ., scales = "free_y", space = "free_y") +
    ylab("") +
    xlab("Percentage point difference between perceived and actual share") +
    ggtitle(title) +
    scale_x_continuous(n.breaks = 10, 
                       limits = c(min(to_plot$`difference between perceived and actual percent`)-5, 
                                  max(to_plot$`difference between perceived and actual percent`)+5)) +
    theme(legend.position="none", 
          axis.title.x = element_text(size=14),
          axis.text.x = element_text(size=14),
          axis.text.y = element_text(size=14),
          strip.text = element_text(size = 14),
          strip.text.y.right = element_text(angle=0),
          plot.title = element_text(size=18))
}


make_dotplot(rv[[2]], title = "Average percentage point difference between\ncustomers' actual mode share and the mode share perceived by shopkeepers")

Below are some robustness checks, where we slice the data different ways to see if the results change.

make_modes_and_perceptions_graph(psd[psd$date !='July 19 2024',], skd, font_size=12,  main_graph = FALSE, 
                                 title='Similar results if you omit the first survey day')[[1]]

make_modes_and_perceptions_graph(psd[psd$date !='July 19 2024' & psd$do_they_shop_on_this_strip == 'yes',], 
                                 skd, font_size=12, main_graph = FALSE, 
                                 title='Average shopkeeper perception gap actually increases if\nif you only use responses from self-described shoppers')[[1]]

The N values for the respondents that drive, take transit, and bike are too small in the subsample below to mean anything, but the number of walkers relative to teh shopkeeper perception actually grows again.

make_modes_and_perceptions_graph(psd[psd$date =='July 26 2024' & psd$do_they_shop_on_this_strip == 'yes',], 
                                 skd, font_size=12, main_graph = FALSE, 
                                 title='Again, average shopkeeper perception gap grows if you only\nuse the last day of data, counting only people that shop at least once a month')[[1]]

We can also just look at the results if we subset to look at each day’s worth of data individually.

Analyzing every day individually, shopkeepers’ average estimate of the share of their customers is only accurate on July 26. This was the day that saw the fewest walkers, and my anecdotal hunch is that this might be because people are more likely to drive / take uber to Namak & Perry’s, since those are kind of “destination restaurants” (as opposed to Yes, for example). And I was standing outside Namak and Perry’s all night on the 26th, a Friday, so I suspect this was the high-water mark for driving, and the lower-bound for walking. (The staff person at Namak estimated that only 40% of customers walk).

Notably, even when we look only at data from the 26th, shopkeepers still significantly underestimated biking and transit, while overestimating car use.

p1 <- make_modes_and_perceptions_graph(psd[psd$date =='July 14 2024',], 
                                 skd, font_size=12, main_graph = FALSE, 
                                 title='Results using only Jul 14 shopper data')
p2 <- make_modes_and_perceptions_graph(psd[psd$date =='July 15 2024',], 
                                 skd, font_size=12, main_graph = FALSE, 
                                 title='Results using only Jul 15 shopper data')
p3 <- make_modes_and_perceptions_graph(psd[psd$date =='July 19 2024',], 
                                 skd, font_size=12, main_graph = FALSE, 
                                 title='Results using only Jul 19 shopper data')
p4 <- make_modes_and_perceptions_graph(psd[psd$date =='July 26 2024',], 
                                 skd, font_size=12, main_graph = FALSE, 
                                 title='Results using only Jul 26 shopper data')

p1[[1]]

p2[[1]]

p3[[1]]

p4[[1]]

make_dotplot(p1[[2]], title = 'Results using only Jul 14 shopper data')

make_dotplot(p2[[2]], title = 'Results using only Jul 15 shopper data')

make_dotplot(p3[[2]], title = 'Results using only Jul 19 shopper data')

make_dotplot(p4[[2]], title = 'Results using only Jul 26 shopper data')

What if we only look at data from the first day, comparing the responses from people going into So’s Your Mom with the responses from the business people in the shop?

Below are the raw responses. There’s a bit of an odd data hiccup in that one of the staff estimated that 80% of customers walk, but didn’t provide a share for anything except walking. If you just include the one complete response, the shopkeeper overestimated both walking and driving.

make_modes_and_perceptions_graph(psd[psd$date =='July 14 2024',], 
                                 skd[skd$store == "Sos your mom" , ], font_size=12, main_graph = FALSE, 
                                 title="Results using only Jul 14 shopper data & responses from So's Your Mom staff")

## [[1]]

## 
## [[2]]
## # A tibble: 24 × 4
## # Groups:   persons_method_of_travel [6]
##    persons_method_of_travel                                                     mode            percent max_percent
##    <chr>                                                                        <chr>             <dbl>       <dbl>
##  1 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" walk              63.2         63.2
##  2 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" car                2.63        63.2
##  3 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" bike or scooter   15.8         63.2
##  4 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" transit           18.4         63.2
##  5 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             walk              75           75  
##  6 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             car               15           75  
##  7 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             bike or scooter    0           75  
##  8 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             transit           10           75  
##  9 "Mode share estimate among\nshoppers that arrived by transit\n(n=2)"         walk              50           50  
## 10 "Mode share estimate among\nshoppers that arrived by transit\n(n=2)"         car               30           50  
## # ℹ 14 more rows

If we just take that 20% that was unspecified in the 2nd shopkeeper’s response and distribute it equally among the other options (biking/scootering, driving, and transit), we see results that are similar-ish to the main results. The shopkeepers actually over-estimate walking, but, like the other shopkeepers on average, they over-estimate car-use while underestimating biking, scootering, and transit-use.

make_modes_and_perceptions_graph(psd[psd$date =='July 14 2024',], 
                                 skd[skd$store == "Sos your mom" , ] %>%
                                 mutate(drive_est = ifelse(date=="July 27 2024", yes=20/3, no=drive_est),
                                        transit_est = ifelse(date=="July 27 2024", yes=20/3, no=transit_est),
                                        bike_scooter_est = ifelse(date=="July 27 2024", yes=20/3, no=bike_scooter_est)
                                        ) , 
                                 font_size=12, main_graph = FALSE, 
                                 title="Results using only Jul 14 shopper data & responses from\nSo's Your Mom staff, imputing missing shares")

## [[1]]

## 
## [[2]]
## # A tibble: 24 × 4
## # Groups:   persons_method_of_travel [6]
##    persons_method_of_travel                                                     mode            percent max_percent
##    <chr>                                                                        <chr>             <dbl>       <dbl>
##  1 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" walk              63.2         63.2
##  2 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" car                2.63        63.2
##  3 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" bike or scooter   15.8         63.2
##  4 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=2)" transit           18.4         63.2
##  5 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             walk              75           75  
##  6 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             car               15           75  
##  7 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             bike or scooter    0           75  
##  8 "Mode share estimate among\nshoppers that arrived by car\n(n=2)"             transit           10           75  
##  9 "Mode share estimate among\nshoppers that arrived by transit\n(n=2)"         walk              50           50  
## 10 "Mode share estimate among\nshoppers that arrived by transit\n(n=2)"         car               30           50  
## # ℹ 14 more rows

We can do something similar with the the responses from staff at Yes Organic Market and the data from survey days 2 and 3. While this isn’t an exact 1-to-1 comparison, relatively more customers were entering Yes on those survey days, because I was standing in front of the store.

This shows results that are consistent with the overall averages.

make_modes_and_perceptions_graph(psd[psd$date =='July 15 2024' | psd$date =='July 19 2024',], 
                                 skd[skd$store == "Yes Organic Market" , ], font_size=12, main_graph = FALSE, 
                                 title='Results using only Jul 15 & 19 shopper data\nand shopkeeper responses from Yes Organic Market staff')

## [[1]]

## 
## [[2]]
## # A tibble: 24 × 4
## # Groups:   persons_method_of_travel [6]
##    persons_method_of_travel                                                     mode            percent max_percent
##    <chr>                                                                        <chr>             <dbl>       <dbl>
##  1 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=5)" walk               46          46  
##  2 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=5)" car                16          46  
##  3 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=5)" bike or scooter    20          46  
##  4 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=5)" transit            18          46  
##  5 "Mode share estimate among\nshoppers that arrived by car\n(n=5)"             walk               36          36  
##  6 "Mode share estimate among\nshoppers that arrived by car\n(n=5)"             car                30.4        36  
##  7 "Mode share estimate among\nshoppers that arrived by car\n(n=5)"             bike or scooter    12          36  
##  8 "Mode share estimate among\nshoppers that arrived by car\n(n=5)"             transit            21.6        36  
##  9 "Mode share estimate among\nshoppers that arrived by transit\n(n=6)"         walk               15.8        36.7
## 10 "Mode share estimate among\nshoppers that arrived by transit\n(n=6)"         car                31.7        36.7
## # ℹ 14 more rows

Ditto for the last day, when I was located in front of Namak: the shopkeepers appear to overestimate car-use and underestimate all other modes.

make_modes_and_perceptions_graph(psd[psd$date =='July 26 2024',], 
                                 skd[skd$store == "Namak" , ], font_size=12, main_graph = FALSE, 
                                 title='Results using only Jul 26 shopper data\nand shopkeeper responses from Namak staff')

## [[1]]

## 
## [[2]]
## # A tibble: 24 × 4
## # Groups:   persons_method_of_travel [6]
##    persons_method_of_travel                                                     mode            percent max_percent
##    <chr>                                                                        <chr>             <dbl>       <dbl>
##  1 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=1)" walk             100          100  
##  2 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=1)" car                0          100  
##  3 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=1)" bike or scooter    0          100  
##  4 "Mode share estimate among\nshoppers that arrived by bike or scooter\n(n=1)" transit            0          100  
##  5 "Mode share estimate among\nshoppers that arrived by car\n(n=4)"             walk              35           36.2
##  6 "Mode share estimate among\nshoppers that arrived by car\n(n=4)"             car               36.2         36.2
##  7 "Mode share estimate among\nshoppers that arrived by car\n(n=4)"             bike or scooter    6.25        36.2
##  8 "Mode share estimate among\nshoppers that arrived by car\n(n=4)"             transit           22.5         36.2
##  9 "Mode share estimate among\nshoppers that arrived by transit\n(n=5)"         walk              43           43  
## 10 "Mode share estimate among\nshoppers that arrived by transit\n(n=5)"         car               20           43  
## # ℹ 14 more rows

Overall, the general conclusions seem strong: on average, shopkeepers seem to underestimate walking (to varying degrees), overestimate car use, and underestimate transit use, and underestimate biking and scootering (even if some shopkeepers’ estimates are in line or above the true mode share values).

Finally, just as an aside, here is the shopping frequency by mode of transportation:

psd %>%
  filter(date=='July 26 2024') %>%
  group_by(persons_method_of_travel) %>%
  summarize(mean_trips_per_month = round(mean(times_shop_on_this_strip_per_month, na.rm=T), 1),
            n=n()) %>%
  knitr::kable()

persons_method_of_travel	mean_trips_per_month	n
bike or scooter	16.0	1
car	2.6	4
transit	4.8	5
walk	8.8	23

And the table below shows the percentage of respondents that report shopping at least once a month, and more than once a month:

psd %>%
  filter(date == "July 26 2024") %>%
  mutate(at_least_once_a_month_flag = ifelse(test=times_shop_on_this_strip_per_month >= 1, yes=1, no=0),
         more_than_once_a_month_flag = ifelse(test=times_shop_on_this_strip_per_month > 1, yes=1, no=0)) %>%
  summarise(at_least_once_a_month_pct = round(mean(at_least_once_a_month_flag, na.rm=T)*100, 0),
         more_than_once_a_month_pct = round(mean(more_than_once_a_month_flag, na.rm=T)*100, 0)
         )

##   at_least_once_a_month_pct more_than_once_a_month_pct
## 1                        94                         64

psd %>%
  filter(date == "July 26 2024") %>%
  mutate(at_least_once_a_month_flag = ifelse(test=times_shop_on_this_strip_per_month >= 1, yes=1, no=0),
         more_than_once_a_month_flag = ifelse(test=times_shop_on_this_strip_per_month > 1, yes=1, no=0)) %>%
  group_by(persons_method_of_travel) %>%
  summarise(
            at_least_once_a_month_pct = round(mean(at_least_once_a_month_flag)*100, 0),
            at_least_once_a_month_count = sum(at_least_once_a_month_flag),
            more_than_once_a_month_pct = round(mean(more_than_once_a_month_flag)*100, 0),
            more_than_once_a_month_count = sum(more_than_once_a_month_flag)
            ) %>%
  knitr::kable()

persons_method_of_travel	at_least_once_a_month_pct	at_least_once_a_month_count	more_than_once_a_month_pct	more_than_once_a_month_count
bike or scooter	100	1	100	1
car	75	3	25	1
transit	100	5	20	1
walk	96	22	78	18

Shopper & shopkeeper survey methodology

pete rodrigue

2024-07-22