Dataset Preparation

This document lays out the data handling process of multiple databases related to broadband availability, speed, and entrepreneurship of Texas, Maine, and Kansas. Databases used here are as follows:

Microsoft Broadband Availability Dataset
Measurement Lab (M-Lab) Datasets Exported via Google BigQuery
GoDaddy Venture Density & Demographic Dataset
Texas Sole Proprietors Data
IRR Rural Index

In the following sections, I will import, clean, and set up the aforementioned datasets to be joined by county.

Install and call packages used for the folloiwng process.

rm(list = ls())   ## Clear the workspace


#install.packages("tidyverse")
#install.packages("ggplot2")
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(reticulate)

Microsoft Broadband Availability Dataset

Microsoft dataset is available in its GitHub repository. Therefore, we will directly import its dataset from the URL. Microsoft updated the dataset June, 2021 to apply data as of October 2020. The updated dataset also includes updated FCC availability as of December 2019.

Note that the percentage variables are imported as factors. We need them to be in the form of integers. Converting factors directly to numeric vectors using as.numeric() would result in loss of information. Following the methods suggested here, the code below converts BROADBAND.AVAILABILITY.PER.FCC and BROADBAND.USAGE to numeric values.

## Create new variabes in numeric form
ms_broadband_2019 <- ms_broadband_2019 %>% 
  rename(pct_broadband_FCC_dec_2017 = `BROADBAND AVAILABILITY PER FCC`,
         pct_broadband_MS_nov_2019 = `BROADBAND USAGE`)

ms_broadband_2020 <- ms_broadband_2020 %>% 
  rename(pct_broadband_FCC_dec_2019 = `BROADBAND AVAILABILITY PER FCC`,
         pct_broadband_MS_oct_2020 = `BROADBAND USAGE`)

ms_broadband <- left_join(ms_broadband_2019, select(ms_broadband_2020, `COUNTY ID`, pct_broadband_FCC_dec_2019, pct_broadband_MS_oct_2020), by = "COUNTY ID")
## Confirm the change
str(ms_broadband)

Next step is to filter out information on the states other than Texas

## Filtering TX
ms_broadband_txksme <- ms_broadband %>% 
  filter(ST == "TX" | ST == "KS" | ST == "ME") %>% droplevels() # Deletes unused levels in factor variables in a dataset

## Inspect the result
str(ms_broadband_txksme)

GoDaddy Dataset

The GoDaddy dataset has been used by researchers (Mossberger, Tolbert, & LaCombe, 2020) for studying local entrepreneurship represented by actual adoption and usage of the Internet. The dataset includes venture density variables in various time frames. Furthermore, the dataset includes numbers of important demographic/economic factors derived from the American Community Survey estimates.

In the following section, I will import, inspect and set up the dataset for further join with other databases.

## Import the Dataset
godaddy_cbsa <- read.csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/GoDaddy_CBSA.csv", header = T)    # GoDaddy dataset at the unit of Core Based Statistical Area (CBSA) defined by U.S. Department of Housing and Urban Development
godaddy_county <- read.csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/godaddy_nov_2020_county.csv", header = T)    # GoDaddy dataset at the county level

## Inspect Structure
str(godaddy_cbsa)
str(godaddy_county)

The GoDaddy dataset also has to be filtered to include only Texas data. Since we are at the moment only interested in county level information, the filtering will only be conducted to the county level dataset. The county level dataset has a variable named cfips, which is a FIPS code. FIPS code idnetifies Texas by the first two digit: ‘48’.

## Filtering Texas using FIPS code
str(godaddy_county)

godaddy_county_txksme <- godaddy_county %>% 
  filter(startsWith(as.character(cfips), "48") | startsWith(as.character(cfips), "23") |
           startsWith(as.character(cfips), "20")) %>% # Filter Texas, Kansas, Maine
  mutate(population_2018 = as.numeric(gsub(",","",as.character(population))),
         cfips = as.character(cfips)) %>% # Convert population from factor to numeric
  separate(county, into = c("county", "state"), sep = ", ", remove = T) %>% # Separate county variable which had state and county data at the same time
  droplevels()  # Drop unused levels from factor variables

## Inspect Structure
str(godaddy_county_txksme)

M-Lab Datasets

Measurement Lab (M-Lab) datasets consists of user reported download (DL) and upload (UL) speed by geographical information. The easiest way to access and retrieve data from the database is by making queries via Google BigQuery platform using Standard SQL statements. You can find more about datasets available from M-Lab and overview here. Currently M-Lab suggests using the Unified View dataset to researchers. The datasets from the following time frames are retrieved through BigQuery using SQL statements shared by M-Lab team modified to match the time frames. The SQL queries and an overview of the methods of calculating median speeds are available here. I made some simple modifications calculating the proportion of people who have recorded median speed of 25Mbps download/3Mbps upload in a given time period, which would align better with FCC’s broadband definition. The actual query used is available here.

September 2019
October 2019
November 2019
December 2019

These time frames are initially selected because it matches the ones provided in the GoDaddy dataset. GoDaddy dataset provides venture density information in several time frame. While there are several time frames in 2018, the intervals are inconsistent. Since the time frames of 2019 are equally distanced (i.e., monthly) from September to December, this time frame was initially selected for M-Lab data retrieval.

In this section, I will import the retrieved M-Lab datasets and clean them.

## Import the Dataset
mlab_sept <- read.csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/Mlab_State_All_Sept_2019_2.csv", header = T)
mlab_oct <- read.csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/Mlab_State_All_Oct_2019_2.csv", header = T)
mlab_nov <- read.csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/Mlab_State_All_Nov_2019_2.csv", header = T)
mlab_dec <- read.csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/Mlab_State_All_Dec_2019_2.csv", header = T)

## Inspect Structure
str(mlab_sept)
str(mlab_oct)
str(mlab_nov)
str(mlab_dec)

#### Cleaning and Filtering the Dataset ####
county_db <- tidycensus::fips_codes
county_tx <- county_db[which(county_db$state == "TX"), ]
county_ks <- county_db[which(county_db$state == "KS"), ]
## September 2019 ##
mlab_sept_txksme <- mlab_sept %>% 
  filter(state == "TX" | state == "KS" | state == "ME") %>% 
  select(-c("BB_state","BB_county_name","DLmed_state","DLmed_county_name","ULmed_state","ULmed_county_name","ul_sample_state","ul_sample_county_name","dl_sample_state","dl_sample_county_name")) %>% 
  mutate(county = paste(as.character(county_name), "County", sep = " "),
         frac_over_25DL = 1 - frac_under_25mbpsDL) %>% 
  droplevels() %>% 
  left_join(., county_db, by = c("state" = "state", "county" = "county")) %>% 
  mutate(county_FIPS = paste0(state_code, county_code)) %>% 
  select(-c("state_name","county_name"))

## October 2019 ##
mlab_oct_txksme <- mlab_oct %>% 
  filter(state == "TX" | state == "KS" | state == "ME") %>% 
  select(-c("BB_state","BB_county_name","DLmed_state","DLmed_county_name","ULmed_state","ULmed_county_name","ul_sample_state","ul_sample_county_name","dl_sample_state","dl_sample_county_name")) %>% 
  mutate(county = paste(as.character(county_name), "County", sep = " "),
         frac_over_25DL = 1 - frac_under_25mbpsDL) %>% 
  droplevels() %>% 
  left_join(., county_db, by = c("state" = "state", "county" = "county")) %>% 
  mutate(county_FIPS = paste0(state_code, county_code)) %>% 
  select(-c("state_name","county_name"))

## November 2019 ##
mlab_nov_txksme <- mlab_nov %>% 
  filter(state == "TX" | state == "KS" | state == "ME") %>% 
  select(-c("BB_state","BB_county_name","DLmed_state","DLmed_county_name","ULmed_state","ULmed_county_name","ul_sample_state","ul_sample_county_name","dl_sample_state","dl_sample_county_name")) %>% 
  mutate(county = paste(as.character(county_name), "County", sep = " "),
         frac_over_25DL = 1 - frac_under_25mbpsDL) %>% 
  droplevels() %>% 
  left_join(., county_db, by = c("state" = "state", "county" = "county")) %>% 
  mutate(county_FIPS = paste0(state_code, county_code)) %>% 
  select(-c("state_name","county_name"))

## December 2019 ##
mlab_dec_txksme <- mlab_dec %>% 
  filter(state == "TX" | state == "KS" | state == "ME") %>% 
  select(-c("BB_state","BB_county_name","DLmed_state","DLmed_county_name","ULmed_state","ULmed_county_name","ul_sample_state","ul_sample_county_name","dl_sample_state","dl_sample_county_name")) %>% 
  mutate(county = paste(as.character(county_name), "County", sep = " "),
         frac_over_25DL = 1 - frac_under_25mbpsDL) %>% 
  droplevels() %>% 
  left_join(., county_db, by = c("state" = "state", "county" = "county")) %>% 
  mutate(county_FIPS = paste0(state_code, county_code)) %>% 
  select(-c("state_name","county_name"))

Texas Sole Proprietors Dataset

This dataset contains information of sole proprietors number provided by the Bureau of Business Research at \(IC^{2}\) Institute. The information is presented at the county level.

## Import the Dataset
tx_proprietor <- read_csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/Sole-Proprietors-tx-combined.csv")

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   county = col_character(),
##   pct_change_proprietors_2000_2017 = col_double(),
##   proprietors_2000 = col_double(),
##   proprietors_2017 = col_double(),
##   pct_proprietors_employment_2000 = col_double(),
##   pct_proprietors_employment_2010 = col_double(),
##   pct_proprietors_employment_2017 = col_double(),
##   total_employment_2017 = col_double(),
##   wage_salary_jobs_2017 = col_double(),
##   pct_change_pro_emp_2000_2017 = col_double(),
##   pct_change_pro_emp_2010_2017 = col_double()
## )

## Inspect the Structure
str(tx_proprietor)

This dataset’s variables are straighforward as it was pre-cleaned on Excel spreadsheet.

IRR Rural Index

This dataset contains information of IRR index at the county level. The index calculates the extent to which a certain county could be considered as ‘rural’. Smaller the value of the index, more rural the county is considered to be.

There are two datasets that contains data from 2000 and 2010. I will import these datasets, filter for Texas and set up for a final merge of all datasets we have.

## Import the Dataset
rural_2000 <- read_csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/IRR-rural-2000.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────
## cols(
##   FIPS2000 = col_double(),
##   `County Name` = col_character(),
##   IRR2000 = col_double()
## )

rural_2010 <- read_csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/IRR-rural-2010.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────
## cols(
##   FIPS2010 = col_double(),
##   `County Name` = col_character(),
##   IRR2010 = col_double()
## )

## Inspect Structure
str(rural_2000)
str(rural_2010)

#### Clean the Datasets ####
## Filter Texas counties using FIPS code ##

rural_2000_txksme <- rural_2000 %>% 
  filter(startsWith(as.character(FIPS2000), "48")| startsWith(as.character(FIPS2000), "23") |
           startsWith(as.character(FIPS2000), "20")) %>% 
  separate(`County Name`, into = c("county", "state"), sep = ", ", remove = T)

rural_2010_txksme <- rural_2010 %>% 
  filter(startsWith(as.character(FIPS2010), "48")| startsWith(as.character(FIPS2010), "23") |
           startsWith(as.character(FIPS2010), "20")) %>% 
  separate(`County Name`, into = c("county", "state"), sep = ", ", remove = T)

## Inspect Structure
str(rural_2000_txksme)
str(rural_2010_txksme)

Rural-Urban Continuum Codes

In addition to the IRR index, I also import the Rural-Urban Continuum Codes (RUCC, 2013). The RUCC forms a classification scheme that distinguishes metropolitan counties by the population size of their metro area, and nonmetropolitan counties by degree of urbanization and adjacency to a metro area.

rucc2013 <- read_csv("https://raw.githubusercontent.com/jwroycechoi/broadband-entrepreneurship/master/Datasets/ruralurbancodes2013.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────
## cols(
##   FIPS = col_character(),
##   State = col_character(),
##   County_Name = col_character(),
##   Population_2010 = col_number(),
##   RUCC_2013 = col_double(),
##   Description = col_character()
## )

rucc2013 <- rucc2013 %>% 
  mutate(Description = str_extract(Description, "^.*?(?= -)")) %>% 
  rename(metro_desc = Description)

Merge into Single Dataset at the County Level

In this section, I will join all the datasets I have cleand and processed above into a single dataset containing all the information at the county level rows. The names of the datasets created from the process above are as follows

ms_broadband_txksme
godaddy_county_txksme
mlab_sept_txksme; mlab_oct_txksme; mlab_nov_txksme; mlab_dec_txksme
tx_proprietor
rural_2000_txksme; rural_2010_txksme

I will first inspect the datasets quickly to see if row counts of all datasets are matched since it is possible some of them (especially M-Lab datasets) have less/missing observations. Afterwards, I’ll join the datasets into one.

#### Inspect the Number of Observations ####
knitr::kable(tibble(MS = count(ms_broadband_txksme)$n, godaddy = count(godaddy_county_txksme)$n,
       mlabsept = count(mlab_sept_txksme)$n, mlaboct = count(mlab_oct_txksme)$n, mlabnov = count(mlab_nov_txksme)$n, mlabdec = count(mlab_dec_txksme)$n,
       txprop = count(tx_proprietor)$n, irr2000 = count(rural_2000_txksme)$n, irr2010 = count(rural_2010_txksme)$n))

MS	godaddy	mlabsept	mlaboct	mlabnov	mlabdec	txprop	irr2000	irr2010
375	374	367	369	368	370	254	383	383

Several missing observation from M-Lab datasets and one missing from GoDaddy dataset here. Therefore, it is critical that these datasets are not used as a reference dataset for joining. Folloiwng code chunk will use join functions of dplyr.

#### Join the Datasets ####

str(ms_broadband_txksme)
str(godaddy_county_txksme)
str(mlab_dec_txksme)
str(tx_proprietor)
str(rural_2000_txksme)

## The variables to match from each datasets are
## ms_broadband_tx$COUNTY.ID          Integer
## ms_broadband_tx$COUNTY.NAME        Factor    ("____ County")
## godaddy_county_tx$cfips            Integer
## godaddy_county_tx$county           Character ("____ County")
## mlab datasets$county               Character ("____ County")
## tx_proprietor$county               Character ("____ County")
## rural datasets#county              Character ("____ County")

## Create a character vector for ms_broadband_tx dataset to minimize potential error beforehand
ms_broadband_txksme <- ms_broadband_txksme %>% mutate(county = `COUNTY NAME`,
                                                      county_FIPS = as.character(`COUNTY ID`)) %>% 
  select(-c("COUNTY NAME","COUNTY ID"))
str(ms_broadband_txksme)

## Nested left_joins on IRR dataset
names(mlab_sept_txksme) <- paste(names(mlab_sept_txksme), ".sept", sep = "")
names(mlab_oct_txksme) <- paste(names(mlab_oct_txksme), ".oct", sep = "")
names(mlab_nov_txksme) <- paste(names(mlab_nov_txksme), ".nov", sep = "")
names(mlab_dec_txksme) <- paste(names(mlab_dec_txksme), ".dec", sep = "")

rural_2010_txksme <- rural_2010_txksme %>% mutate(FIPS2010 = as.character(FIPS2010))
godaddy_county_txksme <- godaddy_county_txksme %>% mutate(cfips = as.character(cfips))

tx_bb_entrepreneur_merged <- left_join(ms_broadband_txksme, rural_2010_txksme, by = c("county_FIPS" = "FIPS2010")) %>% 
  left_join(., select(godaddy_county_txksme, c("cfips","county","state", "population_2018",
                                               starts_with("venturedensity"), starts_with("highlyactive"),
                                           "prosperityindex2016","prosperityindex2018",
                                           "change_prosperity_07_16","change_prosperity_07_18")), by = c("county_FIPS" = "cfips")) %>% 
  left_join(., mlab_sept_txksme, by = c("county_FIPS" = "county_FIPS.sept")) %>% 
  left_join(., mlab_oct_txksme, by = c("county_FIPS" = "county_FIPS.oct")) %>% 
  left_join(., mlab_nov_txksme, by = c("county_FIPS" = "county_FIPS.nov")) %>% 
  left_join(., mlab_dec_txksme, by = c("county_FIPS" = "county_FIPS.dec"))

str(tx_bb_entrepreneur_merged)

## Drop columns with duplicate information

str(tx_bb_entrepreneur_merged)
tx_bb_entrepreneur_merged <- tx_bb_entrepreneur_merged %>%  
  select(-starts_with("state.")) %>% 
  select(-starts_with("county.")) %>% 
  select(-starts_with("county_cod")) %>% 
  select(-starts_with("state_"))

County Business Patterns & Nonemployer Statistics Dataset from the Census Bureau

In this section, we will get data from the County Business Patterns and the Nonemployer Statistics datasets provided by the Census Bureau. These datasets provide annual statistics of businesses with paid employees and no paid employees or payroll, respectively at a detailed geography and industry level. Using censusapi package, I will interact directly with the Census Bureau API to retrieve and modify data. For those who want to follow, one should request an API key first before proceeding.

The process here follows a very useful guideline provided by the developer of censusapi package. You can find the introduction with useful information and examples of interacting with various APIs here, and also detailed documentations here. If you don’t need to interact with economic or other datasets and only need demographic information available from the American Community Survey estimates, tidycensus from Kyle Walker is also a good alternative. Here, I will mainly show the process using censusapi package.

After installing the package, set up your API key by copying and pasting the code below. Put the API key you got from the Census Bureau instead of what says "YOUR KEY GOES HERE" below.

# For general usage, use install.packages("censusapi") to install most stable version of the package.

#devtools::install_github("jwroycechoi/censusapi")     # A fork that fixes some issues regarding NAICS2017 and labels
library(censusapi)

# Add key to .Renviron
Sys.setenv(CENSUS_KEY="YOUR KEY GOES HERE")
# Reload .Renviron
readRenviron("~/.Renviron")
# Check to see that the expected key is output in your R console
Sys.getenv("CENSUS_KEY")

## 
## Attaching package: 'censusapi'

## The following object is masked from 'package:methods':
## 
##     getFunction

You can take a look at available APIs and other important metadata related to specific APIs using the following functions. Refer to specific documentations of the functions as well as the specific API that you want to investigate.

listCensusApis(): Retrieve all available API handles provided by the Census Bureau.
listCensusMetadata(): Retrieve important metadata of specific APIs in specific years, such as the variables or geography.

We are interested in County Business Patterns and Nonemployers Statistics. We can check the variables available from the datasets and the geogrphic level using the code below.

#### CBP Dataset ####
## County Business Pattern dataset variables and geography
View(listCensusMetadata(name = "cbp", vintage = 2012, type = "variables"))
View(listCensusMetadata(name = "cbp", vintage = 2017, type = "variables"))
View(listCensusMetadata(name = "cbp", vintage = 2018, type = "variables"))
listCensusMetadata(name = "cbp", vintage = 2017, type = "geography")
listCensusMetadata(name = "cbp", vintage = 2018, type = "geography")

# You can also put the variable names into separate object if you want to use it for further queries
cbp2017_var <- listCensusMetadata(name = "cbp", vintage = 2017, type = "variables")$name

#### Nonemployer Dataset ####
View(listCensusMetadata(name = "nonemp", vintage = 2012, type = "variables"))
View(listCensusMetadata(name = "nonemp", vintage = 2017, type = "variables"))
View(listCensusMetadata(name = "nonemp", vintage = 2018, type = "variables"))
listCensusMetadata(name = "nonemp", vintage = 2017, type = "geography")
listCensusMetadata(name = "nonemp", vintage = 2018, type = "geography")

In order to retrive the datasets from the Census database you use getCensus function with specific arguments. Generally, required arguments are name, vintage(i.e., year), vars(i.e., variables), and region. In the code below, I’ll retrieve CBP and Nonemployer datasets from year 2017 and 2018 for all counties within three states of our interest.

After raw data retrieval, I will create variables that could potentially represent entrepreneurial activities within the counties. The variables have to be aggregated into county level observations. Note that currently our research purpose does not weigh into investigating industry breakdowns or revenues. Following list of variables will be aggregated at the county level:

County Business Patterns Dataset

emp_cbp_YEAR: Total employment for all industry sectors and all establishments in the county
pct_50_est_cbp_YEAR: % of establishments with less than 50 employees over all establishments in the county
pct_10_est_cbp_YEAR: % of establishments with less than 10 employees over all establishments in the county

Nonemployer Statistics

neest_nemp_YEAR: Total nonemployer establishments for all industry sectors in the county
nonfarmneest_nemp_YEAR: Total nonfarm nonemployer establishments in the county
pct_nonfarmneest_nemp_YEAR: % of nonfarm nonemployer establishments over all nonemployer establishments in the county

Since we will be using same lines of code for different years, I’ll write a local function for County Business Pattern and Nonemployer Statistics database that will

Retrieve data set for each year
Calculate the variables
Merge into single data frame

## Retrieve CBP datasets ##
## This function:
## 1. Calls Census API for Employment, Employment size category, Establishments, and year
## 2. Summarizes total employment, total business establishments, establishments with less than 50 employees, establishments with less than 10 employees,
##    year, % of establishments with less than 50 employees over total establishments, % of establishmets with less than 10 employees over total establishments
## Note: You can enter multiple years and counties of a single state

getCBPemp <- function(yr, countyfips = "*", statefips){
  yrlst <- list(yr)
  dfs <- vector("list", length(yr))
  for (i in seq_along(yrlst[[1]])) {
    if(as.numeric(yrlst[[1]][i]) > 2016){
      varlist <- c("EMP","EMP_F","EMPSZES","EMPSZES_LABEL","ESTAB","ESTAB_F","YEAR")
    } else {
      varlist <- c("EMP","EMP_F","EMPSZES","EMPSZES_TTL","ESTAB","ESTAB_F","YEAR")
      }
    dfs[[i]] <- getCensus(name = "cbp",
                         vintage = as.numeric(yrlst[[1]][i]),
                         vars = varlist,
                         region = paste0("county:", paste0(countyfips, collapse = ",")),
                         regionin = paste0("state:", paste0(statefips, collapse = ","))) %>% 
      mutate(county_FIPS = paste0(state, county, sep = ""),
             EMP = as.numeric(EMP),
             ESTAB = as.numeric(ESTAB),
             EMP = case_when(EMP_F == "a" ~ round(median(seq(0,19)), digits = 0),  # For employment range symbols, replace with median of the range
                             EMP_F == "b" ~ round(median(seq(20,99)), digits = 0),
                             EMP_F == "c" ~ round(median(seq(100,249)), digits = 0),
                             EMP_F == "e" ~ round(median(seq(250,499)), digits = 0),
                             EMP_F == "f" ~ round(median(seq(500,999)), digits = 0),
                             EMP_F == "g" ~ round(median(seq(1000,2499)), digits = 0),
                             EMP_F == "h" ~ round(median(seq(2500,4999)), digits = 0),
                             EMP_F == "i" ~ round(median(seq(5000,9999)), digits = 0),
                             EMP_F == "j" ~ round(median(seq(10000,24999)), digits = 0),
                             EMP_F == "k" ~ round(median(seq(25000,49999)), digits = 0),
                             EMP_F == "l" ~ round(median(seq(50000,99999)), digits = 0),
                             EMP_F == "m" ~ 100000,
                             TRUE ~ EMP)) %>% 
      filter(EMPSZES == "001"|
               EMPSZES == "210"|EMPSZES == "212"|EMPSZES == "220"|EMPSZES == "230"|EMPSZES == "241") %>% 
      mutate(est_total = ifelse(EMPSZES == "001", ESTAB, 0),
             est_50 = ifelse(EMPSZES != "001", ESTAB, 0),
             est_10 = ifelse(EMPSZES == "210"|EMPSZES == "212" | EMPSZES == "220", ESTAB, 0)) %>% 
      group_by(county_FIPS) %>% 
      summarise(emp_cbp = sum(EMP, na.rm = T),
                est_cbp = sum(est_total, na.rm = T),
                est_50_cbp = sum(est_50, na.rm = T),
                est_10_cbp = sum(est_10, na.rm = T),
                year = yrlst[[1]][i]) %>% 
      mutate(pct_50_est_cbp = est_50_cbp / est_cbp,
             pct_10_est_cbp = est_10_cbp / est_cbp)
  }
  df <- bind_rows(dfs)
  df <- df %>% 
    pivot_wider(names_from = year,
                names_glue = "{.value}_{year}",
                values_from = c(emp_cbp, est_cbp, est_50_cbp, est_10_cbp, pct_50_est_cbp, pct_10_est_cbp))
  return(df)
}

cbp_txksme <- getCBPemp(c(2012,2017,2018,2019), statefips = c(20,23,48))

## Retrieve Nonemployer datasets ##
getNEMPemp <- function(yr, countyfips = "*", statefips){
  yrlst <- list(yr)
  dfs <- vector("list", length(yr))
  for (i in seq_along(yrlst[[1]])) {
    if(as.numeric(yrlst[[1]][i]) < 2017){
      varlist <- c("NESTAB","NESTAB_F","NAICS2012","SECTOR")
    } else if (as.numeric(yrlst[[1]][i] >= 2017)) {
      varlist <- c("NESTAB","NESTAB_F","NAICS2017","SECTOR")
    } 
    dfs[[i]] <- getCensus(name = "nonemp",
                         vintage = as.numeric(yrlst[[1]][i]),
                         vars = varlist,
                         region = paste0("county:", paste0(countyfips, collapse = ",")),
                         regionin = paste0("state:", paste0(statefips, collapse = ","))) %>% 
      mutate(county_FIPS = paste(state, county, sep = ""),
             NESTAB = as.numeric(NESTAB)) %>% 
      rename_with(~gsub("[[:digit:]]+","",.)) %>% 
      mutate(est_total = ifelse(NAICS == "00", NESTAB, 0),
             nonfarm_est_total = ifelse(str_detect(NAICS, "^11", negate = TRUE), NESTAB, 0),
             nonfarm_est_total = ifelse(NAICS == "00", 0, nonfarm_est_total)) %>% 
      group_by(SECTOR, county_FIPS) %>% slice(1) %>% ungroup() %>% 
      group_by(county_FIPS) %>% 
      summarise(neest_nemp = sum(est_total, na.rm = T),
                nonfarmneest_nemp = sum(nonfarm_est_total, na.rm = T),
                year = yrlst[[1]][i]) %>% 
      mutate(pct_nonfarmneest_nemp = nonfarmneest_nemp / neest_nemp)
  }
  df <- bind_rows(dfs)
  df <- df %>% 
    pivot_wider(names_from = year, names_glue = "{.value}_{year}", values_from = c(neest_nemp, nonfarmneest_nemp, pct_nonfarmneest_nemp))
  return(df)
}

nonemp_txksme <- getNEMPemp(c(2012,2017,2018), statefips = c(20,23,48)) %>% mutate(chg_pct_nonfarmneest_nemp_2012_2018 = pct_nonfarmneest_nemp_2018 - pct_nonfarmneest_nemp_2012)

Other Economic Statistics

Additionally, I will gather several other economic statistics from few other datasets. THe information gathered here will be employments, nonfarm proprietors, number of firms, and change in number of firms. Information regarding the employments and nonfarm proprietors will be gathered from dataset provided by the Bureau of Economic Analysis (BEA). Previous data from CBP and Nonemployer Statistics (NS) provide information at the level of establishments. Datasets from the Annual Business Survey (ABS) provides firm level statistics. The Annual Business Survey is a survey conducted by the Census Bureau that replaces five-year Survey of Business Owners (SBO) for employer businesses, the Annual Survey of Entrepreneurs (ASE), the Business R&D and Innovation for Microbusinesses survey (BRDI-M), and the innovation section of the Business R&D and Innovation Survey (BRDI-S) from survey year 2017. Comparable information from previous time frame will be imported from SBO and ASE datasets.

From the aforementioned datasets, I will retrieve and create the following measures:

Bureau of Economic Analysis (BEA)

pct_nonfarm_propr_YEAR: % of nonfarm proprietors over total employment
pct_chg_bea_2012_2018: Change in % of nonfarm proprietors over total employment between 2012 and 2018

Annual Business Survey (ABS) (+ SBO, ASE)

firm_YEAR: Number of firms with paid employees
chg_firm_YEAR: Change in the number of firms with paid employees

BEA Dataset API Call

For BEA data, you have to separately register for BEA API key. In R, there is a package that helps you interact with BEA API called bea.R. Detailed information and instructions can be found here.

#### BEA API Interaction and Dataset Explore ####
## Use bea.R package. An introduction of how the package works is available in the GitHub page (https://github.com/us-bea/bea.R) ##
## Mainly we want to explore what kind of data is available and will be useful in terms of entrepreneurship.
## Specifically non-farm proprietorship information that is comparable across different states.

install.packages("bea.R")

library(bea.R)
library(tidyverse)

# Set up the API Key
beaKey <- "YOUR API KEY"

#### Checking datasets and available tables etc. ####
## BEA API exploration requires few prior knowledge on available datasets and tables
## Datasets available are listed here (https://apps.bea.gov/API/signup/index.cfm)
## Tables and parameters used for querying specific datasets are available in the appendices in the API user guide (https://apps.bea.gov/API/bea_web_service_api_user_guide.htm#tabs-9a)
## Another way to identify what you are looking for is playing around with interactive application provided by BEA here (https://apps.bea.gov/itable/itable.cfm).
## In our case we want to look at county level regional information. Specifically, nonfarm proprietors employment

### First, you can use beaSearch to see what data tables are available by keyword

beaSearch("Regional", beaKey, asHtml = T)     # Currently, regional dataset is depracated and seems like the bea.R package hasn't yet been fully updated

### Second, inspect the parameters and parameter values
## We can look at what parameters are available for querying the dataset 'Regional' by using code below
beaParams(beaKey = beaKey, 'Regional')
## Further, you can see what values are available for specific parameter using code below
beaParamVals(beaKey = beaKey, 'Regional', 'TableName')
beaParamVals(beaKey = beaKey, 'Regional', 'LineCode')

library(bea.R)

## Loading required package: data.table

## data.table 1.14.0 using 1 threads (see ?getDTthreads).  Latest news: r-datatable.com

## **********
## This installation of data.table has not detected OpenMP support. It should still work but in single-threaded mode.
## This is a Mac. Please read https://mac.r-project.org/openmp/. Please engage with Apple and ask them for support. Check r-datatable.com for updates, and our Mac instructions here: https://github.com/Rdatatable/data.table/wiki/Installation. After several years of many reports of installation problems on Mac, it's time to gingerly point out that there have been no similar problems on Windows or Linux.
## **********

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

## Creating a generic function for 'toJSON' from package 'jsonlite' in package 'googleVis'

## Note: As of February 2018, beaGet() requires 'TableName' for NIPA and NIUnderlyingDetail data instead of 'TableID.' See https://github.us-bea/bea.R for details.

### Finally, we will use the parameter and parameter values to retrieve data table
## In terms of LineCode, we are interested in:
## 10: Total employment
## 20: Wage and salary employment
## 40: Proprietors employment
## 60: Nonfarm proprietors employment

## In terms of GeoFips, we will query all counties and then filter later on for the three states: Kansas, Maine, Texas
## In terms of Year, the research team has decided to look at the most recent year (2018) and 2012 as it is the year the nation started
## to recover from the Recession

## The LineCode parameter does not allow multiple values input.
## Therefore, we have to generate separate tables for each LineCode and merge afterwards

#### Function to Retrieve Data tables from BEA API ####
getBEAstat <- function(yr, beaKey, linecode = c(10, 40, 60)) {
  beaKey <- beaKey
  yrlist <- list(yr)
  linecodelist <- list(linecode)
  dfs <- vector("list", length(linecode))
  for (i in seq_along(linecodelist[[1]])) {
    beaSpecs <- list(
      'UserID' = beaKey,
      'Method' = 'GetData',
      'datasetname' = 'Regional',
      'TableName' = 'CAEMP25N',
      'LineCode' = linecodelist[[1]][i],
      'GeoFips' = 'COUNTY',
      'Year' = paste0(yr, collapse = ",")
    )
    if (linecodelist[[1]][i] == 10) {
      dfs[[i]] <- beaGet(beaSpecs, asWide = F) %>% 
        filter(str_detect(GeoFips, "^48")|str_detect(GeoFips, "^20")|str_detect(GeoFips, "^23")) %>% 
        select(GeoFips, TimePeriod, DataValue, GeoName) %>% 
        mutate(GeoName = gsub(",.*$","",GeoName)) %>% 
        rename(totalemp_bea = DataValue,
               county = GeoName)
    } else if (linecodelist[[1]][i] == 40) {
      dfs[[i]] <- beaGet(beaSpecs, asWide = F) %>% 
        filter(str_detect(GeoFips, "^48")|str_detect(GeoFips, "^20")|str_detect(GeoFips, "^23")) %>%
        select(GeoFips, TimePeriod, DataValue, GeoName) %>% 
        mutate(GeoName = gsub(",.*$","",GeoName)) %>% 
        rename(proprietors_bea = DataValue,
               county = GeoName)
    } else if (linecodelist[[1]][i] == 60) {
      dfs[[i]] <- beaGet(beaSpecs, asWide = F) %>% 
        filter(str_detect(GeoFips, "^48")|str_detect(GeoFips, "^20")|str_detect(GeoFips, "^23")) %>% 
        select(GeoFips, TimePeriod, DataValue, GeoName) %>% 
        mutate(GeoName = gsub(",.*$","",GeoName)) %>% 
        rename(nonfarmproprietors_bea = DataValue,
               county = GeoName)
    }
  }
  
  bea_10 <- dfs[[1]] %>% select(-county)
  bea_40 <- dfs[[2]] %>% select(-county)
  bea_60 <- dfs[[3]] %>% select(-county)
  bea_prop <- bea_10 %>% 
    left_join(., select(bea_40, proprietors_bea, TimePeriod, GeoFips), by = c("GeoFips" = "GeoFips", "TimePeriod" = "TimePeriod")) %>% 
    left_join(., select(bea_60, nonfarmproprietors_bea, TimePeriod, GeoFips), by = c("GeoFips" = "GeoFips", "TimePeriod" = "TimePeriod")) %>% 
    mutate(pct_nonfarm_bea = nonfarmproprietors_bea / totalemp_bea) %>% 
    rename(year = TimePeriod,
           county_FIPS = GeoFips) %>% 
    pivot_wider(names_from = year, names_glue = "{.value}_{year}", values_from = c(totalemp_bea, proprietors_bea, nonfarmproprietors_bea, pct_nonfarm_bea))
  
  return(bea_prop)
  
}

bea_nonfarmproprietor_txksme <- getBEAstat(c(2012, 2017, 2018, 2019), beaKey = beaKey) %>% mutate(pct_chg_bea_2012_2019 = pct_nonfarm_bea_2019 - pct_nonfarm_bea_2012)

ABS Dataset API Call

In this section, I will import relevant datasets from ABS, ASE, and SBO and create a dataset that will be used for final merge.

#### Look at ABS Metadata ####
## Detailed information about available tables and variables can be found in PDF documents here (https://www.census.gov/programs-surveys/abs/technical-documentation/api.html)

View(listCensusMetadata(name = "abscs", vintage = 2017, type = "variables"))
View(listCensusMetadata(name = "ase/csa", vintage = 2015, type = "variables"))
View(listCensusMetadata(name = "ase/csa", vintage = 2015, type = "geographies"))  # Can't use ASE data because it does not have county level info
View(listCensusMetadata(name = "sbo", vintage = 2012, type = "variables"))
View(listCensusMetadata(name = "sbo", vintage = 2012, type = "geographies"))

## ABS Company Summary Datatable ##
firmlevel_2017 <- getCensus(name = "abscs",
                        vintage = 2017,
                        vars = c("EMP","EMP_F","FIRMPDEMP","FIRMPDEMP_F"),
                        region = "county:*",
                        regionin = "state:20,23,48")

firmlevel_2012 <- getCensus(name = "sbo",
                            vintage = 2012,
                            vars = c("EMP","EMP_F","FIRMPDEMP","FIRMPDEMP_F"),
                            region = "county:*",
                            regionin = "state:20,23,48")

## Clean and transform ##
firmlevel_2012 <- firmlevel_2012 %>% 
  mutate(county_FIPS = paste(state, county, sep = ""),
         EMP = as.numeric(EMP),
         FIRMPDEMP = as.numeric(FIRMPDEMP),
         EMP = case_when(EMP_F == "a" ~ round(median(seq(0,19)), digits = 0),  # For employment range symbols, replace with median of the range
                         EMP_F == "b" ~ round(median(seq(20,99)), digits = 0),
                         EMP_F == "c" ~ round(median(seq(100,249)), digits = 0),
                         EMP_F == "e" ~ round(median(seq(250,499)), digits = 0),
                         EMP_F == "f" ~ round(median(seq(500,999)), digits = 0),
                         EMP_F == "g" ~ round(median(seq(1000,2499)), digits = 0),
                         EMP_F == "h" ~ round(median(seq(2500,4999)), digits = 0),
                         EMP_F == "i" ~ round(median(seq(5000,9999)), digits = 0),
                         EMP_F == "j" ~ round(median(seq(10000,24999)), digits = 0),
                         EMP_F == "k" ~ round(median(seq(25000,49999)), digits = 0),
                         EMP_F == "l" ~ round(median(seq(50000,99999)), digits = 0),
                         EMP_F == "m" ~ 100000,
                         TRUE ~ EMP),
         FIRMPDEMP = case_when(FIRMPDEMP_F == "S" ~ NA_real_, TRUE ~ FIRMPDEMP)) %>% 
  rename(emp_abs_2012 = EMP, firm_2012 = FIRMPDEMP) %>% 
  select(-c("EMP_F","FIRMPDEMP_F","state","county"))

firmlevel_2017 <- firmlevel_2017 %>% 
  mutate(county_FIPS = paste(state, county, sep = ""),
         EMP = as.numeric(EMP),
         FIRMPDEMP = as.numeric(FIRMPDEMP),
         EMP = case_when(EMP_F == "a" ~ round(median(seq(0,19)), digits = 0),  # For employment range symbols, replace with median of the range
                         EMP_F == "b" ~ round(median(seq(20,99)), digits = 0),
                         EMP_F == "c" ~ round(median(seq(100,249)), digits = 0),
                         EMP_F == "e" ~ round(median(seq(250,499)), digits = 0),
                         EMP_F == "f" ~ round(median(seq(500,999)), digits = 0),
                         EMP_F == "g" ~ round(median(seq(1000,2499)), digits = 0),
                         EMP_F == "h" ~ round(median(seq(2500,4999)), digits = 0),
                         EMP_F == "i" ~ round(median(seq(5000,9999)), digits = 0),
                         EMP_F == "j" ~ round(median(seq(10000,24999)), digits = 0),
                         EMP_F == "k" ~ round(median(seq(25000,49999)), digits = 0),
                         EMP_F == "l" ~ round(median(seq(50000,99999)), digits = 0),
                         EMP_F == "m" ~ 100000,
                         TRUE ~ EMP),
         FIRMPDEMP = case_when(FIRMPDEMP_F == "S" ~ NA_real_, TRUE ~ FIRMPDEMP)) %>% 
  rename(emp_abs_2017 = EMP, firm_2017 = FIRMPDEMP) %>% 
  select(-c("EMP_F","FIRMPDEMP_F","state","county"))

## Join the two years, filter Texas, and calculate change variable ##
firmlevel_txksme <- left_join(firmlevel_2012, firmlevel_2017, by = "county_FIPS") %>% filter(str_detect(county_FIPS, "^48")|str_detect(county_FIPS, "^20")|str_detect(county_FIPS, "^23")) %>% 
  mutate(chg_firm_2012_2017 = firm_2017 - firm_2012)

ACS Broadband Measures and Demographics

The ACS broadband subscription measures currently available from the dataset provided by the Arizona State University research team represents people with “any” type of broadband including cell phone subscription. This not only could potentially inflate the subscription number, but also does not allow much comparability with other broadband variables. Therefore, I will call the ACS data directly to retrieve fixed broadband subscription excluding wireless options such as satellite and cell phone broadband.

Furthermore, I will update the demographic information using direct and most up-to-date data from the API.

#### ACS Broadband Subscription Dataset ####
View(listCensusMetadata(name = "acs/acs5", vintage = 2018, type = "variables"))
View(listCensusMetadata(name = "acs/acs5/profile", vintage = 2018, type = "variables"))
View(listCensusMetadata(name = "pep/charagegroups", vintage = 2019, type = "variables"))
View(listCensusMetadata(name = "acs/acs5/subject", vintage = 2018, type = "variables"))
View(listCensusMetadata(name = "acs/acs5/subject", vintage = 2018, type = "geography"))

## Broadband Subscription Percentage from Subject Table ##
acs_broadband_pct <- getCensus(name = "acs/acs5/subject", vintage = 2019,
                               vars = c("S2801_C02_017E","S2801_C02_019E","S2801_C02_016E","S2801_C02_018E",
                                        "S2801_C02_006E","S2801_C02_002E"),
                               region = "county:*",
                               regionin = "state:20,23,48")

# Rename variables #
acs_broadband_pct <- acs_broadband_pct %>% 
  rename(pct_cellularonly_acs_2019 = S2801_C02_016E,
         pct_fixed_acs_2019 = S2801_C02_017E,
         pct_sat_acs_2019 = S2801_C02_018E,
         pct_nobroadband_acs_2019 = S2801_C02_019E,
         pct_mobileonly_acs_2019 = S2801_C02_006E,
         pct_anycomputer_acs_2019 = S2801_C02_002E) %>% 
  mutate(pct_anybroadband_acs_2019 = (100 - pct_nobroadband_acs_2019)/100,
         pct_nocomputer_acs_2019 = (100 - pct_anycomputer_acs_2019)/100,
         pct_cellularonly_acs_2019 = pct_cellularonly_acs_2019/100,
         pct_fixed_acs_2019 = pct_fixed_acs_2019/100,
         pct_sat_acs_2019 = pct_sat_acs_2019/100,
         pct_nobroadband_acs_2019 = pct_nobroadband_acs_2019/100,
         pct_mobileonly_acs_2019 = pct_mobileonly_acs_2019/100,
         pct_anycomputer_acs_2019 = pct_anycomputer_acs_2019/100,
         county_FIPS = paste(state, county, sep = ""))

# Calculate Digital Distress Index informed by Gallardo & Geideman (2019) (https://medium.com/design-and-tech-co/digital-distress-what-is-it-and-who-does-it-affect-part-1-e1214f3f209b)
normalize <- function(x, range = c(0,1)) {
  norm <- (range[2] - range[1])*((x - min(x))/(max(x) - min(x))) + range[1]
  return(norm)
} # Function that normalizes the given vector

acs_broadband_pct <- acs_broadband_pct %>% 
  mutate(distress_subscription_2019 = scale(pct_cellularonly_acs_2019 + pct_nobroadband_acs_2019)[,],
         distress_devices_2019 = scale(pct_mobileonly_acs_2019 + pct_nocomputer_acs_2019)[,],
         digital_distress_2019 = normalize(distress_subscription_2019 + distress_devices_2019, range = c(0,100)))

## Educational Attainment and Industry Profile, Self-employment from the ACS 2019 ##
acs_demo <- getCensus(name = "acs/acs5/profile", vintage = 2019,
                      vars = c("DP02_0067PE","DP02_0068PE","DP02_0066PE", # Educational attainment
                               "DP03_0033PE","DP03_0034PE","DP03_0035PE","DP03_0036PE","DP03_0037PE",
                               "DP03_0038PE","DP03_0039PE","DP03_0040PE","DP03_0041PE","DP03_0042PE",
                               "DP03_0043PE","DP03_0044PE","DP03_0045PE", # Industry profile
                               "DP03_0049PE","DP03_0009PE",
                               "DP05_0007PE","DP05_0008PE", # GenZ,
                               "DP05_0009PE","DP05_0010PE", # Millennials
                               "DP05_0011PE","DP05_0012PE", # GenX
                               "DP05_0013PE","DP05_0014PE","DP05_0015PE", # Boomers
                               "DP05_0001E"), # Total population
                      region = "county:*",
                      regionin = "state:20,23,48")

# Rename variables #
demo_cont <- acs_demo %>% 
  mutate(DP02_0066PE = 100 - DP02_0066PE) %>% # Substract from 100 to get % of ppl less than highschool
  rename(pctlessthanhigh_2019 = DP02_0067PE,
         pctbachelors_2019 = DP02_0068PE,
         pctgraduate_2019 = DP02_0066PE,
         pctagriculture_2019 = DP03_0033PE, pctconstruction_2019 = DP03_0034PE, pctmanufacture_2019 = DP03_0035PE,
         pctwholesale_2019 = DP03_0036PE, pctretail_2019 = DP03_0037PE, pcttransportation_2019 = DP03_0038PE,
         pctinformation_2019 = DP03_0039PE, pctfinance_2019 = DP03_0040PE, pctprofessional_2019 = DP03_0041PE,
         pctedu_healthcare_social_2019 = DP03_0042PE, pctarts_ent_2019 = DP03_0043PE,
         pctpublic_admin_2019 = DP03_0045PE, pctother_occupation_2019 = DP03_0044PE,
         pct_self_employed_2019 = DP03_0049PE,
         pct_unemployment_2019 = DP03_0009PE,
         population_2019 = DP05_0001E,
         genz1 = DP05_0007PE, genz2 = DP05_0008PE, mill1 = DP05_0009PE, mill2 = DP05_0010PE,
         genx1 = DP05_0011PE, genx2 = DP05_0012PE, boom1 = DP05_0013PE, boom2 = DP05_0014PE,
         boom3 = DP05_0015PE) %>% 
  mutate(county_FIPS = paste0(state, county),
         pctlessthanhigh_2019 = pctlessthanhigh_2019/100, pctbachelors_2019 = pctbachelors_2019/100,
         pctgraduate_2019 = pctgraduate_2019/100, pctagriculture_2019 = pctagriculture_2019/100,
         pctconstruction_2019 = pctconstruction_2019/100, pctmanufacture_2019 = pctmanufacture_2019/100,
         pctwholesale_2019 = pctwholesale_2019/100, pctretail_2019 = pctretail_2019/100,
         pcttransportation_2019 = pcttransportation_2019/100, pctinformation_2019 = pctinformation_2019/100,
         pctfinance_2019 = pctfinance_2019/100, pctprofessional_2019 = pctprofessional_2019/100,
         pctedu_healthcare_social_2019 = pctedu_healthcare_social_2019/100, pctarts_ent_2019 = pctarts_ent_2019/100,
         pctpublic_admin_2019 = pctpublic_admin_2019/100, pctother_occupation_2019 = pctother_occupation_2019/100,
         pct_self_employed_2019 = pct_self_employed_2019/100, pct_unemployment_2019 = pct_unemployment_2019/100,
         pct_genz_2019 = genz1 + genz2, pct_millennial_2019 = mill1 + mill2, pct_genx_2019 = genx1 + genx2,
         pct_boomers_2019 = boom1 + boom2 + boom3,
         pct_genz_2019 = pct_genz_2019/100, pct_millennial_2019 = pct_millennial_2019/100, pct_genx_2019 = pct_genx_2019/100,
         pct_boomers_2019 = pct_boomers_2019/100) %>% 
  select(-c(state, county, genz1, genz2, mill1, mill2, genx1, genx2, boom1, boom2, boom3))

FCC 477 Broadband Availability for Year 2019

I will here retrieve FCC 477 broadband data from 2016 to 2019. FCC publishes fixed broadband deployment data every 6 months. Here I will get data from December 2016 to December 2019 from Area Tables API provided by the FCC. In order to interact with FCC broadband data directly through its API, one needs to register for the FCC data portal as well as personal API App Token. Instructions can be found here. Package used to interact with the data is RSocrata.

Alternative to using the API, we could also download the dataset directly from FCC website using the URLs. As API interaction did not work properly at the time of writing this, we will retrieve data directly from the URL.

Using the raw data, We will generate:

pct25_3_(month_yr)_fcc: % of broadband availability at the speed of 25/3 per county
pct100_10_(month_yr)_fcc: % of broadband availability at the speed of 100/10 per county
pct250_25_(month_yr)_fcc: % of broadband availability at the speed of 250/25 per county
pct1000_10_(month_yr)_fcc: % of broadband availability at the speed of 1000/100 per county

library(RSocrata)
#### Import FCC 477 Data of Three States from Area Tables API ####
## Queries for ME, KS, TX for each time point
# Dec 2016
fcc_AT_dec_2016 <- read.socrata(
  "https://opendata.fcc.gov/resource/xv2f-wqqz.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)
# June 2017
fcc_AT_jun_2017 <- read.socrata(
  "https://opendata.fcc.gov/resource/yikn-7er3.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)
# Dec 2017
fcc_AT_dec_2017 <- read.socrata(
  "https://opendata.fcc.gov/resource/fnid-qg8r.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)
# Jun 2018
fcc_AT_jun_2018 <- read.socrata(
  "https://opendata.fcc.gov/resource/cekp-f8tj.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)
# Dec 2018
fcc_AT_dec_2018 <- read.socrata(
  "https://opendata.fcc.gov/resource/wucg-w9k9.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)
# Jun 2019
fcc_AT_jun_2019 <- read.socrata(
  "https://opendata.fcc.gov/resource/udcw-naqn.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)
# Dec 2019
fcc_AT_dec_2019 <- read.socrata(
  "https://opendata.fcc.gov/resource/ws2a-amik.json?$where=type = 'county' AND (starts_with(id,'20') OR starts_with(id,'48') OR starts_with(id,'23')) AND (speed = '25' OR speed = '100' OR speed = '250' OR speed = '1000')&$order=id",
  app_token = token,
  email = email,
  password = password
)

## Queries for downloading FCC staff county estimates
# Create temporary file space to download the dataset
temp <- tempfile()
temp2 <- tempfile()
## Import FCC Staff's block estimates ##
## The dataset provides FCC's estimates of population, households, and housing units per census block ##
## The links are available at https://www.fcc.gov/reports-research/data/staff-block-estimates ##
download.file("https://www.fcc.gov/file/19314/download", temp)  # 2019 Estimates
unzip(zipfile = temp, exdir = temp2)
fcc_staff_est_us2019 <- read.csv(file.path(temp2, "us2019.csv"), header = T)
unlink(temp)
unlink(temp2)
rm(temp)
rm(temp2)

#### Clean Dataset ####
## Staff Estimates for US Census Block Level ##
## Aggregate into county level estimates
fcc_staff_est_us2019 <- fcc_staff_est_us2019 %>% 
  mutate(block_fips = as.character(block_fips),
         county_fips = str_sub(block_fips, 1, 5)) %>% 
  select(stateabbr, county_fips, block_fips, starts_with("pop")) %>% 
  group_by(county_fips) %>% 
  summarise_at(., vars(starts_with("pop")), sum)

#### Calculating the Broadband Availability ####
## General process:
## 1. Filter by tech code in order to eliminate satellite technology
## 2. Based on speed categories, sum up population of the counties that have at least one provider
##    providing broadband at the speed category
## 3. Merge with the staff estimate total county population for each year
## 4. Calculate percentages

dflst <- list(dec_2016 = fcc_AT_dec_2016,
              jun_2017 = fcc_AT_jun_2017,
              dec_2017 = fcc_AT_dec_2017,
              jun_2018 = fcc_AT_jun_2018,
              dec_2018 = fcc_AT_dec_2018,
              jun_2019 = fcc_AT_jun_2019,
              dec_2019 = fcc_AT_dec_2019)

# Function to clean and calculate all the variables for retrieved datasets

getFCCpct <- function(dflst) {
  dfs <- vector("list", length(dflst))
  for (i in seq_along(dflst)) {
    if (gsub("\\_.*","",names(dflst[i])) == "jun") {
      staff_est <- fcc_staff_est_us2019 %>% select(., county_fips, ends_with(as.character(as.numeric(gsub("[^[:digit:]]","", names(dflst)[i]))-1)))
    } else {staff_est <- fcc_staff_est_us2019 %>% select(., county_fips, ends_with(gsub("[^[:digit:]]", "", names(dflst)[i])))}
    
    dfs[[i]] <- dflst[[i]] %>% 
      filter(tech == "acfow") %>% mutate_at(vars(starts_with("has_")), as.numeric) %>% 
      mutate(pop25_3 = ifelse(speed == "25" & (has_1 > 0 | has_2 > 0 | has_3more > 0), rowSums(select(., starts_with("has_"), -has_0)), 0),
             pop100_10 = ifelse(speed == "100" & (has_1 > 0 | has_2 > 0 | has_3more > 0), rowSums(select(., starts_with("has_"), -has_0)), 0),
             pop250_25 = ifelse(speed == "250" & (has_1 > 0 | has_2 > 0 | has_3more > 0), rowSums(select(., starts_with("has_"), -has_0)), 0),
             pop1000_100 = ifelse(speed == "1000" & (has_1 > 0 | has_2 > 0 | has_3more > 0), rowSums(select(., starts_with("has_"), -has_0)), 0)) %>% 
      group_by(id) %>% summarise(pop25_3 = sum(pop25_3),
                                 pop100_10 = sum(pop100_10),
                                 pop250_25 = sum(pop250_25),
                                 pop1000_100 = sum(pop1000_100),
                                 mth_yr = names(dflst)[i]) %>% 
      left_join(., staff_est, by = c("id" = "county_fips")) %>%
      mutate(pct25_3 = (pop25_3/select(., starts_with("pop20")))[,1],
             pct100_10 = (pop100_10/select(., starts_with("pop20")))[,1],
             pct250_25 = (pop250_25/select(., starts_with("pop20")))[,1],
             pct1000_100 = (pop1000_100/select(., starts_with("pop20")))[,1])
  }
  df <- bind_rows(dfs)
  df <- df %>% 
    pivot_wider(names_from = mth_yr,
                names_glue = "{.value}_{mth_yr}_fcc",
                values_from = c(pct25_3, pct100_10, pct250_25, pct1000_100)) %>% 
    gather(var, val, pct25_3_dec_2016_fcc:pct1000_100_dec_2019_fcc, na.rm = T) %>% 
    group_by(id, var) %>% 
    distinct(val) %>% 
    spread(var, val) %>% mutate_at(vars(-id), ~ifelse(.x > 1, NA, .x))
  
  return(df)
}

fcc477_txksme_county <- getFCCpct(dflst = dflst)

Now we will merge these datasets by county FIPS numbers altogether

## Create a V2 merged dataset ##
tx_bb_entrepreneur_merged_v2 <- tx_bb_entrepreneur_merged %>%  
  left_join(., cbp_txksme, by = c("county_FIPS" = "county_FIPS")) %>% 
  left_join(., nonemp_txksme, by = c("county_FIPS" = "county_FIPS")) %>% 
  left_join(., acs_broadband_pct, by = c("county_FIPS" = "county_FIPS")) %>% 
  left_join(., bea_nonfarmproprietor_txksme, by = c("county_FIPS" = "county_FIPS")) %>% 
  left_join(., firmlevel_txksme, by = c("county_FIPS" = "county_FIPS")) %>% 
  left_join(., select(rucc2013, c("FIPS","RUCC_2013","metro_desc")), by = c("county_FIPS" = "FIPS")) %>% 
  left_join(., demo_cont, by = c("county_FIPS" = "county_FIPS"))

## Generate average GoDaddy & M-lab broadband measures and clean unnecessary variables ##
tx_bb_entrepreneur_merged_v2 <- tx_bb_entrepreneur_merged_v2 %>% 
  mutate(vd_mean_19 = rowMeans(select(., matches("^venturedensity.*19$")), na.rm = T),
         vd_mean_20 = rowMeans(select(., matches("^venturedensity.*20$")), na.rm = T),
         havd_mean_19 = rowMeans(select(., matches("^highlyactive.*19$")), na.rm = T),
         havd_mean_20 = rowMeans(select(., matches("^highlyactive.*20$")), na.rm = T),
         pct_broadband_mlab = rowMeans(select(., c(frac_BB.sept, frac_BB.oct, frac_BB.nov, frac_BB.dec)), na.rm = T)) %>% 
  select(-c(starts_with("frac_under_"), starts_with("frac_over"), starts_with("samples."), starts_with("BB_samples"), state, county.y, starts_with("DL"), starts_with("UL"), starts_with("dl_sample"), starts_with("ul_sample")))

tx_bb_entrepreneur_merged_v2 <- tx_bb_entrepreneur_merged_v2 %>% 
  left_join(., fcc477_txksme_county, by = c("county_FIPS" = "id"))

## Modify nonemployer percentage - Nonemployer share over total employment
tx_bb_entrepreneur_merged_v2 <- tx_bb_entrepreneur_merged_v2 %>% 
  mutate(pct_nonfarmneest_nemp_2018 = nonfarmneest_nemp_2018 / totalemp_bea_2018,
         pct_nonfarmneest_nemp_2012 = nonfarmneest_nemp_2012 / totalemp_bea_2012,
         pct_nonfarmneest_nemp_2012 = ifelse(is.na(pct_nonfarmneest_nemp_2012) | is.infinite(pct_nonfarmneest_nemp_2012),NA, pct_nonfarmneest_nemp_2012),
         pct_nonfarmneest_nemp_2018 = ifelse(is.na(pct_nonfarmneest_nemp_2018) | is.infinite(pct_nonfarmneest_nemp_2018),NA, pct_nonfarmneest_nemp_2018),
         metro_f = as.factor(metro_desc)) 

## Create industry diversity index following equations used by Alesina et al. (1999) and Gallardo et al. (2020)
## The equation is Industry = 1 - sum(ind_share^2)

tx_bb_entrepreneur_merged_v2 <- tx_bb_entrepreneur_merged_v2 %>% 
  mutate(indstry_diversity = 1 - (pctagriculture_2019^2 + pctconstruction_2019^2 + 
                                    pctmanufacture_2019^2 + pctwholesale_2019^2 + pctretail_2019^2 +
                                    pcttransportation_2019^2 + pctinformation_2019^2 +
                                    pctfinance_2019^2 + pctprofessional_2019^2 +
                                    pctedu_healthcare_social_2019^2 + pctother_occupation_2019^2 +
                                    pctpublic_admin_2019^2 + pctarts_ent_2019^2)) %>% 
  mutate(pct_broadband_MS_nov_2019 = case_when(is.na(pct_broadband_MS_nov_2019) ~ 0, TRUE ~ pct_broadband_MS_nov_2019),
         pct_broadband_MS_oct_2020 = case_when(is.na(pct_broadband_MS_oct_2020) ~ 0, TRUE ~ pct_broadband_MS_oct_2020),
         pct_broadband_FCC_dec_2017 = case_when(is.na(pct_broadband_FCC_dec_2017) ~ 0, TRUE ~ pct_broadband_FCC_dec_2017),
         pct_broadband_FCC_dec_2019 = case_when(is.na(pct_broadband_FCC_dec_2019) ~ 0, TRUE ~ pct_broadband_FCC_dec_2019),
         pct_fixed_acs_2019 = case_when(is.na(pct_fixed_acs_2019) ~ 0, TRUE ~ pct_fixed_acs_2019),
         pct_broadband_mlab = case_when(is.na(pct_broadband_mlab) ~ 0, TRUE ~ pct_broadband_mlab),
         nonfarmprop_percapita = nonfarmproprietors_bea_2019 / population_2019,
         est_10_cbp_percapita = est_10_cbp_2019 / population_2019,
         est_50_cbp_percapita = est_50_cbp_2019 / population_2019)

## Create composite measure of broadband Quality of Service (QoS) using Microsoft's broadband variable and Mlab data
## This is done by standardizing each variable, adding the two, and normalizing it
tx_bb_entrepreneur_merged_v2 <- tx_bb_entrepreneur_merged_v2 %>% 
  mutate(pct_bb_qos = normalize((scale(pct_broadband_MS_oct_2020)[,] + scale(pct_broadband_mlab)[,]),
                                      range = c(0,100))/100,
         digital_distress_2019 = digital_distress_2019/100)

## Write CSV ##
write_csv(tx_bb_entrepreneur_merged_v2, "Broadband-Entrepreneurship-TXKSME.csv")