Chapter 3 Data

3.1 Risk Screening Environmental Indicators Model Data

The Risk Screening Environmental Indicators (RSEI) Model is a geographically detailed data set produced by the United States Environmental Protection Agency (EPA). RSEI data is based upon the Toxic Release Inventory (TRI), a data set on toxic releases in the United States.

TRI data is self reported by facilities, with each observation being a release of a reporting chemical at a reporting facility. For each observation, data is collected on which chemical was released, how much of it was released, and the facility from which it was released. This data is collected for a mandated list of industries, facilities, and chemicals that change over time. For a release to be reported it has to be a mandated chemical, released by a mandated facility in a mandated industry.

The detailed location and chemical data that is collected through TRI is reformatted through a fate and transport model to create RSEI. The RSEI data shows where each release has traveled, from its initial point of release, on an 800m by 800m grid across the USA. The ultimate data is an observation for each release, for each square in the grid that the release hits. This gives us an idea of how the chemicals spread from the source locations, and enables us to create a map across the entire nation for where TRI chemicals accumulate in any year between 1988 and 2014. This is a much improved measure of pollution, as it theoretically shows the amount of toxins experienced in any given location.

The initial RSEI disaggregated microdata is approximately 4 terabytes of data. Given that the 800m by 800m grid across the United States has on the order of 10 million squares, each release hits many squares, we have many releases, and the data covers 24 years, this is an enormous data set.

The RSEI reformat of the TRI contains the following variables: X and Y are the geographic identifiers for the cell on the grid across the US, with (0, 0) in the center of the US. Release number tells us which release that row is associated with. Chemical number through media are all release specific data. Conc (concentration) is specific to the release at that square, indicating the concentration of the chemical across that grid cell. ToxConc is the toxicity weighted concentration of that release in that grid cell. ‘Score’ variables are toxicity values as weighted by the population in that grid cell.

Below see an example of the disaggregated microdata:

X	Y	Release	Chem	Facility	Media	Conc	ToxConc	Score
-185	51	2050156	317	3	1	4.6e-4	2.28e-3	0
-184	41	2050156	317	3	1	3.3e-4	1.65e-3	0
-184	42	2050156	317	3	1	3.3e-4	1.67e-3	0
-184	43	2050156	317	3	1	3.3e-4	1.66e-3	0
-184	44	2050156	317	3	1	3.4e-4	1.68e-3	0
…	…	…	…	…	…	…	…	…

3.2 Important Caveats

Due to the nature of the data, there are a few interesting caveats to consider.

The RSEI data is self reported, and has been thought to contain some severe under reporting (De Marchi and Hamilton, 2006). This could lead to low estimates of toxicity, and could bias results if certain areas or industries are more likely to under report. Highly regulated chemicals may be more likely to be under reported, as facilities are incentivized to release as little as possible.

The data is entirely based off a black box fate and transport model. The model has uncertainty that is not reported in RSEI and therefore is not being addressed. The raw data we use are estimates to begin with, but are being treated as observed data.

The data only captures releases for certain facility types within certain industries, for certain chemical types within those facilities. Not all chemicals are mandated reporting, and any analysis that is done based off the data can’t be extrapolated to discuss toxicity more generally. Some important industries (like mining) are not monitored over the entire period, and are removed to ensure consistency over time, so areas that strongly represent those industries will have deflated toxicity estimates.

Because RSEI only captures a specific subset of chemicals, it is difficult to relate the RSEI scores to health or quality of life outcomes in an area. There are many obvious environmental hazards that are likely to have strong influence on public health and living conditions that TRI doesn’t address, eg: brownfields, solid waste disposal, animal farming, hazardous waste, etc.

Another detail of the data that makes interpreting results difficult is how chemicals are compared. RSEI gives the weight of the release and the chemical of the release, but chemicals have very different levels of toxicity. A small amount of mercury released is much worse than a small amount of CO2. To that end, the EPA assigned each chemical a ‘toxicity’ weight, by which the amount of the chemical released is multiplied. This means that we can aggregate all the chemicals in an area, and compare the overall toxicity over space and time. The toxicity weights allow us to compare to other chemicals’ toxicity weight, and overall toxicity of a given chemical release, but also means that the values only have meaning in comparison.

Despite the limitations that the data presents, it provides an incredibly detailed and complex view of toxicity in the United States that is worth delving in to.

3.3 Data Continuity

3.3.1 Census Comparability

RSEI data is available between 1988 and 2015, and over that time Census geographies have experienced considerable overhaul. Many of our questions of interest involve demographic characteristics, and the changes we see in environmental toxicity over time for those demographics. As such, we need aggregate toxicity to block group or tract level in order to merge with Census data. To do so, we calculate the toxicity values in Census 2010 geography. Since we want to use Census areas as the unit of analysis, we use crosswalks that help us transform past Census data to the 2010 geographies.

3.3.2 Chemical Consistency

Consistency across time, space, and chemicals is vital in order to be able to compare toxicity values across time. Though the EPA is incredibly detailed in their data collection, they are also subject to changing scientific consensus and legislation, and therefore haven’t been able to provide entirely consistent data.

Over time, as chemicals have been found to be toxic, they have been added to the list of TRI mandated reporting chemicals. There are also chemicals that have been removed from the mandated reporting list. Because the list of chemicals reported changes over time, aggregating all the data would cause us to see artificial jumps in toxicity. These jumps wouldn’t be reflective of actual increases in toxicity, but rather toxicity that was beginning to be measured. These jumps may change what areas appear toxic in the data, as industries that don’t have to report at some point in the time frame will be entirely removed from the data set. Several of those industries are very highly polluting, and after elimination locations that focus strongly on those industries will show deflated scores.

3.3.3 Industry Reporting Consistency

TRI regulates who needs to self report using the North American Industry Classification System (NAICS), and before NAICS was available used its predecessor, the Standard Industrial Classification (SIC) system. Just as we see changes in the regulations for chemicals, we see changes in the regulations of various industries. NAICS codes that need to report are regulated independently of chemical codes, and NAICS codes that are not consistently reported across the time period of interest must be removed to maintain continuity.

As an example, a textiles facility releasing mercury might fall under mandated reporting due to their NAICS code, but the neighboring mining facility that also releases mercury might not have to report because of the NAICS code. If that changes over the time period, and suddenly the mining industry also mandates reporting, we will see an artificial huge jump in the mercury present in that area if we don’t remove by industry.

3.4 Data Cleanup

Data cleaning was done with R using the DBI and SQLite packages (R-SIG-DB et al., 2018; Muller et al., 2018.). Since there is so much data, it’s not feasible to process it using typical R functions, so we used a SQL database. This significantly speeds up the processes detailed below.

The disaggregated microdata from RSEI is one observation per release per cell on the 800m by 800m grid that it reached. We filter each of the observations to check that it is

from a chemical that is consistent across the relevant years, and
from a release that is linked to an industry that is consistent across years,

then allocate the observation to the appropriate Census geographic unit.

Filtering out observations whose releases are not under a regulated NAICS category for the entire time is complex. Using a table that contains the regulation and deregulation dates of NAICS codes we can find the consistent industry categories. However, the only reference to NAICS or SIC codes in RSEI data are not release specific. The Facility table provides the 6 NAICS codes most commonly associated with the facility. However, NAICS codes are release specific, not facility specific, meaning that for each emission reported a NAICS code is reported. Removing by facility will not yield accurate results, since facilities may have different types of NAICS emissions. The textiles facility we used as an example earlier might make both shoes and jackets. These production outputs would have different NAICS codes, meaning only releases originating from one production line would have to be reported. To get data on the NAICS codes by submission, we use the original TRI reporting form data, linking to RSEI disaggregated microdata by the document control number.

In the data cleansing process, we filter the data to remove industries and chemicals that were not consistent over the period of inquiry, and aggregate the data to the relevant Census geographies. The data then forms a data set for each year, with an observation for each block group, and just one measure, an aggregated toxicity level.

block	concentration	area
010010201001	627.3050138	6.520168
010010201002	499.6297799	8.48669
010010202001	578.8311689	3.137173
010010202002	756.3733114	1.962949
010010203001	637.7356488	5.907125
…	…	…

It’s important to keep in mind throughout this discussion that the ‘concentration’ estimate must be interpreted in the context of the consistency adjustments. The pollution estimates are comparable across time and location, but only in the context of continuous EPA regulation, and can not be interpreted independently of one another.