Traffic fatality data sets for Austin, TX.
The traffic fatality reports pusblished by APD are hosted on their website for 2 years. Past this point they are archived and are not publicly accessible anymore.
Our automated workflow is composed of the 4 following steps:
- Generate the raw data sets.
- Import/Merge the external data sets.
- Augment the data sets.
- Merge the results into
*-all-*.jsondata sets.
Using these techniques to generate the data sets, we ensure that they can always be recreated in case of problem would occur.
Each step will be detailed in a dedicated section bellow.
The data sets named fatalities-{year}-raw.json are generated directly from ScrAPD without any manual intervention. A data set is created for each year.
Each of these external data sets has a dedicated documentation file in the docs folder.
- Socrata
Here is an example showing how the data is imported:
TOPDIR=$(git rev-parse --show-toplevel); for YEAR in {17..18}; do python ${TOPDIR}/tools/scrapd-importer-fatalities-socrata.py ${TOPDIR}/datasets/fatalities-20${YEAR}-raw.json ${TOPDIR}/external-datasets/socrata-apd-archives/socrata-apd-20${YEAR}.json > ${TOPDIR}/datasets/fatalities-20${YEAR}-augmented.json;doneThe data sets named fatalities-{year}-augmented.json are data sets that have been enhanced, in order to improve the
quality of the data.
# Generate empty data sets.
for f in fatalities-20{13..20}-augmented.json; do echo "[]" > "${f}"; doneAugment the data sets:
for f in fatalities-20{13..20}-augmented.json; do python "${TOPDIR}/tools/scrapd-augmenter-geocoding-geocensus.py"-i ${f}; doneThe data sets are being augmented with ScrAPD augmenters that can be found in the tools folder. Each augmenter
Currently the following augmenters are available:
- scrapd-augmenter-geocoding-geocensus
Corrections can also be added manually. Corrections can add extra fields or update values in existing fields. They are applied last.
A correction must be made in a file named augmentation-manual-{year}.json. All the corrections for the same year
MUST be grouped together in the same file. The order does not matter. If an entry is found several times, the last
one (i.e. the lowest one in the file) will superseed all the others.
[
{
"19-0400694": {
"Type": "Pedestrian",
"Gender": "Female",
},
"19-0320079": {
"Type": "Bicyle",
"Gender": "Unknown",
}
}
]Apply the changes for a specific year:
python tools/scrapd-merger.py -i datasets/fatalities-2020-augmented.json augmentations/2020/augmentation-manual-2020.jsonThe data sets whose year is all are a combination of all the data sets of the same category.
There are generated using the following jq command:
jq -s add fatalities-20{17..20}-raw.json > fatalities-all-raw.json
jq -s add fatalities-20{17..20}-augmented.json > fatalities-all-augmented.jsonA full update happens when ScrAPD gets updated with changes that drastically improve the quality of the data which was retrieved.
ScrAPD versions which required a full update:
- 1.5.0
- 1.5.1
As a result, all the data sets were updated with the following command:
for i in {17..20}; do python tools/scrapd-merger.py -i fatalities-20${i}-raw.json <(scrapd -v --format json --from "Jan 1 20${i}" --to "Dec 31 20${i}"); done