Calculating public burden using OIRA data -- Part Two
An experiment in using open data to make government better
Published on: Feb 13, 2017

Yesterday, I published an article about using open government data to hunt for paper-based information requests by the government. Based on the data, it looked like there are still a lot of hours spent filling out paper-based forms. As I noted, though, I ran out of time to do careful analysis. So, today, let's explore deeper.

First, we'll create a histogram to look for the distributions of requests. To do so, we'll use pandas to examine the results data, and specifically the histogram method.

In [1]:
# Set up the graphing environment. Because I'm using jupyter notebooks, first I need to tell
# it to show the graphs inline. I also use the `ggplot` style, because it's less hideous. 
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
In [2]:
import pandas as pd
data = pd.read_json('results.json')
data.burden.plot.hist()
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fca4799d358>

Wait. Hold on right there. That's not what you'd expect to see. That looks like there's an outlier. Let's see what that might be... To do so, we look for the top ten burdens.

In [3]:
data[["burden", "title"]] .sort_values('burden', ascending=False).head(10)
Out[3]:
burden title
720 2997500000 U. S. Business Income Tax Return
729 48731780 IRA Contribution Information
719 34115874 Form 1099-DIV--Dividends and Distributions
718 24951529 Return of Organization Exempt From Income Tax ...
248 20036012 2017-2018 Free Application for Federal Student...
509 13500230 National Fire Incident Reporting System (NFIRS...
717 10880812 Employer's Annual Tax Return for Agricultural ...
497 9902378 Arrival and Departure Record
449 7736084 Physician Quality Reporting System (PQRS) (CMS...
713 7041290 Customer Due Diligence Requirements for Financ...

Oh dear. Looks like we've got a pretty obvious mistake here: "U.S. Business Income Tax Return" can definitely be filed electronically. Same with the other things on the list. And that one outlier accounts for 3 billion of the 3.3 billion hours. Oof. So what gives?

Well, it turns out that the way that OIRA displays the burden data is that if any of the forms that are part of an information collection request is not electronically available, then the burden for all of the forms gets aggregated. And unfortunately, there doesn't seem to be an obvious way to back out the other forms. So, that's not very useful, unfortunately.

Let's see what the total burden is if you remove the top 20% of information collection requests.

In [4]:
"{:,} hours".format(data.burden.sum() - data.sort_values('burden', ascending=False).head(220).burden.sum())
Out[4]:
'5,589,316 hours'

So, that feels a lot more sane, and a lot less exciting. There are only 5,589,316 hours of public burden for everything but the top 20% of information collection requests.

In the end, this is a great lesson in how a data schema can lead to incorrect conclusions.

Still, we have some good data near the bottom of the chart.

In [5]:
data.sort_values('burden').head(890).burden.plot.hist(bins=30)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fca467f8080>

In other words, there are a lot of information requests that account for a couple hundred hours of public burden. Not a surprising result, but perhaps even more useful in the end. This result means that there are about 200 forms in the middle that account for much of the remaining burden hours. Now, that seems like a good place to start.