QUALITY DATA

The Go Code Colorado data team would like to tell you about quality public data. On this page, you’ll find information about how public data is different from private data and slightly different from open data, and also the key factors of quality data and standards for file naming.

PUBLIC DATA

Aside from being free to use, public data is different from private data in two major ways. The first is that public data does not include any personally identifiable information (pii) whereas private data sometimes does. The second is that public data is always secondary data, meaning it is collected by the government for the purposes of government operations (primary data), but when it is re-purposed by data consumers it becomes secondary data. This means that, quite often, data consumers are limited in their capacity to request changes to the data that meet their needs. In some scenarios, there is an open feedback loop (much like the one created by Go Code Colorado!) in which data consumers can request enhancements to the formatting or content of the data.

There is a lot of variety in the way that different government agencies manage their data. State to state, city to city there are differences, and within states there are a wide variety of organizational structures. In some cases, the state will aggregate some of the local data to produce a data product designed to measure patterns across the municipal jurisdictions that collect the data. Quite often for these aggregate datasets, the data at the local level retains a greater level of detail. To determine if a dataset is managed at a state or local level, think about yourself as a citizen to find clues about how data is collected about us. Where you go when you need to renew your driver’s license versus where you pay your property taxes, is an easy example.

Public Data, Open Data and Private Data

The main difference between public data and open data is that public data is derived from the government and open data can have a variety of sources. Thus, there are three distinct categories of available data from the perspective of a data consumer using data to create a commercially available product: Public, Open and Private. These categories are determined by their types of licenses and allowable use cases. Open Data is created for the people by the people. Meaning that a collaborative organization builds the data through a group of volunteers responsible for housing the data and carrying the burden of gathering, collating, cleaning and otherwise curating the data to make it available. Open Data is essentially crowd sourced data, and like all crowd sourced data, there are issues with versioning and currency. Examples in this category are Open Street Map, OpenAddresses.io, OpenAerialMap and Mapillary providing Open Street View — all three represent open data that is created by public contribution with limited restrictions on the data’s use (essentially, share any improvements made to the data back to the community), regardless if the user is producing a free or for sale product. Public data could also be considered “open government data”, in the sense that it can also be used freely without constraint in a commercial product because of its Public Domain use license. There are government datasets that are not public, and thus the term “public data” is formally designated to refer to “open government data”. In this way, public data is more “open” than open data, because the Public Use License has the least restrictions on use. Private datasets are expected to be the cleanest, most reliable (accurate, kept up to date, etc.), and best documented datasets available for data consumers. This of course comes with a price of a restricted use agreement and the associated monetary cost of use.

QUALITY DATA

There are three key factors to quality data

  1. Is the data accurate?
  2. Is it up-to-date?
  3. Can it be combined easily with other data for analysis?

Use the tag word “gocodecolorado” on the Colorado Information Marketplace (CIM) to find datasets curated for the Go Code Colorado competition.

Data Curation is the process of transforming data from its original build and native management into a series of datasets that are easily discoverable by and formatted to the needs of the data consumer.

As a reminder, all datasets on CIM can be accessed through the portal’s application programming interface (API).

—Indications Data Has Been Curated—

In order of importance to the consumer. Certainly it could be debated if “Machine Readable” or “Fresh” is of greater paramount.

Machine Readable

Machine Readable means that the data can easily be loaded into any analytical or statistical processing system in order to look for patterns and trends in the data. Other use cases include the creation of a product or tool that utilizes frequently changing data to power a data accessibility application or otherwise combines data to create a product of value based on the accessibility of the available data. There are countless reasons why government agencies do not give public access to production databases, and thus the process of making public data machine readable involves architecting data migration pathways and extract, transform, load (ETL) processes.

Look for indications that distinct effort has gone into ensuring the data is presented in the appropriate size and scale. This means that tables are built to present a meaningful collection of fields, so as to minimize the number of different API calls required to convey a story or produce an app with the data. Different tables should be selected and combined to produce a final product based on the breakdown of the database’s internal themes and maximized for interpretation by the user. This size of each resulting table is determined in the publishing process by a data architect who knows the right way to balance the breakdown of tables between groups that encourage theme discoverability and groups that are so segmented that they become inefficient.

The complexity of a database (schema) determines the number of tables that are produced and published for public consumption. In some cases, information is duplicated on different tables to provide the necessary context for that table. A good example of this is two datasets that each have a County FIPS Code field and also have the associated County Name (as opposed to having the name in a lookup table). For datasets exported from a single source, a key field (two associated UniqueID fields in two related tables) will allow them to be joined together. In other cases, similar datasets have columns with common attributes can be finessed to create a unique ID for each record s these tables can then also be joined.

The Related Datasets section of the metadata has other datasets that could be joined (come from the same source) or have a similar theme (come from a different source at the same agency, or a similar source at different agencies). All of the CDLE datasets provide a good example, as the two Occupational Projections datasets are related. Additionally, the Labor Market Information system (LMI) of the Colorado Department of Labor and Employment (CDLE) is a large database of more than 70 tables. The portion of the dataset that is public has been through the ETL (extract, transform and load) process from the LMI database and loaded onto the CIM as five separate tables:

Fresh

What does it mean to have Fresh data?

Sometimes it means maintaining a temporal record, which requires publishing new data instead of updating existing datasets. Other times it means maintaining a connection that keeps the data true with the source database on a daily basis. Turns out, there’s lots of different ways that government agencies can (and do!) manage their data.

Thus — “Fresh Data” is defined by the distance in time (or lack thereof) between the “last updated date” and the “expected update date” –>> utilize the statistics page for each dataset as a way to show how many times each dataset has been updated.

  Static vs Dynamic Data

Static data is data that is not updated. Static data often is a snapshot of a particular time period, or could be a report/analysis.

Dynamic data is subject to updates and changes. Dynamic datasets are often updated on a scheduled basis to keep them current.

  Why do some datasets have a date in the title and some don’t?

Datasets on CIM sometimes have a date in their title. This is most commonly due to the dataset only containing data for the date specified in the title. For example, traffic datasets on CIM are given by the data provider as an annual dataset. These datasets as static, by-year data. A 2016 dataset is not updated with 2017 data, instead a new 2017-only dataset is created.

   Last Update Date(s)

The “Updated Date” at the top right (and top left of the “About This Dataset” section) of every dataset page is a created field and changes depending on what was last updated. There are quite a few dates that are captured in the metadata. Here is an overview:

Data Last Updated – When an update or replace was run for the data of the dataset

Metadata Last Updated – When a change to the title, description, or various other metadata pieces was made

Date Created – When the dataset was published on data.colorado.gov

Date of initial Dataset Creation – When the dataset was created by the data curator/provider.

 Expected Update Frequency

In the “About” section of a dataset on CIM, or if you click on “Show More” on the “Primer” page, additional metadata information will be displayed. There are a few important fields under “Data Updates” and “Data Quality” – each with at least one example provided.

Expected Update Frequency – The general frequency – “annual, monthly, quarterly, daily”

Update Schedule – The expected update schedule for this dataset – “Daily to CIM”

Update Method – The update method – “Automated by Socrata Scheduler”

Update Type – Distinguishes between manually or automatic updates – “Automated – Daily”

Source Update Schedule – The expected update schedule at the data source – “monthly, quarterly”

Metadata and Description are Clear and Succinct

All metadata has been created from interviewing the data providers and data stewards in order to characterize the data as thoroughly as possible in relation to the needs of the data user, and presented on this website in a way that is designed to enhance the creative process of designing an entry for the Go Code Colorado competition. This summary largely under-represents the associated level of effort that goes in to the discovery process with the data provider to better understand how it is made and currently maintained, changes the data has gone through over the years, and other textural components that can help ensure the secondary use of the data is appropriate to the application its being used for,

STANDARDS FOR FILE NAMING FORMAT

Go Code Colorado datasets employ a standard naming convention, in addition to a standard format for the “brief description” that immediately follows the dataset title in the CIM catalog.

Title formatting for a good first impression:

[3 word desc] + [data type] + [geography]

  1. Airport Locations for Colorado
  2. Business Entities for Colorado
  3. Bike Lane Routes for Denver

Description formatting for Improved Catalog Browsing

  1. Maximum 30 words
  2. First 15 ‘show’ in CIM catalog
    1. What’s in it?
    2. Who is the data provider?
    3. What is the time frame?
    4. How is it maintained?
    5. How was it created?

good governance

Best practices for open data programs include:

  • Thorough documentation on the technical data publishing process.
  • Published open data strategic plan and progress toward publishing goals.
  • Published documentation on the organizational steps in the  data provider process.
  • Good documentation on what the metadata is and why it’s valuable.
  • Meaningful metadata.
  • Metrics and dashboards showing various data update schedules.
  • Use of standards and naming conventions.
  • Quality keywords and enhanced searching.
  • Successful apps built off the data, e.g. SF, Chicago.
  • Run interesting programs around the data, e.g. NYC.
  • Control over catalog populating and ensuring authoritative sources.
  • Smart regulation of public publishing and crowd sourced data.

Collectively these indicators can all be used to gauge the quality of the data.