The Go Code Colorado data team would like to tell you about quality public data. On this page, you’ll find information about how public data is different from private data, the key factors of quality data and standards for file naming.
Aside from being free to use, public data is different from private data in two major ways. The first is that public data does not include any personally identifiable information whereas private data sometimes does. The second is that public data is always secondary data, meaning it is collected by the government for the purposes of government operations (primary data), but when it is re-purposed by data consumers it becomes secondary data. This means that, quite often, data consumers are limited in their capacity to request changes to the data that meet their needs. In some scenarios, there is an open feedback loop (much like the one created by Go Code Colorado!) in which data consumers can request enhancements to the formatting or content of the data. The main difference between public data and open data is that public data is derived from the government and open data can have a variety of sources.
There is a lot of variety in the way that different government agencies manage their data. State to state, city to city there are differences, and within states there are a wide variety of organizational structures. That being said, there are some indicators that we can look to as citizens that give us a clue about how data is collected about us. Think about where you go when you need to renew your driver’s license versus where you pay your property taxes. The graphic below shows examples to help you see the pattern. In some cases, the state will aggregate some of the local data to produce a data product designed to measure patterns across the municipal jurisdictions that collect the data. Quite often for these aggregate datasets, the data at the local level retains a greater level of detail.
PUBLIC DATA, open data and Private data
There are three distinct categories of available data from the perspective of a data consumer using data to create a commercially available product. These categories determine the types of licenses and allowable use cases for the three different types. Open Data is created for the people by the people. Meaning that a collaborative organization builds the data through a group of volunteers housing the burden of gathering, collating, cleaning and otherwise curating the data to make it available. Open Data is essentially crowd sourced data, and like all crowd sourced data, there are issues with versioning and currency. Examples in this category are Open Street Map and Addresses.io — both represent open data that is created by public contribution with minimal restrictions on the data’s use, regardless if the user is producing a free or for sale product. Public data could also be considered Open Government data, in the sense that it can also be used freely without constraint in a commercial product. There are Government datasets that are not public, and thus the term Public Data is formally designated to refer to Open Government data. In some ways public data is more open that open data, because the public use license has the lease restrictions on use. Private datasets are usually the cleanest, most reliable (accurate, kept up to date, etc.) and best documented datasets available for data consumers. This of course comes with a price of a restricted use agreement and/or money.
There are three key factors to quality data
- Is the data accurate?
- Is it up-to-date?
- Can it be combined easily with other data for analysis?
The following content explains how the tag word “gocode” on the Colorado Information Marketplace (CIM) means quality.
Data curation is the process of transforming data from its original build and native management into a series of datasets that are formatted to the needs of the data consumer.
As a reminder, all datasets on CIM can be accessed through the portal’s application programming interface (API).
Indications Data Has Been Curated
Best practices for open data programs include:
- Thorough documentation on the technical data publishing process.
- Published open data strategic plan and progress toward publishing goals.
- Published documentation on the organizational steps in the data provider process.
- Good documentation on what the metadata is and why it’s valuable.
- Meaningful metadata.
- Metrics and dashboards showing various data update schedules.
- Use of standards and naming conventions.
- Quality keywords and enhanced searching.
- Successful apps built off the data, e.g., NYC, SF, Chicago.
- Control over catalog populating and ensuring authoritative sources.
- Smart regulation of public publishing and crowd sourced data.
In addition to these administrative and applied signals that an open data portal is populated with data of value, you can also look at the formatting of the data itself to get a feel for how much cleaning and curating has gone into the publishing.
Look for indications that distinct effort has gone into ensuring the data is presented in the appropriate size and scale. This means that tables are built to present a meaningful collection of fields, so as to minimize the number of different API calls required to convey a story or produce an app with the data. Different tables should be selected and combined to produce a final product that is based on the breakdown of a database based on its internal themes and maximized for interpretation by the user. This size is determined in the publishing process by a data architect who knows the right way to balance the breakdown of tables between groups that encourage theme discoverability and groups that are so segmented that they become inefficient.
The complexity of a database determines the number of tables that are produced and published for public consumption. In some cases, information is duplicated on different tables to provide the necessary context for that table. A good example of this is two datasets that each have a field related to the County FIPS Code. Both would have this column in addition to a descriptive code such as County Name. In the best scenario, similar datasets have columns with common attributes or a shared unique ID for each record. These tables can be joined.
The Labor Market Information system (LMI) of the Colorado Department of Labor and Employment (CDLE) is a large database of more than 70 tables. The portion of the dataset that is public has been through the extract, transform and load (ETL) process from the LMI database and loaded onto the CIM as five separate tables:
- Employment by Industry
- Income Data by County
- Employment and Unemployment Estimates
- Occupational Employment Statistics
- Current Employment Statistics
Metadata and Description are Clear and Succinct
All metadata has been created from interviewing the data providers and data stewards in order to characterize the data as thoroughly as possible in relation to the needs of the data user, and presented on this website in a way that is designed to enhance the creative process of designing an entry for the Go Code Colorado competition.
STANDARDS FOR FILE NAMING FORMAT
Go Code Colorado datasets employ a standard naming convention, in addition to a standard format for the “brief description” that immediately follows the dataset title in the CIM catalog.
Title formatting for a good first impression:
[3 word desc] + [data type] + [geography]
- Airport Locations for Colorado
- Business Entities for Colorado
- Bike Lane Routes for Denver
Description formatting for Improved Catalog Browsing
- Maximum 30 words
- First 15 ‘show’ in CIM catalog
- What’s in it?
- Who is the data provider?
- What is the time frame?
- How is it maintained?
- How was it created?