Defining Data for Visualization

Best Practices

About This Document

When creating a data visualization in the CDC Open Visualization Editor (COVE), you must answer various questions about the source dataset.  This is called data definition. By defining the data, you tell the tool how source data are to be used in the visualization.

This document is intended primarily for COVE developers, but data owners, managers, and analysts may also benefit.

Here are a few notes to keep in mind:

  • COVE supports a wide variety of data visualizations, some of which require specific types of data in a specific format.  Except where noted, the term chart in this document refers to the commonly used visualizations based on the cartesian coordinate system: bar, line, and combo.
  • If you are interested in the data definition for a specialized visualization type (such as the box-and-whiskers plot or the scatter plot), we suggest you visit the documentation specific to the visualization type. To find documentation by type, visit the index page for data presentation.
  • Examples of source data are provided as screenshots of Microsoft Excel files, but the principles apply whether the source data are in JSON or CSV format.
  • This document provides general, high-level guidance. The accompanying guided exercises help COVE developers see the specifics of how source data are defined in the tool and serve as an introduction to chart and map development in COVE.  [link to exercise index page to come]
  • The example visualizations are static images, but “live” interactive visualizations are presented with the exercises.  Links to sample source data are included in the exercise instructions.

Key Data Definition Questions

To fully define the data for a visualization, you must answer a few key questions on the “Import Data” and “Configure” tabs in COVE:

  • Source dataset:  For each visualization, you can upload a source file to COVE or provide a URL.  (With a URL source, you can specify that data are pulled automatically from the source URL with each refresh of the visualization.)
  • Source orientation and number of data series: In the source dataset, what is the structure of the data to be visualized — vertical or horizontal?  And are multiple data series involved?  The answers to these questions on the “Import Data” tab lay the foundation for the visualization you’re building. Based on your answers, you may have to answer one or more additional questions before moving on to the “Configure” tab.  The section “Getting Started: The Import Data Tab” explains these concepts and provides links to exercises that provide detailed guidance on the conditional questions.
  • Data Series:  For charts, you must select the source data for the data series you’re visualizing.  A chart can show one or more color-coded data series, which are represented in the chart legend (although you may want to hide the legend when there is only one data series). For maps, the tool doesn’t ask about data series; instead it asks you to specify the “data column.”
  • Dates/Categories: For charts, you must select the source data for the date/category axis.  Chart exercises 1 and 2 provide details on defining the data series and date/category axis, as well as other chart tips and guidelines.  [link to come]
  • Filter Columns:  For filterable data, you must select the source column for each filter control.  This is a critical step that can be overlooked because the visualization tool doesn’t generate an error message for this omission.  For more information, see “Setting Up Filter Controls” in this document.

In addition to the key specifications above, you and the content owner should also consider:

  • Tooltip Content:  When end users interact with a data point in a data map or chart, a tooltip displays the point value and other information.  Some information is included automatically, but you can add source data to tooltips to provide more context.
  • Data Table Contents:  As with tooltips, the tool automatically includes certain data in the supporting data table, but you can specify additional source data.  For more information on tooltips and the data table, see “Managing Content in Tooltips and the Data Table” in this document.

Getting Started: The Import Data Tab

The COVE user interface is organized in three tabs:  Choose Visualization Type, Import Data, and Configure.  This section provides guidance on the first two data definition questions presented on the Import Data tab. Keep in mind that answering these questions is just the start of setting up a data visualization.  Exercises 1 through 3 demonstrate the complete configuration of the example charts.  (Exercise links follow this section.) If you’re new to COVE or just want to have a better grasp of how the tool works, we suggest that you work the exercises in this section after reviewing this documentation.

Determining Source Data Orientation

When you specify the source data, the first question is about the orientation of the source data:  is it vertical or horizontal?  With maps, the orientation is typically vertical, as illustrated in the source data below.

Screenshot of map source data with values structured vertically

To answer the orientation question for bar, line, and combo charts, you must know which categories are to be presented on the date/category axis.  Let’s look at some examples.

Source Data Example 1 for Charts

In the first chart below, Age Groups are presented on the date/category axis, and because these categories are formatted horizontally in the source data on the left, the source dataset is defined as “horizontal.”  The second chart uses the exact same source data but presents the vertically formatted Sex categories along the date/category axis, so the source dataset is defined as “vertical.”

Source data in horizontal format and resulting chart

Excluding categories:  Note that the “All” value for Sex is not included in the charts above even though the Sex column in the source data has “All” values.  As you will see when you do the chart exercises (links to come after the next source data example), COVE makes it easy to exclude data from the date/category Axis and the legend.

Source Data Example 2 for Charts

Now let’s look at the same source data but in a different structure.  With this example, both the Sex categorization and Age Group categorization are structured vertically.  This means that we would select “Vertical” to build both versions of the chart.

Source data defined as "vertical" and resulting bar chart

About long vs. wide format:  The data structure illustrated above is sometimes referred to as long format.  Some COVE users find that, generally, a data file in a purely long format is easier to work with than a widely formatted file, given its flexibility and the ease of specifying filter controls, tooltip content, etc.

Determining the Number of Data Series

Are there multiple series represented in your data?  That’s the second key question on the Import Data tab.

COVE treats the data sources for maps as single-series datasets regardless of the dataset format and content.  This is primarily due to the fact that data maps are inherently single-series visualizations (although filter controls can allow users to essentially change the series currently displayed). The bottom line is that, for maps, you can answer No to this question.

The datasets for line, bar, and combo charts can be single- or multi-series, and you must, as the question implies, indicate how many series are supported by the source data — not the number to be visualized in the chart. So for these visualization types, the answer can be Yes or No.

The first two datasets below, shown in their entirety, clearly have only one data series for age groups. The third dataset has multiple series for both age groups and sex. With the third dataset, whether you intend to visualize only one data series or multiple, the answer to the question about multiple series would be Yes.

Screenshots of single-series and multi-series datasets for comparison

Chart-Building Exercises:  The supporting exercise series begins with three chart-building exercises to demonstrate configuration with different data formats.  These exercises provide a great introduction to the Import Data tab for charts as well as options for fine-tuning your chart behavior and presentation.  Go to Exercise 1.

Setting Up Filter Controls

With filter controls, you can provide end users with different views on data in a single visualization. This feature is available for maps and all charts except the single-data-point charts such as data bite, gauge, and waffle chart.  To take advantage of this feature, you must have a dataset formatted for filtering. This is due to the way that filter controls are typically set up in COVE:  Each filter control is associated with a single column of source data.

Example of Formatting as a Filter Limitation

Consider the map data below.  The data for males and females are in separate columns. If a goal is to allow end users to filter the data by sex, the data must be reformatted.  (However, the same data file could be used to generate two data maps, one for males and one for females.)

Source data with values for males and females in separate columns

The Solution for Filtering

To present the data together in one map with a filter for sex, the solution is to reformat the data so that the categories for sex are contained in a single column:

Source data and map with Sex filter

Handling Duplicate Data

It’s important to understand that, with the source data above, a filter control for the Sex column is not just desirable — it’s essentially required. In fact, filter controls are required for both Year and Sex to resolve duplicate data.  This is because a data map can show only one value for each state at a time.

Without these filter controls, COVE would still generate a map, but with great risk of unreliable data. This is just one reason that COVE developers should become familiar with the source data content and format so that they can resolve potential issues as soon as possible.

Managing filter settings: COVE automatically pulls the options for each filter control.  For example, if a year of data were added to the source data above, the additional year would be added to the Year drop-down selection. When a filter value is added to the source data, the COVE developer may need to edit the filter control to ensure that the correct default selection is set. The developer also has options for controlling the type of control — tab, pill, or drop-down selector.

Exercise 4:  Quantitative Map Building and Setting Up Filter Controls.  This exercise demonstrates how to build a COVE visualization with filter controls.  It also serves as an introduction to quantitative map building in COVE.  Go to Exercise 4.

Managing Content in Tooltips and the Data Table

Most COVE visualization types support additional columns.  By “additional” we mean columns of data that are not visualized but are instead presented as supplemental information in pop-ups, the supporting data table, or both. With long-formatted data, it’s easy to take advantage of this feature.

[Note:  I need to update the source data labels to indicate when an illustration shows only partial data.  I think it’s obvious, but should do regardless.]

In the illustration below, the source file includes two pieces of information for each state:  Rate and Funding Status.  The chart designer has decided to visualize the numeric Rate while including the Funding Status as supplemental information in the pop-ups that display when a user interacts with the map. Funding Status can also be included in the supporting data table.

Funding status as additional column in map pop-up

About categorical maps:  The map in the illustration above is a numeric or quantitative map in that the color-coding is based on numeric data (Rate). With the same source data, you could also create a categorical map based on the Funding Status.  The color-coded categories in a map can be sequential (i.e., based on a scale or intensity range) or qualitative.  See example categorical maps.

Exercise 5: Configuring Categorical a Categorical Map.  Data definition for a categorical map is a bit different from the definition of a numeric / quantitative map. This exercise walks you through the key steps. It also demonstrates how to add data to tooltips and the data table. Go to Exercise 5.

Working with Confidence Intervals

With the single-series bar, line, and combo charts you can include one set of confidence intervals (CIs).  (With the forecast chart, you can include multiple confidence intervals over time.)

In the source dataset, each CI group must be formatted as two columns:  one for the lower-bound values and one for the upper-bound values.  (Note that at this time, only single-series charts can display CI values.)

Screenshot of source dataset with CIs and resulting bar chart

Exercise 6: Working with Confidence Intervals.  In addition to demonstrating the configuration of confidence intervals, this exercise provides experience in working with single-series chart data. Go to Exercise 6.

Working with Multiple Metric Series

At times multiple metrics may be available in the same source dataset, for example:

  • Total Cases and Cases per 100K, by Year
  • Cases per 100K and Program Costs, by State

See the document Data Visualization: Presenting Multiple Metrics for a discussion of four scenarios involving multiple metric series from a single source dataset.

[Note:  I’ve removed the planned Q&A section from this document for two reasons:  we need an all-purpose Q&A doc for COVE and I wanted to keep this document as short as possible.  Data definition could be a subsection in the separate Q&A doc, with links to this doc.]