SANDS—Semi-Automated Non-response Detection for Surveys

Purpose

  • When analyzing data, researchers sometimes need to filter out responses that may not or do not answer the question.
  • The SANDS—Semi-Automated Non-response Detection for Surveys—model helps researchers process survey response data by detecting these non-responses.
  • Learn more about how SANDS helps researchers review large amounts of survey response data.
Icons of a survey, magnifying glass, and graph over a map of the U.S.

About SANDS

SANDS—Semi-Automated Non-response Detection for Surveys—is an open-access AI tool developed by the National Center for Health Statistics (NCHS). It helps researchers and survey administrators detect responses that may not or do not answer the question (non-responses) in open-ended survey text. The model helps human reviewers to quickly divide a large volume of text for manual review.

To use SANDS, follow the model card or the detailed instructions in the Getting Started section.

Before applying the model to real data, review these sections:

Model details

This model is a fine-tuned version of the supervised SimCSE BERT base uncased model. It was introduced at the American Association of Public Opinion Research (AAPOR) 2022 Annual Conference.

The model is uncased, so it treats importantImportant, and ImPoRtAnT the same.

  • Developed by: CDC's National Center for Health Statistics
  • Model Type: Text Classification
  • Language(s): English
  • License: Apache-2.0

Parent Model: For more details about SimCSE, we encourage users to visit the SimCSE Github repository, and the base model on HuggingFace. The model was fine-tuned on 3,000 labeled, open-ended responses from the NCHS Research and Development Survey's RANDS during COVID 19 Rounds 1 and 2 surveys. The base SimCSE BERT model was trained on BookCorpus and English Wikipedia.

Training procedure

  • Learning rate: 5e-5
  • Batch size: 16
  • Number training epochs: 4
  • Base Model pooling dimension: 768
  • Number of labels: 5

Getting started

To use SANDS, first install python. Using a package manager, install pandas and the transformers module:

> pip install transformers pandas

Once you’ve installed the modules, the following code illustrates how to download the model, and parse a fixed set of responses:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pandas as pd

# Load the model
model_location = "NCHS/SANDS"
model = AutoModelForSequenceClassification.from_pretrained(model_location)
tokenizer = AutoTokenizer.from_pretrained(model_location)

# Create example responses to test
responses = [
    "sdfsdfa",
    "idkkkkk",
    "Because you asked",
    "I am a cucumber",
    "My job went remote and I needed to take care of my kids",
]

# Run the model and compute a score for each response
with torch.no_grad():
    tokens = tokenizer(responses, padding=True, truncation=True, return_tensors="pt")
    output = model(**tokens)
    scores = torch.softmax(output.logits, dim=1).numpy()

# Display the scores in a table
columns = ["Gibberish", "Uncertainty", "Refusal", "High-risk", "Valid"]
df = pd.DataFrame(scores, columns=columns)
df.index.name = "Response"
print(df)

The code should output the following table

Outputs of various text inputs to SANDS model
Response Gibberish Uncertainty Refusal High-risk Valid
sdfsdfa 0.998 0.000 0.000 0.000 0.000
idkkkkk 0.002 0.995 0.001 0.001 0.001
Because you asked 0.001 0.001 0.976 0.006 0.014
I am a cucumber 0.001 0.001 0.002 0.797 0.178
My job went remote and I needed to take care of my kids 0.000 0.000 0.000 0.000 1.000

Uses

This model is to be used on survey responses for data cleaning. When analyzing data, researchers can use SANDS to filter out non-responses. The model will return a score for a response in 5 different categories:

  • Gibberish
  • Refusal
  • Uncertainty
  • High Risk
  • Valid as a probability vector that sums to 1

Response types

  • Gibberish: Nonsensical response where the respondent entered text without regard for English syntax
    • Examples: "ksdhfkshgk" and "sadsadsadsadsadsadsad"
  • Refusal: Responses in valid English that are either a direct refusal to answer the question asked or a response that provides no contextual relationship to the question asked.
    • Examples: "Because" or "Meow"
  • Uncertainty: Responses indicating that the person does not understand the question, does not know the answer to the question, or does not know how to respond to the question
    • Examples: "I dont know" or "unsure what you are asking"
  • High-Risk: Responses that may be valid depending on the context and content of the question—these responses require human expertise to determine if they are valid
    • Examples: "Necessity" or "I am a cucumber"
  • Valid: Responses that answer the question at hand and provide an insight to the respondents' thoughts on the subject of the question
    • Examples: "COVID began for me when my children's school went online and I needed to stay home to watch them" or "staying home, avoiding crowds, still wear masks"

Misuses and out-of-scope use

The model has been trained εspecifically to identify survey non-response when the survey respondent has given an open-ended response, but their answer does not address the question or provide meaningful insight. Examples of these types of responses include "meow," "ksdhfkshgk," or "idk."

The model was fine-tuned on 3,000 labeled, open-ended responses to web probes on questions relating to the COVID-19 pandemic. These responses were gathered from NCHS's Research and Development Survey.

Web probes are questions designed to draw out information about how respondents understand, think about, and respond to the questions that are being evaluated. They are different than traditional open-ended survey questions. The context of our labeled responses was limited in focus to both COVID-19 and health responses. Responses outside this scope may notice a drop in performance.

The model trained on responses from both web and phone-based open-ended probes. There may be limitations in model effectiveness with more traditional open-ended survey questions with responses provided in other mediums.

This model does not assess the factual accuracy of responses or filter out responses with different demographic biases. It was not trained to be factual of people or events, so using the model for such classification is out of scope for the abilities of the model.

We did not train the model to recognize non-response in any language other than English. Responses in languages other than English are out of scope and the model will perform poorly. Any correct classifications are a result of the base SimCSE or Bert Models.

Risk, limitations, and biases

To investigate if there were differences between demographic groups on sensitivity and specificity, we conducted two-tailed Z-tests across demographic groups. These included:

  • Education (some college or less and bachelor’s or more)
  • Sex (male or female)
  • Mode (computer or telephone)
  • Race and ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, and all others who are non-Hispanic)
  • Age (18-29, 30-44, 45-59, and 60+)

There were 4,813 responses to 3 probes. To control for family-wise error rate, we applied the Bonferroni correction to the alpha level (α < 0.00167).

There were statistically significant differences in specificity between education levels, mode, and White and Black respondents. There were no statistically significant differences in sensitivity.

Respondents with some college or less had lower specificity compared to those with more education (0.73 versus 0.80, p < 0.0001). Respondents who used a smartphone or computer to complete their survey had a higher specificity than those who completed the survey over the telephone (0.77 versus 0.70, p < 0.0001). Black respondents had a lower specificity than White respondents (0.65 versus 0.78, p < 0.0001). Effect sizes for education and mode were small (h = 0.17 and h = 0.16, respectively). The effect size for race was between small and medium (h = 0.28).

Because the model was fine-tuned from SimCSE, itself fine-tuned from BERT, it will reproduce all biases inherent in these base models. Due to tokenization, the model may incorrectly classify typos, especially in acronyms. For example: LGBTQ is valid, while LBGTQ is classified as gibberish.

Open source license

Model and code are released as open source under the Creative Commons Universal Public Domain. That includes source files and code samples, if any, in the content. This means you can use the code, model, and content in this repository except for any official trademarks in your own projects.

Open-source projects are made available and contributed to under licenses that include terms that—for the protection of contributors—make clear that the projects are offered—

  • “As-is”
  • Without warranty
  • Disclaiming liability for damages resulting from using the projects

This model is no different. The open content license it is offered under includes such terms.