A reproducible R workflow to preserve variable and value labels in Stata, SPSS, and SAS datasets for transparent and reproducible health research

Post published:September 15, 2025
Post category:Volume 37. No 3 (2025)
Post comments:0 Comments

Wingston Felix Ng’ambi¹, Adamson Sinjani Muula^2-4

Health Economics and Policy Unit, Department of Health Systems and Policy, Kamuzu University of Health Sciences, Lilongwe, Malawi
Africa Centre of Excellence in Public Health and Herbal Medicine (ACEPHEM), Kamuzu University of Health Sciences, Blantyre, Malawi
Professor and Head, Department of Community and Environmental Health, School of Global and Public Health
President, ECSA College of Public Health Physicians, Arusha, Tanzania

Abstract
Introduction
Large-scale health surveys like the Demographic and Health Surveys (DHS) and WHO STEPS are essential for tracking health trends and guiding policies in low- and middle-income countries. However, when these datasets are imported into tools like R, they often lose crucial metadata, variable and value labels, turning clear categories into cryptic codes. This slows analysis, risks errors, and weakens data reuse.
Methods
We developed a reproducible workflow in R to import and process survey data while preserving variable and value labels. Using open-source packages such as haven, labelled, and tidyverse, we automated reading of datasets, extraction of metadata, replacement of codes with readable labels, and renaming of variables with full descriptions. The workflow was designed to be modular, easy to adapt, and accessible for analysts with basic R skills.
Results
We tested the workflow on the contraceptive use module from the 2015/16 Malawi DHS and the tobacco use module from Malawi’s Global Youth Tobacco Survey. Without our process, variables appeared as vague codes (e.g., v312) and responses as plain numbers. After applying our workflow, these were transformed into clear, labelled categories like “Injectable” or “Never Married.” Frequency tables generated from the cleaned data were easier to interpret and share. This automated approach saved several hours of manual recoding and reduced the risk of errors.
Conclusion
By maintaining metadata, our workflow improves transparency, reproducibility, and efficiency in digital health research. This supports better training, clearer communication, and more reliable use of health data for policy and program decisions.

Keywords: digital health, data harmonisation, metadata preservation, health surveys, reproducible research

DOWNLOAD PDF

Please Share Content Share this content

You Might Also Like

BiInferApp: A Shiny Application for Bidirectional Inference between P-values and Confidence Intervals for epidemiological measures of effects in R software

Teaching Corner: How to improve the comprehensive capabilities of radiographers at Mzuzu Central Hospital within the context of modern medical imaging technology?

A Practical Guide to Key Considerations in Logistic Regression for Clinical and Public Health Research: R tutorial

Leave a Reply Cancel reply

Share this content