{"id":13527,"date":"2025-09-19T13:27:32","date_gmt":"2025-09-19T13:27:32","guid":{"rendered":"https:\/\/www.mmj.mw\/?p=13527"},"modified":"2025-09-19T13:30:00","modified_gmt":"2025-09-19T13:30:00","slug":"a-practical-guide-to-key-considerations-in-logistic-regression-for-clinical-and-public-health-research-r-tutorial","status":"publish","type":"post","link":"https:\/\/www.mmj.mw\/?p=13527","title":{"rendered":"A Practical Guide to Key Considerations in Logistic Regression for Clinical and Public Health Research: R tutorial"},"content":{"rendered":"\n<p><strong>Wingston Felix Ng\u2019ambi<sup>1<\/sup>, Cosmas Zyambo, Adamson Sinjani Muula<sup>3-5<\/sup><\/strong><\/p>\n\n\n\n<ol type=\"1\"><li>Health Economics and Policy Unit, Department of Health Systems and Policy, Kamuzu University of Health Sciences, Lilongwe, Malawi<\/li><li>Africa Centre of Excellence in Public Health and Herbal Medicine (ACEPHEM), Kamuzu University of Health Sciences, Blantyre, Malawi<\/li><li>Department of Public Health and Family Medicine, University of Zambia, Lusaka, Zambia<\/li><li>Professor and Head, Department of Community and Environmental Health, School of Global and Public Health<\/li><li>President, ECSA College of Public Health Physicians, Arusha, Tanzania<\/li><\/ol>\n\n\n\n<h4><a><strong><span class=\"has-inline-color has-black-color\">Abstract<\/span><\/strong><\/a><\/h4>\n\n\n\n<p>Logistic regression is one of the most widely used statistical methods in clinical and public health research, especially for analysing binary outcomes such as disease presence, treatment uptake, or health behaviour change. While its use appears straightforward, important methodological decisions made during data preparation, model specification, and interpretation can substantially influence findings and their policy relevance. This paper offers a practical, step-by-step guide to key considerations in logistic regression, tailored for clinical and public health researchers. Drawing from applied examples in infectious diseases, non-communicable diseases, and health services research, we highlight common pitfalls and best practices for variable selection, model diagnostics, handling missing data, assessing fit, and presenting results. The guide aims to bridge the gap between statistical theory and real-world application, equipping researchers to produce more robust, transparent, and policy-relevant evidence.<\/p>\n\n\n\n<h4><a><strong><span class=\"has-inline-color has-black-color\">Key Words<\/span><\/strong><\/a><\/h4>\n\n\n\n<p>Logistic regression, Clinical research, Public Health, infectious diseases, non-communicable diseases, health services research<\/p>\n\n\n\n<h4><strong>Introduction<\/strong><\/h4>\n\n\n\n<p>Logistic regression is one of the most widely used statistical tools in clinical and public health research<sup><a href=\"#ref-schober2021\">1<\/a><\/sup>. Its versatility, interpretability, and capacity to model binary outcomes make it an indispensable method in epidemiology, health services research, and health policy analysis<sup><a href=\"#ref-vittinghoff2012\">3<\/a><\/sup>. In clinical and public health, binary outcomes are common<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>; for example, whether an individual is vaccinated, whether a patient adheres to treatment, or whether a pregnant woman delivers in a health facility. Logistic regression allows researchers to examine the relationship between such outcomes and multiple predictors, while adjusting for potential confounding factors<sup><a href=\"#ref-pourhoseingholi2012\">9<\/a><\/sup>. In global health research contexts, logistic regression underpins analyses from flagship datasets such as the Demographic and Health Surveys (DHS), the Multiple Indicator Cluster Surveys (MICS), the World Health Organization\u2019s STEPS surveys, and disease-specific surveys such as the Population-based HIV Impact Assessments (PHIA). These data sources are often the backbone of national health policy reviews, donor-funded programme evaluations, and academic studies <sup><a href=\"#ref-franzen2017\">10<\/a><\/sup>. For instance, logistic regression has been used to identify determinants of HIV testing uptake across sub-Saharan Africa, factors influencing childhood immunisation in South Asia, and socio-economic predictors of maternal health service use in Latin America. Despite its broad adoption, the method is often misapplied or under-reported. Studies may omit critical details about variable selection, ignore checks for key assumptions such as linearity in the logit, or misinterpret odds ratios as risk ratios. Where many datasets are collected using complex survey designs, a frequent oversight is failing to account for clustering, stratification, and sampling weights<sup><a href=\"#ref-sheffel2019\">11<\/a><\/sup>. This can lead to biased estimates, misleading standard errors, and ultimately, practice and policy recommendations that are not supported by the data<sup><a href=\"#ref-simundic2013\">12<\/a><\/sup>.<\/p>\n\n\n\n<p>The gap between statistical theory and its practical application is a constant threat in understanding research findings. Researchers may be constrained by small sample sizes<sup><a href=\"#ref-faber2014\">13<\/a><\/sup>, incomplete data<sup><a href=\"#ref-carpenter2021\">14<\/a><\/sup>, or limited access to statistical training and software<sup><a href=\"#ref-ozgur2015\">15<\/a><\/sup>. These challenges can result in oversimplified analyses that miss important interactions or produce unstable estimates<sup><a href=\"#ref-shrier2021\">16<\/a><\/sup>. For example, a study examining determinants of hypertension treatment in a country may exclude socio-economic variables because of missing data, or might not test for interaction effects between age and sex, thereby overlooking potentially important policy insights. The aim of this paper is to provide a practical, step-by-step guide to key considerations when applying the logistic regression in public health research, with a specific focus on contexts relevant to both global and African health studies. This guide bridges the gap between statistical best practice and the realities of working with large-scale survey data, health facility datasets, and programmatic monitoring systems. It is written for applied researchers, programme evaluators, postgraduate students, and public health practitioners who may not be statisticians by training but who regularly use logistic regression to generate evidence for decision-making.<\/p>\n\n\n\n<p><strong>Data sets analysed<\/strong><\/p>\n\n\n\n<p>Throughout the paper, we combine statistical principles with worked examples in R, using real-world datasets from both global and African sources. Examples include:<\/p>\n\n\n\n<p>&#8211; Predictors of full immunisation among children aged 12\u201323 months using the Malawi Demographic and Health Survey (DHS) 2015\u201316 data <a href=\"https:\/\/dhsprogram.com\/data\/dataset\/Malawi_Standard-DHS_2015.cfm\">MDHS 2015-2016<\/a>. Below is the code that creates the dhs_mw dataset for analysis.<\/p>\n\n\n\n<p># Always set working directory to the current project folder<br>setwd(getwd())<br><br># Load required packages &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br><br>library(haven)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # for reading Stata (.dta) files like DHS datasets<br>library(dplyr)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # for data wrangling (filtering, mutating, selecting)<\/p>\n\n\n\n<p><br>Attaching package: &#8216;dplyr&#8217;<\/p>\n\n\n\n<p>The following objects are masked from &#8216;package:stats&#8217;:<br><br>&nbsp;&nbsp;&nbsp; filter, lag<\/p>\n\n\n\n<p>The following objects are masked from &#8216;package:base&#8217;:<br><br>&nbsp;&nbsp;&nbsp; intersect, setdiff, setequal, union<\/p>\n\n\n\n<p>library(labelled)&nbsp;&nbsp; # for handling variable\/value labels from DHS<br>library(mice)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # for multiple imputation of missing values<\/p>\n\n\n\n<p><br>Attaching package: &#8216;mice&#8217;<\/p>\n\n\n\n<p>The following object is masked from &#8216;package:stats&#8217;:<br><br>&nbsp;&nbsp;&nbsp; filter<\/p>\n\n\n\n<p>The following objects are masked from &#8216;package:base&#8217;:<br><br>&nbsp;&nbsp;&nbsp; cbind, rbind<\/p>\n\n\n\n<p>library(survey)&nbsp;&nbsp;&nbsp;&nbsp; # for survey-weighted analyses (logistic regression)<\/p>\n\n\n\n<p>Loading required package: grid<\/p>\n\n\n\n<p>Loading required package: Matrix<\/p>\n\n\n\n<p>Loading required package: survival<\/p>\n\n\n\n<p><br>Attaching package: &#8216;survey&#8217;<\/p>\n\n\n\n<p>The following object is masked from &#8216;package:graphics&#8217;:<br><br>&nbsp;&nbsp;&nbsp; dotchart<\/p>\n\n\n\n<p># 1. Load DHS dataset &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br><br>dhs_mw &lt;- read_dta(&#8220;MWKR7AFL.DTA&#8221;)<br># Reads the DHS Malawi Kids Recode file (KR) in Stata format (.dta).<br># The object &#8216;dhs_mw&#8217; now contains the full DHS dataset for analysis.<br><br># 2. Select children age 12\u201323 months &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br><br>dhs_mw &lt;- dhs_mw %>%<br>\u00a0 filter(b8 >= 1 &amp; b8 &lt;= 2)<br># b8 = age of child in years<br># We restrict the sample to children 12\u201323 months old, since they are<br># expected to have completed all basic vaccinations.<br><br># 3. Define full immunisation outcome &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;<br><br>dhs_mw &lt;- dhs_mw %>%<br>\u00a0 mutate(<br>\u00a0\u00a0\u00a0 full_immunisation = case_when(<br>\u00a0\u00a0\u00a0\u00a0\u00a0 # child fully immunised if all required vaccines are 1,2,3<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h2 %in% c(1,2,3) &amp;\u00a0\u00a0 # BCG<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h3 %in% c(1,2,3) &amp;\u00a0\u00a0 # DPT1<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h4 %in% c(1,2,3) &amp;\u00a0\u00a0 # DPT2<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h5 %in% c(1,2,3) &amp;\u00a0\u00a0 # DPT3<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h6 %in% c(1,2,3) &amp;\u00a0\u00a0 # Polio1<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h7 %in% c(1,2,3) &amp;\u00a0\u00a0 # Polio2<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h8 %in% c(1,2,3) &amp;\u00a0\u00a0 # Polio3<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h9 %in% c(1,2,3) &amp;\u00a0\u00a0 # Measles<br>\u00a0\u00a0\u00a0\u00a0\u00a0 h9a %in% c(1,2,3)\u00a0\u00a0\u00a0 # Sometimes measles2 \/ rubella etc depending on DHS round<br>\u00a0\u00a0\u00a0\u00a0\u00a0 ~ 1,<br>\u00a0\u00a0\u00a0\u00a0\u00a0 TRUE ~ 0<br>\u00a0\u00a0\u00a0 )<br>\u00a0 )<br># Creates a binary outcome variable &#8216;full_immunisation&#8217;.<br># A child is coded 1 if they received all required vaccines:<br>#\u00a0\u00a0 BCG, 3 doses of DPT, 3 doses of polio, and measles.<br># Otherwise, coded as 0.<br># NOTE: variable names (bcg, dpt1, etc.) may differ in your DHS file<br># and need to be checked.<br><br># 4. Recode predictors and keep analysis variables &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br><br>dhs_mw &lt;- dhs_mw %>%<br>\u00a0 mutate(<br>\u00a0\u00a0\u00a0 child_age = b8,\u00a0\u00a0 # keep child age in months as continuous predictor<br>\u00a0\u00a0\u00a0<br>\u00a0\u00a0\u00a0 mother_education = factor(v106,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 # mother&#8217;s education level<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 levels = c(0,1,2,3),<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 labels = c(&#8220;none&#8221;,&#8221;primary&#8221;,&#8221;secondary&#8221;,&#8221;higher&#8221;)),<br>\u00a0\u00a0\u00a0<br>\u00a0\u00a0\u00a0 wealth_quintile = factor(v190,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 # household wealth quintile<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 levels = 1:5,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 labels = c(&#8220;poorest&#8221;,&#8221;poorer&#8221;,&#8221;middle&#8221;,&#8221;richer&#8221;,&#8221;richest&#8221;)),<br>\u00a0\u00a0\u00a0<br>\u00a0\u00a0 \u00a0urban = ifelse(v025 == 1, 1, 0),\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 # place of residence<br>\u00a0\u00a0\u00a0 # v025 = type of residence (1=urban, 2=rural).<br>\u00a0\u00a0\u00a0 # We recode to binary (1=urban, 0=rural).<br>\u00a0\u00a0\u00a0<br>\u00a0\u00a0\u00a0 mother_age = factor(v013,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 # mother&#8217;s age group<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 levels = 1:7,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 labels = c(&#8220;15-19&#8243;,&#8221;20-24&#8243;,&#8221;25-29&#8221;,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8220;30-34&#8243;,&#8221;35-39&#8243;,&#8221;40-44&#8243;,&#8221;45-49&#8221;)),<br>\u00a0\u00a0\u00a0<br>\u00a0\u00a0\u00a0 region = factor(v024,\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 # region of residence<br>\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0levels = c(1,2,3),<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 labels = c(&#8220;Northern&#8221;,&#8221;Central&#8221;,&#8221;Southern&#8221;)),<br>\u00a0\u00a0\u00a0<br>\u00a0\u00a0\u00a0 cluster = v021,\u00a0\u00a0\u00a0 # cluster (primary sampling unit, PSU)<br>\u00a0\u00a0\u00a0 strata\u00a0 = v022,\u00a0\u00a0\u00a0 # survey strata<br>\u00a0\u00a0\u00a0 weight\u00a0 = v005 \/ 1000000\u00a0\u00a0 # sample weight (normalize by 1,000,000 as per DHS)<br>\u00a0 ) %>%<br>\u00a0 select(full_immunisation, child_age, mother_education, mother_age,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 wealth_quintile, urban, region, cluster, strata, weight)<br># After recoding, we keep only the variables relevant for the analysis.<br><br>#Dealing with data types so that we are able to run all the commands perfectly<br>dhs_mw$mother_education &lt;- factor(dhs_mw$mother_education,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 levels = c(&#8220;none&#8221;, &#8220;primary&#8221;, &#8220;secondary&#8221;, &#8220;higher&#8221;))<br>dhs_mw$child_age &lt;- factor(dhs_mw$child_age, levels = c(1,2))<br>dhs_mw$urban &lt;- factor(dhs_mw$urban, levels = c(0,1))<br>dhs_mw$wealth_quintile &lt;- factor(dhs_mw$wealth_quintile,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 levels = c(&#8220;poorest&#8221;, &#8220;poorer&#8221;, &#8220;middle&#8221;, &#8220;richer&#8221;, &#8220;richest&#8221;))<\/p>\n\n\n\n<p>By grounding the discussion in real datasets, we demonstrate how key considerations such as outcome coding, variable selection, model diagnostics, and interpretation translate into practical steps. We also highlight common pitfalls and provide recommendations to ensure logistic regression results are robust, transparent, and practice and policy-relevant. In doing so, this paper aims to demystify logistic regression for applied public health research, contexts where data complexity and policy urgency demand both statistical rigour and pragmatic decision-making.<\/p>\n\n\n\n<p><strong>Step-by-step guide to key considerations<\/strong><\/p>\n\n\n\n<p>This section gives a hands-on, practical walk through the most important choices you will make when using logistic regression in clinical and public health work. Each part explains why the step matters, shows common mistakes, and gives short R code you can copy and adapt. We use short, clear language, one idea per sentence, and real public health examples from Africa and elsewhere.<\/p>\n\n\n\n<p><em>Step 1: Define the research question<\/em> A clear question is the most important start. Ask, what is the exact outcome. Ask, is it naturally binary, or am I forcing a binary split. Be explicit about whether your aim is explanation, causal inference, or prediction<sup><a href=\"#ref-ito2025\">17<\/a><\/sup>. Each aim needs a slightly different approach. A precise research question is the foundation of any analysis. This requires clear specification of the outcome, with attention to whether it is naturally binary or whether dichotomisation is being imposed on a continuous or ordinal variable, which can reduce information<sup><a href=\"#ref-ratan2019\">18<\/a><\/sup>. Equally important is explicit definition of the analytic aim. Explanatory analyses prioritise identifying associations and patterns between variables, whereas causal inference focuses on estimating the effect of an exposure or intervention under defined assumptions<sup><a href=\"#ref-hammerton2021\">19<\/a><\/sup>. In contrast, prediction models aim to maximise accuracy in forecasting outcomes, often using machine learning methods that prioritise performance over interpretability. Distinguishing between these aims at the outset ensures alignment between research design, choice of methods, and interpretation of results<sup><a href=\"#ref-elshawi2019\">20<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Example, explanatory question<\/em>: \u201cWhat factors predict facility delivery in rural Malawi?\u201d Here you want effect estimates that adjust for confounding. Use odds ratios carefully, and consider risk differences for policy clarity.<\/p>\n\n\n\n<p><em>Example, predictive question<\/em>: \u201cCan we predict which children will be fully immunised in Malawi?\u201d Here you focus on discrimination and calibration, you care less about causal interpretation.<\/p>\n\n\n\n<p><em>Practical checks, before modelling<\/em>: Before starting any modelling, it is important to carry out some practical checks to ensure your analysis is robust. First, inspect the frequency of your outcome: very rare events or extremely common outcomes can pose challenges for standard regression methods and may require alternative approaches or careful interpretation. Second, decide on your reference category for categorical variables, code it clearly (for example, 0 versus 1), and document this choice explicitly in your methods section. Doing so improves transparency, reproducibility, and clarity for readers, reviewers, and policymakers who may rely on your findings.<\/p>\n\n\n\n<p>R, quick checks:<\/p>\n\n\n\n<p># check outcome distribution<br>table(dhs_mw$full_immunisation)<\/p>\n\n\n\n<p><br>&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; 1<br>5626&nbsp; 874<\/p>\n\n\n\n<p>prop.table(table(dhs_mw$full_immunisation))<\/p>\n\n\n\n<p><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br>0.8655385 0.1344615<\/p>\n\n\n\n<p><em>Step 2: Data preparation<\/em> <em>Outcome coding<\/em>: This section addresses outcome coding, predictor coding, handling of missing data, survey design features, and simple data transformations. For outcome coding, clarity is essential. Outcomes should be coded consistently, often using 0 and 1 to indicate absence and presence of the event, respectively<sup><a href=\"#ref-vittinghoff2012\">3<\/a><\/sup>. It is critical to state explicitly which value represents the event of interest. For example, in the Malawi Demographic and Health Survey (DHS) child immunisation dataset, the outcome can be coded as 1 = fully immunised and 0 = not fully immunised. Such explicit coding not only reduces ambiguity but also facilitates reproducibility and comparability across analyses. R, example:<\/p>\n\n\n\n<p>table(dhs_mw$full_immunisation, useNA = &#8220;ifany&#8221;)<\/p>\n\n\n\n<p><br>&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; 1<br>5626&nbsp; 874<\/p>\n\n\n\n<p><em>Predictor coding<\/em>: For predictor coding, simplicity and interpretability should guide decisions. Categories should be meaningful and analytically relevant, while avoiding excessive fragmentation into many small groups with very few observations, which can compromise statistical power and model stability<sup><a href=\"#ref-heinze2018\">21<\/a><\/sup>. When continuous variables are used in interaction terms, centring them around their mean is recommended, as this improves the interpretability of regression coefficients and reduces multicollinearity<sup><a href=\"#ref-iacobucci2016\">22<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Missing data<\/em>: For missing data, exploration of patterns is the first step. It is important to assess whether data are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR), as each mechanism has different implications for analysis<sup><a href=\"#ref-carpenter2021\">14<\/a><\/sup>. In many health survey settings, multiple imputation provides a robust approach for handling missingness and yields less biased estimates compared to ad hoc methods<sup><a href=\"#ref-vanbuuren2018\">25<\/a><\/sup>. Complete-case analysis should be avoided when the proportion of missing data is greater than 10%, as this can reduce statistical power and lead to biased results if the missingness is not completely at random<sup><a href=\"#ref-article\">26<\/a><\/sup>.<\/p>\n\n\n\n<p>R, example with mice:<\/p>\n\n\n\n<p>library(mice)<br># md.pattern(dhs_mw) # Our data is completely observed with no missing data<br>imp &lt;- mice(dhs_mw %&gt;% select(full_immunisation, child_age, mother_education, wealth_quintile, urban),<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; m = 5, method = &#8216;pmm&#8217;, seed = 123)<\/p>\n\n\n\n<p><br>&nbsp;iter imp variable<br>&nbsp; 1&nbsp;&nbsp; 1<br>&nbsp; 1&nbsp;&nbsp; 2<br>&nbsp; 1&nbsp;&nbsp; 3<br>&nbsp; 1&nbsp;&nbsp; 4<br>&nbsp; 1&nbsp;&nbsp; 5<br>&nbsp; 2&nbsp;&nbsp; 1<br>&nbsp; 2&nbsp;&nbsp; 2<br>&nbsp; 2&nbsp;&nbsp; 3<br>&nbsp; 2&nbsp;&nbsp; 4<br>&nbsp; 2&nbsp;&nbsp; 5<br>&nbsp; 3&nbsp;&nbsp; 1<br>&nbsp; 3&nbsp;&nbsp; 2<br>&nbsp; 3&nbsp;&nbsp; 3<br>&nbsp; 3&nbsp;&nbsp; 4<br>&nbsp; 3&nbsp;&nbsp; 5<br>&nbsp; 4&nbsp;&nbsp; 1<br>&nbsp; 4&nbsp;&nbsp; 2<br>&nbsp; 4&nbsp;&nbsp; 3<br>&nbsp; 4&nbsp;&nbsp; 4<br>&nbsp; 4&nbsp;&nbsp; 5<br>&nbsp; 5&nbsp;&nbsp; 1<br>&nbsp; 5&nbsp;&nbsp; 2<br>&nbsp; 5&nbsp;&nbsp; 3<br>&nbsp; 5&nbsp;&nbsp; 4<br>&nbsp; 5&nbsp;&nbsp; 5<\/p>\n\n\n\n<p>mod_imp &lt;- with(imp, glm(full_immunisation ~ child_age + mother_education + wealth_quintile + urban,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; family = binomial))<br>pool(mod_imp)<\/p>\n\n\n\n<p>Class: mipo&nbsp;&nbsp;&nbsp; m = 5<br>&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;term m&nbsp;&nbsp;&nbsp; estimate&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ubar b&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; t dfcom<br>1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (Intercept) 5 -2.02286323 0.017112417 0 0.017112417&nbsp; 6490<br>2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child_age2 5 -0.17364443 0.005336868 0 0.005336868&nbsp; 6490<br>3&nbsp;&nbsp;&nbsp; mother_educationprimary 5&nbsp; 0.04161282 0.014670901 0 0.014670901&nbsp; 6490<br>4&nbsp; mother_educationsecondary 5&nbsp; 0.11769675 0.021149348 0 0.021149348&nbsp; 6490<br>5&nbsp;&nbsp;&nbsp;&nbsp; mother_educationhigher 5 -0.03311987 0.094604028 0 0.094604028&nbsp; 6490<br>6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wealth_quintilepoorer 5&nbsp; 0.13756188 0.013136974 0 0.013136974&nbsp; 6490<br>7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wealth_quintilemiddle 5&nbsp; 0.17057447 0.013670904 0 0.013670904&nbsp; 6490<br>8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wealth_quintilericher 5&nbsp; 0.31267124 0.014138052 0 0.014138052&nbsp; 6490<br>9&nbsp;&nbsp;&nbsp;&nbsp; wealth_quintilerichest 5&nbsp; 0.29185892 0.019885581 0 0.019885581&nbsp; 6490<br>10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urban1 5&nbsp; 0.10163447 0.012979206 0 0.012979206&nbsp; 6490<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; df riv lambda&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fmi<br>1&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>2&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>3&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>4&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>5&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>6&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>7&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>8&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>9&nbsp; 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<br>10 6487.247&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0 0.0003081547<\/p>\n\n\n\n<p><em>Survey design and weights<\/em>: Survey design and weights require special attention, as many&nbsp; national health datasets are based on complex sampling strategies that involve stratification, multi-stage sampling, clustering, and unequal probabilities of selection<sup><a href=\"#ref-heeringa2017\">29<\/a><\/sup>. Ignoring these design features can result in biased standard errors and, in some cases, biased point estimates<sup><a href=\"#ref-lumley2011\">30<\/a><\/sup>. Analyses should therefore apply survey-aware methods that incorporate weights and account for clustering to ensure valid statistical inference<sup><a href=\"#ref-lumley2011\">30<\/a><\/sup>.<\/p>\n\n\n\n<p>R, using survey package:<\/p>\n\n\n\n<p>library(survey)<br><br>dhs_design &lt;- svydesign(ids = ~cluster, strata = ~strata, weights = ~weight, data = dhs_mw, nest = TRUE)<br><br>svyglm(full_immunisation ~ child_age + mother_education + wealth_quintile + urban,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; design = dhs_design, family = quasibinomial())<\/p>\n\n\n\n<p>Stratified 1 &#8211; level Cluster Sampling design (with replacement)<br>With (849) clusters.<br>svydesign(ids = ~cluster, strata = ~strata, weights = ~weight,<br>&nbsp;&nbsp;&nbsp; data = dhs_mw, nest = TRUE)<br><br>Call:&nbsp; svyglm(formula = full_immunisation ~ child_age + mother_education +<br>&nbsp;&nbsp;&nbsp; wealth_quintile + urban, design = dhs_design, family = quasibinomial())<br><br>Coefficients:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child_age2&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -2.24446&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.15802&nbsp;<br>&nbsp; mother_educationprimary&nbsp; mother_educationsecondary&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.19456&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.29615&nbsp;<br>&nbsp;&nbsp; mother_educationhigher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wealth_quintilepoorer&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.52557&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.27186&nbsp;<br>&nbsp;&nbsp; &nbsp;wealth_quintilemiddle&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; wealth_quintilericher&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.22911&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.47902&nbsp;<br>&nbsp;&nbsp; wealth_quintilerichest&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; urban1&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.33950&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.05492&nbsp;<br><br>Degrees of Freedom: 6499 Total (i.e. Null);&nbsp; 784 Residual<br>Null Deviance:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5135<br>Residual Deviance: 5097&nbsp;&nbsp;&nbsp;&nbsp; AIC: NA<\/p>\n\n\n\n<p><em>Transformations and outliers<\/em>: For transformations and outliers, distributions of continuous predictors should be carefully examined. Skewed predictors may benefit from log transformations or flexible spline functions to improve model fit and interpretability<sup><a href=\"#ref-choi2022\">31<\/a><\/sup>. Extreme values should be handled cautiously; winsorising or truncating should only be applied when there is clear justification, as inappropriate handling of outliers can distort results and reduce external validity<sup><a href=\"#ref-kwak2017\">32<\/a><\/sup>.<\/p>\n\n\n\n<p>R, centring example:<\/p>\n\n\n\n<p>dhs_mw &lt;- dhs_mw %&gt;%<br>&nbsp; mutate(child_age = case_when(<br>&nbsp;&nbsp;&nbsp; child_age == 1 ~ 0,<br>&nbsp;&nbsp;&nbsp; child_age == 2 ~ 1,<br>&nbsp;&nbsp;&nbsp; TRUE ~ NA_real_&nbsp; # keep missing or unexpected values as NA<br>&nbsp; ))<br><br># Check the result<br>table(dhs_mw$child_age, useNA = &#8220;ifany&#8221;)<\/p>\n\n\n\n<p><br>&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; 1<br>3248 3252<\/p>\n\n\n\n<p>dhs_mw &lt;- dhs_mw %&gt;% mutate(child_age_c = child_age &#8211; mean(child_age, na.rm = TRUE))<\/p>\n\n\n\n<p><em>Document every recode<\/em>: Finally, every recoding step should be documented in detail. Transparent documentation supports reproducibility, allows other researchers to replicate the analysis, and helps informed policymakers and reviewers trust the validity of the results<sup><a href=\"#ref-ngambi2025\">33<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Step 3: Variable selection<\/em> <em>Choose predictors with purpose<\/em>: Selection of predictors should be grounded in theory, prior evidence, and where possible, formal tools such as directed acyclic graphs (DAGs) to make assumptions explicit<sup><a href=\"#ref-piccininni2020\">34<\/a><\/sup>. This approach ensures that models reflect plausible causal structures and avoid spurious associations<sup><a href=\"#ref-pearl2009\">35<\/a><\/sup>. While data-driven methods, such as automated variable selection, can be valuable in prediction tasks, they are less appropriate for explanatory or causal analyses, where reliance on prior knowledge provides stronger justification for including or excluding variables<sup><a href=\"#ref-dyer2025\">36<\/a><\/sup>. A clear rationale for each predictor strengthens the credibility and interpretability of the final model<sup><a href=\"#ref-westreich2012\">37<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Events per variable (EPV)<\/em>: The number of outcome events relative to the number of predictors is a key consideration in model building. A widely cited rule of thumb is to maintain at least ten outcome events for each parameter estimated, often referred to as the events-per-variable (EPV) guideline<sup><a href=\"#ref-bujang2018\">38<\/a><\/sup>. When EPV is low, models risk instability, overfitting, and inflated standard errors. In such situations, strategies include reducing the number of predictors, collapsing sparse categories into broader groups, or applying penalised regression methods such as ridge, lasso, or elastic net. These approaches help preserve model stability while still capturing relevant information<sup><a href=\"#ref-altelbany2021\">39<\/a><\/sup>.<\/p>\n\n\n\n<p>R, check EPV:<\/p>\n\n\n\n<p>events &lt;- sum(dhs_mw$full_immunisation == 1, na.rm = TRUE)<br>num_params &lt;- 6&nbsp; # approximate number of model coefficients<br>epv &lt;- events \/ num_params<br>epv<\/p>\n\n\n\n<p>1 145.6667<\/p>\n\n\n\n<p><em>Automated stepwise selection is common<\/em>: Automated stepwise selection is a common approach for reducing the number of predictors in a model, but it is inherently unstable and can produce results that are sensitive to small changes in the data. When this method is employed, it is essential to report its use transparently, including the criteria for adding or removing variables<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>. To ensure that the final model is robust and reliable, validation techniques such as bootstrapping or cross-validation should be applied, providing an assessment of model stability and predictive performance beyond the original sample<sup><a href=\"#ref-james2013\">40<\/a><\/sup>.<\/p>\n\n\n\n<p>R, stepwise warning example:<\/p>\n\n\n\n<p>library(MASS)<\/p>\n\n\n\n<p><br>Attaching package: &#8216;MASS&#8217;<\/p>\n\n\n\n<p>The following object is masked from &#8216;package:dplyr&#8217;:<br><br>&nbsp;&nbsp;&nbsp; select<\/p>\n\n\n\n<p># List all variables in `dhs_mw`<br>for (var in names(dhs_mw)) {<br>&nbsp; print(var)<br>}<\/p>\n\n\n\n<p>1 &#8220;full_immunisation&#8221;<br>1 &#8220;child_age&#8221;<br>1 &#8220;mother_education&#8221;<br>1 &#8220;mother_age&#8221;<br>1 &#8220;wealth_quintile&#8221;<br>1 &#8220;urban&#8221;<br>1 &#8220;region&#8221;<br>1 &#8220;cluster&#8221;<br>1 &#8220;strata&#8221;<br>1 &#8220;weight&#8221;<br>1 &#8220;child_age_c&#8221;<\/p>\n\n\n\n<p>full_mod &lt;- glm(full_immunisation ~., family = binomial, data = dhs_mw)<br>step_mod &lt;- stepAIC(full_mod, direction = &#8220;both&#8221;, trace = FALSE)<br>summary(step_mod)<\/p>\n\n\n\n<p><br>Call:<br>glm(formula = full_immunisation ~ child_age + wealth_quintile +<br>&nbsp;&nbsp;&nbsp; region, family = binomial, data = dhs_mw)<br><br>Coefficients:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Estimate Std. Error z value Pr(&gt;|z|)&nbsp;&nbsp;&nbsp;<br>(Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -2.06314&nbsp;&nbsp;&nbsp; 0.12461 -16.557&nbsp; &lt; 2e-16 ***<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.16720&nbsp;&nbsp;&nbsp; 0.07325&nbsp; -2.283 0.022459 *&nbsp;<br>wealth_quintilepoorer&nbsp;&nbsp; 0.14488&nbsp;&nbsp;&nbsp; 0.11461&nbsp;&nbsp; 1.264 0.206198&nbsp;&nbsp;&nbsp;<br>wealth_quintilemiddle&nbsp;&nbsp; 0.21043&nbsp;&nbsp;&nbsp; 0.11700&nbsp;&nbsp; 1.798 0.072102 .&nbsp;<br>wealth_quintilericher&nbsp;&nbsp; 0.36978&nbsp;&nbsp;&nbsp; 0.11733&nbsp;&nbsp; 3.152 0.001624 **<br>wealth_quintilerichest &nbsp;0.40253&nbsp;&nbsp;&nbsp; 0.11774&nbsp;&nbsp; 3.419 0.000629 ***<br>regionCentral&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.35952&nbsp;&nbsp;&nbsp; 0.10316&nbsp;&nbsp; 3.485 0.000492 ***<br>regionSouthern&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.17059&nbsp;&nbsp;&nbsp; 0.10419&nbsp; -1.637 0.101574&nbsp;&nbsp;&nbsp;<br>&#8212;<br>Signif. codes:&nbsp; 0 &#8216;***&#8217; 0.001 &#8216;**&#8217; 0.01 &#8216;*&#8217; 0.05 &#8216;.&#8217; 0.1 &#8216; &#8216; 1<br><br>(Dispersion parameter for binomial family taken to be 1)<br><br>&nbsp;&nbsp;&nbsp; Null deviance: 5132.1&nbsp; on 6499&nbsp; degrees of freedom<br>Residual deviance: 5068.9&nbsp; on 6492&nbsp; degrees of freedom<br>AIC: 5084.9<br><br>Number of Fisher Scoring iterations: 4<\/p>\n\n\n\n<p>library(broom)<br>library(knitr)<br><br># Assume step_mod is your fitted glm\/logistic model<br># step_mod &lt;- glm(full_immunisation ~ child_age + sex + &#8230;, family = binomial(), data = dhs_mw)<br><br># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br># Create a tidy table<br># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>tidy_table &lt;- broom::tidy(step_mod) %&gt;%<br>&nbsp; mutate(<br>&nbsp;&nbsp;&nbsp; OR = exp(estimate),&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # convert logOR to OR<br>&nbsp;&nbsp;&nbsp; lower_CI = exp(estimate &#8211; 1.96 * std.error),&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # 95% CI lower<br>&nbsp;&nbsp;&nbsp; upper_CI = exp(estimate + 1.96 * std.error),&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # 95% CI upper<br>&nbsp;&nbsp;&nbsp; p_value = p.value&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# p-value<br>&nbsp; ) %&gt;%<br>&nbsp; dplyr::select(term, OR, lower_CI, upper_CI, p_value)<br><br># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br># Print nicely<br># &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br>knitr::kable(tidy_table, digits = 3, caption = &#8220;Logistic Regression Results: OR and 95% CI&#8221;)<\/p>\n\n\n\n<p>Logistic Regression Results: OR and 95% CI<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><td>term<\/td><td>OR<\/td><td>lower_CI<\/td><td>upper_CI<\/td><td>p_value<\/td><\/tr><\/thead><tbody><tr><td>(Intercept)<\/td><td>0.127<\/td><td>0.100<\/td><td>0.162<\/td><td>0.000<\/td><\/tr><tr><td>child_age<\/td><td>0.846<\/td><td>0.733<\/td><td>0.977<\/td><td>0.022<\/td><\/tr><tr><td>wealth_quintilepoorer<\/td><td>1.156<\/td><td>0.923<\/td><td>1.447<\/td><td>0.206<\/td><\/tr><tr><td>wealth_quintilemiddle<\/td><td>1.234<\/td><td>0.981<\/td><td>1.552<\/td><td>0.072<\/td><\/tr><tr><td>wealth_quintilericher<\/td><td>1.447<\/td><td>1.150<\/td><td>1.822<\/td><td>0.002<\/td><\/tr><tr><td>wealth_quintilerichest<\/td><td>1.496<\/td><td>1.187<\/td><td>1.884<\/td><td>0.001<\/td><\/tr><tr><td>regionCentral<\/td><td>1.433<\/td><td>1.170<\/td><td>1.754<\/td><td>0.000<\/td><\/tr><tr><td>regionSouthern<\/td><td>0.843<\/td><td>0.687<\/td><td>1.034<\/td><td>0.102<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><em>Penalised regression for many predictors<\/em>: Penalised regression methods are useful when models include many predictors or when the primary goal is accurate prediction<sup><a href=\"#ref-altelbany2021\">39<\/a><\/sup>. Techniques such as lasso regression are widely used because they simultaneously perform variable selection and shrinkage, reducing overfitting and improving model generalisability. By constraining or penalising coefficient estimates, these methods stabilise models with high-dimensional data while retaining the most informative predictors, making them particularly valuable in settings with limited events per variable or complex predictor structures<sup><a href=\"#ref-james2013\">40<\/a><\/sup>.<\/p>\n\n\n\n<p>R, lasso example with glmnet:<\/p>\n\n\n\n<p>library(glmnet)<\/p>\n\n\n\n<p>Loaded glmnet 4.1-8<\/p>\n\n\n\n<p>x &lt;- model.matrix(full_immunisation ~ child_age + mother_education + wealth_quintile + urban + region+strata+cluster+mother_age, data = dhs_mw),-1<br>y &lt;- dhs_mw$full_immunisation<br>cvfit &lt;- cv.glmnet(x, y, family = &#8220;binomial&#8221;, alpha = 1)<br>coef(cvfit, s = &#8220;lambda.min&#8221;)<\/p>\n\n\n\n<p>20 x 1 sparse Matrix of class &#8220;dgCMatrix&#8221;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;s1<br>(Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -1.919637609<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.104732144<br>mother_educationprimary&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mother_educationsecondary&nbsp; 0.050703849<br>mother_educationhigher&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>wealth_quintilepoorer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>wealth_quintilemiddle&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>wealth_quintilericher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.137566617<br>wealth_quintilerichest&nbsp;&nbsp;&nbsp;&nbsp; 0.135410726<br>urban1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.052631921<br>regionCentral&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.287798303<br>regionSouthern&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.166838474<br>strata&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>cluster&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mother_age20-24&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.002730957<br>mother_age25-29&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mother_age30-34&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.031755025<br>mother_age35-39&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mother_age40-44&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mother_age45-49&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.166722734<\/p>\n\n\n\n<p><em>Collinearity<\/em>: Collinearity among predictors can compromise model stability and interpretability<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>. It is important to assess correlations using measures such as variance inflation factors (VIF) and to address highly correlated variables by either dropping redundant predictors, combining them into meaningful composite measures, or applying dimensionality reduction techniques such as principal component analysis<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>. Managing collinearity ensures that coefficient estimates remain reliable and that the model provides clear and interpretable insights<sup><a href=\"#ref-vatcheva2016\">41<\/a><\/sup>.<\/p>\n\n\n\n<p>R, VIF:<\/p>\n\n\n\n<p>library(car)<\/p>\n\n\n\n<p>Loading required package: carData<\/p>\n\n\n\n<p><br>Attaching package: &#8216;car&#8217;<\/p>\n\n\n\n<p>The following object is masked from &#8216;package:dplyr&#8217;:<br><br>&nbsp;&nbsp;&nbsp; recode<\/p>\n\n\n\n<p>vif(glm(full_immunisation ~ child_age + mother_education + wealth_quintile + urban, family = binomial, data = dhs_mw))<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GVIF Df GVIF^(1\/(2*Df))<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.002348&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.001173<br>mother_education 1.315415&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.046752<br>wealth_quintile&nbsp; 1.698182&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.068435<br>urban&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.447586&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.203157<\/p>\n\n\n\n<p><em>Step 4: Model specification<\/em> Model specification involves selecting the appropriate functional form for predictors, testing potential interactions, and deciding whether to include random effects when the data exhibit clustering beyond the survey design<sup><a href=\"#ref-kirkwood2003\">42<\/a><\/sup>. Proper specification ensures that the model accurately captures the underlying relationships without introducing bias or misspecification. Considering interactions allows for more nuanced understanding of how predictors jointly influence the outcome, while random effects account for unobserved heterogeneity in clustered data, improving both the validity and interpretability of the model<sup><a href=\"#ref-kirkwood2003\">42<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Linearity in the logit<\/em>: Logistic regression assumes that the log odds of the outcome change linearly with each continuous predictor<sup><a href=\"#ref-kirkwood2003\">42<\/a><\/sup>. It is important to assess this assumption, either graphically or using formal statistical tests<sup><a href=\"#ref-schreiber-gregory2018\">44<\/a><\/sup>. If the relationship is found to be non-linear, flexible approaches such as spline functions or fractional polynomials can be applied to capture the true shape of the association, ensuring that the model accurately represents the data and improves predictive performance<sup><a href=\"#ref-hosmer2013\">46<\/a><\/sup>.<\/p>\n\n\n\n<p>R, simple graphical check and natural spline:<\/p>\n\n\n\n<p>library(splines)<br>mod_linear &lt;- glm(full_immunisation ~ child_age + mother_education + wealth_quintile + urban, family = binomial, data = dhs_mw)<br>mod_spline &lt;- glm(full_immunisation ~ ns(child_age, df = 4) + mother_education + wealth_quintile + urban, family = binomial, data = dhs_mw)<\/p>\n\n\n\n<p>Warning in ns(child_age, df = 4): shoving &#8216;interior&#8217; knots matching boundary<br>knots to inside<\/p>\n\n\n\n<p>anova(mod_linear, mod_spline, test = &#8220;Chisq&#8221;)<\/p>\n\n\n\n<p>Analysis of Deviance Table<br><br>Model 1: full_immunisation ~ child_age + mother_education + wealth_quintile +<br>&nbsp;&nbsp;&nbsp; urban<br>Model 2: full_immunisation ~ ns(child_age, df = 4) + mother_education +<br>&nbsp;&nbsp;&nbsp; wealth_quintile + urban<br>&nbsp; Resid. Df Resid. Dev Df Deviance Pr(&gt;Chi)<br>1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6490&nbsp;&nbsp;&nbsp;&nbsp; 5110.5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6490&nbsp;&nbsp;&nbsp;&nbsp; 5110.5&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p>For plotting the logit:<\/p>\n\n\n\n<p>dhs_mw &lt;- dhs_mw %&gt;% mutate(pred = predict(mod_linear, type = &#8220;response&#8221;),<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; logit = log(pred \/ (1 &#8211; pred)))<br>library(ggplot2)<br>ggplot(dhs_mw, aes(x = child_age, y = logit)) + geom_point(alpha = 0.3) + geom_smooth()<\/p>\n\n\n\n<p>`geom_smooth()` using method = &#8216;gam&#8217; and formula = &#8216;y ~ s(x, bs = &#8220;cs&#8221;)&#8217;<\/p>\n\n\n\n<p>Warning: Failed to fit group -1.<br>Caused by error in `smooth.construct.cr.smooth.spec()`:<br>! x has insufficient unique values to support 10 knots: reduce k.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"758\" height=\"606\" src=\"https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-4.png\" alt=\"\" class=\"wp-image-13528\" srcset=\"https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-4.png 758w, https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-4-300x240.png 300w\" sizes=\"(max-width: 758px) 100vw, 758px\" \/><\/figure>\n\n\n\n<p><strong>Figure 1: Distribution of full immunization by age of the child<\/strong><\/p>\n\n\n\n<p><em>Interactions<\/em>: Interactions should be included in a model only when there is a plausible rationale based on theory or prior evidence<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>. Once included, interactions should be tested statistically and visualised to aid interpretation. In many policy-relevant analyses, certain interactions can be particularly meaningful<sup><a href=\"#ref-cotter2023\">47<\/a><\/sup>; for example, the effect of education on health outcomes may differ between urban and rural residents. Careful consideration of interactions enhances the model\u2019s ability to capture nuanced relationships and provides more actionable insights for decision-making.<\/p>\n\n\n\n<p>R, example interaction:<\/p>\n\n\n\n<p>mod_int &lt;- glm(full_immunisation ~ child_age + mother_education * urban + wealth_quintile, family = binomial, data = dhs_mw)<br>summary(mod_int)<\/p>\n\n\n\n<p><br>Call:<br>glm(formula = full_immunisation ~ child_age + mother_education *<br>&nbsp;&nbsp;&nbsp; urban + wealth_quintile, family = binomial, data = dhs_mw)<br><br>Coefficients:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Estimate Std. Error z value Pr(&gt;|z|)&nbsp;&nbsp;&nbsp;<br>(Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -1.996732&nbsp;&nbsp; 0.132556 -15.063&nbsp; &lt; 2e-16 ***<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.174629&nbsp;&nbsp; 0.073106&nbsp; -2.389&nbsp; 0.01691 *&nbsp;<br>mother_educationprimary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.008395&nbsp;&nbsp; 0.124631&nbsp;&nbsp; 0.067&nbsp; 0.94630&nbsp;&nbsp;&nbsp;<br>mother_educationsecondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.092991&nbsp;&nbsp; 0.155946&nbsp;&nbsp; 0.596&nbsp; 0.55097&nbsp;&nbsp;&nbsp;<br>mother_educationhigher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.009672&nbsp;&nbsp; 0.506280&nbsp;&nbsp; 0.019&nbsp; 0.98476&nbsp;&nbsp;&nbsp;<br>urban1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;-0.398505&nbsp;&nbsp; 0.541564&nbsp; -0.736&nbsp; 0.46183&nbsp;&nbsp;&nbsp;<br>wealth_quintilepoorer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.140799&nbsp;&nbsp; 0.114704&nbsp;&nbsp; 1.227&nbsp; 0.21964&nbsp;&nbsp;&nbsp;<br>wealth_quintilemiddle&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.172510&nbsp;&nbsp; 0.117111&nbsp;&nbsp; 1.473&nbsp; 0.14074&nbsp;&nbsp;&nbsp;<br>wealth_quintilericher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.314012&nbsp;&nbsp; 0.119719&nbsp;&nbsp; 2.623 &nbsp;0.00872 **<br>wealth_quintilerichest&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.291767&nbsp;&nbsp; 0.141237&nbsp;&nbsp; 2.066&nbsp; 0.03885 *&nbsp;<br>mother_educationprimary:urban1&nbsp;&nbsp;&nbsp; 0.547274&nbsp;&nbsp; 0.555365&nbsp;&nbsp; 0.985&nbsp; 0.32441&nbsp;&nbsp;&nbsp;<br>mother_educationsecondary:urban1&nbsp; 0.495062&nbsp;&nbsp; 0.564461&nbsp;&nbsp; 0.877&nbsp; 0.38046&nbsp;&nbsp;&nbsp;<br>mother_educationhigher:urban1&nbsp;&nbsp;&nbsp;&nbsp; 0.401599&nbsp;&nbsp; 0.796192&nbsp;&nbsp; 0.504&nbsp; 0.61398&nbsp;&nbsp;&nbsp;<br>&#8212;<br>Signif. codes:&nbsp; 0 &#8216;***&#8217; 0.001 &#8216;**&#8217; 0.01 &#8216;*&#8217; 0.05 &#8216;.&#8217; 0.1 &#8216; &#8216; 1<br><br>(Dispersion parameter for binomial family taken to be 1)<br><br>&nbsp;&nbsp;&nbsp; Null deviance: 5132.1&nbsp; on 6499&nbsp; degrees of freedom<br>Residual deviance: 5109.4&nbsp; on 6487&nbsp; degrees of freedom<br>AIC: 5135.4<br><br>Number of Fisher Scoring iterations: 4<\/p>\n\n\n\n<p><em>Random effects<\/em>: When data have a hierarchical or clustered structure that is not fully accounted for by survey weights, multilevel models with random effects should be considered<sup><a href=\"#ref-bender2005\">48<\/a><\/sup>. For example, children may be nested within clusters, and clusters within districts. Multilevel models explicitly capture between-cluster variation and can provide either cluster-specific or population-averaged estimates, depending on the choice of link function and the interpretation required<sup><a href=\"#ref-austin2024\">49<\/a><\/sup>. Incorporating random effects improves model accuracy, accounts for unobserved heterogeneity, and ensures valid inference in clustered data.<\/p>\n\n\n\n<p>R, simple multilevel example with lme4:<\/p>\n\n\n\n<p>library(lme4)<\/p>\n\n\n\n<p>Warning: package &#8216;lme4&#8217; was built under R version 4.5.1<\/p>\n\n\n\n<p>mod_mixed &lt;- glmer(full_immunisation ~ child_age + mother_education + (1 | cluster), family = binomial, data = dhs_mw)<br>summary(mod_mixed)<\/p>\n\n\n\n<p>Generalized linear mixed model fit by maximum likelihood (Laplace<br>&nbsp; Approximation) glmerMod<br>&nbsp;Family: binomial&nbsp; ( logit )<br>Formula: full_immunisation ~ child_age + mother_education + (1 | cluster)<br>&nbsp;&nbsp; Data: dhs_mw<br><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; AIC&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; BIC&nbsp;&nbsp;&nbsp; logLik -2*log(L)&nbsp; df.resid<br>&nbsp;&nbsp; 4939.8&nbsp;&nbsp;&nbsp; 4980.4&nbsp;&nbsp; -2463.9&nbsp;&nbsp;&nbsp; 4927.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 6494<br><br>Scaled residuals:<br>&nbsp;&nbsp;&nbsp; Min&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1Q&nbsp; Median&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3Q&nbsp;&nbsp;&nbsp;&nbsp; Max<br>-1.0933 -0.3682 -0.2737 -0.2371&nbsp; 3.9218<br><br>Random effects:<br>&nbsp;Groups&nbsp; Name&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Variance Std.Dev.<br>&nbsp;cluster (Intercept) 1.081&nbsp;&nbsp;&nbsp; 1.04&nbsp;&nbsp;&nbsp;<br>Number of obs: 6500, groups:&nbsp; cluster, 849<br><br>Fixed effects:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Estimate Std. Error z value Pr(&gt;|z|)&nbsp;&nbsp;&nbsp;<br>(Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -2.29534&nbsp;&nbsp;&nbsp; 0.14281 -16.072&nbsp;&nbsp; &lt;2e-16 ***<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.15959&nbsp;&nbsp;&nbsp; 0.07996&nbsp; -1.996&nbsp;&nbsp; 0.0459 *&nbsp;<br>mother_educationprimary&nbsp;&nbsp;&nbsp; 0.12493&nbsp;&nbsp;&nbsp; 0.13413&nbsp;&nbsp; 0.931&nbsp;&nbsp; 0.3516&nbsp;&nbsp;&nbsp;<br>mother_educationsecondary&nbsp; 0.30202&nbsp;&nbsp;&nbsp; 0.15524&nbsp;&nbsp; 1.945&nbsp;&nbsp; 0.0517 .&nbsp;<br>mother_educationhigher&nbsp;&nbsp;&nbsp;&nbsp; 0.19440&nbsp;&nbsp;&nbsp; 0.33794&nbsp;&nbsp; 0.575&nbsp;&nbsp; 0.5651&nbsp;&nbsp;&nbsp;<br>&#8212;<br>Signif. codes:&nbsp; 0 &#8216;***&#8217; 0.001 &#8216;**&#8217; 0.01 &#8216;*&#8217; 0.05 &#8216;.&#8217; 0.1 &#8216; &#8216; 1<br><br>Correlation of Fixed Effects:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (Intr) chld_g mthr_dctnp mthr_dctns<br>child_age&nbsp;&nbsp; -0.284&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mthr_dctnpr -0.827&nbsp; 0.023&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mthr_dctnsc -0.732&nbsp; 0.009&nbsp; 0.759&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br>mthr_dctnhg -0.340&nbsp; 0.013&nbsp; 0.346&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.323&nbsp;&nbsp;&nbsp;<\/p>\n\n\n\n<p><em>Keep models parsimonious<\/em>: Models should be kept as parsimonious as possible, including only predictors that are necessary and justified by theory or prior evidence. Complex models require more data to estimate parameters reliably, and overfitting can occur when too many variables are included relative to the number of outcome events. A parsimonious approach improves interpretability, reduces variance in estimates, and increases the likelihood that the model will generalise to other datasets or populations<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Step 5: Model diagnostics and fit<\/em> <em>Good modelling requires checking fit and influence<\/em>: Good modelling practice requires careful assessment of model fit and the influence of individual observations<sup><a href=\"#ref-hilbe2009\">4<\/a><\/sup>. Multiple diagnostic tools should be employed rather than relying on a single test, as different diagnostics provide complementary information<sup><a href=\"#ref-harrell2015\">8<\/a><\/sup>. Evaluating residuals, leverage, influence measures, and overall goodness-of-fit help identify potential model misspecification, outliers, or highly influential points, ensuring that the final model is both robust and reliable.<\/p>\n\n\n\n<p><em>Discrimination (AUC)<\/em>: Discrimination assesses a model\u2019s ability to distinguish between individuals who experience the event and those who do not. The c-statistic, commonly referred to as the area under the receiver operating characteristic curve (AUC), provides a summary measure of this separation<sup><a href=\"#ref-sadatsafavi2022\">50<\/a><\/sup>. In policy-relevant applications, an AUC above 0.7 is generally considered acceptable, while an AUC above 0.8 indicates strong discrimination; however, the interpretation should always consider the context, the outcome, and the intended use of the model<sup><a href=\"#ref-corbacioglu2023\">51<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Calibration<\/em>: Calibration evaluates whether predicted probabilities align with observed outcomes<sup><a href=\"#ref-walsh2017\">52<\/a><\/sup>. Good calibration indicates that the model\u2019s predicted risk accurately reflects the actual probability of the event. Common approaches include calibration plots and the Hosmer-Lemeshow test<sup><a href=\"#ref-hosmer2013\">46<\/a><\/sup>; the Hosmer-Lemeshow test can be overly sensitive in large samples, making visual inspection of calibration curves often more informative<sup><a href=\"#ref-hilbe2009\">4<\/a><\/sup>. Properly calibrated models are critical for decision-making and policy applications, ensuring that predicted risks can be interpreted with confidence<sup><a href=\"#ref-huang2020\">53<\/a><\/sup>.<\/p>\n\n\n\n<p>R, Hosmer-Lemeshow, ROC, Brier:<\/p>\n\n\n\n<p>library(ResourceSelection)<\/p>\n\n\n\n<p>Warning: package &#8216;ResourceSelection&#8217; was built under R version 4.5.1<\/p>\n\n\n\n<p>ResourceSelection 0.3-6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2023-06-27<\/p>\n\n\n\n<p>hoslem.test(dhs_mw$full_immunisation, fitted(mod_linear), g = 10)<\/p>\n\n\n\n<p><br>&nbsp;&nbsp;&nbsp; Hosmer and Lemeshow goodness of fit (GOF) test<br><br>data:&nbsp; dhs_mw$full_immunisation, fitted(mod_linear)<br>X-squared = 5.644, df = 8, p-value = 0.687<\/p>\n\n\n\n<p>library(pROC)<\/p>\n\n\n\n<p>Type &#8216;citation(&#8220;pROC&#8221;)&#8217; for a citation.<\/p>\n\n\n\n<p><br>Attaching package: &#8216;pROC&#8217;<\/p>\n\n\n\n<p>The following objects are masked from &#8216;package:stats&#8217;:<br><br>&nbsp;&nbsp;&nbsp; cov, smooth, var<\/p>\n\n\n\n<p>roc_obj &lt;- roc(dhs_mw$full_immunisation, fitted(mod_linear))<\/p>\n\n\n\n<p>Setting levels: control = 0, case = 1<\/p>\n\n\n\n<p>Setting direction: controls &lt; cases<\/p>\n\n\n\n<p>auc(roc_obj)<\/p>\n\n\n\n<p>Area under the curve: 0.5484<\/p>\n\n\n\n<p># Brier score<br>mean((dhs_mw$full_immunisation &#8211; fitted(mod_linear))^2)<\/p>\n\n\n\n<p>1 0.1159879<\/p>\n\n\n\n<p><em>Residuals and influential points<\/em>: Assessing residuals and influential points is essential to ensure model robustness<sup><a href=\"#ref-hilbe2009\">4<\/a><\/sup>. Evaluating residuals, leverage, and influence measures helps identify observations that may disproportionately affect coefficient estimates<sup><a href=\"#ref-dey2025\">55<\/a><\/sup>. This is particularly important in small samples, where single extreme or influential observations can distort results. Detecting and addressing such points through careful investigation or sensitivity analyses improves model reliability and strengthens confidence in the findings.<\/p>\n\n\n\n<p>infl &lt;- influence.measures(mod_linear)<br>#summary(infl) #Activate this to see the results<\/p>\n\n\n\n<p><em>Model validation<\/em>: Model validation is a critical step in both predictive and inferential analyses<sup><a href=\"#ref-shmueli2010\">57<\/a><\/sup>. For prediction-focused models, internal validation can be performed using cross-validation or hold-out samples to assess performance and generalisability<sup><a href=\"#ref-coley2023\">58<\/a><\/sup>. For inference-oriented models, bootstrapping provides a way to evaluate optimism, adjust confidence intervals, and check the stability of parameter estimates. In practice, tools such as the caret<sup><a href=\"#ref-maxkuhn2008\">59<\/a><\/sup> or cv.glmnet<sup><a href=\"#ref-friedman2009\">60<\/a><\/sup> packages in R facilitate cross-validation for penalised models<sup><a href=\"#ref-song2021\">61<\/a><\/sup>, while the rms<sup><a href=\"#ref-harrell2022\">62<\/a><\/sup> package supports bootstrapping for inferential models. It is essential to report validation diagnostics in the manuscript, include relevant plots in supplementary materials, and document any influential observations that were removed or retained, ensuring transparency and reproducibility.<\/p>\n\n\n\n<p><em>Step 6: Interpretation and presentation<\/em><\/p>\n\n\n\n<p>Interpreting model results in a clear and actionable way is essential<sup><a href=\"#ref-persoskie2017\">63<\/a><\/sup>, especially for policy and decision-making. While odds ratios are standard outputs in logistic regression, they can be difficult to interpret when the outcome is common<sup><a href=\"#ref-sperandei2014\">64<\/a><\/sup>. Whenever possible, presenting predicted probabilities, absolute risks, or risk differences provides a more intuitive understanding of the effect sizes and facilitates practical decision-making<sup><a href=\"#ref-thompson2025\">65<\/a><\/sup>. Translating statistical findings into measures that are directly meaningful to stakeholders enhances the usability and impact of the research.<\/p>\n\n\n\n<p>Odds ratios and confidence intervals, R:<\/p>\n\n\n\n<p>library(broom)<br>tidy(mod_linear, exponentiate = TRUE, conf.int = TRUE)<\/p>\n\n\n\n<p># A tibble: 10 \u00d7 7<br>&nbsp;&nbsp; term&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; estimate std.error statistic&nbsp; p.value conf.low conf.high<br>&nbsp;&nbsp; &lt;chr&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&lt;dbl&gt;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;<br>&nbsp;1 (Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.132&nbsp;&nbsp;&nbsp; 0.131&nbsp;&nbsp;&nbsp; -15.5&nbsp;&nbsp; 6.11e-54&nbsp;&nbsp;&nbsp; 0.102&nbsp;&nbsp;&nbsp;&nbsp; 0.170<br>&nbsp;2 child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.841&nbsp;&nbsp;&nbsp; 0.0731&nbsp;&nbsp;&nbsp; -2.38&nbsp; 1.75e- 2&nbsp;&nbsp;&nbsp; 0.728&nbsp;&nbsp;&nbsp;&nbsp; 0.970<br>&nbsp;3 mother_educationpri\u2026&nbsp;&nbsp;&nbsp; 1.04&nbsp;&nbsp;&nbsp;&nbsp; 0.121&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.344 7.31e- 1&nbsp;&nbsp;&nbsp; 0.826&nbsp;&nbsp;&nbsp;&nbsp; 1.33<br>&nbsp;4 mother_educationsec\u2026&nbsp;&nbsp;&nbsp; 1.12&nbsp;&nbsp;&nbsp;&nbsp; 0.145&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.809 4.18e- 1&nbsp;&nbsp;&nbsp; 0.848&nbsp;&nbsp;&nbsp;&nbsp; 1.50<br>&nbsp;5 mother_educationhig\u2026&nbsp;&nbsp;&nbsp; 0.967&nbsp;&nbsp;&nbsp; 0.308&nbsp;&nbsp;&nbsp;&nbsp; -0.108 9.14e- 1&nbsp;&nbsp;&nbsp; 0.515&nbsp;&nbsp;&nbsp;&nbsp; 1.73<br>&nbsp;6 wealth_quintilepoor\u2026&nbsp;&nbsp;&nbsp; 1.15&nbsp;&nbsp;&nbsp;&nbsp; 0.115&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.20&nbsp; 2.30e- 1&nbsp;&nbsp;&nbsp; 0.917&nbsp;&nbsp;&nbsp;&nbsp; 1.44<br>&nbsp;7 wealth_quintilemidd\u2026&nbsp;&nbsp;&nbsp; 1.19&nbsp;&nbsp;&nbsp;&nbsp; 0.117&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.46&nbsp; 1.45e- 1&nbsp;&nbsp;&nbsp; 0.943&nbsp;&nbsp;&nbsp;&nbsp; 1.49<br>&nbsp;8 wealth_quintilerich\u2026&nbsp;&nbsp;&nbsp; 1.37&nbsp;&nbsp;&nbsp;&nbsp; 0.119&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.63&nbsp; 8.55e- 3&nbsp;&nbsp;&nbsp; 1.08&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.73<br>&nbsp;9 wealth_quintilerich\u2026&nbsp;&nbsp;&nbsp; 1.34&nbsp;&nbsp;&nbsp;&nbsp; 0.141&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.07&nbsp; 3.85e- 2&nbsp; &nbsp;&nbsp;1.01&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.76<br>10 urban1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.11&nbsp;&nbsp;&nbsp;&nbsp; 0.114&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.892 3.72e- 1&nbsp;&nbsp;&nbsp; 0.884&nbsp;&nbsp;&nbsp;&nbsp; 1.38<\/p>\n\n\n\n<p>Predicted probabilities and marginal effects, R:<\/p>\n\n\n\n<p>newdata &lt;- expand.grid(child_age = c(1, 2), mother_education = c(&#8220;none&#8221;,&#8221;secondary&#8221;),<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;wealth_quintile = c(&#8220;poorest&#8221;,&#8221;richest&#8221;), urban = c(0,1))<br><br>newdata$urban &lt;- factor(newdata$urban, levels = levels(dhs_mw$urban))<br><br>newdata$pred &lt;- predict(mod_linear, newdata, type = &#8220;response&#8221;)<br>newdata<\/p>\n\n\n\n<p>&nbsp;&nbsp; child_age mother_education wealth_quintile urban&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pred<br>1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.10006454<br>2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.08547725<br>3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.11117341<br>4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.09513788<br>5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.12958323<br>6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.11122458<br>7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.14344683<br>8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 0 0.12340252<br>9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.10959612<br>10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.09376410<br>11&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.12162015<br>12&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; poorest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.10425454<br>13&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.14148453<br>14&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.12167547<br>15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.15639262<br>16&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp; 1 0.13482409<\/p>\n\n\n\n<p><em>Common outcomes<\/em>: When the outcome of interest is common, odds ratios can overstate the magnitude of associations, making interpretation challenging. In such cases, reporting adjusted risk ratios can provide a more intuitive measure of effect<sup><a href=\"#ref-knol2011\">67<\/a><\/sup>. A practical approach is to use a Poisson regression model with robust standard errors<sup><a href=\"#ref-hilbe2009\">4<\/a><\/sup>, which yields approximate risk ratios that are easier for policymakers and other stakeholders to interpret and act upon. This method improves the clarity and usability of findings without compromising statistical validity.<\/p>\n\n\n\n<p>R, Poisson with robust SE:<\/p>\n\n\n\n<p>library(sandwich)<br>library(lmtest)<\/p>\n\n\n\n<p>Loading required package: zoo<\/p>\n\n\n\n<p><br>Attaching package: &#8216;zoo&#8217;<\/p>\n\n\n\n<p>The following objects are masked from &#8216;package:base&#8217;:<br><br>&nbsp;&nbsp;&nbsp; as.Date, as.Date.numeric<\/p>\n\n\n\n<p>mod_pois &lt;- glm(full_immunisation ~ child_age + mother_education + wealth_quintile + urban,<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; family = poisson(link = &#8220;log&#8221;), data = dhs_mw)<br>coeftest(mod_pois, vcov = vcovHC(mod_pois, type = &#8220;HC0&#8221;))<\/p>\n\n\n\n<p><br>z test of coefficients:<br><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Estimate Std. Error&nbsp; z value&nbsp; Pr(&gt;|z|)&nbsp;&nbsp;&nbsp;<br>(Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -2.148666&nbsp;&nbsp; 0.117840 -18.2337 &lt; 2.2e-16 ***<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -0.149931&nbsp;&nbsp; 0.063185&nbsp; -2.3729&nbsp; 0.017649 *&nbsp;<br>mother_educationprimary&nbsp;&nbsp;&nbsp; 0.036564&nbsp;&nbsp; 0.105423&nbsp;&nbsp; 0.3468&nbsp; 0.728720&nbsp;&nbsp;&nbsp;<br>mother_educationsecondary&nbsp; 0.101486&nbsp;&nbsp; 0.125990&nbsp;&nbsp; 0.8055&nbsp; 0.420525&nbsp;&nbsp;&nbsp;<br>mother_educationhigher&nbsp;&nbsp;&nbsp; -0.026540&nbsp;&nbsp; 0.263740&nbsp; -0.1006&nbsp; 0.919846&nbsp;&nbsp;&nbsp;<br>wealth_quintilepoorer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.120942&nbsp;&nbsp; 0.100385&nbsp;&nbsp; 1.2048 &nbsp;0.228285&nbsp;&nbsp;&nbsp;<br>wealth_quintilemiddle&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.149702&nbsp;&nbsp; 0.101941&nbsp;&nbsp; 1.4685&nbsp; 0.141963&nbsp;&nbsp;&nbsp;<br>wealth_quintilericher&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.271568&nbsp;&nbsp; 0.102762&nbsp;&nbsp; 2.6427&nbsp; 0.008225 **<br>wealth_quintilerichest&nbsp;&nbsp;&nbsp;&nbsp; 0.253477&nbsp;&nbsp; 0.121070&nbsp;&nbsp; 2.0936&nbsp; 0.036292 *&nbsp;<br>urban1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.086229&nbsp;&nbsp; 0.094482&nbsp;&nbsp; 0.9126&nbsp; 0.361428&nbsp;&nbsp;&nbsp;<br>&#8212;<br>Signif. codes:&nbsp; 0 &#8216;***&#8217; 0.001 &#8216;**&#8217; 0.01 &#8216;*&#8217; 0.05 &#8216;.&#8217; 0.1 &#8216; &#8216; 1<\/p>\n\n\n\n<p><em>Tables and figures<\/em>: Effective presentation of results is critical for clarity and impact. Tables should clearly display adjusted estimates, sample sizes, number of events, and key model fit statistics<sup><a href=\"#ref-dwivedi2022\">68<\/a><\/sup>. Supplementary tables can provide a complete list of variables considered, along with the code used to generate the final model, supporting transparency and reproducibility. Visualisations, such as predicted probability plots, are particularly valuable for communicating results to policymakers, translating complex statistical outputs into intuitive insights that can inform decision-making<sup><a href=\"#ref-padilla2018\">69<\/a><\/sup>.<\/p>\n\n\n\n<p><em>Language and claims<\/em>: When reporting results from observational studies, careful attention should be paid to the language used<sup><a href=\"#ref-olarte2021\">70<\/a><\/sup>. Causal claims should be avoided unless the study design and methods explicitly support them<sup><a href=\"#ref-STERRANTINO2024100415\">71<\/a><\/sup>. Instead of stating that \u201cX causes Y,\u201d it is more appropriate to describe findings as \u201cX is associated with Y.\u201d For prediction-focused analyses, terms such as \u201cpredicts\u201d or \u201cforecast\u201d are suitable, reflecting the model\u2019s purpose without implying causation. Using cautious and precise language enhances credibility and prevents overinterpretation of the results<sup><a href=\"#ref-ito2025\">17<\/a><\/sup>.<\/p>\n\n\n\n<p><strong>Practical checklist to finish modelling<\/strong><\/p>\n\n\n\n<p>Before you write results, tick these items:<\/p>\n\n\n\n<ul><li><em>Outcome coded and documented, 0\/1, event defined<\/em>: Outcomes should be coded consistently as 0 and 1, with the event clearly defined. All recoding steps must be documented to ensure transparency and reproducibility.<\/li><li><em>Missing data handled, and method reported<\/em>: Patterns of missing data should be explored, appropriate methods applied, and the chosen approach clearly reported. Multiple imputation is often preferred over complete-case analysis when missingness is substantial.<\/li><li><em>Survey design accounted for when needed<\/em>: When working with complex survey data, stratification, clustering, and weights should be accounted for using survey-aware methods to produce valid estimates and standard errors.<\/li><li><em>EPV checked, penalisation used if EPV is low<\/em>: The ratio of outcome events to predictors should be checked. If EPV is low, the number of predictors can be reduced, categories collapsed, or penalised methods such as lasso or Firth regression applied.<\/li><li><em>Linearity in logit assessed, non-linear terms used when needed<\/em>: Continuous predictors should be assessed for linearity in the logit. Non-linear relationships can be accommodated using splines or fractional polynomials to improve model fit.<\/li><li><em>Interactions tested only when sensible<\/em>: Interactions should only be included when there is a plausible rationale. Testing and visualising interactions aids interpretation, such as examining how effects differ by subgroups like urban versus rural residents.<\/li><li><em>Diagnostics run, AUC and calibration reported, influential points checked<\/em>: Model diagnostics should be conducted thoroughly. Report discrimination (AUC), calibration, residuals, and check for influential points to ensure model robustness.<\/li><li><em>Results presented with ORs or RRs, predicted probabilities and clear interpretation<\/em>: Present results with interpretable measures such as odds ratios, risk ratios, predicted probabilities, or risk differences. Clear presentation helps stakeholders and policymakers understand and act on findings.<\/li><li><em>Code and data workflow saved for reproducibility<\/em>: All code, workflows, and key decisions should be saved and documented to allow others to replicate the analysis and verify results.<\/li><\/ul>\n\n\n\n<p><strong>Common pitfalls and how to avoid them<\/strong><\/p>\n\n\n\n<p>Despite its power and flexibility, logistic regression is often misapplied in clinical and public health research. These errors can weaken the validity of findings, obscure important associations, and sometimes lead to misleading policy recommendations<sup><a href=\"#ref-ranganathan2017\">43<\/a><\/sup>. Below, we outline frequent pitfalls, illustrate them with examples from global and African contexts, and provide practical strategies to avoid them; complete with R code snippets for immediate use. By recognizing these common pitfalls and applying these practical solutions, health researchers can ensure their logistic regression analyses are robust, interpretable, and useful for decision-making. This is especially critical in settings where data and analytical challenges are common.<\/p>\n\n\n\n<ul><li><em>Misunderstanding the Odds Ratio<\/em> Odds ratios can overstate the magnitude of associations when outcomes are common, typically above 10%<sup><a href=\"#ref-sheldrick2017\">73<\/a><\/sup>. For instance, in Malawi, facility delivery rates are often high, so ORs may exaggerate the perceived effect. To avoid misinterpretation, it is advisable to report marginal effects, predicted probabilities, or adjusted risk ratios alongside ORs, and to provide clear explanations of what each measure represents.<\/li><\/ul>\n\n\n\n<p># Display odds ratios with confidence intervals<br>library(broom)<br>mod_logistic = mod_int<br>tidy(mod_logistic, exponentiate = TRUE, conf.int = TRUE)<\/p>\n\n\n\n<p># A tibble: 13 \u00d7 7<br>&nbsp;&nbsp; term&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; estimate std.error statistic&nbsp; p.value conf.low conf.high<br>&nbsp;&nbsp; &lt;chr&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;&nbsp;&nbsp;&nbsp;&nbsp; &lt;dbl&gt;<br>&nbsp;1 (Intercept)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.136&nbsp;&nbsp;&nbsp; 0.133&nbsp;&nbsp; -15.1&nbsp;&nbsp;&nbsp; 2.82e-51&nbsp;&nbsp;&nbsp; 0.104&nbsp;&nbsp;&nbsp;&nbsp; 0.175<br>&nbsp;2 child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.840&nbsp;&nbsp;&nbsp; 0.0731&nbsp;&nbsp; -2.39&nbsp;&nbsp; 1.69e- 2&nbsp;&nbsp;&nbsp; 0.727&nbsp;&nbsp;&nbsp;&nbsp; 0.969<br>&nbsp;3 mother_educationpri\u2026&nbsp;&nbsp;&nbsp; 1.01&nbsp;&nbsp;&nbsp;&nbsp; 0.125&nbsp;&nbsp;&nbsp;&nbsp; 0.0674 9.46e- 1&nbsp;&nbsp;&nbsp; 0.794&nbsp;&nbsp;&nbsp;&nbsp; 1.29<br>&nbsp;4 mother_educationsec\u2026&nbsp;&nbsp;&nbsp; 1.10&nbsp;&nbsp;&nbsp;&nbsp; 0.156&nbsp;&nbsp;&nbsp;&nbsp; 0.596&nbsp; 5.51e- 1&nbsp;&nbsp;&nbsp; 0.809&nbsp;&nbsp;&nbsp;&nbsp; 1.49<br>&nbsp;5 mother_educationhig\u2026&nbsp;&nbsp;&nbsp; 1.01&nbsp;&nbsp;&nbsp;&nbsp; 0.506&nbsp;&nbsp;&nbsp;&nbsp; 0.0191 9.85e- 1&nbsp;&nbsp;&nbsp; 0.332&nbsp;&nbsp;&nbsp;&nbsp; 2.52<br>&nbsp;6 urban1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0.671&nbsp;&nbsp;&nbsp; 0.542&nbsp;&nbsp;&nbsp; -0.736&nbsp; 4.62e- 1&nbsp;&nbsp;&nbsp; 0.197&nbsp;&nbsp;&nbsp;&nbsp; 1.74<br>&nbsp;7 wealth_quintilepoor\u2026&nbsp;&nbsp;&nbsp; 1.15&nbsp;&nbsp;&nbsp;&nbsp; 0.115&nbsp;&nbsp;&nbsp;&nbsp; 1.23&nbsp;&nbsp; 2.20e- 1&nbsp;&nbsp;&nbsp; 0.919&nbsp;&nbsp;&nbsp;&nbsp; 1.44<br>&nbsp;8 wealth_quintilemidd\u2026&nbsp;&nbsp;&nbsp; 1.19&nbsp;&nbsp;&nbsp;&nbsp; 0.117&nbsp;&nbsp;&nbsp;&nbsp; 1.47&nbsp;&nbsp; 1.41e- 1&nbsp;&nbsp;&nbsp; 0.944&nbsp;&nbsp;&nbsp;&nbsp; 1.50<br>&nbsp;9 wealth_quintilerich\u2026&nbsp;&nbsp;&nbsp; 1.37&nbsp;&nbsp;&nbsp;&nbsp; 0.120&nbsp;&nbsp;&nbsp;&nbsp; 2.62&nbsp;&nbsp; 8.72e- 3&nbsp;&nbsp;&nbsp; 1.08&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.73<br>10 wealth_quintilerich\u2026&nbsp;&nbsp;&nbsp; 1.34&nbsp;&nbsp;&nbsp;&nbsp; 0.141&nbsp;&nbsp;&nbsp;&nbsp; 2.07&nbsp;&nbsp; 3.88e- 2&nbsp;&nbsp;&nbsp; 1.01&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.76<br>11 mother_educationpri\u2026&nbsp;&nbsp;&nbsp; 1.73&nbsp;&nbsp;&nbsp;&nbsp; 0.555&nbsp;&nbsp;&nbsp;&nbsp; 0.985&nbsp; 3.24e- 1&nbsp;&nbsp;&nbsp; 0.646&nbsp;&nbsp;&nbsp;&nbsp; 6.02<br>12 mother_educationsec\u2026&nbsp;&nbsp;&nbsp; 1.64&nbsp;&nbsp;&nbsp;&nbsp; 0.564&nbsp;&nbsp;&nbsp;&nbsp; 0.877&nbsp; 3.80e- 1&nbsp;&nbsp;&nbsp; 0.600&nbsp;&nbsp;&nbsp;&nbsp; 5.79<br>13 mother_educationhig\u2026&nbsp;&nbsp;&nbsp; 1.49&nbsp;&nbsp;&nbsp;&nbsp; 0.796&nbsp;&nbsp;&nbsp;&nbsp; 0.504&nbsp; 6.14e- 1 &nbsp;&nbsp;&nbsp;0.338&nbsp;&nbsp;&nbsp;&nbsp; 7.98<\/p>\n\n\n\n<p># Calculate predicted probabilities (marginal effects)<br>newdata &lt;- data.frame(<br>&nbsp; mother_education = c(&#8220;none&#8221;, &#8220;primary&#8221;),<br>&nbsp; urban = c(0, 1),<br>&nbsp; wealth_quintile = c(&#8220;richest&#8221;, &#8220;middle&#8221;),<br>&nbsp; child_age = c(1, 1)<br>)<br>newdata$urban &lt;- factor(newdata$urban, levels = levels(dhs_mw$urban))<br><br>newdata$pred_prob &lt;- predict(mod_logistic, newdata, type = &#8220;response&#8221;)<br>print(newdata)<\/p>\n\n\n\n<p>&nbsp; mother_education urban wealth_quintile child_age pred_prob<br>1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; richest&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 0.1324355<br>2&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;primary&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; middle&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1 0.1368518<\/p>\n\n\n\n<ul><li><em>Ignoring linearity in the logit for continuous predictors<\/em> Failing to check whether continuous variables have a linear relationship with the log odds can lead to biased coefficient estimates and misinterpretation<sup><a href=\"#ref-norton2018\">74<\/a><\/sup>. To address this, the relationship should be assessed using graphical methods, and non-linear terms such as splines or fractional polynomials can be incorporated when needed.<\/li><\/ul>\n\n\n\n<p># library(splines)<br># dhs_mw$mother_age &lt;- as.numeric(dhs_mw$mother_age)<br># mod_linear &lt;- glm(full_immunisation&nbsp; ~ mother_age + child_age, family = binomial, data = dhs_mw)<br># mod_spline &lt;- glm(full_immunisation&nbsp; ~ ns(mother_age, df = 4) + child_age, family = binomial, data = dhs_mw)<br># anova(mod_linear, mod_spline, test = &#8220;Chisq&#8221;)<\/p>\n\n\n\n<ul><li><em>Overfitting the model<\/em> Including too many predictors relative to the number of outcome events can produce unstable estimates and reduce generalisability. To avoid overfitting, maintain at least 10 EPV, or, when this is not possible, use penalised regression methods such as lasso, ridge, or Firth logistic regression<sup><a href=\"#ref-heinze2020\">76<\/a><\/sup>.<\/li><\/ul>\n\n\n\n<p># # Calculate EPV<br># events &lt;- sum(df$outcome == 1)<br># num_predictors &lt;- length(coef(mod_logistic)) &#8211; 1<br># epv &lt;- events \/ num_predictors<br># print(epv)<br>#<br># # Example of penalised logistic regression with LASSO<br># library(glmnet)<br># x &lt;- model.matrix(outcome ~ ., df),-1<br># y &lt;- df$outcome<br># cvfit &lt;- cv.glmnet(x, y, family = &#8220;binomial&#8221;, alpha = 1)<br># coef(cvfit, s = &#8220;lambda.min&#8221;)<\/p>\n\n\n\n<ul><li><em>Omitted variable bias<\/em> Failing to include important confounders can produce biased estimates of effects and mislead conclusions. To avoid this, predictors should be selected based on domain knowledge, prior literature, or formal tools such as directed acyclic graphs (DAGs) to identify key confounding variables<sup><a href=\"#ref-pearl2016\">78<\/a><\/sup>.<\/li><\/ul>\n\n\n\n<p>No direct code, but ensure confounders are in your model syntax, e.g.: <em>glm(outcome ~ exposure + confounder1 + confounder2, family = binomial, data = df)<\/em><\/p>\n\n\n\n<ul><li><em>Multicollinearity<\/em> Highly correlated predictors can inflate standard errors and destabilise coefficient estimates, reducing model reliability. To address multicollinearity, variance inflation factors (VIF) should be checked, and highly correlated variables can be removed, combined, or otherwise reduced through dimensionality reduction techniques.<\/li><\/ul>\n\n\n\n<p>library(car)<br>vif(mod_logistic)<\/p>\n\n\n\n<p>there are higher-order terms (interactions) in this model<br>consider setting type = &#8216;predictor&#8217;; see ?vif<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GVIF Df GVIF^(1\/(2*Df))<br>child_age&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.003608&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.001802<br>mother_education&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.600133&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.332600<br>urban&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 32.683686&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5.716965<br>wealth_quintile&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.737765&nbsp; 4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.071517<br>mother_education:urban 140.361228&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.279683<\/p>\n\n\n\n<ul><li><em>Complete or quasi-complete separation<\/em> When a predictor perfectly or nearly perfectly predicts the outcome, standard logistic regression cannot estimate coefficients reliably, leading to model failure. To address this, penalised regression methods such as Firth\u2019s logistic regression can be used, providing stable estimates even in the presence of separation<sup><a href=\"#ref-fekadu2020\">81<\/a><\/sup>.<\/li><\/ul>\n\n\n\n<p># # Firth logistic regression using logistf package<br># library(logistf)<br>#<br># newdata$full_immunisation &lt;- c(1, 0) # Add the outcomes<br>#<br># firth_mod &lt;- logistf(full_immunisation ~ mother_education + wealth_quintile + urban + child_age, data = newdata)<br># summary(firth_mod)<\/p>\n\n\n\n<ul><li><em>Ignoring interaction effects<\/em> Failing to include meaningful interactions can conceal important differences in how predictors affect the outcome across subgroups<sup><a href=\"#ref-greenland2008\">83<\/a><\/sup>. To address this, interactions should be tested and included when supported by theory or prior evidence, and results should be visualised to aid interpretation.<\/li><\/ul>\n\n\n\n<p># mod_interaction &lt;- glm(outcome ~ predictor1 * predictor2 + other_vars, family = binomial, data = df)<br># summary(mod_interaction)<\/p>\n\n\n\n<ul><li><em>Poor Handling of missing data<\/em><\/li><\/ul>\n\n\n\n<p>Relying solely on complete-case analysis can produce biased results when data are not missing completely at random<sup><a href=\"#ref-little2019\">85<\/a><\/sup>. To avoid this, patterns of missingness should be explored, and appropriate methods such as multiple imputation<sup><a href=\"#ref-azur2011\">86<\/a><\/sup> should be applied to reduce bias and maintain statistical power.<\/p>\n\n\n\n<p># library(mice)<br># md.pattern(dhs_mw)<br># imp &lt;- mice(df, m = 5, method = &#8216;pmm&#8217;, seed = 123)<br># fit_imp &lt;- with(imp, glm(outcome ~ predictors, family = binomial))<br># pool(fit_imp)<\/p>\n\n\n\n<ul><li><em>Inadequate model diagnostics<\/em><\/li><\/ul>\n\n\n\n<p>Failing to assess model fit and predictive performance can lead to misleading conclusions<sup><a href=\"#ref-ATM29812\">87<\/a><\/sup>. To avoid this, a combination of diagnostics should be used, including the Hosmer\u2013Lemeshow test, ROC\/AUC for discrimination, calibration plots, and checks of residuals and influential points, ensuring the model is robust and reliable.<\/p>\n\n\n\n<p>library(ResourceSelection)<br>hoslem.test(dhs_mw$full_immunisation, fitted(mod_logistic), g = 10)<\/p>\n\n\n\n<p><br>&nbsp;&nbsp;&nbsp; Hosmer and Lemeshow goodness of fit (GOF) test<br><br>data:&nbsp; dhs_mw$full_immunisation, fitted(mod_logistic)<br>X-squared = 4.4301, df = 8, p-value = 0.8164<\/p>\n\n\n\n<p>library(pROC)<br>roc_obj &lt;- roc(dhs_mw$full_immunisation, fitted(mod_logistic))<\/p>\n\n\n\n<p>Setting levels: control = 0, case = 1<\/p>\n\n\n\n<p>Setting direction: controls &lt; cases<\/p>\n\n\n\n<p>auc(roc_obj)<\/p>\n\n\n\n<p>Area under the curve: 0.5488<\/p>\n\n\n\n<p>plot(roc_obj)<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"758\" height=\"606\" src=\"https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-5.png\" alt=\"\" class=\"wp-image-13529\" srcset=\"https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-5.png 758w, https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-5-300x240.png 300w\" sizes=\"(max-width: 758px) 100vw, 758px\" \/><\/figure>\n\n\n\n<p><strong>Figure 2: Sensitivity and specificity of the fitted model<\/strong><\/p>\n\n\n\n<ul><li><em>Over-reliance on P-values<\/em><\/li><\/ul>\n\n\n\n<p>Focusing solely on statistical significance can obscure the practical importance of findings<sup><a href=\"#ref-Greenland2016\">88<\/a><\/sup>. To avoid this, effect sizes and confidence intervals should be reported alongside p-values, and the discussion should emphasise real-world relevance and policy implications.<\/p>\n\n\n\n<p>No code needed but ensure your results tables include estimates with 95% confidence intervals and interpret them clearly.<\/p>\n\n\n\n<p><strong>Policy and Practice relevance: <\/strong>Logistic regression in clinical and public health research is not an end in itself. The ultimate goal is to inform policies and interventions that improve health outcomes<sup><a href=\"#ref-Olowe2024\">89<\/a><\/sup>. This section highlights how to interpret, communicate, and apply logistic regression findings to support evidence-based decision-making<sup><a href=\"#ref-Zardo2014\">90<\/a><\/sup>, with examples from African and global health contexts.<\/p>\n\n\n\n<ul><li><em>From Odds Ratios to Meaningful Measures<\/em><\/li><\/ul>\n\n\n\n<p>While odds ratios are standard outputs in logistic regression, they can be abstract and unintuitive, especially for non-technical audiences. Expressing results as predicted probabilities, absolute risks, or risk differences provides a clearer, more actionable view of the findings, making it easier for policymakers to understand and use the information for decision-making<sup><a href=\"#ref-Osborne2015\">91<\/a><\/sup>. Example: In Malawi\u2019s DHS analysis on childhood immunisation, an OR of 2.5 for maternal secondary education might sound large. But translating this into a predicted probability increase (e.g., from 60% to 85% immunisation coverage) better illustrates the potential impact of education-focused programs.<\/p>\n\n\n\n<p>R code to calculate and plot predicted probabilities:<\/p>\n\n\n\n<p>library(ggplot2)<br># Create new data for different maternal education levels<br>newdata &lt;- data.frame(<br>\u00a0 mother_education = factor(c(&#8220;none&#8221;, &#8220;primary&#8221;, &#8220;secondary&#8221;, &#8220;higher&#8221;),<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0levels = c(&#8220;none&#8221;, &#8220;primary&#8221;, &#8220;secondary&#8221;, &#8220;higher&#8221;)),<br>\u00a0 child_age = mean(dhs_mw$child_age, na.rm = TRUE),<br>\u00a0 wealth_quintile = &#8220;middle&#8221;,<br>\u00a0 urban = 0<br>)<br><br># Match the factor structure of the model<br>newdata$urban &lt;- factor(newdata$urban, levels = levels(dhs_mw$urban))<br><br><br># Predict probabilities<br>newdata$pred_prob &lt;- predict(mod_logistic, newdata, type = &#8220;response&#8221;)<br><br># Plot predicted probabilities with 95% CI<br>ggplot(newdata, aes(x = mother_education, y = pred_prob)) +<br>\u00a0 geom_point(size = 3) +<br>\u00a0 ylim(0,1) +<br>\u00a0 labs(title = &#8220;Predicted Probability of Full Immunisation by Maternal Education&#8221;,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 x = &#8220;Maternal Education Level&#8221;,<br>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 y = &#8220;Predicted Probability&#8221;) +<br>\u00a0 theme_minimal()<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"957\" height=\"567\" src=\"https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-6.png\" alt=\"\" class=\"wp-image-13530\" srcset=\"https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-6.png 957w, https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-6-300x178.png 300w, https:\/\/www.mmj.mw\/wp-content\/uploads\/2025\/09\/image-6-768x455.png 768w\" sizes=\"(max-width: 957px) 100vw, 957px\" \/><\/figure>\n\n\n\n<p><strong>Figure 3: Probability of full immunisation by<\/strong><\/p>\n\n\n\n<ul><li><em>Identifying high-risk groups for targeted interventions<\/em> Logistic regression is a valuable tool for identifying populations at higher risk of adverse outcomes<sup><a href=\"#ref-Muller2014\">92<\/a><\/sup>. By highlighting groups with elevated predicted probabilities, it helps policymakers and program managers target interventions more effectively, ensuring that resources are directed where they are most needed<sup><a href=\"#ref-Shipe2019\">93<\/a><\/sup>. Example: In Malawi DHS data (2004-2016), logistic regression revealed informal employment, being male, and low education significantly reduce the odds of HIV testing uptake<sup><a href=\"#ref-Ngambi2020HIVTestingMalawi\">94<\/a><\/sup>. Such evidence supports outreach campaigns in rural areas targeting low-education groups.<\/li><\/ul>\n\n\n\n<p>R code snippet to generate subgroup predicted probabilities:<\/p>\n\n\n\n<p># Define groups for prediction<br>newdata &lt;- expand.grid(<br>&nbsp; urban = c(0, 1),<br>&nbsp; mother_education = factor(c(&#8220;none&#8221;, &#8220;secondary&#8221;), levels = levels(dhs_mw$mother_education)),<br>&nbsp; child_age = mean(dhs_mw$child_age, na.rm = TRUE),<br>&nbsp; wealth_quintile = &#8220;middle&#8221;<br>)<br><br># Match the factor structure of the model<br>newdata$urban &lt;- factor(newdata$urban, levels = levels(dhs_mw$urban))<br><br>newdata$pred_prob &lt;- predict(mod_logistic, newdata, type = &#8220;response&#8221;)<br>print(newdata)<\/p>\n\n\n\n<p>&nbsp; urban mother_education child_age wealth_quintile&nbsp; pred_prob<br>1&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none 0.5003077&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; middle 0.12880232<br>2&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; none 0.5003077&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; middle 0.09029033<br>3&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary 0.5003077&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; middle 0.13960206<br>4&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; secondary 0.5003077&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; middle 0.15160838<\/p>\n\n\n\n<ul><li><em>Informing resource allocation<\/em><\/li><\/ul>\n\n\n\n<p>Quantifying how predictors influence predicted probabilities can directly inform resource allocation. For example, if wealth quintile strongly affects hypertension treatment in Zambia<sup><a href=\"#ref-Mwale2025\">95<\/a><\/sup>, programs can target support to lower-income groups. Policymakers often prefer absolute differences in risk or predicted probabilities rather than odds ratios, as these measures clearly indicate the potential impact of interventions and facilitate practical decision-making.<\/p>\n\n\n\n<p>R code to calculate risk differences:<\/p>\n\n\n\n<p>library(margins)<\/p>\n\n\n\n<p>Warning: package &#8216;margins&#8217; was built under R version 4.5.1<\/p>\n\n\n\n<p>mod_logistic &lt;- glm(full_immunisation ~ wealth_quintile + child_age + wealth_quintile, family = binomial, data = dhs_mw)<br><br># Calculate marginal effects (risk differences)<br>marg_eff &lt;- margins(mod_logistic)<br>summary(marg_eff)<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; factor&nbsp;&nbsp;&nbsp;&nbsp; AME&nbsp;&nbsp;&nbsp;&nbsp; SE&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; z&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; p&nbsp;&nbsp; lower&nbsp;&nbsp; upper<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child_age -0.0202 0.0085 -2.3872 0.0170 -0.0368 -0.0036<br>&nbsp; wealth_quintilemiddle&nbsp; 0.0197 0.0125&nbsp; 1.5784 0.1145 -0.0048&nbsp; 0.0442<br>&nbsp; wealth_quintilepoorer&nbsp; 0.0153 0.0120&nbsp; 1.2725 0.2032 -0.0083&nbsp; 0.0389<br>&nbsp; wealth_quintilericher&nbsp; 0.0394 0.0133&nbsp; 2.9552 0.0031&nbsp; 0.0133&nbsp; 0.0655<br>&nbsp;wealth_quintilerichest&nbsp; 0.0442 0.0135&nbsp; 3.2746 0.0011 &nbsp;0.0177&nbsp; 0.0706<\/p>\n\n\n\n<ul><li><em>Presenting results to non-technical audiences<\/em><\/li><\/ul>\n\n\n\n<p>Clear visualisation and simple language improve uptake of findings.<\/p>\n\n\n\n<ul><li>Use predicted probability plots rather than tables of ORs alone.<\/li><li>Translate statistics into plain English: e.g., \u201cChildren of mothers with secondary education are 25 percentage points more likely to be fully immunised.\u201d<\/li><li>Highlight policy-relevant variables.<\/li><li>Discuss limitations and uncertainties openly.<\/li><li><em>Supporting program evaluation and monitoring<\/em> Repeated logistic regression analyses on program data can help track progress over time, identify barriers to implementation, and inform mid-course corrections<sup><a href=\"#ref-kirkwood2003\">42<\/a><\/sup>. By quantifying changes in predicted probabilities or risk differences, programs can evaluate effectiveness and adjust strategies to better reach high-risk populations<sup><a href=\"#ref-Muller2014b\">96<\/a><\/sup>. Example: In a malaria control program in Nigeria, logistic regression showed bed net ownership strongly predicted prevention behavior. Over time, logistic regression helped evaluate if increased net distribution translated into behavior change.<\/li><\/ul>\n\n\n\n<p><strong>Cautions in Policy and Application:<\/strong><\/p>\n\n\n\n<ul><li><em>Associations do not equal causation<\/em>: Logistic regression applied to observational data identifies associations and potential targets, but it does not establish causal relationships<a href=\"#ref-harrell2015\">8<\/a>. Findings should be interpreted cautiously, and causal claims should only be made if supported by study design or additional evidence. Clear language helps prevent misinterpretation and overstatement of results<sup><a href=\"#ref-DAmico2025\">97<\/a><\/sup><em>.<\/em><\/li><li><em>Consider confounding and bias<\/em>: When interpreting regression results for policy decisions, it is essential to account for potential confounding and other sources of bias<sup><a href=\"#ref-Mathur2022\">98<\/a><\/sup>. Associations observed in the data may be influenced by unmeasured or inadequately controlled factors, and failing to consider these can lead to inappropriate recommendations<sup><a href=\"#ref-DAmico2025\">97<\/a><\/sup>. Careful model specification, adjustment for key confounders, and transparent reporting strengthen the reliability of findings for decision-making.<\/li><li><em>Community engagement<\/em>: Logistic regression provides valuable quantitative evidence, but combining these results with qualitative insights and community engagement enriches interpretation<sup><a href=\"#ref-Noyes2019\">99<\/a><\/sup>. Local and expert knowledge can contextualise statistical associations, highlight barriers or enablers not captured in the data, and guide the design of interventions that are culturally appropriate and feasible, ensuring that policies are both evidence- and expert-informed and locally relevant<sup><a href=\"#ref-Khatri2024\">100<\/a><\/sup>.<\/li><\/ul>\n\n\n\n<h4><a><strong><span class=\"has-inline-color has-black-color\">Conclusion and Recommendations<\/span><\/strong><\/a><\/h4>\n\n\n\n<p>Logistic regression remains a fundamental tool in clinical and public health research, offering insights into factors associated with key health outcomes. Logistic regression results become truly valuable when translated into understandable, actionable insights. Using predicted probabilities, risk differences, and subgroup analyses helps policymakers prioritise interventions, allocate resources efficiently, and communicate findings effectively. Including clear visualisations and plain-language interpretations enhances impact in global and African health settings. When applied carefully, it can guide policy and practice across diverse settings, including complex health systems. This paper has presented a practical guide to the core considerations in logistic regression modelling. We highlighted essential steps from defining clear research questions, preparing data thoughtfully, choosing variables wisely, specifying models correctly, to conducting thorough diagnostics. Common pitfalls such as overfitting, misinterpreting odds ratios, ignoring interactions, and mishandling missing data were identified, alongside concrete ways to avoid them. Finally, we emphasized translating statistical results into meaningful, actionable evidence for policymakers and practitioners.<\/p>\n\n\n\n<p>Key recommendations for researchers and practitioners include:<\/p>\n\n\n\n<ul><li><em>Plan carefully from the start<\/em>: Define your outcome clearly, understand your data\u2019s limitations, and anticipate sample size needs to ensure robust models.<\/li><li><em>Prepare and code variables thoughtfully<\/em>: Treat continuous predictors properly, handle missing data with modern methods like multiple imputation, and account for survey design complexities.<\/li><li><em>Use variable selection strategies that balance theory and data-driven approaches<\/em>: Avoid automatic stepwise selection without validation.<\/li><li><em>Check model assumptions rigorously<\/em>: Evaluate linearity, interactions, collinearity, and separation issues; adjust models accordingly.<\/li><li><em>Validate your models<\/em>: Use internal validation techniques such as bootstrapping or cross-validation to assess reliability and generalizability.<\/li><li><em>Present results clearly and meaningfully<\/em>: Go beyond odds ratios\u2014provide predicted probabilities, risk differences, and subgroup analyses to inform practical decisions.<\/li><li><em>Communicate findings in plain language with supportive visuals<\/em>: This enhances uptake by non-technical audiences including policymakers, program managers, and communities.<\/li><li><em>Recognize the limits of observational data<\/em>: Use logistic regression results as one part of evidence, integrating other study designs and community perspectives.<\/li><\/ul>\n\n\n\n<p>By following these principles, global and African public health researchers can harness logistic regression to generate credible, actionable evidence. This evidence can support the design, targeting, and evaluation of interventions that improve health outcomes and reduce inequities.<\/p>\n\n\n\n<p><strong>Funding<\/strong><\/p>\n\n\n\n<p>None<\/p>\n\n\n\n<p><strong>Competing interests<\/strong><\/p>\n\n\n\n<p>We declare no competing interests.<\/p>\n\n\n\n<p><strong>Availability of data<\/strong><\/p>\n\n\n\n<p>All data are available within the manuscript.<\/p>\n\n\n\n<p><strong>Author contributions<\/strong><\/p>\n\n\n\n<p>WFN= Conceptualization and drafting the paper and code. CZ and ASM=Reviewing the paper and the code. Both authors reviewed and approved the paper together with the code.<\/p>\n\n\n\n<h4><a><span class=\"has-inline-color has-black-color\">References<\/span><\/a><\/h4>\n\n\n\n<p><a>1. Schober, P., &amp; Vetter, T. R. (2021). Logistic regression in medical research. <em>Anesthesia and Analgesia<\/em>, <em>132<\/em>(2), 365\u2013366. <\/a><a href=\"https:\/\/doi.org\/10.1213\/ANE.0000000000005247\"><em>https:\/\/doi.org\/10.1213\/ANE.0000000000005247<\/em><\/a><\/p>\n\n\n\n<p><a>2. Gallis, J. A., &amp; Turner, E. L. (2019). Relative measures of association for binary outcomes: Challenges and recommendations for the global health researcher. <em>Annals of Global Health<\/em>, <em>85<\/em>(1), 137. <\/a><a href=\"https:\/\/doi.org\/10.5334\/aogh.2581\"><em>https:\/\/doi.org\/10.5334\/aogh.2581<\/em><\/a><\/p>\n\n\n\n<p><a>3. Vittinghoff, E., Glidden, D. V., Shiboski, S. C., &amp; McCulloch, C. E. (2012). <em>Regression methods in biostatistics: Linear, logistic, survival, and repeated measures models<\/em>. Springer. <\/a><a href=\"https:\/\/doi.org\/10.1007\/978-1-4614-1353-0\"><em>https:\/\/doi.org\/10.1007\/978-1-4614-1353-0<\/em><\/a><\/p>\n\n\n\n<p><a>4. Hilbe, J. M. (2009). <em>Logistic regression models<\/em> (1st ed.). Chapman &amp; Hall.<\/a><\/p>\n\n\n\n<p><a>5. Pearce, N. (2016). Analysis of matched case-control studies. <em>BMJ (Clinical Research Ed.)<\/em>, <em>352<\/em>, i969. <\/a><a href=\"https:\/\/doi.org\/10.1136\/bmj.i969\"><em>https:\/\/doi.org\/10.1136\/bmj.i969<\/em><\/a><\/p>\n\n\n\n<p><a>6. Rothman, K. J. (2008). <em>Epidemiology: An introduction<\/em> (2nd ed.). Oxford University Press.<\/a><\/p>\n\n\n\n<p><a>7. Vandenbroucke, J. P., &amp; Pearce, N. (2007). Case-control studies: Basic concepts. <em>International Journal of Epidemiology<\/em>, <em>36<\/em>(5), 948\u2013952. <\/a><a href=\"https:\/\/doi.org\/10.1093\/ije\/dym207\"><em>https:\/\/doi.org\/10.1093\/ije\/dym207<\/em><\/a><\/p>\n\n\n\n<p><a>8. Harrell, F. E. (2015). Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis. <em>Springer Series in Statistics<\/em>, <em>39<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1007\/978-3-319-19425-7\"><em>https:\/\/doi.org\/10.1007\/978-3-319-19425-7<\/em><\/a><\/p>\n\n\n\n<p><a>9. Pourhoseingholi, M. A., Baghestani, A. R., &amp; Vahedi, M. (2012). How to control confounding effects by statistical analysis. <em>Gastroenterology and Hepatology from Bed to Bench<\/em>, <em>5<\/em>(2), 79\u201383.<\/a><\/p>\n\n\n\n<p><a>10. Franzen, S. R., Chandler, C., &amp; Lang, T. (2017). Health research capacity development in low and middle income countries: Reality or rhetoric? A systematic meta-narrative review of the qualitative literature. <em>BMJ Open<\/em>, <em>7<\/em>(1), e012332. <\/a><a href=\"https:\/\/doi.org\/10.1136\/bmjopen-2016-012332\"><em>https:\/\/doi.org\/10.1136\/bmjopen-2016-012332<\/em><\/a><\/p>\n\n\n\n<p><a>11. Sheffel, A., Wilson, E., Munos, M., &amp; Zeger, S. (2019). Methods for analysis of complex survey data: An application using the tanzanian 2015 demographic and health survey and service provision assessment. <em>Journal of Global Health<\/em>, <em>9<\/em>(2), 020902. <\/a><a href=\"https:\/\/doi.org\/10.7189\/jogh.09.020902\"><em>https:\/\/doi.org\/10.7189\/jogh.09.020902<\/em><\/a><\/p>\n\n\n\n<p><a>12. Simundi\u0107, A.-M. (2013). Bias in research. <em>Biochemia Medica (Zagreb)<\/em>, <em>23<\/em>(1), 12\u201315. <\/a><a href=\"https:\/\/doi.org\/10.11613\/bm.2013.003\"><em>https:\/\/doi.org\/10.11613\/bm.2013.003<\/em><\/a><\/p>\n\n\n\n<p><a>13. Faber, J., &amp; Fonseca, L. M. (2014). How sample size influences research outcomes. <em>Dental Press Journal of Orthodontics<\/em>, <em>19<\/em>(4), 27\u201329. <\/a><a href=\"https:\/\/doi.org\/10.1590\/2176-9451.19.4.027-029.ebo\"><em>https:\/\/doi.org\/10.1590\/2176-9451.19.4.027-029.ebo<\/em><\/a><\/p>\n\n\n\n<p><a>14. Carpenter, J. R., &amp; Smuk, M. (2021). Missing data: A statistical framework for practice. <em>Biometrical Journal<\/em>, <em>63<\/em>(5), 915\u2013947. <\/a><a href=\"https:\/\/doi.org\/10.1002\/bimj.202000196\"><em>https:\/\/doi.org\/10.1002\/bimj.202000196<\/em><\/a><\/p>\n\n\n\n<p><a>15. Ozgur, C., Kleckner, M., &amp; Li, Y. (2015). Selection of statistical software for solving big data problems: A guide for businesses, students, and universities. <em>SAGE Open<\/em>, <em>5<\/em>(2). <\/a><a href=\"https:\/\/doi.org\/10.1177\/2158244015584379\"><em>https:\/\/doi.org\/10.1177\/2158244015584379<\/em><\/a><\/p>\n\n\n\n<p><a>16. Shrier, I., Redelmeier, D. A., Schnitzer, M. E., &amp; Steele, R. J. (2021). Challenges in interpreting results from \u2019multiple regression\u2019 when there is interaction between covariates. <em>BMJ Evidence-Based Medicine<\/em>, <em>26<\/em>(2), 53\u201356. <\/a><a href=\"https:\/\/doi.org\/10.1136\/bmjebm-2019-111225\"><em>https:\/\/doi.org\/10.1136\/bmjebm-2019-111225<\/em><\/a><\/p>\n\n\n\n<p><a>17. Ito, C., Al-Hassany, L., Kurth, T., &amp; Glatz, T. (2025). Distinguishing description, prediction, and causal inference: A primer on improving congruence between research questions and methods. <em>Neurology<\/em>, <em>104<\/em>(4), e210171. <\/a><a href=\"https:\/\/doi.org\/10.1212\/WNL.0000000000210171\"><em>https:\/\/doi.org\/10.1212\/WNL.0000000000210171<\/em><\/a><\/p>\n\n\n\n<p><a>18. Ratan, S. K., Anand, T., &amp; Ratan, J. (2019). Formulation of research question &#8211; stepwise approach. <em>Journal of Indian Association of Pediatric Surgeons<\/em>, <em>24<\/em>(1), 15\u201320. <\/a><a href=\"https:\/\/doi.org\/10.4103\/jiaps.JIAPS_76_18\"><em>https:\/\/doi.org\/10.4103\/jiaps.JIAPS_76_18<\/em><\/a><\/p>\n\n\n\n<p><a>19. Hammerton, G., &amp; Munaf\u00f2, M. R. (2021). Causal inference with observational data: The need for triangulation of evidence. <em>Psychological Medicine<\/em>, <em>51<\/em>(4), 563\u2013578. <\/a><a href=\"https:\/\/doi.org\/10.1017\/S0033291720005127\"><em>https:\/\/doi.org\/10.1017\/S0033291720005127<\/em><\/a><\/p>\n\n\n\n<p><a>20. Elshawi, R., Al-Mallah, M. H., &amp; Sakr, S. (2019). On the interpretability of machine learning-based model for predicting hypertension. <em>BMC Medical Informatics and Decision Making<\/em>, <em>19<\/em>, 146. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12911-019-0874-0\"><em>https:\/\/doi.org\/10.1186\/s12911-019-0874-0<\/em><\/a><\/p>\n\n\n\n<p><a>21. Heinze, G., Wallisch, C., &amp; Dunkler, D. (2018). Variable selection &#8211; a review and recommendations for the practicing statistician. <em>Biometrical Journal<\/em>, <em>60<\/em>(3), 431\u2013449. <\/a><a href=\"https:\/\/doi.org\/10.1002\/bimj.201700067\"><em>https:\/\/doi.org\/10.1002\/bimj.201700067<\/em><\/a><\/p>\n\n\n\n<p><a>22. Iacobucci, D., Schneider, M. J., Popovich, D. L., &amp; Bakamitsos, G. A. (2016). Mean centering helps alleviate \u201cmicro\u201d but not \u201cmacro\u201d multicollinearity. <em>Behavior Research Methods<\/em>, <em>48<\/em>, 1308\u20131317. <\/a><a href=\"https:\/\/doi.org\/10.3758\/s13428-015-0624-x\"><em>https:\/\/doi.org\/10.3758\/s13428-015-0624-x<\/em><\/a><\/p>\n\n\n\n<p><a>23. Dong, Y., &amp; Peng, C. J. (2013). Principled missing data methods for researchers. <em>SpringerPlus<\/em>, <em>2<\/em>(1), 222. <\/a><a href=\"https:\/\/doi.org\/10.1186\/2193-1801-2-222\"><em>https:\/\/doi.org\/10.1186\/2193-1801-2-222<\/em><\/a><\/p>\n\n\n\n<p><a>24. Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., Wood, A. M., &amp; Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. <em>BMJ<\/em>, <em>338<\/em>, b2393. <\/a><a href=\"https:\/\/doi.org\/10.1136\/bmj.b2393\"><em>https:\/\/doi.org\/10.1136\/bmj.b2393<\/em><\/a><\/p>\n\n\n\n<p><a>25. Buuren, S. van. (2018). <em>Flexible imputation of missing data<\/em> (2nd ed.). Chapman &amp; Hall\/CRC.<\/a><\/p>\n\n\n\n<p><a>26. Ross, R., Breskin, A., &amp; Westreich, D. (2020). When is a complete case approach to missing data valid? The importance of effect measure modification. <em>American Journal of Epidemiology<\/em>, <em>189<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1093\/aje\/kwaa124\"><em>https:\/\/doi.org\/10.1093\/aje\/kwaa124<\/em><\/a><\/p>\n\n\n\n<p><a>27. Szwarcwald, C. L. (2023). National health surveys: Overview of sampling techniques and data collected using complex designs. <em>Epidemiologia e Servi\u00e7os de Sa\u00fade<\/em>, <em>32<\/em>(3), e2023431. <\/a><a href=\"https:\/\/doi.org\/10.1590\/S2237-96222023000300014.EN\"><em>https:\/\/doi.org\/10.1590\/S2237-96222023000300014.EN<\/em><\/a><\/p>\n\n\n\n<p><a>28. Lumley, T., Diehr, P., Emerson, S., &amp; Chen, L.-J. (2004). The importance of the normality assumption in large public health data sets. <em>Annual Review of Public Health<\/em>, <em>25<\/em>(1), 151\u2013169. <\/a><a href=\"https:\/\/doi.org\/10.1146\/annurev.publhealth.25.101802.123014\"><em>https:\/\/doi.org\/10.1146\/annurev.publhealth.25.101802.123014<\/em><\/a><\/p>\n\n\n\n<p><a>29. Heeringa, S. G., West, B. T., &amp; Berglund, P. A. (2017). Applied survey data analysis. <em>Chapman and Hall\/CRC<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1201\/9781315371650\"><em>https:\/\/doi.org\/10.1201\/9781315371650<\/em><\/a><\/p>\n\n\n\n<p><a>30. Lumley, T. (2011). Complex surveys: A guide to analysis using r. <em>Wiley<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1002\/9781118257607\"><em>https:\/\/doi.org\/10.1002\/9781118257607<\/em><\/a><\/p>\n\n\n\n<p><a>31. Choi, G., Buckley, J. P., Kuiper, J. R., &amp; Keil, A. P. (2022). Log-transformation of independent variables: Must we? <em>Epidemiology<\/em>, <em>33<\/em>(6), 843\u2013853. <\/a><a href=\"https:\/\/doi.org\/10.1097\/EDE.0000000000001534\"><em>https:\/\/doi.org\/10.1097\/EDE.0000000000001534<\/em><\/a><\/p>\n\n\n\n<p><a>32. Kwak, S. K., &amp; Kim, J. H. (2017). Statistical data preparation: Management of missing values and outliers. <em>Korean Journal of Anesthesiology<\/em>, <em>70<\/em>(4), 407\u2013411. <\/a><a href=\"https:\/\/doi.org\/10.4097\/kjae.2017.70.4.407\"><em>https:\/\/doi.org\/10.4097\/kjae.2017.70.4.407<\/em><\/a><\/p>\n\n\n\n<p><a>33. Ng\u2019ambi, W. F., &amp; Muula, A. S. (2025). A reproducible r workflow to preserve variable and value labels in stata, SPSS, and SAS datasets for transparent and reproducible health research. <em>Malawi Medical Journal<\/em>, <em>37<\/em>(3).<\/a><\/p>\n\n\n\n<p><a>34. Piccininni, M., Konigorski, S., Rohmann, J. L., &amp; Kurth, T. (2020). Directed acyclic graphs and causal thinking in clinical risk prediction modeling. <em>BMC Medical Research Methodology<\/em>, <em>20<\/em>(1), 179. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12874-020-01058-z\"><em>https:\/\/doi.org\/10.1186\/s12874-020-01058-z<\/em><\/a><\/p>\n\n\n\n<p><a>35. Pearl, J. (2009). Causal inference in statistics: An overview. <em>Statistics Surveys<\/em>, <em>3<\/em>, 96\u2013146. <\/a><a href=\"https:\/\/doi.org\/10.1214\/09-SS057\"><em>https:\/\/doi.org\/10.1214\/09-SS057<\/em><\/a><\/p>\n\n\n\n<p><a>36. Dyer, B. P. (2025). Variable selection for causal inference, prediction, and descriptive research: A narrative review of recommendations. <em>European Heart Journal Open<\/em>, <em>5<\/em>(3), oeaf070. <\/a><a href=\"https:\/\/doi.org\/10.1093\/ehjopen\/oeaf070\"><em>https:\/\/doi.org\/10.1093\/ehjopen\/oeaf070<\/em><\/a><\/p>\n\n\n\n<p><a>37. Westreich, D., &amp; Greenland, S. (2013). The table 2 fallacy: Presenting and interpreting confounder and modifier coefficients. <em>American Journal of Epidemiology<\/em>, <em>177<\/em>(4), 292\u2013298. <\/a><a href=\"https:\/\/doi.org\/10.1093\/aje\/kws412\"><em>https:\/\/doi.org\/10.1093\/aje\/kws412<\/em><\/a><\/p>\n\n\n\n<p><a>38. Bujang, M. A., Sa\u2019at, N., Sidik, T., &amp; Joo, L. C. (2018). Sample size guidelines for logistic regression from observational studies with large population: Emphasis on the accuracy between statistics and parameters based on real life clinical data. <em>Malaysian Journal of Medical Sciences<\/em>, <em>25<\/em>(4), 122\u2013130. <\/a><a href=\"https:\/\/doi.org\/10.21315\/mjms2018.25.4.12\"><em>https:\/\/doi.org\/10.21315\/mjms2018.25.4.12<\/em><\/a><\/p>\n\n\n\n<p><a>39. Altelbany, S. (2021). Evaluation of ridge, elastic net and lasso regression methods in precedence of multicollinearity problem: A simulation study. <em>Journal of Applied Economics and Business Studies<\/em>, <em>5<\/em>, 131\u2013142. <\/a><a href=\"https:\/\/doi.org\/10.34260\/jaebs.517\"><em>https:\/\/doi.org\/10.34260\/jaebs.517<\/em><\/a><\/p>\n\n\n\n<p><a>40. James, G., Witten, D., Hastie, T., &amp; Tibshirani, R. (2013). <em>An introduction to statistical learning<\/em> (Vol. 112, pp. 3\u20137). Springer. <\/a><a href=\"https:\/\/doi.org\/10.1007\/978-1-4614-7138-7\"><em>https:\/\/doi.org\/10.1007\/978-1-4614-7138-7<\/em><\/a><\/p>\n\n\n\n<p><a>41. Vatcheva, K. P., Lee, M., McCormick, J. B., &amp; Rahbar, M. H. (2016). Multicollinearity in regression analyses conducted in epidemiologic studies. <em>Epidemiology (Sunnyvale)<\/em>, <em>6<\/em>(2), 227. <\/a><a href=\"https:\/\/doi.org\/10.4172\/2161-1165.1000227\"><em>https:\/\/doi.org\/10.4172\/2161-1165.1000227<\/em><\/a><\/p>\n\n\n\n<p><a>42. Kirkwood, B. R., &amp; Sterne, J. A. C. (2003). <em>Essential medical statistics<\/em>. Blackwell Science.<\/a><\/p>\n\n\n\n<p><a>43. Ranganathan, P., Pramesh, C. S., &amp; Aggarwal, R. (2017). Common pitfalls in statistical analysis: Logistic regression. <em>Perspectives in Clinical Research<\/em>, <em>8<\/em>(3), 148\u2013151. <\/a><a href=\"https:\/\/doi.org\/10.4103\/picr.PICR_87_17\"><em>https:\/\/doi.org\/10.4103\/picr.PICR_87_17<\/em><\/a><\/p>\n\n\n\n<p><a>44. Schreiber-Gregory, D., &amp; Bader, K. (2018). <em>Logistic and linear regression assumptions: Violation recognition and control<\/em>.<\/a><\/p>\n\n\n\n<p><a>45. Schuster, N. A., Rijnhart, J. J. M., Twisk, J. W. R., &amp; Heymans, M. W. (2022). Modeling non-linear relationships in epidemiological data: The application and interpretation of spline models. <em>Frontiers in Epidemiology<\/em>, <em>2<\/em>, 975380. <\/a><a href=\"https:\/\/doi.org\/10.3389\/fepid.2022.975380\"><em>https:\/\/doi.org\/10.3389\/fepid.2022.975380<\/em><\/a><\/p>\n\n\n\n<p><a>46. Hosmer, D. W., Lemeshow, S., &amp; Sturdivant, R. X. (2013). Applied logistic regression. <em>Wiley Series in Probability and Statistics<\/em>, <em>398<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1002\/9781118548387\"><em>https:\/\/doi.org\/10.1002\/9781118548387<\/em><\/a><\/p>\n\n\n\n<p><a>47. Cotter, J., Schmiege, S., Moss, A., &amp; Ambroggio, L. (2023). How to interact with interactions: What clinicians should know about statistical interactions. <em>Hospital Pediatrics<\/em>, <em>13<\/em>(10), e319\u2013e323. <\/a><a href=\"https:\/\/doi.org\/10.1542\/hpeds.2023-007259\"><em>https:\/\/doi.org\/10.1542\/hpeds.2023-007259<\/em><\/a><\/p>\n\n\n\n<p><a>48. Bender, R., Augustin, T., &amp; Blettner, M. (2005). Generalisability and the random effects model. <em>Biometrical Journal<\/em>, <em>47<\/em>(1), 19\u201331. <\/a><a href=\"https:\/\/doi.org\/10.1002\/bimj.200410258\"><em>https:\/\/doi.org\/10.1002\/bimj.200410258<\/em><\/a><\/p>\n\n\n\n<p><a>49. Austin, P. C., Kapral, M. K., Vyas, M. V., Fang, J., &amp; Yu, A. Y. X. (2024). Using multilevel models and generalized estimating equation models to account for clustering in neurology clinical research. <em>Neurology<\/em>, <em>103<\/em>(9), e209947. <\/a><a href=\"https:\/\/doi.org\/10.1212\/WNL.0000000000209947\"><em>https:\/\/doi.org\/10.1212\/WNL.0000000000209947<\/em><\/a><\/p>\n\n\n\n<p><a>50. Sadatsafavi, M., Saha-Chaudhuri, P., &amp; Petkau, J. (2022). Model-based ROC curve: Examining the effect of case mix and model calibration on the ROC plot. <em>Medical Decision Making<\/em>, <em>42<\/em>(4), 487\u2013499. <\/a><a href=\"https:\/\/doi.org\/10.1177\/0272989X211050909\"><em>https:\/\/doi.org\/10.1177\/0272989X211050909<\/em><\/a><\/p>\n\n\n\n<p><a>51. \u00c7orbac\u0131o\u011flu, \u015e. K., &amp; Aksel, G. (2023). Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value. <em>Turkish Journal of Emergency Medicine<\/em>, <em>23<\/em>(4), 195\u2013198. <\/a><a href=\"https:\/\/doi.org\/10.4103\/tjem.tjem_182_23\"><em>https:\/\/doi.org\/10.4103\/tjem.tjem_182_23<\/em><\/a><\/p>\n\n\n\n<p><a>52. Walsh, C., Sharman, K., &amp; Hripcsak, G. (2017). Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk. <em>Journal of Biomedical Informatics<\/em>, <em>76<\/em>, 10.1016\/j.jbi.2017.10.008. <\/a><a href=\"https:\/\/doi.org\/10.1016\/j.jbi.2017.10.008\"><em>https:\/\/doi.org\/10.1016\/j.jbi.2017.10.008<\/em><\/a><\/p>\n\n\n\n<p><a>53. Huang, Y., Li, W., Macheret, F., Gabriel, R. A., &amp; Ohno-Machado, L. (2020). A tutorial on calibration measurements and calibration models for clinical prediction models. <em>Journal of the American Medical Informatics Association<\/em>, <em>27<\/em>(4), 621\u2013633. <\/a><a href=\"https:\/\/doi.org\/10.1093\/jamia\/ocz228\"><em>https:\/\/doi.org\/10.1093\/jamia\/ocz228<\/em><\/a><\/p>\n\n\n\n<p><a>54. Zhang, Z. (2016). Residuals and regression diagnostics: Focusing on logistic regression. <em>Annals of Translational Medicine<\/em>, <em>4<\/em>(10), 195. <\/a><a href=\"https:\/\/doi.org\/10.21037\/atm.2016.03.36\"><em>https:\/\/doi.org\/10.21037\/atm.2016.03.36<\/em><\/a><\/p>\n\n\n\n<p><a>55. Dey, D., Haque, M. S., Islam, M. M., et al. (2025). The proper application of logistic regression model in complex survey data: A systematic review. <em>BMC Medical Research Methodology<\/em>, <em>25<\/em>, 15. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12874-024-02454-5\"><em>https:\/\/doi.org\/10.1186\/s12874-024-02454-5<\/em><\/a><\/p>\n\n\n\n<p><a>56. Steyerberg, E. W. (2010). Clinical prediction models: A practical approach to development, validation, and updating. <em>Springer<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1007\/978-0-387-77244-8\"><em>https:\/\/doi.org\/10.1007\/978-0-387-77244-8<\/em><\/a><\/p>\n\n\n\n<p><a>57. Shmueli, G. (2010). To explain or to predict? <em>Statistical Science<\/em>, <em>25<\/em>(3), 289\u2013310. <\/a><a href=\"https:\/\/doi.org\/10.1214\/10-STS330\"><em>https:\/\/doi.org\/10.1214\/10-STS330<\/em><\/a><\/p>\n\n\n\n<p><a>58. Coley, R. Y., Liao, Q., Simon, N., &amp; Shortreed, S. M. (2023). Empirical evaluation of internal validation methods for prediction in large-scale clinical data with rare-event outcomes: A case study in suicide risk prediction. <em>BMC Medical Research Methodology<\/em>, <em>23<\/em>(1), 33. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12874-023-01844-5\"><em>https:\/\/doi.org\/10.1186\/s12874-023-01844-5<\/em><\/a><\/p>\n\n\n\n<p><a>59. Kuhn, M. (2008). Building predictive models in r using the caret package. <em>Journal of Statistical Software<\/em>, <em>28<\/em>, 1\u201326. <\/a><a href=\"https:\/\/doi.org\/10.18637\/jss.v028.i05\"><em>https:\/\/doi.org\/10.18637\/jss.v028.i05<\/em><\/a><\/p>\n\n\n\n<p><a>60. Friedman, J., Hastie, T., &amp; Tibshirani, R. (2009). <em>Glmnet: Lasso and elastic-net regularized generalized linear models<\/em> (Vol. 1). <\/a><a href=\"https:\/\/cran.r-project.org\/web\/packages\/glmnet\/index.html\"><em>https:\/\/cran.r-project.org\/web\/packages\/glmnet\/index.html<\/em><\/a><\/p>\n\n\n\n<p><a>61. Song, Q. C., Tang, C., &amp; Wee, S. (2021). Making sense of model generalizability: A tutorial on cross-validation in r and shiny. <em>Advances in Methods and Practices in Psychological Science<\/em>, <em>4<\/em>(1). <\/a><a href=\"https:\/\/doi.org\/10.1177\/2515245920947067\"><em>https:\/\/doi.org\/10.1177\/2515245920947067<\/em><\/a><\/p>\n\n\n\n<p><a>62. Jr., F. E. H. (2022). <em>Rms: Regression modeling strategies<\/em>. <\/a><a href=\"https:\/\/CRAN.R-project.org\/package=rms\"><em>https:\/\/CRAN.R-project.org\/package=rms<\/em><\/a><\/p>\n\n\n\n<p><a>63. Persoskie, A., &amp; Ferrer, R. A. (2017). A most odd ratio: Interpreting and describing odds ratios. <em>American Journal of Preventive Medicine<\/em>, <em>52<\/em>(2), 224\u2013228. <\/a><a href=\"https:\/\/doi.org\/10.1016\/j.amepre.2016.07.030\"><em>https:\/\/doi.org\/10.1016\/j.amepre.2016.07.030<\/em><\/a><\/p>\n\n\n\n<p><a>64. Sperandei, S. (2014). Understanding logistic regression analysis. <em>Biochemia Medica<\/em>, <em>24<\/em>(1), 12\u201318. <\/a><a href=\"https:\/\/doi.org\/10.11613\/BM.2014.003\"><em>https:\/\/doi.org\/10.11613\/BM.2014.003<\/em><\/a><\/p>\n\n\n\n<p><a>65. Thompson, J., Watson, S. I., Middleton, L., et al. (2025). Estimating relative risks and risk differences in randomised controlled trials: A systematic review of current practice. <em>Trials<\/em>, <em>26<\/em>, 1. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s13063-024-08690-w\"><em>https:\/\/doi.org\/10.1186\/s13063-024-08690-w<\/em><\/a><\/p>\n\n\n\n<p><a>66. Ning, Y., Lam, A., &amp; Reilly, M. (2022). Estimating risk ratio from any standard epidemiological design by doubling the cases. <em>BMC Medical Research Methodology<\/em>, <em>22<\/em>, 157. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12874-022-01636-3\"><em>https:\/\/doi.org\/10.1186\/s12874-022-01636-3<\/em><\/a><\/p>\n\n\n\n<p><a>67. Knol, M., Le Cessie, S., Algra, A., Vandenbroucke, J., &amp; Groenwold, R. (2011). Overestimation of risk ratios by odds ratios in trials and cohort studies: Alternatives to logistic regression. <em>CMAJ : Canadian Medical Association Journal<\/em>, <em>184<\/em>, 895\u2013899. <\/a><a href=\"https:\/\/doi.org\/10.1503\/cmaj.101715\"><em>https:\/\/doi.org\/10.1503\/cmaj.101715<\/em><\/a><\/p>\n\n\n\n<p><a>68. Dwivedi, A. K. (2022). How to write statistical analysis section in medical research. <em>Journal of Investigative Medicine<\/em>, <em>70<\/em>(8), 1759\u20131770. <\/a><a href=\"https:\/\/doi.org\/10.1136\/jim-2022-002479\"><em>https:\/\/doi.org\/10.1136\/jim-2022-002479<\/em><\/a><\/p>\n\n\n\n<p><a>69. Padilla, L. M., Creem-Regehr, S. H., Hegarty, M., et al. (2018). Decision making with visualizations: A cognitive framework across disciplines. <em>Cognitive Research: Principles and Implications<\/em>, <em>3<\/em>, 29. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s41235-018-0120-9\"><em>https:\/\/doi.org\/10.1186\/s41235-018-0120-9<\/em><\/a><\/p>\n\n\n\n<p><a>70. Olarte Parra, C., Bertizzolo, L., Schroter, S., Dechartres, A., &amp; Goetghebeur, E. (2021). Consistency of causal claims in observational studies: A review of papers published in a general medical journal. <em>BMJ Open<\/em>, <em>11<\/em>(5), e043339. <\/a><a href=\"https:\/\/doi.org\/10.1136\/bmjopen-2020-043339\"><em>https:\/\/doi.org\/10.1136\/bmjopen-2020-043339<\/em><\/a><\/p>\n\n\n\n<p><a>71. Sterrantino, A. F. (2024). Observational studies: Practical tips for avoiding common statistical pitfalls. <em>The Lancet Regional Health &#8211; Southeast Asia<\/em>, <em>25<\/em>, 100415. https:\/\/doi.org\/<\/a><a href=\"https:\/\/doi.org\/10.1016\/j.lansea.2024.100415\"><em>https:\/\/doi.org\/10.1016\/j.lansea.2024.100415<\/em><\/a><\/p>\n\n\n\n<p><a>72. Carlin, J. B., &amp; Moreno-Betancur, M. (2025). On the uses and abuses of regression models: A call for reform of statistical practice and teaching. <em>Statistics in Medicine<\/em>, <em>44<\/em>(13-14), e10244. <\/a><a href=\"https:\/\/doi.org\/10.1002\/sim.10244\"><em>https:\/\/doi.org\/10.1002\/sim.10244<\/em><\/a><\/p>\n\n\n\n<p><a>73. Sheldrick, R., Chung, P., &amp; Jacobson, R. (2017). Math matters: How misinterpretation of odds ratios and risk ratios may influence conclusions. <em>Academic Pediatrics<\/em>, <em>17<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1016\/j.acap.2016.10.008\"><em>https:\/\/doi.org\/10.1016\/j.acap.2016.10.008<\/em><\/a><\/p>\n\n\n\n<p><a>74. Norton, E. C., &amp; Dowd, B. E. (2018). Log odds and the interpretation of logit models. <em>Health Services Research<\/em>, <em>53<\/em>(2), 859\u2013878. <\/a><a href=\"https:\/\/doi.org\/10.1111\/1475-6773.12712\"><em>https:\/\/doi.org\/10.1111\/1475-6773.12712<\/em><\/a><\/p>\n\n\n\n<p><a>75. Riley, R., Snell, K., Martin, G., Whittle, R., Archer, L., Sperrin, M., &amp; Collins, G. (2020). Penalisation and shrinkage methods produced unreliable clinical prediction models especially when sample size was small. <em>Journal of Clinical Epidemiology<\/em>, <em>132<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1016\/j.jclinepi.2020.12.005\"><em>https:\/\/doi.org\/10.1016\/j.jclinepi.2020.12.005<\/em><\/a><\/p>\n\n\n\n<p><a>76. Heinze, G., Wallisch, C., &amp; Dunkler, D. (2020). Variable selection &#8211; a review and recommendations for the practicing statistician. <em>Biometrical Journal<\/em>, <em>62<\/em>(1), 39\u201367. <\/a><a href=\"https:\/\/doi.org\/10.1002\/bimj.201900037\"><em>https:\/\/doi.org\/10.1002\/bimj.201900037<\/em><\/a><\/p>\n\n\n\n<p><a>77. Shrier, I., &amp; Platt, R. W. (2008). Reducing bias through directed acyclic graphs. <em>BMC Medical Research Methodology<\/em>, <em>8<\/em>, 70. <\/a><a href=\"https:\/\/doi.org\/10.1186\/1471-2288-8-70\"><em>https:\/\/doi.org\/10.1186\/1471-2288-8-70<\/em><\/a><\/p>\n\n\n\n<p><a>78. Pearl, J., &amp; Mackenzie, D. (2016). The book of why: The new science of cause and effect. <em>Basic Books<\/em>.<\/a><\/p>\n\n\n\n<p><a>79. Suhas, S., Manjunatha, N., Kumar, C. N., et al. (2023). Firth\u2019s penalized logistic regression: A superior approach for analysis of data from india\u2019s national mental health survey, 2016. <em>Indian Journal of Psychiatry<\/em>, <em>65<\/em>(12), 1208\u20131213. <\/a><a href=\"https:\/\/doi.org\/10.4103\/indianjpsychiatry.indianjpsychiatry_827_23\"><em>https:\/\/doi.org\/10.4103\/indianjpsychiatry.indianjpsychiatry_827_23<\/em><\/a><\/p>\n\n\n\n<p><a>80. Gupta, A., Gupta, R., Singh, S., et al. (2018). Understanding logistic regression analysis. <em>Journal of the Indian Association of Pediatric Surgeons<\/em>, <em>23<\/em>(3), 131\u2013136. <\/a><a href=\"https:\/\/doi.org\/10.4103\/jiaps.JIAPS_46_18\"><em>https:\/\/doi.org\/10.4103\/jiaps.JIAPS_46_18<\/em><\/a><\/p>\n\n\n\n<p><a>81. Fekadu, A., Medhin, G., Selamu, M., &amp; Hanlon, C. (2021). Predicting adult hospital admission from emergency department using machine learning: Analyzing electronic health records of 1.4 million visits in the USA. <em>Psychological Medicine<\/em>, <em>51<\/em>(2), 290\u2013299. <\/a><a href=\"https:\/\/doi.org\/10.1017\/S0033291719001434\"><em>https:\/\/doi.org\/10.1017\/S0033291719001434<\/em><\/a><\/p>\n\n\n\n<p><a>82. Ferreira, J. C., &amp; Patino, C. M. (2017). Subgroup analysis and interaction tests: Why they are important and how to avoid common mistakes. <em>Jornal Brasileiro de Pneumologia<\/em>, <em>43<\/em>(3), 162. <\/a><a href=\"https:\/\/doi.org\/10.1590\/S1806-37562017000000170\"><em>https:\/\/doi.org\/10.1590\/S1806-37562017000000170<\/em><\/a><\/p>\n\n\n\n<p><a>83. Greenland, S., Pearl, J., &amp; Robins, J. M. (1999). Causal diagrams for epidemiologic research. <em>Epidemiology<\/em>, <em>10<\/em>(1), 37\u201348. <\/a><a href=\"https:\/\/doi.org\/10.1097\/00001648-199901000-00008\"><em>https:\/\/doi.org\/10.1097\/00001648-199901000-00008<\/em><\/a><\/p>\n\n\n\n<p><a>84. Kontopantelis, E., White, I. R., Sperrin, M., &amp; Buchan, I. E. (2017). Outcome-sensitive multiple imputation: A simulation study. <em>BMC Medical Research Methodology<\/em>, <em>17<\/em>(2). <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12874-016-0281-5\"><em>https:\/\/doi.org\/10.1186\/s12874-016-0281-5<\/em><\/a><\/p>\n\n\n\n<p><a>85. Little, R. J. A., &amp; Rubin, D. B. (2019). <em>Statistical analysis with missing data<\/em> (3rd ed.). Wiley. <\/a><a href=\"https:\/\/doi.org\/10.1002\/9781119482260\"><em>https:\/\/doi.org\/10.1002\/9781119482260<\/em><\/a><\/p>\n\n\n\n<p><a>86. Azur, M. J., Stuart, E. A., Frangakis, C., &amp; Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? <em>International Journal of Methods in Psychiatric Research<\/em>, <em>20<\/em>(1), 40\u201349. <\/a><a href=\"https:\/\/doi.org\/10.1002\/mpr.329\"><em>https:\/\/doi.org\/10.1002\/mpr.329<\/em><\/a><\/p>\n\n\n\n<p><a>87. Zhou, Z.-R., Wang, W.-W., Li, Y., Jin, K.-R., Wang, X.-Y., Wang, Z.-W., Chen, Y.-S., Wang, S.-J., Hu, J., Zhang, H.-N., Huang, P., Zhao, G.-Z., Chen, X.-X., Li, B., &amp; Zhang, T.-S. (2019). In-depth mining of clinical data: The construction of clinical prediction model with r. <em>Annals of Translational Medicine<\/em>, <em>7<\/em>(23). <\/a><a href=\"https:\/\/atm.amegroups.org\/article\/view\/29812\"><em>https:\/\/atm.amegroups.org\/article\/view\/29812<\/em><\/a><\/p>\n\n\n\n<p><a>88. Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., &amp; Altman, D. G. (2016). Statistical tests, p values, confidence intervals, and power: A guide to misinterpretations. <em>European Journal of Epidemiology<\/em>, <em>31<\/em>(4), 337\u2013350. <\/a><a href=\"https:\/\/doi.org\/10.1007\/s10654-016-0149-3\"><em>https:\/\/doi.org\/10.1007\/s10654-016-0149-3<\/em><\/a><\/p>\n\n\n\n<p><a>89. Olowe, K., Edoh, N., Zouo, S., &amp; Olamijuwon, J. (2024). Comprehensive review of logistic regression techniques in predicting health outcomes and trends. <em>World Journal of Advanced Pharmaceutical and Life Sciences<\/em>, <em>7<\/em>, 016\u2013026. <\/a><a href=\"https:\/\/doi.org\/10.53346\/wjapls.2024.7.2.0039\"><em>https:\/\/doi.org\/10.53346\/wjapls.2024.7.2.0039<\/em><\/a><\/p>\n\n\n\n<p><a>90. Zardo, P., &amp; Collie, A. (2014). Predicting research use in a public health policy environment: Results of a logistic regression analysis. <em>Implementation Science<\/em>, <em>9<\/em>, 142. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s13012-014-0142-8\"><em>https:\/\/doi.org\/10.1186\/s13012-014-0142-8<\/em><\/a><\/p>\n\n\n\n<p><a>91. Osborne, J. (2015). <em>Best practices in logistic regression<\/em>. SAGE Publications. <\/a><a href=\"https:\/\/doi.org\/10.4135\/9781483399041\"><em>https:\/\/doi.org\/10.4135\/9781483399041<\/em><\/a><\/p>\n\n\n\n<p><a>92. Muller, C., &amp; Maclehose, R. (2014). Estimating predicted probabilities from logistic regression: Different methods correspond to different target populations. <em>International Journal of Epidemiology<\/em>, <em>43<\/em>. <\/a><a href=\"https:\/\/doi.org\/10.1093\/ije\/dyu029\"><em>https:\/\/doi.org\/10.1093\/ije\/dyu029<\/em><\/a><\/p>\n\n\n\n<p><a>93. Shipe, M. E., Deppen, S. A., Farjah, F., &amp; Grogan, E. L. (2019). Developing prediction models for clinical use using logistic regression: An overview. <em>Journal of Thoracic Disease<\/em>, <em>11<\/em>(Suppl 4), S574\u2013S584. <\/a><a href=\"https:\/\/doi.org\/10.21037\/jtd.2019.01.25\"><em>https:\/\/doi.org\/10.21037\/jtd.2019.01.25<\/em><\/a><\/p>\n\n\n\n<p><a>94. Ng\u2019ambi, W., Chiumia, I. K., Chagoma, N., &amp; Mfutso-Bengo, J. (2020). Factors associated with uptake of HIV testing in malawi: A trend analysis of the malawi demographic and health survey data from 2004 to 2016. <em>Journal of HIV and AIDS Research<\/em>, <em>2<\/em>(1), 101.<\/a><\/p>\n\n\n\n<p><a>95. Mwale, F., Ng\u2019ambi, W. F., Zonda, J. M., Mfutso-Bengo, J., &amp; Nkhoma, D. (2025). Assessing the effects of socioeconomic status on use of hypertension care services in zambia: Insights from 2017 WHO stepwise survey. <em>International Journal of Noncommunicable Diseases<\/em>, <em>10<\/em>(2), 83\u201392. <\/a><a href=\"https:\/\/doi.org\/10.4103\/jncd.jncd_15_25\"><em>https:\/\/doi.org\/10.4103\/jncd.jncd_15_25<\/em><\/a><\/p>\n\n\n\n<p><a>96. Muller, C. J., &amp; MacLehose, R. F. (2014). Estimating predicted probabilities from logistic regression: Different methods correspond to different target populations. <em>International Journal of Epidemiology<\/em>, <em>43<\/em>(3), 962\u2013970. <\/a><a href=\"https:\/\/doi.org\/10.1093\/ije\/dyu029\"><em>https:\/\/doi.org\/10.1093\/ije\/dyu029<\/em><\/a><\/p>\n\n\n\n<p><a>97. D\u2019Amico, F., Marmiere, M., Fonti, M., Battaglia, M., &amp; Belletti, A. (2025). Association does not mean causation, when observational data were misinterpreted as causal: The observational interpretation fallacy. <em>Journal of Evaluation in Clinical Practice<\/em>, <em>31<\/em>(1), e14288. <\/a><a href=\"https:\/\/doi.org\/10.1111\/jep.14288\"><em>https:\/\/doi.org\/10.1111\/jep.14288<\/em><\/a><\/p>\n\n\n\n<p><a>98. Mathur, M. B., &amp; VanderWeele, T. J. (2022). Methods to address confounding and other biases in meta-analyses: Review and recommendations. <em>Annual Review of Public Health<\/em>, <em>43<\/em>, 19\u201335. <\/a><a href=\"https:\/\/doi.org\/10.1146\/annurev-publhealth-051920-114020\"><em>https:\/\/doi.org\/10.1146\/annurev-publhealth-051920-114020<\/em><\/a><\/p>\n\n\n\n<p><a>99. Noyes, J., Booth, A., Moore, G., Flemming, K., Tun\u00e7alp, \u00d6., &amp; Shakibazadeh, E. (2019). Synthesising quantitative and qualitative evidence to inform guidelines on complex interventions: Clarifying the purposes, designs and outlining some methods. <em>BMJ Global Health<\/em>, <em>4<\/em>(Suppl 1), e000893. <\/a><a href=\"https:\/\/doi.org\/10.1136\/bmjgh-2018-000893\"><em>https:\/\/doi.org\/10.1136\/bmjgh-2018-000893<\/em><\/a><\/p>\n\n\n\n<p><a>100. Khatri, R. B., Endalamaw, A., Erku, D., &amp; al., et. (2024). Enablers and barriers of community health programs for improved equity and universal coverage of primary health care services: A scoping review. <em>BMC Primary Care<\/em>, <em>25<\/em>(1), 385. <\/a><a href=\"https:\/\/doi.org\/10.1186\/s12875-024-02629-5\"><em>https:\/\/doi.org\/10.1186\/s12875-024-02629-5<\/em><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Wingston Felix Ng\u2019ambi1, Cosmas Zyambo, Adamson Sinjani Muula3-5 Health Economics and Policy Unit, Department of Health Systems and Policy, Kamuzu University of Health Sciences, Lilongwe, Malawi Africa Centre of Excellence in Public Health and Herbal Medicine (ACEPHEM), Kamuzu University of Health Sciences, Blantyre, Malawi Department of Public Health and Family Medicine, University of Zambia, Lusaka, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[75,108],"tags":[135],"_links":{"self":[{"href":"https:\/\/www.mmj.mw\/index.php?rest_route=\/wp\/v2\/posts\/13527"}],"collection":[{"href":"https:\/\/www.mmj.mw\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mmj.mw\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mmj.mw\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mmj.mw\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13527"}],"version-history":[{"count":1,"href":"https:\/\/www.mmj.mw\/index.php?rest_route=\/wp\/v2\/posts\/13527\/revisions"}],"predecessor-version":[{"id":13531,"href":"https:\/\/www.mmj.mw\/index.php?rest_route=\/wp\/v2\/posts\/13527\/revisions\/13531"}],"wp:attachment":[{"href":"https:\/\/www.mmj.mw\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13527"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mmj.mw\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13527"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mmj.mw\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13527"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}