Om de twee jaar organiseert de NPSO de Innovatiedag waarbij PhD's uit België en Nederland de kans krijgen hun onderzoek te presenteren. Dit jaar wordt de 7e editie. Ook dit jaar zal de NPSO innovatieprijs weer worden toegekend via een stemronde met het aanwezige publiek. De prijs wordt gesponsord door het CBS. De onderwerpen die aan bod komen zijn divers, van waarneming tot analyse tot machine learning. In alle gevallen gaat het om vernieuwing die relevant is voor sociaal-wetenschappelijk onderzoek en survey onderzoek.
Locatie: Universiteitsbibliotheek Universiteit Utrecht, Boothzaal, Heidelberglaan 3, Utrecht
Toegezegde sprekers:
- Charlotte Müller (UU) - Testing the Performance and Bias of Large Language Models in Generating Synthetic Survey Data
- Chris Lam (TUE/CBS) - How to integrate AI into smart surveys: A case for the Budget Survey
- Danielle McCool (UU) - The role of participant understanding in data donation studies
- Daniëlle Remmerswaal (UU) - Do respondents show higher activity and engagement in app-based diaries compared to web-based diaries? A case study using Statistics Netherlands’ Household Budget Diary.
- Daria Dementeva (KUL) - Capturing respondents' geographic contexts in surveys: A GIS approach to the uncertain geographic context problem
- Jim Achterberg (LUMC) - Synthetic data in health technology assessments
- Lisa Sivak (RUG) - Measuring the predictability of fertility outcomes using mass collaboration and simulation
- Maik Beuken (Hogeschool Zuyd) - Identification of clusters in a large general adult population sample of the Dutch Health Survey Using Machine Learning: A systemic interpretation of personal and environmental characteristics on BMI related weight categories
- Maike Weipe (UU) - Combining register and survey data to understand carbon-intensive time use
- Rachel de Jong (UL) - Anonymity and disclosure control for network data
- Sophie Berkhout (UU/CBS) - MATILDA: Your resource for intensive longitudinal research
- Thom Volker (UU) - Being certain about the uncertain: Prediction intervals with missing data
Globaal programma:
- 10:30 - 11:00: Inloop met koffie en thee
- 11:00 - 11:10: Opening
- 11:10 - 12:30: Eerste plenaire sessie
- 12:30 - 13:00: Lunch (gratis)
- 13:00 - 16:00: Tweede en derde plenaire sessies
- 16:00 - 16:15: Afsluitende presentatie
- 16:15 - 16:30: Stemronde en uitreiking NPSO innovatieprijs
- 16:30 - 17:00: Borrel
SESSIE 1
Maik Beuken - Identification of clusters in a large general adult population sample of the Dutch Health Survey using Machine Learning: A systemic interpretation of personal and environmental characteristics on BMI related weight categories
Overweight and obesity are growing public health concerns in the Netherlands, projected to affect 64% of adults by 2050. In this study we aimed to uncover BMI-related patterns in a large, multi-source health dataset using variable selection prior to clustering, to support tailored health behavior interventions. A mixed methods approach was applied, combining quantitative analysis of Dutch Health Survey data with proximity to facilities and air pollution data, and qualitative expert input. Trough the variable selection method we identified 47 BMI-related variables. Clustering was performed using k-means on principal components. An expert panel study explored the practical relevance of the findings. Data from 15,202 respondents revealed 10 clusters in the total, overweight, and obese groups, and 9 in the morbidly obese group. Cluster profiles varied, amongst others, in socioeconomic status, and health conditions. Experts recognized associations between selected variables and BMI, merged clusters, and identified four overarching patterns. Structured data analysis combined with expert interpretation revealed distinct BMI-related subgroups that may inform targeted health promotion strategies. Expert interpretation was challenging due to limited contextual data, highlighting the importance of method choice for practical application
Sophie Berkhout – MATILDA: Your resource for intensive longitudinal research
Intensive longitudinal research has become increasingly popular in the social sciences to study how processes unfold over time. Such studies involve collecting many repeated measures of multiple individuals, referred to as intensive longitudinal data (ILD). As with any study design, it is important that the research goal, theory, measurements, and analyses are aligned. In the context of ILD, time plays an essential role in study design, raising questions such as: At what timescale does the process operate? What dynamics are of interest? How often and for how long do you need to measure to capture these dynamics?
The theory, measurements, and analyses of a study are interconnected, meaning that a decision on one of these aspects impacts the others. For instance, the sampling frequency determines which patterns of the process can be observed. While many resources exist for ILD studies, such as methodological papers and books, they often focus on only one of these aspects, making it difficult to navigate their interdependencies. Additionally, because the field is rapidly evolving, resources can become outdated quickly.
In this presentation, I introduce MATILDA (https://matilda.fss.uu.nl/), a new online resource that provides peer-reviewed, up-to-date educational articles designed to guide researchers in aligning theory, measurement, and analysis when studying processes with ILD.
Lisa Sivak - Measuring the predictability of fertility outcomes using mass collaboration and simulation
Accurate predictions of life outcomes can inform social theory and policy. Yet, social science predictions are often inaccurate. We examine what limits predictive accuracy in the case of fertility, studying who has a child within three years under near-best currently available conditions. We use full-population data from the Dutch registers and high-quality LISS survey data in the Predicting Fertility data challenge. Over 150 people participated in the data challenge and submitted over 70 models, ranging from traditional machine learning approaches to state-of-the-art foundational models. Across both data sources, predictive performance was only modest. To assess how much prediction could improve in principle, we used simulation to estimate an upper bound based on the inherent randomness of conception and pregnancy outcomes. Even under conservative assumptions, this randomness substantially contributes to predictive error. Remaining errors appear only partly due to limited sample size, as learning curves suggest, and likely reflect unmeasured predictors or other random factors. These findings underscore the difficulty of individual-level life outcome prediction and highlight the important role of randomness in human lives.
Charlotte Müller – Testing the Performance and Bias of Large Language Models in Generating Synthetic Survey Data
The idea to replace human survey respondents with LLM-generated synthetic respondents has gained broad attention in both academic and market research as a promising timely and cost-efficient method of data collection. However, previous research has shown that LLMs are likely to reproduce limitations and biases found in their training data, including underrepresentation of certain subgroups and imbalanced topic coverage. Using survey data from the Netherlands and Germany, we conduct a large-scale analysis examining model performance across different item types (factual, attitudinal, behavioral) and social-science-related topics. We leverage the LLM agents based on two different few-shot learning tasks: a sociodemographic setup, which draws on several individual background variables, and a panel-informed setup, which additionally incorporates previous responses. This combined setup allows us to evaluate whether synthetic respondents serve as a suitable proxy for human respondents and, if so, when and for whom they perform best.
Our findings show a deep lack of predicition accuracy and systematic biases in the sociodemographic setup, which are not mitigated by changes in input language or model calibration. However, providing the LLM with richer respondent-level information, such as previous responses, significantly improves performance. These insights provide important guidance on the opportunities and boundaries using synthetic respondents as a data collection tool in survey data-based research.
SESSIE 2
Chris Lam - How to integrate AI into smart surveys: A case for the Budget Survey
Every four years, Statistics Netherlands conducts the national budget survey to get insight into the spending behavior of Dutch households. Participation to the survey however is burdensome, as respondents must manually record each product that they purchase. To alleviate this burden, Statistics Netherlands is developing a smart survey in the form of a smartphone application, that allows respondents to take photographs of their receipts instead. This can turn the current time-consuming process into the simple act of only taking a photograph.
This approach however does not eliminate the burden; but merely transfers it from the respondent to Statistics Netherlands. It is now faced with the challenge of how to process this stream of photographs. To deal them, automated systems are being developed, for which the usage of AI technology is being explored.
In this session we show how Statistics Netherlands will use AI to process smart survey data and discuss our findings so far.
Danielle McCool - The role of participant understanding in data donation studies
Respondents often rate concerns over data privacy as very important when asked about potential reasons for not sharing their data. The more potentially sensitive the contents of the data, the more hesitant people are to provide it, even to trusted organizations. In response to heightened privacy concerns among respondents and stricter data minimization practices, different tactics have emerged with the goal of offering privacy-sensitive research techniques. While this is a step forwards from the perspective of research ethics, it’s unclear whether this reduces participants’ own doubts. An experiment was added to several data donation studies in which respondents were provided an explanation of a privacy-preserving data donation technique, followed by a quiz to test their understanding of the procedure. We find that increased respondent understanding of the privacy-preserving technique predicts donation. This effect is independent of other critical, but non-actionable, factors such as age, education, trust, and technological savvy, providing a mechanism by which researchers can potentially improve donation rates. We also investigate how understanding differs across respondent characteristics, within studies, and by question, with some fun results.
Daniëlle Remmerswaal - Do respondents show higher activity and engagement in app-based diaries compared to web-based diaries? A case study using Statistics Netherlands’ Household Budget Diary.
Smartphones offer opportunities for official statistics, promising improved user experience, reduced response burden, and higher data quality. We investigate whether respondents show higher activity and engagement in app-based diaries compared to traditional web-based diaries.
We use Statistics Netherlands’ Household Budget Survey (HBS) as a case study. The HBS is a diary survey conducted every five years to capture household expenditure on goods and services. In 2020, Statistics Netherlands conducted a 4-week web-based survey. In 2021, they conducted a 2-week app-based survey on a smaller sample. We compare participation behavior of respondents in the two modes.
We hypothesize that app respondents will show greater retention and engagement, driven primarily by the smartphone’s constant availability, the smart features that simplify data collection, and the user-friendly design of the smartphone app. Our first research question (RQ1) examines whether retention rates are higher among app respondents, measured through stages of nonresponse and survival analysis of last activity. The second question (RQ2) assesses objective burden by comparing average time spent per purchase between app and web respondents, with a focus on the time-saving potential of the app’s scanning feature. The third question (RQ3) explores whether app respondents are more active and show more spread-out reporting patterns.
Daria Dementeva – Capturing respondents' geographic contexts in surveys: A GIS approach to the uncertain geographic context problem
Capturing respondents' geographical contexts in surveys remains a complex task, as it is far from being a mere technical detail but shapes the understanding of how geographical environments influence individual attitudes, opinions, and behaviors. This has several consequences for how respondents' geographical contexts are measured in surveys. While augmenting individual survey responses with administrative geographical contexts (e.g., ZIP codes, municipalities, census tracts, block tracts, or statistical sectors) or pseudo-administrative ones (e.g., official grid cells) has become an established strategy, more recent approaches have also employed a variety of non-administrative, subjective geographical contexts, such as egohoods/circular buffer-based geographies, self-drawn maps, or mobility and social media-based activity spaces. Frequently, however, these approaches are implemented with little consideration for situating survey outcomes of interest within a theoretically meaningful geographical context. In geography, this has been referred to as the Uncertain Geographic Context Problem (UGCP), a measurement issue concerning the approximation/delineation of theoretically relevant geographical contexts in relation to individual-level outcomes (e.g., attitudes, opinions, behaviors). To date, UGCP has been rarely addressed in survey-based applications. Drawing on Geographic Information Systems (GIS) approaches to UGCP, I take advantage of these methods and integrate them in scenarios where georeferenced survey data are limited, such that when only respondents' locations are available, and the collection of respondent-specific subjective geographical contexts is not foreseen in survey designs. In particular, I focus on augmenting survey responses with GIS-based travel-time neighborhoods. Neighborhoods are approximated using plausible walking-time or driving-time areas constructed around respondents' locations, taking into account street networks and built environment layouts that may shape (or even determine) walking and driving route choices. I employ this methodology in the theoretical application of the interethnic group relations-neighborhood context nexus, focusing on the relationship between perceived ethnic threat and neighborhood-level ethnic diversity and socioeconomic status, and how this relationship can be influenced by neighborhood context approximation. Based on this application, I conclude with recommendations for survey researchers wishing to (1) improve the measurement of geographical contexts and (2) tailor them to survey-based outcomes of interest.
Maike Weipe - Combining Register and Survey Data to Understand Carbon-Intensive Time Use
Statistical institutions collect extensive register data and administer valuable surveys, yet the latter typically cover only a limited share of the population. While the integration of register and survey data already holds great promise, an even broader range of opportunities arises when multiple surveys can be combined - even when they lack (substantial) overlap. Our research is situated within the field of climate sociology and aims to estimate individual carbon emissions in the Netherlands by combining household expenditure and individual time-use data. Both types of data demand considerable resources from researchers and participants alike and they are therefore usually not collected together.
In this presentation, we introduce a preliminary framework for integrating multiple sources of data with each other by combining various sources of Dutch register data. This approach not only enables the combination of multiple surveys beyond their original samples but also opens up new possibilities for more inclusive, representative, and scalable social research.
Ultimately, the project contributes to the innovation of data integration methods in the social sciences, offering a responsible and efficient way to expand the analytical reach of surveys and to generate new insights into the social foundations of (climate-relevant) behaviour.
SESSIE 3
Jim Achterberg - Synthetic data in Health Technology Assessments
Healthcare Technology Assessments (HTAs), such as those performed for medical devices and digital health applications, require high-quality datasets to prove efficacy and safety. In practice however, medical data poses significant challenges regarding privacy, bias, fairness, and missingness. That is, medical datasets are often too sensitive to freely share across stakeholders within HTA lifecycles, unfairly represent patient populations, or contain too little data due to high data collection costs.
Synthetic data provides a potential solution. Synthetic data is algorithmically generated data which can augment or replace real datasets for the purpose of privacy-preservation or rebalancing imbalanced datasets. One of the main drawbacks of synthetic data, however, is the inherent tradeoff between realism (fidelity) and privacy-preservation. That is, increasingly realistic synthetic data is more similar to the real dataset that it was based on, and therefore risks leaking sensitive medical information. Fortunately, the same tradeoff does not necessarily hold for usefulness of synthetic data in a particular task, such as those found in HTAs. We developed a method for synthetic data generation which directly optimizes for task-specific usefulness instead of fidelity, entitled Fidelity Agnostic Synthetic Data, which is therefore especially useful in the HTA context.
Thom Volker - Being certain about the uncertain: Prediction intervals with missing data
The development and application of prediction models is often complicated by missing data. Most statistical methods do not readily allow for incorporating missing values during estimation. Consequently, the model of choice cannot be estimated. Imputation can be a solution: the missing values are replaced by values drawn from an imputation model, after which the prediction model can be estimated. Broadly, two strategies can be defined: single and multiple imputation. With single imputation, each missing cell is filled in with the most probable value given the other variables in the data, under some model. Historically, single imputation has been the method of choice in a prediction context. It is simple, readily fits in subsequent workflows and may not necessarily harm predictive accuracy. However, if the interest is in accurate predictive uncertainty quantification, single imputation is problematic. Attempting to reconstruct the missing value precisely is often naïve, and by doing so, the distribution of the data is heavily distorted. Important, the variability in the data is reduced, resulting in uncertainty estimates that are too small. We propose a multiple imputation-based workflow that yields valid uncertainty estimates Multiple imputation accounts for the uncertainty around the missing values by drawing imputations from a distribution. The uncertainty in the imputations is propagated through the prediction model, and the scale of this uncertainty can be used to quantify the prediction uncertainty using Rubin’s rules. We show that, using this approach, valid prediction intervals can be constructed in a linear model setting, regardless of whether missing data occurs in training or testing data, or both. Moreover, the prediction interval is valid regardless of the amount of missing data, and is individualized, in the sense that it scales with the amount of missing data an individual record has. We implemented the required estimation method in the R-package MICE.
Rachel de Jong - Anonymity and disclosure control for network data
Networks offer unique opportunities to enhance our understanding of the structure of our society and economy. However, since these network datasets can contain information on individual entities, publishing or sharing them can lead to a breach of privacy. Even after pseudonymization, i.e., removing identifiers such as a person’s name, individuals can still be identified based on their surrounding network structure. In this work, we study k-anonymity for networks. Moreover, we study the effect of different attacker scenarios, where an attacker, who tries to obtain sensitive information from the network, differs in the amount of knowledge. Additionally, we extend the notion of k-anonymity to account for uncertainty in attacker knowledge. Finally, we investigate perturbation based anonymization approaches to preserve