Early in my tenure at RAND, I helped to develop methods for blending probability and nonprobability data. High quality sources of data for surveys (i.e., probability samples) often do not yield large enough samples when a remote segment of a population is targeted, and researchers often attempt to supplement those through nonprobability convenience samples. I originally explored these approaches for application to a survey of caregivers of veterans of the US Armed Forces who served following September 11, 2001–using a convenience sample of caregivers taken from the Wounded Warrior project (Robbins et al., 2021).
Motivated by the potential utility of nonprobability data, I was awarded a competitive NSF grant (Award #1837959, $991,127) to explore methods for producing generalizable inferences from highly non-representative big data sources (e.g., social media). The proposed case study involved using data from Twitter (i.e., tweets) to gauge public opinion on political candidates in real time during an election cycle. The difficulty of generalizing users of Twitter to a broader population is rooted in the fact that little is known about them—basic demographic characteristics that are usually used to develop survey weights (such as gender, age, race, education, etc.) are unknown for Twitter users. The proposed work involved designing and administering a survey to both a probability sample of the US population (collected through the NORC AmeriSpeak Panel) and a convenience sample of Twitter users (collected using a targeted advertising campaign on Twitter). The survey collected basic demographics along with political beliefs and social media usage patterns for the respondents.
First, the research team, which I assembled and led, applied the prior methods (Robbins et al., 2021) to illustrate that it is possible to adjust the Twitter convenience sample to be representative of the broader population of US adults (Pollard et al., 2026). We then used information available for all twitter users (i.e., their user profile and tweets) to develop proxies for their demographic and related characteristics. These proxies are developed by using regularized regression with survey characteristics as outcomes and 100,000+ indicators derived from the user profiles and tweets as predictors and are shown to closely replicate the intended characteristics. The Twitter sample when weighted to generalize to the broader population can be used to derive population benchmarks for these proxy variables. We then collected a “universe” of tens of thousands of Twitter users and employed sentiment analysis to quantify the degree of approval or disapproval expressed in their tweets towards presidential candidates in the 2020 election cycle. Fine-tuned large language models were shown to be highly effective at capturing benchmark sentiment scores (Griswold et al., 2025). Using the population benchmarks of the proxy variables, the Twitter universe was weighted to generalize to the US adult population, enabling us to develop a Twitter-based estimate of political approval scores that is argued to be representative of the US adult population and can be tracked over time throughout an election cycle. We illustrated that this score (calculated using the weighted universe) mimics contemporaneous public opinion polling but is more responsive to major events such as presidential debates and the Capitol Hill riots of January 6, 2021.
Collaborators:
References
Journal Articles
-
Blending of probability and convenience samples as applied to a survey of military caregivers
M. W. Robbins, B. Ghosh-Dastidar, and R. Ramchand
Journal of Survey Statistics and Methodology, 2021
Probability samples are the preferred method for providing inferences that are generalizable to a larger population. However, in many cases, this approach is unlikely to yield a sample size large enough to produce precise inferences. Our goal here is to improve the efficiency of inferences from a probability sample by combining (or blending) it with a nonprobability sample, which is (by itself) potentially fraught with selection biases that would compromise the generalizability of results. We develop novel methods of statistical weighting that may be used for this purpose. Specifically, we make a distinction between weights that can be used to make the two samples representative of the population individually (disjoint blending) and those that make only the combined sample representative (simultaneous blending). Our focus is on weights constructed using propensity scores, but consideration is also given to calibration weighting. We include simulation studies that, among other illustrations, show the gain in precision provided by the convenience sample is lower in circumstances where the outcome is strongly related to the auxiliary variables used to align the samples. Motivating the exposition is a survey of military caregivers; our interest is focused on unpaid caregivers of wounded, ill, or injured US servicemembers and veterans who served following September 11, 2001. Our work serves not only to illustrate the proper execution of blending but also to caution the reader with respect to its dangers, as invoking a nonprobability sample may not yield substantial improvements in precision when assumptions are valid and may induce biases in the event that they are not.
-
A Demonstration of Propensity Score Weighting to Adjust a Social Media Nonprobability Sample Survey of Political Attitudes
M. Pollard, M. W. Robbins, and M. G. Griswold
Public Opinion Quarterly, 2026
Interest in using nonprobability online samples continues to grow despite concerns about selection bias. Many methods exist for adjusting nonprobability data so it may yield generalizable inferences. Here we investigate whether a propensity score weighting method can balance differences between a probability sample and a nonprobability sample of Twitter (now X) users to evaluate the feasibility of using social media data for producing generalizable inferences on public opinion. We fielded identical surveys to 2,001 probability-sampled respondents (June 30-July 22, 2022) and 949 Twitter users (March 1-July 13, 2022); final analytic sample sizes were 1,972 and 822, respectively. The nonprobability sample differed significantly in demographic characteristics (younger, lower income, higher educational attainment), and broadly endorsed significantly more liberal attitudes toward a range of political and policy issues than the probability sample. We show that the propensity score weighting procedure, using demographics, techno/psychographics, and political ideology, reconciles differences between the samples for 25 of the 27 attitudes assessed. The results demonstrate the feasibility and utility of the propensity score weighting procedure to replicate a probability sample with nonprobability social media data and add to the literature on the use of nonprobability samples to draw population-level inferences.
-
Stay Tuned: Improving Sentiment Analysis and Stance Detection Using Large Language Models
M. G. Griswold, M. W. Robbins, and M. Pollard
Political Analysis, 2025
Sentiment analysis and stance detection are key tasks in text analysis, with applications ranging from understanding political opinions to tracking policy positions. Recent advances in large language models (LLMs) offer significant potential to enhance sentiment analysis techniques and to evolve them into the more nuanced task of detecting stances expressed toward specific subjects. In this study, we evaluate lexicon-based models, supervised models, and LLMs for stance detection using two corpuses of social media data—a large corpus of tweets posted by members of the U.S. Congress on Twitter and a smaller sample of tweets from general users—which both focus on opinions concerning presidential candidates during the 2020 election. We consider several fine-tuning strategies to improve performance—including cross-target tuning using an assumption of congressmembers’ stance based on party affiliation—and strategies for fine-tuning LLMs, including few shot and chain-of-thought prompting. Our findings demonstrate that: 1) LLMs can distinguish stance on a specific target even when multiple subjects are mentioned, 2) tuning leads to notable improvements over pretrained models, 3) cross-target tuning can provide a viable alternative to in-target tuning in some settings, and 4) complex prompting strategies lead to improvements over pretrained models but underperform tuning approaches.