Digest 2021 #1: Data and Democracy

Welcome to my first digest, on the theme of ‘Data and Democracy’. In this article I’ll share some of the best resources I’ve found in the last few months of 2021, covering science, data, statistics, modelling, policy, and democracy.

Parker, Simon. ‘Conservative Anarchism, Self-Organisation and the Future of Government’. Medium (blog), 22 June 2020. https://medium.com/@SimonFParker/conservative-anarchism-self-organisation-and-the-future-of-government-2ef5447b7f02.

Simon Parker is a civil servant in the UK working for Redbridge Council in London. I found this article by web-searching for ‘“systems thinking” conservatism’, as I was curious to see if anyone had examined how systems thinking can fit into a conservative political-philosophical mindset. Parker’s article doesn’t look at this, but I am very glad I found it because it has been one of the most interesting and exciting articles I’ve read all year. Like a good brownie, this article is dense but not overwhelming: it’s chock full of insights but still manages to be readable over a cup of tea. At the end Parker remarks that they were “going to write a book, but decided to splurge the ideas down as a short essay instead”, and I would love to see some of these ideas developed further.

In this article, Parker discusses the origins and failure of New Public Management; the application of agile development, design thinking, and systems thinking to policy-making; and self-organisation as an end goal for the next evolution of democracy. To illustrate these ideas, Simon comments on recent political elections in the UK and USA; previous governments in the UK; US-Mexico border legislation; the effect of austerity on local government budgets; housing policy and the competing interests at play in the housing system; Occupy, Enspiral, and the Sunflower Movement; and Brexit, Trump, and Extinction Rebellion.

Overall, this is a very insightful read for anyone interested in public policy, democracy, and/or organisation design.

boyd, danah. ‘Statistical Imaginaries’. Substack newsletter. Data: Made Not Found (by danah) (blog), 1 December 2021. https://zephoria.substack.com/p/statistical-imaginaries.

In this talk (available as a transcript or a video of a presentation boyd gave at the Microsoft Research Summit 2021, linked via the post), danah boyd explains the inherently political nature of data, and census data in particular. Various aspects of data and its politicisation are discussed, and there are two issues which stand out to me. First, the way that uncertainty is indicated and communicated to policy-makers and the public affects the perception of data and statistics as authoritative and accurate, with scientists and statisticians having to make trade-offs between scientific honesty and the appearance of correctness. Second, the choice of sampling methods and the choice of questions included in a survey are deeply political and motivated by the interests of politicians and policy-makers, who do not want data that reveals weaknesses in their policies. This shaping of data and data gathering processes takes the form of ‘wilful ignorance’, where certain data remain ungathered so that they cannot be used to develop a critical narrative. boyd also discusses differential privacy; and the wider gap between the vision of data as a source of raw, unbiased truth, and the reality of data as complicated and opaque technologies which obscure the assumptions and biases embedded within them.

Recommended for anyone who works with data, including policy-makers.

‘2021 EU Conference on Modelling for Policy Support: Collaborating across Disciplines to Tackle Key Policy Challenges | Knowledge for Policy’. Accessed 27 December 2021. https://knowledge4policy.ec.europa.eu/event/2021-eu-conference-modelling-policy-support-collaborating-across-disciplines-tackle-key_en.

The 2021 EU Conference on Modelling for Policy Support took place over five days in November and was incredibly engaging and interesting. I pulled out some of the key themes I picked up on as part of a message to the JISCMail SIMSOC mailing list, which I reproduce here:

Three other themes arose during the EU conference which I think are relevant to this discussion [on SIMSOC] of validation: co-production with stakeholders; communication of results to stakeholders; and management of uncertainty. These overlap but I will briefly discuss each in turn.

Many modellers at the conference extolled the value of designing models collaboratively with stakeholders such as domain experts, policy-makers, service users, citizens, etc. In particular, it was suggested that these people should be brought in to the modelling process as early as possible, to collaboratively determine the questions we want our models to answer; to explain the limits of models; and to determine data requirements and other needs of the modellers such as expert knowledge on certain processes. Building on this suggestion is the idea of ‘iterated’ model development, where complexity is ‘layered in’ to the model and stakeholders and consulted at each iteration: workshopping system dynamics or agent behaviours; and discussing results. This co-production approach served two benefits. Firstly, it allowed stakeholders to build trust in the model, because their feedback was incorporated throughout the process and because they were better able to understand the limitations of the models. Secondly, it resulted in more robust models in terms of ‘process validation’, because it allowed modellers to validate process assumptions very quickly. I also heard at least twice at the conference that sometimes ‘the process is more important than the result’, i.e. we can learn a lot just by undertaking (collaborative) modelling and these learnings are valuable themselves, separately from the outputs generated by the models.

Many stakeholders are non-technical and time-constrained, so it is important to communicate model results appropriately: know your audience. As modellers it is essential that we understand the needs of those who consume our models and their results: what do they actually want to know? what are their questions? Communication of uncertainty and variance is particularly fraught, because large variances in projections for different regions/populations/contexts can cause skepticism about the results, even if this variance is generated by true underlying differences between these contexts. This is doubly true for any measures of uncertainty (e.g. intervals on an effect size vs the mean/expected effect size): “if the model is accurate why are you so uncertain about this output?” Two suggestions for communicating uncertainty and variance stood out to me: explain it; and eliminate it. With explanation, we can look at the drivers of the uncertainty/variance: what parameter or input is driving this output? We can then communicate this to stakeholders in a narrative fashion: “Italy has a different land availability compared to France, so the main driver for agricultural changes in Italy is repurposing of land, whereas France will be more reliant on technological innovation”. With elimination, if we can really pin down the question our stakeholders want answered, we can perform further analyses on our results which eliminate uncertainty at the cost of granularity. Do we need to present a ranking of each region on some metric, or do we just need to cluster similar regions and talk about a qualitative profile instead? The research question is as much a tool for framing and presenting our research, as it is a problem to be solved. :) See Session 9 from the conference for more on these ideas.

Finally, drilling in to the uncertainty question more: how do we as researchers begin to understand and quantify the uncertainty in our models? The ‘sensitivity analysis’ is the most common tool we have for exploring uncertainty, but many modellers expressed their inability to conduct sensitivity analyses due to time constraints! Personally I feel uncomfortable with this: if we are modelling to advise policy, a thorough understanding of uncertainty is essential, because we are literally messing with people’s lives, and we need to be damn sure that we will not cause adverse outcomes. A comprehensive understanding of the risk-reward trade-off is essential, and if we can’t develop confidence about the scale of potential risk, then perhaps we should default to ‘fail-safe’ policies, which are less likely to have catastrophic adverse effects if they fail, even if the potential pay-off of success is only small? I believe we need new or more advanced tools for sensitivity analyses, including fuzzing of input datasets, and techniques for uncertainty propagation, which allow us to determine how uncertainty ‘flows’ through a model; indicate which assumptions we have least confidence in; and determine which areas of our models we need to really ‘pin down’, thus directing future research. Frederik [Schaff] stated that “there is no standard way to understand / analyse causation generically”, but I have become quite enamoured with the causal inference literature and use of directed acyclic graphs (DAGs) for modelling the data-generating process (DGP) and I feel there is some potential to integrate these causal inference tools further with systems dynamics, microsimulation, and agent-based modelling.

Recordings of all the sessions from the conference are available on the European Commission website, excluding workshops, covering many topics and models in domains as varied as banking, epidemiology, and climate change adaptation. Recommended viewing for policy-makers, and researchers who work with or produce evidence for policy-makers.

The OR Society. A Guide to Automated Forecasting of Criminal Court Cases at the Ministry of Justice - Ben Lavelle, 2019. https://www.youtube.com/watch?v=_j92yPEmIrM.

This half-hour presentation by UK civil servant Ben Lavelle was recommended to me by another UK civil servant, Sarah Livermore. Lavelle gave this talk to the Operational Research Society to show some of the recent developments at the Ministry of Justice around development of a new data operations pipeline, to improve the reproducibility and user experience of producing datasets from court case files. It’s really exciting to see modern software engineering tools and practices like version control and continuous integration/deployment making there way through to the civil service to support such important work.

Recommended watching/listening for software engineers, data engineers, researchers, and analysts with an interest in justice, policy, and reproducible, transparent, and production-oriented workflows.

Liliana Bounegru and Jonathan Gray, eds. The Data Journalism Handbook, 2021. https://www.aup.nl/en/book/9789048542079/the-data-journalism-handbook.

Data journalism involves the use of data and visualisation to discover, understand, and communicate stories about the real world. This open access book draws together over 50 short write-ups about various data journalism projects and the experiences of their investigators.

The large number of short chapters makes this work very easy to dip in to if you’re looking for an interesting, quick read while you sip your preferred hot beverage. Ideal for anyone interested in journalism, data, and research ethics/justice.

McElreath, Richard. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press, 2020.

After discussing some statistical modelling with my friend Joe Atkins-Turkish I was motivated to seek out a textbook on Bayesian statistics, to make sure I wasn’t missing any philosophical nuances in my understanding of the Bayesian vs. frequentist approaches. Statistical Rethinking is so much more. In this graduate-level text, McElreath covers Bayesian analysis; model comparison via cross-validation and information criteria; multilevel models; and graphical causal models à la the causal inference work of Judea Pearl. In other words, this textbook explains many of the advanced, breakthrough technologies which anyone working with data would want in their toolkit.

So far I’ve only read the first chapter, which likens statistical (and process/causal) models to the Golem of Prague, and McElreath’s argument parallels one of my own adages:

Computers are very loyal and very stupid: they will only do exactly what you tell them to, no more, no less.

This first chapter is very approachable but dense with insights. McElreath comments on the need to understand the models we use; the distinction between statistical and process or causal models; the issues with hypothesis testing and falsificationism; and goes on to justify the necessity of the tools and methods which he documents throughout the rest of the book.

I’d recommend the first chapter alone to anyone working with or interested in data, statistics, or science.