Recorded on March 8, 2023, this video features a lecture by Jo Guldi, Professor of History and Practicing Data Scientist at Southern Methodist University. Professor Guldi’s lecture was entitled “Towards a Practice of Text-Mining to Understand Change Over Historical Time: The Persistence of Memory in British Parliamentary Debates in the Nineteenth Century.”
Co-sponsored by Social Science Matrix, the UC Berkeley Department of History, and D-Lab, this talk was presented as part of the Social Science / Data Science event series, a collaboration between Social Science Matrix and D-Lab.
Abstract
A world awash in text requires interpretive tools that traditional quantitative science cannot provide. Text mining is dangerous because analysts trained in quantification often lack a sense of what could go wrong when archives are biased or incomplete. Professor Guldi’s talk reviewed a brief catalogue of disasters created by data science experts who voyage into humanistic study. It finds a solution in “hybrid knowledge,” or the application of historical methods to algorithm and analysis.
Case studies engage recent work from the philosophy of history (including Koselleck, Erle, Assman, Tanaka, Chakrabarty, Jay, Sewell, and others) and investigate the “fit” of algorithms with each historical frame of reference on the past. This talk profiles recent research into the status of “memory” in British politics. It profiled the persistence of references to previous eras in British history, to historical conditions per se, and to futures hoped for and planned, using NLP analysis. It presented the promise and limits of text-mining strategies such as Named Entity Recognition and Parts of Speech Analysis for modeling temporal experience as a whole, suggesting how these methods might support students of social science and the humanities, and also revealing how traditional topics in these subjects offer a new research frontier for students of data science and informatics.
About Jo Guldi
Jo Guldi, Professor of History and Practicing Data Scientist at Southern Methodist University, is author of four books: Roads to Power: Britain Invents the Infrastructure State (Harvard 2012), The History Manifesto (Cambridge 2014), The Long Land War: The Global Struggle for Occupancy Rights (Yale 2022), and The Dangerous Art of Text Mining (Cambridge forthcoming). Her historical work ranges from archival studies in nation-building, state formation, and the use of technology by experts. She has also been a pioneer in the field of text mining for historical research, where statistical and machine-learning approaches are hybridized with historical modes of inquiry to produce new knowledge. Her publications on digital methods include “The Distinctiveness of Different Eras,” American Historical Review (August 2022) and “The Official Mind’s View of Empire, in Miniature: Quantifying World Geography in Hansard’s Parliamentary Debates,” Journal of World History 32, no. 2 (June 2021): 345–70. She is a former junior fellow at the Harvard Society of Fellows.
Listen to the lecture below, or on Google Podcasts or Apple Podcasts.
Transcript
Jo Guldi: Towards a Practice of Text Mining to Understand Change Over Historical Time
[MUSIC PLAYING]
[JAMES VERNON] Good afternoon, everyone. My name is James Vernon. And I am a teacher in the history department. And I’m delighted to be able to welcome Jo Guldi back to Berkeley today.
Before I get going, I have been told by Marion that I need to remind everyone that the event is co-sponsored by the History Department; and by the Data Lab, D-Lab; and by the Social Science Matrix. So there’s a great team of co-sponsorship that speaks to the enormous interest in Jo’s work across the campus.
I’ve known Jo for probably close to 20 years, which is a little embarrassing– 15, for sure, but not far off 20. She’s one of the most fearless and most energetic scholars that I know. And before I try and explain something about the work that she does, I should let you know that she already published more books than I have. And I’m probably almost twice her age.
She has published her first book, which is very special to me because we worked on it a little together, and was published by Harvard in 2012. And it’s called Roads to Power– Britain Invents the Infrastructure State. Jo did her graduate work here at Berkeley. But in a way that is absolutely characteristic of Jo and the types of work that she does, she came out of comp lit into urban planning and architecture to settle finally in history.
But while she was in history, she was already very much taking the digital turn and helping to build the conversations that ended up at the D-Lab here on campus. Her next book was The History Manifesto that she co-authored with– oh, this is embarrassing. I’m blanking on his name.
[JO GULDI] David Armitage.
[JAMES VERNON] David Armitage, the very distinguished professor in my field of work and a professor at Harvard. And then this year, she has– last year, she published the book that you can see on the table here that she was talking about a lunchtime today, The Long Land War– The Global Struggle for Occupancy Rights, which is published very beautifully by Yale. And because Jo is Jo, there is now another book that she is going to be talking about today called The Dangerous Art of Text Mining, which is coming out very shortly with Cambridge University Press with this very striking image as its front cover.
So what unites or what brings together the different types of writing and scholarship that Jo does? What I love about Jo’s work is that it’s constantly pushing us to think both methodologically about how we produce scholarship. But it’s also continually in that process urging us to think about questions of scale. And for historians especially, that’s a really profound question.
And it’s led Jo to a very intense engagement with the growing field of digital humanities to think about the way in which working through methods like text mining we can work on a scale both geographically and temporally that historians are usually very uncomfortable at working at. Because historians tend to generally be very grounded in an archive that gives them a very specific relationship to time and to space. And Jo had always been pushing at that boundaries in a way in which she’s traveled between archival scholarship and her work in terms of digital methods.
The other thing that I want to say about Jo’s work, which I think brings together the different projects that she’s worked on is, to me, being a continuing preoccupation with the question– with two questions. One is about the nature of power.
Often, for Jo, that’s about thinking about state formation and the set of new technologies of power that came into the world in the 19th and the 20th century. But Jo has always been deeply interested in those groups of people who mobilize against those new technologies of power and of regulation. And I think The Long Land War and The Road– and the book on The Road to Power are really great examples of that type of scholarship.
And then the other element that was there for Jo, I think even when she began in comp lit was an absolute preoccupation with what she began by talking about as landscape. And Jo was profoundly influenced by scholars in the history of architecture and urban planning around the politics of landscape. But the question of land and understandings of property and property rights have really been at the center of the types of historical questions that she’s been asking in her scholarship.
So all of which is to long windedly key up the talk that she’s going to do for us this afternoon on The Dangerous Art of Text Mining– A Methodology for Digital History, which I’m delighted to say as someone who was trained in the history of politics is going to be dealing in part with the vast amounts of text that were collected by various forms of publication, including Hansard, the official collection of the Houses of Parliament in Britain. Jo, it’s wonderful to have you here. And we’re really looking forward to your talk.
[JO GULDI] Thank you so much, James. It’s really a delight to be here. It’s a delight to be among old friends, old advisers, new friends at a time when Berkeley is initiating initiatives like The Matrix and the D-Lab and text mining and data science. I hope to hear more about those conversations at dinner, including getting the names right.
But I’m very grateful to Marion, in particular, for facilitating this in such a way that such a diverse set of audiences could come together and to the History Department to being willing to host a nonconventional history talk. We had a great discussion of the archival book. We talked about archives at noon. So if you weren’t– if you were there, hi again. And if you weren’t there, archives are real. But we’re not talking about them right now. Right now, we’re going to talk about data.
So there is no doubt that today, social science is becoming a big data subject. As part of that process, fields make new discoveries. In political science, economics, and sociology, investigations into robustness, inductive versus deductive thinking, correlation versus causality, and false positives have generated important new standards.
Because history is both a social science as well as a qualitative discipline in the humanities with an appreciation for description, our processes will look different as history starts to engage data. It will need new standards of validation. Temporal experience as a criterion of successful discovery means slightly different work than the work proposed in fields which take certainty as the standard alone.
Because we are ultimately a positivist science descended from the Enlightenment, we in history must take an interest in those specific findings about events, periodizations, and other temporal experiences contained within our textual archives. And here, I believe could be the beginning of a productive new dialogue across the social sciences about the best practices for characterizing temporal change over time and validating conclusions made on the basis of algorithms.
So in today’s talk, I’d like to tell you about what validation might look like when applying text mining to the analysis of the experience of past events. I’ll be introducing three historical concepts– memory, periodization, and archive– and telling you about applying statistics and machine learning to gain insight about these categories of temporal experience. But first, I want to tell you a little about what text mining is, how validation is traditionally performed in history, and why I recommend working with standards of meaning from the field of history when engaging in text mining.
Mining refers specifically to the extraction of valuable metallic auras from the Earth and metaphorically to any process that extracts rare and valuable content from its surrounding context. Data mining begins with counting but includes statistical transformations to test for relationships, such as correlation and significance. And text mining typically treats text as the data in question.
It begins with computational transformations that break up and classify digital strings of archival text into units representing constituent words and phrases. We’re just counting words. Next, we might apply statistical manipulations in order to study those meaningful signals and their relationships typically using the kind that trained analysts from the humanities and social sciences would like to detect in the course of reading but now carrying out that analysis on a bigger scale.
Validating the results of quantitative text mining represents a new terrain that requires thinking with the tools of multiple disciplines. In history, as with other fields in the humanities, problems of believability have been traditionally linked to forms of dense qualitative description of unique objects. So consider, for example, this surveyor’s notebook from Ireland in the 1880s, which includes notes on local farmsteads, the cost of rent, and economic investments by peasants, all of which is easy enough to translate into the numbers on a spreadsheet.
But there are also elements of the surveyor’s notebook that I would find in the archive that are more difficult to discern when abstracted into data. The shape of the book is about gay big, which means that it’s ideally organized so that it can fit into the pocket of a traveling code. And I can take it with me on the railroad as I visit the farmsteads across Ireland. It’s a traveler’s notebook.
Reading up on the imprint leads me to discover that it was created in a woman-owned paper shop in Dublin for the purposes of accommodating travelers. So there are many travelers of the era who might be taking notes with devices such as this. It’s only by reading the invisible context of the data, not just the data itself, that one might learn about the surveyor’s politics and how his beliefs in designing a participatory economy factored into his new method of surveying the landscape, therefore, collecting rents and asking how much of the local investment originated with the tenant versus the landlord.
Relevant details for answering that question are located not in the object or in the biography of the surveyor or our deck– sorry, are located in this object as an object or in the biographies of the surveyor and has other published writings, not in the words on the page that would be transcribed in the process of text mining.
So it follows that text mining only ever applies to a sliver of the available knowledge about the past. It offers no replacement for the archive as a whole or the practice of history as a whole, a field that’s potentially concerned with the life of surveyors as well as the agrarian rebellions and political movements with which they intersect.
So just to make it super clear, I’m not talking about creating a magic button that says, make history now, and outdating the entire history department. You’re still going to need them. Are we clear? Yes? OK, good.
The surveyor’s notebook also illustrates the kinds of expertise that have traditionally been necessary to manufacture a compelling interpretation of the past. The readers’ willingness to trust qualitative descriptions often hinges on forms of expertise such as paleography, rhetoric, the history of technology or the history of the book. A deep training in these specialized skills like how to read the surveyors’ handwriting supports the individual historian skills of appreciating specific details that support meaningful interpretation of the object.
So we like our details. We like knowing that there’s expertise behind these facts and interpretations. And what this means is that if practitioners from history are going to trust me when I start touching computers, they’re going to want a similar level of maniacal detail engagement from the new field of text mining.
They will want to test the bias of each data set as well as the limits and promise of every algorithm as well as the result of what happens when the algorithm is applied. It’s not going to be enough in history to treat the algorithm like a black box. It will get me jeered out of the history seminar.
Above all, historians expect some reckoning with meaning with significant stories that have not been previously heard yet, yet which harness the potential of shaking up our understanding of the past, adding fundamentally new information to our stories about nations, institutions, individuals, and human experience in general.
Discussions of accuracy and meaningfulness are vital for text mining because the tools of text mining are by their nature reductive. They produce meaning by taking a knife to the data about the past, reducing past experience to a minimized selection of experience and information. Each data-driven visualization produced by text mining is merely one of the possible representations of what Timothy Tangherlini in the front row called the vast unread behind the data analysis.
Every exercise in data mining works with a portion of possible truths housed in the totality of each historical archive, such that any given interpretation of data necessarily represents massive information loss. Yet, text mining is powerful if done right. If the analysis of text mining are visualized successfully, a tiny image offers at least potentially an outsized return on investment, distilling shelf miles of text into a valuable, pithy representation of what those words did.
So a single visualization might reduce the story of a single institution’s politics and how they’ve changed over 100 years as in this visualization of the state of the union addresses over time. Or it might give us a mirror of what stories people have told us about how COVID was transmitted or how American novelists present white characters and Black characters and how those representations in fiction have changed over time in a more diverse nation whose publishing industry has remained biased according to the standards of the 1950s.
But there are also many data-driven analysis of texts that purport to offer substantive insight and fail. Such failures of text mining occur when algorithmic distillations of text are misapplied with the result of analyses that are empty biased or simply false. So failing to account for the limits and bias of archives is the major reason for retractions of several recent articles published in science on the papers of the National Academy of Science and attacked by historians.
For example, in this representation of the history of the world associated with an article in the Journal of Science and circulated on the Nature website, which failed to represent– which claimed to represent the history of migration all over the world but failed to acknowledge any activity by non-whites, including the transportation of enslaved humans across the Atlantic as a part of world history, left out some important things.
A second problem sometimes encountered is failing to account for grammatical relationships, which creates representations of language based on bags of words that may miss something. One of the– you may have noticed that the Wordle used to be all over the front page of American newspapers. It disappeared around the time that Trump was elected, possibly because the Trump speeches and the Obama speeches create exactly the same word cloud.
They both say the American people over and over and over again. That’s because all of the grammatical relationships and the two-word phrases are left out, which means that it’s just economy America and the people in the economy. If you do what– I wrote this without thinking about you in the front audience. But I’m just– if you do what Tim Tangherlini does and you put the grammar words back in, then you have all of these other relationships, which are much more telling about the distinctiveness of populist speech on the left versus populist speech on the right.
So don’t leave out the grammar. A third problem is failing to engage with existing knowledge about the past, which results in data-driven text mining that’s simply redundant. For instance, famously, the publications by the Culturomics group which invented ngrams, their conclusions included the fact that Nazi Germany censored books. We already knew. We knew that. Not new. Not a discovery. And unimpressive to historians. Many chuckles in the history department about that one.
So in a qualitative/quantitative question like the determination of relevant historical events, validation is more than a matter of p-values and error bars. A validation practice that’s satisfying to historians as well as the theorists of information must demonstrate intimacy with algorithms and their parameters as well as the technical interest in the results of those algorithms applied to historical questions. It must model temporal experience in a way recognizable to historians with a capacity for innovation and what we already know about the past.
The former criterion requires an appreciation of statistics, the ladder of history. In metallurgy, the field that understands geography and the physiognomy of the Earth, the way you get to value to the valuable ore is geology. In human experience, the fields that understand meaning and value about what’s changing the past are the humanities and social sciences and history in particular.
We have already established that the appreciation of the bias in each archives is a matter of robust analysis. So allow me to introduce my data. I’m talking about the collection of the speeches of the UK parliament from 1806 to 1911. So these are the speeches of the House of Commons and the House of Lords.
The major speeches were recorded. More of them were recorded over time. Here are the members of the House of Commons. Gladstone is up front. The Lord Chancellor’s on the back. All of these people have papers. Often, it’s the text of the speech they intend to read.
Their speeches are going to circulate outside parliament to the rest of the nation. Because up there, in the gallery, there are journalists. Towards the end of the century, they’re taking note by shorthand. Earlier, they have more primitive methods. They’re trying to write down all of the memorable speeches to be republished the next day in the newspaper.
Those speeches are bundled by one printer, in particular Thomas Curson Hansard. And so the printed speeches are known as Hansard. Hansard is my data set. It’s now a data set. It was turned into a data set for the first time in the 1990s. It was cleaned in the early 2000s. My lab cleaned more.
It’s comprehensive, periodic, and well preserved. One of the nice things about working with Hansard is that we know what’s in and what’s out. It’s not like the Google Books data set, which is random assortment of printed text that could be novels or magazines or essays biased in all sorts of directions.
We know how parliament was biased. Do you see any women? No. People of color? Not so much. We know. We know who they are and what’s in and what’s out. So we can describe it very well. And when we’re modeling change, we know what kind of change we’re talking about.
My data is, as I said, 1806 to 1911. It’s a big data. It’s 46,000 individual speakers, 100,000 separate debates, and a million speeches, about a quarter billion words in all. It’s too much to read.
Also, we have also established that the discipline of history represents a guide to the analysis of texts in the past that are meaningful. In my forthcoming book, The Dangerous Art of Text Mining, I argue for beginning with the building blocks of historical understanding. And I show that list here– memory, period, archive event, influence, change over time, and modernization.
These are concepts from the field of history which have been heavily theorized. The philosophy of history has engaged some of these terms for a century to help us understand different elements of temporal experience. The historical past is not all one. There are multiple pasts.
So I believe that in bringing these concepts into dialogue with data science, it’s possible to advance towards a newly robust practice of digital history and also to add to the robustness, usefulness, and meaningfulness of the practice of data science itself. In this talk, I’m prepared to show approaches to the three of the categories I mentioned here– memory, periodization, and archive. If time remains, I’ll join those case studies to one of my theoretical publications on critical search about how to move from qualitative and quantitative data to meaning.
Can I put you to work? Please hand out. So first, let’s take the concept of memory. Memory was introduced by Maurice Halbwachs at the beginning of the 20th century as distinct from history, which is the study of what actually happened in the past, in its totality. Memory is collective, anchored to place and to oral tradition. It’s the source of identity.
History is institutional and expert. It’s bound up with enlightenment dialectics of argumentation about the truth of what happened. Memory studies traces the popular partial and often deeply political reception of the past in contrast to the study of history proper, which applies social science to pursue the truth of the totality of past experience.
So historians have canonically studied memory through the creation of new rituals and monuments, whether Tudor funerary monuments like this one, which was created to tell you all that Richard Boyle, Earl of Cork, was a really cool dude; or the monuments that commemorated Civil War generals, which have become more problematic in a more enlightened age; or through such instances of invented rituals like the invention of academic regalia or the Scottish kilt.
And what follows, I’m concerned with acts of memorialization in speech. What happens when a politician says, I remember the Boston Tea Party? Well, they didn’t say that a lot in the British Parliament in the 19th century. They said, I remember the Glorious Revolution. But which events did they memorialize the most?
So let me show you a first instance. And then we’ll get to the images. This is a very baby instance. And then we’ll get to the images in your handout in just a second. Let’s begin with the simplest possible search for memory.
I’m looking for numbers between 1066 and 1911 and the parliamentary debates that were mentioned more than 20 times in any given year. Now, this could misfire. It’s possible that somebody says, 1067, and they are counting bales of hay or talking about conversions from the metric system.
But the strong diagonal line suggests that the numbers are meaningful because people in the past tended to– people in parliament tend to refer to legislation in their own year, the previous year, the next year. Next year, in 2024, we’re going to pass this bill. Last year, we passed that bill of 2022.
So we refer to those mentions a lot. Occasionally, they refer to deadlines expected in the future like that 1870 is the deadline for a diplomatic convention. Spot checking the numbers annotated in this graph confirms that they are almost universally references to years.
So the visual form of the chart is itself a modest innovation. I adopted a single dot plot to represent what I call a double timeline, meaning time is on the x-axis and time is on the y-axis. And the x here on the x-axis is the year of mention.
So this is the date of the political speech when Gladstone is saying the Glorious Revolution. On the y-axis is the year mentioned. So this is the year that’s mentioned in the text, the date that appears in the number.
And it lends itself to some starter observations. Some years are being added while others are dropping off each year, producing the diagonal. But that’s not all we find. I won’t tell you about all of the conclusions. You can say a lot of interesting things.
But one of the first interesting things to look for is strong vertical lines, suggesting a moment of memorialization. So that strong vertical line is 1838. 1838 is the date of the Tamworth Manifesto, the making of the modern Conservative Party associated with Benjamin Disraeli. It’s a moment when Conservatives are reforming themselves in order to argue against Whig ideas about modernity and the Conservative claim to an aristocratic idea of Britain is grounded in the conviction that Conservatives accurately represent the ancient aristocratic past, the glory days of Britain.
And so what do the Conservatives do in parliament? They start sprinkling their speeches with really completely random associations from the Tudor and Stewart years. Like, it’s not enlightening reading if you look at the 1567 and 1592 and 1667 and you ask, why are they debating these years?
They’re talking about the status of the earldom of Mar. They’re talking about Tudor funerary vestments like what priests are supposed to wear officially on certain days to celebrate certain sacraments. It’s not really important stuff. It’s not vital to the life of the nation. What they’re doing is symbolically signaling, I am attached to the ancient Tudor and medieval past. I know about all these things that happened in the Tudor age because I, a Conservative, have this special relationship to the past.
So you can imagine– you can imagine pursuing a question of memory like this by hand. But the success of this method confirms that the computer can with great efficiency identify changing relationships to the past simply by the simplest possible way of looking for numbers. Now, we get a very different perspective on memory and relationships to the past if we start applying different text mining methods.
And part of what I’m going to argue for in terms of a validation process appropriate to history is this process of exploring the data, exploratory data analysis via iterating over separate methods. So in machine learning, it’s giving us something called named entity recognition in which the computer recognizes the parts of speech on the level of the sentence.
And it makes a guess about which of these noun phrases, noun-like phrases might be the name of a person, the name of a place, or the name of an event based on suggestions by, for example, the different parts of speech around that noun-like phrase.
So if we ask the computer to guess about the events from the sentences in Hansard and then we give each of those events a number– those numbers are added by me– then we can organize them again in a double timeline. And we can track the phrases not mentioned with numbers but the actual phrase like the phrase “the French Revolution” over time.
So, again, tracked over time, here’s the date of the speech. There’s the date of the actual event. This is not surprising at all in British history. There’s book after book about 19th century history that says the British were terrified that the French Revolution was going to happen in Britain and it would lead to the decapitation of the aristocracy and the seizure of their lands.
So they never stop talking about it. So it’s very nice that that’s confirmed. But what we didn’t know is we didn’t know that the Crimean War has that duration of sustained memory. In contrast to most of the colonial wars, you see the Boer War or the Zulu war, the Egyptian war, the Persian war which are forgotten about. There’s a tiny shadow lingering of memory.
So we can learn something about the persistence of memory. The other one that stands out is the plan of campaign. The plan of campaign is very important to me. Because in the other book, I start with Ireland in the 1880s because of Irish activism.
This is the moment when the Irish tenants who weren’t allowed to an owner inherit land rise up against the landlords and say, we want our land back. It’s called the Plan of Campaign. And they don’t stop talking about the Plan of Campaign after it happens, which tells you something about manufacturing memory maybe on behalf of the Irish lobby.
But one of the– but this is a troubling chart in another way because there aren’t that many events from social history. There’s the Great Exhibition, which isn’t remembered that long, maybe a decade after it happens. But where are the peasants? Where are the post-colonial subjects?
Britain’s a big place. We know that Parliament is mostly an aristocratic institution, not for the entire century. But there should be some other kinds of events. Why is the computer only finding the wars? Well, the answer is that it’s a matter of scale.
And so if I control the vocabulary, the max n, the maximum number of mentions per year in the previous chart was 989. Here, it’s 77. So we’re going down in terms of scale. We’re using scale. And controlling of vocabulary is a way of exploring social events.
So we’re looking just for social events. And so these events are spoken about a degree less. It’s not that the computer didn’t find them. But they’re less numerous. So they didn’t show up unless you know that social event history is something you should look for.
This chart gives us an interesting opportunity to compare the persistence of memory around the Irish famine and the Bengal famine. The Irish famine is talked about in parliament nearly every year after it happens. The Mongol famine which should be– they’re really talking about the Bengal famine of 1880. So the number is wrong there– is forgotten about virtually after it happens despite millions upon millions dead.
So you see the bias of empire, in which the Irish have representation. But the citizens of Bengal don’t have representation. It shows up immediately– parliament is incapable of memorializing what happened despite its responsibility for the famine.
There’s clear data that parliament is mainly ignoring the colonies. And this is something that I’ve published about in other studies. The exception that shows up in the data– and this tracks with the consensus in British history– is the Great Revolt also known as the Indian Mutiny. The Great Revolt of 1857 is remembered every year because the British are reminding themselves that they have to be very afraid of Indian subjects and it’s necessary to arm themselves against them.
The tool seems to work to identify and compare patterns of memory. Much of this, we already knew. But we didn’t know it with the specificity until– with the same level of specificity until it could be measured.
So let’s look, again, at a smaller subset, descending, again, by another factor. And here, the max n is lowering from 77 to 6. So we’re going down by a factor of 10. Again, we’re looking for events that match the bigram riot. So famous riots in British history. What are they talking about?
Riots are a matter of a local uprising usually involving the working class dealt with by parliament hastily– often hastily and quickly forgotten, although one of the things that springs out from this analysis is the lasting memory of the Gordon and Featherstone riots.
The Gordon riots, we know about. We know about them in 1780. We don’t know about the way in which they’re being invoked in parliament in the late 19th century. And the same with the Featherstone riots.
Here, we get into a really specific frame of a question for investigation. The most remembered early riots are the Rebecca riots and the Gordon riots. But why is it that the Hyde Park riot of 1866 seems to evoke comparisons with the Gordon riots of 1780, whereas the Wexford riot of 1883 and later riots, this comparisons utterly vanish? What does the obsolescence of memory tell us about how contemporary riots were interpreted?
So in contradistinction to the previous charts of events, here’s a fundamentally new question about the vanishing of memory when we couldn’t have asked before text mining. So a decade ago, a student of history could have asked the question about any two riots and how they were invoked together. And they might have used keyword search to plumb the riots to at a time. But they could not have begun with an archive and an entire category of temporal experience, memory, and progressed from there to identify riots as a subject where the patterns of memory are full of surprises.
OK, so next, I’d like to move on to periodization, the issue of periodization. And, again, I have a handout. In the next case study, I’ll be applying an algorithm for finding statistical distinctiveness, TF-IDF to the problem of periodization. Now, in contrast to memory, periodization is a theory about how time is divided.
Historians argue about the significance of centuries and decades, posing questions like, when did the 19th century begin? The French Revolution is an excellent candidate. In short, we’re interested in what distinguishes one decade from another or one century from another as well as the possibility that some decades are just more historically meaningful than others. Big moment of change. Sometimes you feel like everything is changing around you.
So to investigate distinctiveness, I turn to a vintage statistical algorithm. This is old data science, not new data science, TF-IDF. TF-IDF was introduced in 1973 by statistician Karen Sparck Jones. It’s useful as a tool for class. It’s used in the library– in library schools as a tool for classifying articles by ranking the likelihood of each word document pair relative to all words and all documents in a data set.
So in my article, I applied this algorithm not to articles in the library, not asking what’s the word most distinctive of each article or each author, but to time periods. So hence, TF-IDF becomes TF-IPF, Inverse Period Frequency. What are the words that are most distinctive of each time period?
And the time period can vary. I can ask, what are the words most distinctive of each decade, of each 20-year period, of each year, of each week, of each day? So I wrote a 4-year loop and iterated through those periods. And the results are interesting.
At the level of the 20-year period, we get fairly predictable results that look something like a table of contents for textbook of British history. So the concerns of the day change from the Corn Laws and their effect at a time of harvest failure to the Bank of England and cash payments to the repeal of the Corn Laws and Irish famine to cholera and the temperance movement, rent strikes, and crofters rebellions, workingmen’s compensation, and public education.
That’s a pretty good approximation for what happened in the 19th century, according to Britain’s parliament. Those are the distinctive– some of the most distinctive historical events of each time period. So that’s no new information about British history.
But what this proves is that the technique could be applied to periodize any new corpora. So, for example, if you apply it to Reddit over the last 10 years– and we don’t really have a big theory about what the turning point of the last 10 years was. You could make up some theories. But my students have done this. They take the sexuality thread from Reddit. And then they can identify the moment of a transgender suicide that changes the conversation in the language that’s being talked about.
So we can use this kind of approach to periodize archives where we don’t have a working theory, of what the major events are or the major change points. Things become much more interesting when I start looking at the words that are most distinctive of a single day. So what the computer is looking for here is a word that was said 500 times on one day in British history and then never said again in parliament or maybe mentioned once a year.
So such a word is plumbers. Plumbers get one day in the entire 19th century. This is the date when the Plumbers Union comes to parliament with some materials that they want officially approved and discussed.
And it’s notable that once we go down to the level of 1 day, the concerns change. We’re looking at longer periods of time. We’re talking about matters of states. Middles periods of times, we talk about colonization. We talk about the railways. We talk about particular factories interest.
By the time we get down to one day, you’re seeing interests in Britain who only have enough power over parliament to command the attention of parliamentary representatives for single day out of the century. So there are lots of working class concerns. There are brewers and distilleries. There’s the issue of vagrants, what should we do about vagrants. There are the silk workers, their environmental concerns like smoke.
So time turns out to operate as something like an index of parliamentary attention. We can compare how much time each interest gets. The chartists, James, only get 1 month. Slavery gets 6 months. The abolition of slavery gets 6 months. The plumbers get 1 day.
So this gives us an interesting way of measuring parliamentary attention, bringing politics back into the meaning of temporality. Another intriguing aspect of these distinctiveness measures is that they can be adjusted in order to reveal different shapes of time. So what I’m talking about right now is essentially temporal fossils.
Temporal fossils are words or phrases that come into usage for single day or 20-year period. And then you never hear about them again. After the Irish– after the Corn Laws are discussed, we don’t go back and re debate the Corn Laws. After the railways are discussed, we don’t go back and redebate the railways of this at the same level. We might mention occasional railway bills. But there’s a moment of the railways that’s several months’ long.
But we can adjust that. We might be interested in different shapes of time. For example, words that come into being, neologisms that are talked about, and then they persist for the rest of time. So we can tweak the math to look for those. That would be historical novelty. And I gave you the timeline for historical novelty. We can look at last gasps, words that were used. And then in a certain time period, they go away.
So looking for different categories reflects different shapes of time. We can see– when we think about these investigations of what’s temporally distinctive, we have the possibility of many complementary objects of study, each of which can give us an intersecting feeling for what is coming and going relative to the past.
My third category of study is the archive. In history, the archive is key. Our findings are only as good as our ability to muster a record of specific documentary instances to prove an argument.
So one of the challenges in working with big data is to talk between distant reading, the overview, typically produced by word count, and close readings of an archive. That is, to move from the aggregate visualizations, like the ones I’ve been showing, back to specific speech acts that are recorded in the archive with names, dates, and verifiable records about exactly what happened. This is one of the trickiest moves to make but also one of the most vital to the discipline of history.
So the method that I’ll be applying in this section is word embeddings. And it offers one possible approach to the study of historical processes. Word embeddings can be used to detect many kinds of historical forces. But one of their virtues is to demonstrate the changing contexts in which certain words have appeared in historical debates.
So I’m showing you the collocates, the most distinctive collocates, words that appear in the same sentence as the key word environmentalist by 5-year period. So they very likely appear in the same sentence or certainly in the same speech. And here, we’ve switched from Hansard to the US Congress. Sorry, I tricked you. Ha-ha, we’re in America now.
1970 is the first decade in which people start using the neologism environmentalist in Congress. Word endings matter. I’m using the word environmentalist rather than environmental or environment because a hostile work environment or childhood learning environment would give us a totally different picture of what the changing discourse was about.
Word embeddings give me direct access to how the context in which they were debated was changing over time. So this image and the longer chart that you have on the first page shows the change in context. And if you glance over it, you’ll see many possible directions in which the inquiry could go.
We have the names of industries like the logging industry. We have the name of species that are being debated like different kinds of owls and fish. We have geographical regions that come in and out of the debate. At some points, the debate over environmentalism is mostly about the Pacific Northwest. Other times, it’s about California. Other times, it’s about international affairs.
So any one of these objects could be the next move for an inquiry. It’s up to the analysts to decide which of these are the most salient. Now, it’s because of other historical debates that I chose the question to inspect that I did.
And the other historical debate in question is, Oreskes and Conway’s engagement with the question of why American politics didn’t solve global warming way back in the 1970s when the scientists were like, hey, we just did the measurements and global warming is real– you should really do something. Because they did take it directly to Congress.
So Oreskes and Conway went through all of the paperwork on what the scientists discovered and when that information was presented to Congress and how politics was tilted away from the scientific consensus. And for that reason, I decided to focus on the moral discourse of environmentalism. In Oreskes and Conway’s account, the discovery of the climate emergencies in the 1960s was met with urgent pleas from the scientific community for federal attention and sustained research.
By the 1980s, however, a handful of rogue scientists with ties to fossil fuels companies have begun to deliberately distort the analysis of climate change using doctored graphics and bringing defamation lawsuits against the scientists who tried to argue with them on the basis of data. Week reporting in The Wall Street Journal further undermine the cause of the truth with the result that doubt was cast upon the broadcast consensus of climate science, which was routinely denounced as a hoax.
So by naming names, Oreskes and Conway bring something like a suit of litigation against The Wall Street Journal and other interlocutors. And we might call this the litigate of mode of writing in the humanities. Historians often imagine their work as that of litigants. And so my question was how far text mining could help us to specify the suit against figures in Congress for their role in sowing distrust against environmentalism.
In table 12.2, we see a distillation of the previous table. So just to be clear, the previous table 12.1, you just have the first 25 or so rows. But really, it’s a much longer table. It’s as long as the number– as the number of unique words in the corpus.
I read the first 500 rows. Out of those first 500 rows, I hand-selected anything that looked like a moral discourse like the word kook or the word hoax or the word elitist. And there were a lot of those terms. So that’s what I’m presenting to you in 12.2.
It’s word embeddings per 5-year period hand-sorted by me. There’s a human in the box. I could have used sentiment analysis. But for reasons I’m happy to discuss, I don’t trust it. I think the results are garbage when applied not to Amazon reviews but to Congress. Happy to say more. Definitely don’t apply it to the 19th century. Big trouble.
So moral discourse in 12.2. Two moments spring out. There are more moral discourse words in some eras than others. There’s 1975. There’s 2000. Things heat up. And then they cool down in the middle, which is interesting.
Importantly, the word embeddings don’t tell us how these words are being used. We can guess. But we don’t ultimately know whether people are saying, environmentalists are elitists. Or they could be saying, it’s so elitist that your fossil fuel industry is doing x. I’m going to call an environmentalist. We don’t know. We don’t know who’s the environmentalist. So we need to look more closely.
So the next several iterations– in the next several iterations, I show how I validated the context in which the term environmentalist appeared. I used word count, simple word count of bigrams to track what was being said about environmentalists. So these are literally two words that are welded together. They’re not saying, I’m going to tell the environmentalist that your fossil fuel dude is so radical. They’re saying, radical environmentalists did x– overzealous environmentalists did x
And this chart reveals a chronology. There’s a discourse of distrust at the beginning. But it escalates after 1995 to up to 80 mentions of these phrases in 2005 to 2009. The findings suggest that rhetoric of attacking environmentalists coincided with Newt Gingrich’s campaign to win Congress for the Republican Party during the 1990s, adopting the language focused strategy associated with Republican Fred Luntz and his famous 1990 memo Language– a key mechanism of control.
I present this to Americans. They’re like, oh, I know what that is. They didn’t know which phrases. They didn’t know how it intersected with environmental history. But they were like, oh, that’s Newt. Look, it changed things.
So next, I decided to investigate which speakers use the phrases like extreme environmentalists that we saw at the previous chart. It turns out that 90% of the phrases on the previous chart were coming from only six speakers, these guys. The chart also suggests that well before Gingrich, Senator Ted Stevens of Alaska was already pioneering a rhetoric of mistrust.
So descending to the level of keywords and context, every speech in which Ted Stevens utters the phrase extreme environmentalist gives you precisely the story of what happened. Stevens coins the rhetoric of extreme environmentalists to sow mistrust against the defenders of the Alaskan Wildlife Refuge on behalf of the Trans-Alaska pipeline from 1973 to the end of the 1970s.
In the 1990s, he returns the phrase again in the context of this Luntz Gingrich political move in order to manufacture a memory of the success of labeling the extreme environmentalists. It worked for me. We were able to pass the law by just one deciding vote. Thank you, Spiro Agnew. You too can swing the Republican Party in these tough times against environmentalists if you just use phrases like mine.
He essentially tells the story in the middle of Congress. It’s a bit more coded than that. But he’s telling other Republicans how well this strategy works. He switches, again, in 2001, as you can see not on the screen but in your handout, the last page, where he swaps out the phrase extreme environmentalists for radical environmentalists in the aftermath of 9/11, making a nonexplicit but symbolic case for equating radical Islamism qua terrorism with radical environmentalism, also terrorism.
So by using computational approaches alongside conceptual questions such as the historian’s problem of explanation, the analyst can leverage the scalar power of the algorithm against questions that matter. Such an approach requires a commitment to a textual database like the US Congress to a series of methods at minimum word count and word embeddings as well as a working knowledge of debates over meaning in history and major forces in the 20th century.
The training required almost certainly places this work outside the grasp of the lone data analyst with no training in the humanities because context is that important. The most important part of this work is explanation in the sense of identifying which questions matter and how they matter. My case study leans heavily on historians like Naomi Oreskes for the relevant question, why did Congress not respond to the science of climate change who sowed the mistrust– who sowed the mistrust if you had to take them to a court of law? Would it stand up?
The project then treats word embeddings, word count, and guided reading as stops along the way towards a full argumentation, providing answers for the questions that already exist in the world. Text mining can support historical explanation but not automatically. The historical explanation that combines evidence for word embeddings will almost always have to rely on the analysts’ understanding of and appreciation for historical reasoning, historical context, and grand questions of historical change, none of which are provided in computational form.
So regarding text mining as an art, that requires a sense of what matters, engaged with the humanities and social sciences, produces a very different research process from a data science paper concerned solely with algorithmic innovation. So critical search is my theory about how moving between quantitative and qualitative analysis can be done. It requires the unpacking of the bias of each data set, of each algorithm, of each value inside each algorithm and constructively putting the results into a dialogue with primary and secondary sources.
The process here is submitted as an alternative to faddish adaptations of new tools and visualizations quickly applied or to following computer science questions about predicting the result of algorithms applied to historical topics where prediction is a problematic term. And I’m happy to say more on that. But that’s what’s in the book. I don’t like prediction because historians denounced it a long time ago. It doesn’t work for us, except in very minute cases such as military history. I’ll say more. Ask about it.
The theory of critical search departs from the fact that historians are ultimately liable to be persuaded by combining qualitative and quantitative approaches together, measuring data as an index for further reading and depth of the kind that we’ve provided in keywords and context. In the keywords– the critical search article, I outline three ideal steps that can be applied in any order iteratively while moving through this process.
Seeding refers to the discussion of the concepts, the archives, the individuals submitted as a focus of inquiry. Today, we talked about parliament, memory, periodization. So those terms come from elsewhere. The seeds come from elsewhere.
Broad winnowing means concentrating on one view of the document base and asking, what does the data say about what’s interesting here? So any of the visualizations we’ve looked at represents an opportunity for winnowing. That means moving from the overview of the 19th century as a whole to a particular why are the plumber’s talking to parliament. You can follow it up. We know everything except the plumbers. Or why is Ted Stevens the first guy winnowing down?
And then guided reading is the process of actually returning to what’s on the page to the primary source by the time we’ve got keywords in context. And it looks like what historians do anyway. They carefully read debates. They think about what matters. We’ve just had a shortcut to those texts that matters. And now, we need to think about them.
So you look at the document. You hold the book at your hand. You actually read the speech. We practice guided reading when we follow the visualization of the years mentioned in parliament back to the era of the Tamworth Manifesto in 1838.
And what this looks like in process might be way more complex. Here’s an attempt to put a flow chart around what my lab did once, not necessarily replicable. We seeded. We seeded again. We winnowed. We read. We seeded again. And then winnowed and read and then read some more.
The point is that it’s an iterative process and often guided by historians’ concerns about interpretation. We’re trying to use this process to unpack the black box of algorithmic methods to engage these three ideal processes until we understand historical change.
So where do we go from here? Well, I think digital history is at the beginning of a process, not at the end. This list of categories of temporal experience is just a starter. My experiments with mattering these categories of temporal experience to statistics, to algorithms, to machine learning is likewise just a beginning of– the beginning of a process. It’s not exhaustive.
But in order to be robust, the discipline of digital history needs to begin with some approaches to historical experience that will allow us to have concrete arguments, not over existential issues like whether text mining has a place in the academy, but over constructive approaches that can produce accuracy, specificity, and relevant meaning. Thank you very much.
[JAMES VERNON] You have plenty of time for questions both online and in the room.
[AUDIENCE MEMBER]This is fabulous. Thank you. So I have a question about the place that you see text mining fitting in with, say, the historiography of the Hansard Corpus. Because a lot of the time, when text mining or digital humanities gets brought up, it’s the story of rupture. No, we’ve got this new method that’s coming out from whole cloth. And so I appreciate the ways that you’re trying to bridge with earlier historiographical methods.
And so I want to ask the question in sort a funny way, which is like, was there something algorithmic or machinic about either the work of telling history prior to this, which you are picking up or refining in some way, and then similarly, with the parliamentarians themselves in the 19th century have thought of aspects of their work in that way?
And just, I think that Frank Luntz actually is a great figure for that because he goes out and says, why, yes, language does have arbitrary features to it. And so when you slice it out into 5-year periods and watch the words connect with other words relatively freely, it seems like there’s a neat– I don’t know– homology in method and object.
[JO GULDI] Thank you. Thank you. Yes, a fascinating question. So first, I think there are two questions embedded in the one. And one of them is about parliament and the– you started off asking about the historiography of parliament.
And in the longer discussion of validation that’s in the book, one of the things that– one of the moves that I make on behalf of Hansard is to say, it’s really interesting to work with Hansard because Hansard has been the basis for writing the history of the British nation for over 100 years. Nobody has read all of Hansard because it’s a quarter billion words. But enormous chunks of Hansard, of these speeches and debates have been read and digested in order to write the history of the abolition of the slave trade and so on.
So I bring that up because it offers a validation that I didn’t discuss. We’re talking about validating by multiple algorithmic essays into similar questions like multiple algorithms that can unpack memory. But one of the things that I’m concerned with is that then when I periodize the 19th century and I say, this looks familiar, it looks familiar because I’ve read those British history textbooks, some of which are based on histories written by people who read Hansard.
So it’s good if the machine’s model matches the model created by the humans who are doing the reading. So that’s a very rare expert mark to have for such a large corpus. And that’s one of the words that I– reasons that I fell in love with parliament, even though it’s an elite institution filled with white men and was trained as a social historian here.
I fell in love with it because it was like, we do need to test and compare the work of these machines to the work of historians. Few cases where we could do it, we could do it right here. If the machine approximates what the historian said when the historians read it, pretty good. Pretty good.
But then you have a second question, which is more like a rhetoric question about how modular speech is. And I think it’s a really good question because you’re right– Luntz knew something about speech and about rhetoric. And rhetorical manuals since Quintilian make use of the fact that we can categorize speech acts.
This one’s an extended metaphor. This one is a really compelling violent juxtaposition of two images designed to get your attention. Because they work. You can see them at work in the speeches of Julius Caesar or in Shakespeare or in Gladstone.
And I’ve contemplated research projects that would dive into that more fully. And I’ve talked to historians about this. And they’ve told me that this is profoundly uninteresting. But I think it might be interesting to people from other disciplines. So I’ll tell you what they are. And if you want them, go for it.
[INAUDIBLE]
Awesome, OK. So this is for you. So there are compilations of– there are compilations of speeches compiled as literature, as objects of study. By the end of the 19th century, William Jennings Bryan, the great populist in America, is compiling the world’s great speeches in 10 volumes. And most of it is from Hansard. Most of it’s from Hansard. There’s a lot of Disraeli. There’s a lot of Gladstone. There’s a lot of the debates over the slave trade, the debates over Warren Hastings.
So on the one hand, there’s an opportunity to look for textual reviews because we can figure out who’s quoting Shakespeare the most, who’s quoting other members of parliament, which members of parliament get quoted. We can figure out the characteristics of the speeches that get excerpted and republished in this form for rhetorical study.
And so I think there are lots of rhetorical questions about what speech is doing and what the patterns are inside the speech that are super fascinating in the literature department. Even though the history department might be left cold, it’s OK. Yeah.
[AUDIENCE MEMBER] I’m a PhD student in the history department. And I have two interrelated questions. I may sound like the devil’s advocate to the historians here. One is that you mentioned sentiment analysis and that you’re very skeptical of it. As we know and what I know right now, it’s very basic and can lead to a lot of inaccurate querying.
But as we work with them, as we read more and we know more, does it relate? Right now, we’re talking about a number of softwares, the current one being ChatGPT teaching itself to be more or less vanilla more colorful. Do you think change over time will lead us to have far more self-taught systems that be more sophisticated?
And I mean, what should historians remember in terms of the epics of doing that? Just a big question there. And the second question that I had was, right now, a lot of this data is in English and largely from Euro America. A lot of this in, say, the third-world non-English archives are still just being digitized. And I’m wondering how we can– I mean, whether exclusion or inclusion of those archives in the digital world– I mean, what are the ramifications, the larger ramifications of those for an LP or any kind of data mining?
[JO GULDI] Yes, thank you. Very intelligent question. Very smart. So sentiment analysis out of the box trained on Amazon product reviews and Twitter. Twitter is a total disaster when applied to the 19th century. Spending examples include that socialist is coded as fear.
So that might be true if you’re training the algorithm on– I don’t know– bros who code and buy things from Amazon. And then when they’re afraid of the toaster, they’re like, it’s socialist toaster. I think mom is also quoted as fear as in, I guess my mom would like this toaster. I don’t know. Like, it’s coming from some set of data.
But apply this to the era of the flourishing of Fabian socialism when parliamentarians talk about how great the socialist era will be when we have flush toilets in all residences in the city. They’re not afraid. They’re really excited because it’s not the Cold War.
So those distinctions become really important. It also gets really confused with rhetorical gestures. Disraeli and Gladstone get coded as massively sad because they say things like, I beg you to consider. Or a fear that my honorable friend has a false idea of information.
So those are rhetorical gestures that don’t actually convey a sentiment. So identifying the historical sentiments is a bit more tricky. And you can come up with a lot of garbage.
But I think that you’re right, that there are issues about training sentiment identification data sets on different types of speech. It would be too expensive and not really worthwhile to mechanical [? torque ?] it through Victorian Hansard. At least for my purposes, I wasn’t interested enough in sentiment analysis to spend valuable time and money on that.
But GPT technology seems to be– this seems to be one of the promising arenas. It cannot write history from scratch. It can do data training sets, maybe, said Nicole Coleman yesterday at Stanford. So she’s in the library. She does data. She was showing me how well it categorized some reports on fish from the 1970s. Could it categorize sentiment analysis? I don’t know. But it seems really plausible. I think that’s an excellent research project.
You ask about non-English archives. I’m not a specialist. But I will say that digital history deeply cares about this question because one of the values of history departments today in every history department is the representation of the Global South, is the representation of ethnic diversity, which we represent through hires to make sure that there is a professor of the Middle East and a professor of Russia and a professor of China, professor of Africa, a professor of African-American Studies in every department.
Even a tiny department like the one I work in, we try to aim for some of that diversity because we believe that the voices in those archives and– those archives tell us things that the voices in parliament will never tell us. We have to go back to those archives. So then, the question becomes, what’s the relationship of those archives to digitalization?
And perhaps the most compelling set of answers about that comes from Alex Hill, originally a literature scholar who was at the Columbia University library and has now migrated to a faculty position at Yale. And Alex Hill is responsible for a concept called nimble tense, which is about how to send packages to the Caribbean to ask local people to document their own documents and stories and then send us back the data so we can help with analysis rather than sending the much more expensive option of a graduate student to the Caribbean for several years to write down all of the stories.
So this is an acknowledgment that there are storytellers and archivists in many parts of the world that could become our collaborators in a process of documenting and preserving knowledge from the Global South or from other civilizations. And there’s an ethics of that. And perhaps the best book on that subject, the ethics of documentation, how to work with local groups and where this has been done really well is Roopika Risam of Dartmouth New Digital Worlds.
[AUDIENCE MEMBER] Hi. I’m also a history graduate student. So I understand we all have our limits and in terms of archive. And I was just wondering, since it’s a new method that some of us will probably end up working, why is it that it’s also starts with top to bottom? Like, this parliamentary archive was used to write a certain kind of history– was used to write a certain kind of history. Why is it that a new methodology also takes that as a point of beginning?
So that also leads to another question. It’s like, some of us who do like microhistory or who have a very thin soul space, how can we contribute to a digital history method or in what ways taking something which doesn’t have a lot of flex source base and use that as a found this in person kind of original history method instead of something which is already a major archive?
[JO GULDI] Yes, excellent first question. So why is it true that digital history in our enlightened age replicates the bias of the 19th century? I mean, it’s like a zombie kind of history. Like, we got over a dead white man history. And yet, all we have in the digital space is more dead white men. I think it produces an appropriate level of a shock and horror reaction in history departments when that realization is made. And so I do a lot of apologizing and pointing to Roopika Risam and Alex Hill.
But the reason that it’s true is very important to understand. Many of these digitalization projects have been funded at the national level in the European Union. So the National Library of Sweden has digitalized the Swedish newspapers, the Swedish novels, and the Swedish parliament 300 years. Lots of data.
Finland has done the same. Britain did the same starting in the 1990s. Parliament funded it. So it was one of the first digitalization projects. It’s essentially nationalism all over again coordinated through the governing bodies and through libraries without historians in the mix.
Now, there’s an institutional response also coming from historians in the EU, where the conversation is actually about 10 years in the future from North American history departments. So this is something that we can learn from. So in Finland, for example, the historians have realized that leaving the question of what gets digitalized to the Finnish parliament results in nationalism. And that doesn’t match their values.
So humanities deans from 20 different universities have converged to write a 20 million euro grant proposal for the National Research Budget of Finland. And in this, they have asked lots of historians who have no digital skills whatsoever to help them rank in importance the archives that can be found in Finland. The archives that can be found in Finland, some of them are about the Finnish people. But some of them are about immigrants.
Some of them are about minorities. Some of them are about women. Some of them are about the working class. And they’re from all different periods. So we would probably want some diversity of time period like some medieval texts, some modern texts, some representation of immigrants, some representation of the geographical diversity of the country. And historians can have a really interesting conversation about that.
Now, what coordinating it on a national level means is that they can present– the humanities’ budget of Finland is like 0.5% of the National Research budget of the nation. A lot more goes to civil engineering. A lot more goes to public health. No offense. Public health is trying to cure cancer. Way to go.
But if history is able to, using 20 [INAUDIBLE] at a time, shift the conversation from 0.5% of the research budget to 2% of the research budget because we have a plan, then that’s a massive windfall for history. And you can start imagining capturing those archives and providing for future generations of historians and the public to understand the diversity of the nation’s past in a new way. So then, the digital history projects can be aligned with the values of the history department, which is very interesting.
And then, you ask about, what is it that a microhistorian can do where the microhistorians innovating in terms of method? Right now, the NIH is making a lot of grants to microhistorians. So I was just at Stanford. And I was meeting with a historian of the Middle East who was telling me about digitalizing one really big book of records of debt relationships. I was hearing from another historian of Latino experience in the United States, who came across one really cool archive that’s yellow pages of Latino New York. He’s mapping them all.
So those techniques look very different than mine. The microhistorical project that I would love to do to capture the voices of the working class in Britain– I know you’re standing right there. I would love to do to capture the voices of the working class in Britain is a text matching, a text similarity exercise, where we find a collection of working class pamphlets.
There aren’t that many of them relative to the speech is in parliament. But wouldn’t it be nice to know who in parliament sounds like the working class speeches, which ideas from the working class pamphlets get taken up into parliament, which ones persist for the longest time? So we can just as generations of post-colonial historians read the imperial archive against the grain in order to understand the politics of the Great Revolt and the permanent settlement.
So digital historians can also use microarchives to read the macroarchives of parliament against the grain. So that’s technically possible. And that’s actually, I think, one of the next most important hurdles for the discipline of digital history. And I’m very happy to pursue that with anybody who feels inclined to. Yes. Yes, please. Oh, sorry, there’s another–
[AUDIENCE MEMBER] Yeah, so I’m glad you started with the critique that a lot of times, findings from text mining aren’t very surprising. I do computational stuff. And I haven’t really been impressed with most text mining findings for that reason.
And I was just hoping– like, I’m not a historian. So I was hoping that you could highlight in your results where that didn’t happen. Because I was looking at your findings here. And a lot of times, it was like, the parliament cared about the French Revolution as we already knew or these sorts of things.
And so I would just love for you to highlight– because I don’t really know history– where these sorts of methods surprised historians or where historians push back on your findings because they go against the grain and how– partly because I do think that a lot of text mining. You can just read it and come to the same conclusions. And I think that’s why it’s a lot of times very repetitive of stuff that’s been done decades previous. So if you could highlight that for us in terms of your own discipline, that’d be really interesting.
[JO GULDI] Yeah, thanks. So I showed one such moment here in 1838. We really couldn’t have found that with traditional methods. Is it Earth shattering to the process of 19th century British history to know that people were signaling– the new Conservative Party was signaling with these references to memory? It’s not Earth shattering. But it’s definitely a finding of a kind that we couldn’t find otherwise.
In one presentation, an American historian leapt out of their chair at this one. The first mention of the American Constitution in Hansard is in 1832. 1832 is the debate over the British Reform Act. So it’s whether the middle class gets the vote.
And that’s really interesting because Britain is heartbroken at losing its North American colonies and doesn’t know whether the American Constitution is a declaration of war or we should totally dismiss it. So the first time they acknowledge its existence in parliament is in 1832 is they’re deciding, should we also give more people to vote like they do in America? And some people in the House of Lords say some very nice things about the American Constitution and how it’s given America political stability.
So that moment of reflection also reaffirms that there’s this moment of America almost acting as a beacon to the world quite early, shocking the British discourse, and that we didn’t know. I gave the example of my students– I mean, The plumbers are a real finding. I passed by it quickly because I’m interested in the method, not the finding. But the plumbers could be a dissertation. It could absolutely be dissertation-length material in British history as could half a dozen of the other terms that I mentioned on this visualization.
So part of it is about the level of what constitutes shocking. Interesting versus shocking. Most of our findings when text mining the British Parliament shouldn’t be shocking because there have been literally hundreds of British historians reading Hansard over the last 100 years to understand the 19th century. So if aliens invaded and constructed the pyramids in the middle of the Victorian era, that would be shocking and would probably be wrong. We probably know about those things. But we can find interesting patterns that we didn’t previously know about.
But what is shocking that could get public interest is my students’ work that I was telling you about, where we don’t know what the last 12 years of transgender intellectual history are until you model it on Reddit, using some of these techniques. And then it jumps out. It’s like one particular transgender suicide. You could theorize that the transgender suicides are really important.
But there’s one that tilts the discourse. And you model the before. And you model the after. And they’re nothing like each other. So that, you couldn’t guess. And it’s newsworthy. So yeah, it’s an excellent question because I opened with the standard of meaning. The standard meaning is one of the things that history holds up. That’s why we examine historiography and what other people have– how other people have written the history before us. It is the standard to which we hold ourselves. And most of my articles engage that question in some way.
[AUDIENCE MEMBER] First, I wanted to just mention about that 1830s American Constitution point. So I think that’s when the US gave the right to vote to white males regardless of income and property. And so that was like a major expansion. So maybe that could be when they started talking about the issue. I’m not sure. But that could be.
So my question– so I’m a Berkeley history alumnus from the ’90s. And I have a question about source. And when I was at Berkeley, I worked for almost 2 years at a group called the Pacific Institute for Research and Evaluation. It’s now at the SkyDeck building. We were there, PIRE. It’s called P-I-R-E.
So it’s an NGO. And my job was newspaper coding, going through old newspapers in regions and looking for keywords related to substance abuse like alcohol use and trauma. And so I learned then that relying on newspapers was actually a powerful way to affect policy because that was the idea. You rely on newspapers to tell people, to tell lawmakers, hey, we should ban certain things here and there.
So my question is more about war, the Iraq war. My cousin’s been sent to Fallujah for Phantom storm too. He has PTSD. A lot of bad things happen. But I remember when the war started, the military demanded that the news reporters, the news agencies had to be embedded within military units.
Now, Reuters and Al-Jazeera refused the reporters by acci– they say by accident, got blown up. And then after that, the news basically would report what the military wanted them to say. So my concern is that how historians go around these kinds of limits? Because I know that newspapers, they are very important. Of course, parliament, that’s– parliamentary discussions are much more efficient. But newspapers are constituted a major source of information to affect policy.
And then when I was at Berkeley, I remember one of my English professor saying that there was a military guy who said during the Vietnam War, there was overwhelming support. There was very little disaffection with the war or dissension. People were all behind it.
And then the professor was like, I don’t remember that to the young person. So this is– so I feel like a lot of stuff in the news is now coming up. And, of course, a few days ago, New York Times published an article about rectal feeding in Iraq against Iraqi civilians or people or suspects. And these things just coming out now.
Because when I– I remember when the Iraq war came out, the news kept saying, this is good. The Iraqi children, look, they’re all rushing out to greet us. They love us. And if you reported some independent news, you might end up gone. So yeah, what ways would historians use to circumvent these kinds of state-led blockages?
[JO GULDI] Yeah, thanks, really interesting question. So in the case of the experience of people in a war confrontation, a historian of the 20th century using traditional methods would consult the newspaper and then pursue oral history, diaries, newsletters, other written documents in order to account for and triangulate against the suppression of the official record that you’re describing.
And so what you’re essentially describing is the dangers of relying solely on an official corpus, whether parliament or the newspaper, and looking for those other records and the stories that they tell. So I take on board absolutely vital. And that’s why the discipline of history is not about to surrender its sword and shield to the data science department.
It’s going to continue teaching all of these techniques of engaging with archives. Or all of the archives going to get digitalized while there are oral history projects that are recording the testimony of people, activists in the anti-Apartheid movement of the 1980s, of Vietnam War. And these days, that testimony is often digitalized and digital in nature.
So there’s a possibility of putting the newspaper account of the Vietnam War into conversation, dialogue with the account of veterans themselves. And that could be really informative because there are algorithms for finding what’s in bucket A but not in bucket B. That’s really important, really important issue for the public.
Hasn’t been done yet. Hasn’t been done yet even for the low-hanging fruit. The low-hanging fruit is we have parliament. We have the British newspapers any day now from the living with machines project. We have the British novel.
The low-hanging fruit is what’s in the British novel that’s not in parliament, what’s in parliament that’s not in the novel. There are novelists who are in parliament like Benjamin Disraeli. There are novelists who were read by parliament who inspired reforms like Charles Dickens.
So there should be a lot of transference what’s the lag. And that’s a data-intensive project that some of us are dying to see, dying to see somebody go after it a really data-driven way.
You highlight another category that I would love to add to that list of historical concepts. And that’s corrections of memory after the fact– corrections of memory after the fact. So this happens in Congress itself. It happens in the newspaper. They say, whoops, we missed this 10 years ago. This was happening now. There’s an investigation. Oh, we did use torture in Guantanamo. We did use– torture was permitted, oops.
So in my methods, my methods would miss that because the investigation would track as the discussion. So in terms of memory, would show up because Guantanamo is over. And the memory section would show up. Now, we’re talking about Guantanamo after the fact.
So yeah, it’s interesting to play with corrections of memory as a particular genre. Is there a way just based on data of examining that? I think the memory discussions aren’t really thick enough enhancer defined up. The newspapers will offer more material. It’s a really interesting question. yeah, Thank you.
[AUDIENCE MEMBER] I have a quick question. You mentioned prediction is a problem. Could you elaborate on that? Because I believe a lot of these algorithms as well work in terms of prediction. And with success of things like GPT– so just predicting the next war type of stuff. So I’m just– are you saying it in general historical sense? Or are you saying it with regards to the algorithm? Just curious.
[JO GULDI] So prediction, the take in the book, I talk about prediction a lot. And I work it through many philosophers of history. I take prediction to Jill Lepore. And Jill Lepore says, you want to predict the future. Prediction is an inherently risky proposition.
I take prediction to Karl Popper. Karl Popper says, the field of history does not do prediction. We don’t believe that there are laws of history because human beings are inherently creative and come up with new responses, new forms of governance. So therefore, the search for historical laws is in its nature doomed.
I take prediction to Reinhart Koselleck. And Koselleck says with Arthur Danto’s ideal chronicler, if we had an ideal archive of every thought or mood or wish that every human had ever had, we could measure all of the repeated events against all of the singular events. And we could develop a total predictive mechanism. But we don’t have that because our archives are imperfect.
And then I take prediction to the military historians. And the military historians say, I have no problem with prediction. A military field, if there’s a hill and you can hide an army behind it, that predicts the fact that if there’s another military field, I can also hide an army there. It’s predictive. No problem.
So what that tells me is that there are limited– my discipline is hostile to the word prediction. But when I go over to the data science department or the computer science department, they’re like, what are you trying to predict? What is the most surprising discovery we can help you make? We have test sets and training sets. And they predict things. And the prediction is the measure of accuracy.
And so I say, oh. Oh, you’re using the word prediction in three different ways. You’re going to annoy four out of five historians just by using the word. Test and training data sets can be useful for teaching an algorithm about 19th century sentiment analysis. I’m not going to say, don’t do that. That’s great.
But maybe predicting the future, which some data scientists ala Peter Turchin are trying to do with historical data sets in a way that would be offensive to many historians, relies too much on a concept of laws of human behavior that can be predicted on the past, the basis of past conflict. And most of the history department thinks that that is not going to work– not going to work.
And yet, there is a conversation with mathematicians like Chris Noble, who we love. Kris Neilbo says, if you look at smaller data sets like Reddit threads– there’s some Reddit threads where they’re introducing new terminology every month. They’re using new words. There are other Reddit data sets where they’re always using the same vocabulary over and over. Again, which means maybe they’re refining their use of a couple of keywords– you can predict that these two communities are going to continue to operate in the same direction.
This is a really interesting investigation of prediction. I don’t know if it’s history. But it probably applies to the future. And you can predict some things about the future without annoying Karl Popper.
[JAMES VERNON] I have a really nerdy question about Hansard. And the Hansard, as you know, was created from an extra parliamentary campaign to try and ensure that the business of parliament was available to the British population. It was what we would now call an exercise in the transparency of governance. And yet, we also know that the way in which parliament worked changed really dramatically in the 19th century.
So I’m wondering whether– I mean, this seems a lot more basic than the forms of analysis that you’re doing. But I’m wondering whether one thing that seems that would be able to do with your techniques would be to understand how the performance of parliamentary debates change, whether there’s more space or less space given to parliamentary debate, whether more or less politicians are speaking or not.
And I’m wondering whether that’s just a level that you feel is not interesting. But it seems to me it could actually tell you something that historians working in paper archives find very hard to track. Whereas when you look at the size of your data set, you would be able to deliver that type of analysis.
[JO GULDI] Yes, absolutely. So I’m thinking of a book you put me on by Ryan Vieira, which suggests that after 1867, when the working class gets the vote, parliament stays longer later hours in parliament debating what the parliamentary representative can do for the silverware industry. Because they know that the next day in the newspaper, their speech is going to be reprinted. And they’re going to be held accountable by working men who can vote.
So Vieira’s book has no data. It has lots of evidence not in the form of data and not in the form of quantitative accounts. And it’s a trivial exercise to count the words and investigate 1867 and who’s spending more time.
So one of the reasons that we hold off on that is that the data, it seemed to me that as soon as we got into issues of representation, the speaker metadata was really important because it’s important to track the individuals who were introduced in 1867 but weren’t there before 1867. And the speaker metadata inherited from multiple past Hansard projects is unbelievably bad.
The digging into data project of 2010 spent a million dollars cleaning Hansard. So they– said and their data set was terrible. And their political scientist publishing data right now with that data set and its speakers, they have about 10% of accuracy. 1 in 10 cases is accurate, can be matched.
We worked with chemistry PhD who was used to working with genomic information to reconcile the speaker’s names. And we think we’re at 90% accuracy now. But I have not worked with that data because I’m waiting for my team to finish the cleaning process.
So that can absolutely be next up in terms of priority. It will probably– I think you’re right, that it promises great results. I would love to work with Ryan. Yeah.
[JAMES VERNON] It might be my nerdy question [INAUDIBLE].
Thank you, everyone, for coming and for your fantastic questions for Jo about her work. Can we give her round of applause? And thank you so much.
[MUSIC PLAYING]