CRELS

Consequential Sentences: Computational Analyses of California Parole Hearing Transcripts

Recorded on April 1, 2025, this video features a talk by AJ Alvero, a computational sociologist at Cornell University, presenting findings from an analysis of parole hearing transcripts in California.

This talk is part of a symposium series presented by the UC Berkeley Computational Research for Equity in the Legal System Training Program (CRELS), which trains doctoral students representing a variety of degree programs and expertise areas in the social sciences, computer science and statistics. The talk was co-sponsored by the UC Berkeley Berkeley Institute of Data Sciences (BIDS).

Abstract

In California, candidates for parole are able to present their case with the support of an attorney to commissioners appointed by the state. These hearings are professionally transcribed, making them highly amenable to a variety of social scientific questions and computational text analysis. In this talk, I will discuss a large project analyzing every parole hearing transcript in California that occurred from November 2007 until November 2019, along with a wealth of administrative data, some of which was obtained after successfully suing the California Department of Corrections and Rehabilitation (CDCR). In some of our early work, we find that patterns in the text based on the words being used and who is using them (e.g., words used by the parole commissioner) have stronger explanatory power than variables used in past studies. To conclude, I will discuss forthcoming work which takes advantage of the unique structure of the transcripts.

Podcast and Transcript

Listen to this podcast below or on Apple Podcasts.

 

[MUSIC PLAYING]

DAVID HARDING: All right, welcome, everyone. My name is Dave Harding. I’m a professor in sociology and faculty director of CRELS, the Computational Research for Equity in the Legal System.

And so we’re a program that trains doctoral students in a variety of degree programs on campus in the intersection of social sciences, computer science, statistics, and the substantive domain of the legal system. Our talk today is co-sponsored by the Berkeley Institute for Data Science and, of course, by the Social Science Matrix, who’s hosting us today.

Just before I introduce our speaker, I want to let doctoral students know that the call for applications for next year’s CRELS’ fellowships, our traineeships, and also the computational social science training program traineeships is out now. Details are on our website crels.berkeley.edu. And the other thing you can do to stay in the loop is to subscribe to our newsletter. There’s also a link on the CRELS’ website for that.

So today we’ll hear from AJ Alvaro, who’s a computational sociologist at Cornell University, with affiliations in sociology, information science, and computer science. He earned his PhD at Stanford and also a master’s degree in statistics there.

His research examines moments of high stakes evaluation, specifically college admissions and parole hearing, which I think we’ll hear about today. In doing so, he addresses questions and topics related to the sociological inquiry of artificial intelligence, culture, language, education, race and ethnicity, and organizational decision making. So welcome, AJ. Take it from there.

AJ ALVERO: Thank you for the lovely introduction. Really happy to be here with you all. And let’s begin. So like David mentioned, my name is AJ Alvaro. I’m an assistant research professor at the Center for Data Science for Enterprise and Society at Cornell University.

And the talk today, which is entitled Consequential Sentences– Computational Analyses of California Parole Hearing Transcripts. We’ll go over a forthcoming paper, as well as I’ll go over some of our plans for future follow-up studies.

OK, so when people hear the word parole, they tend to think of scenes like this. This is from the 1994 film the Shawshank Redemption. Arguably, this is one of the most widely seen examples of what a parole hearing looks like in film and media.

And the scenes tend to look like this. So you have a room. And in the room, there’s a table with old white men in positions of power who decide the fate of a parole candidate. And for example, in the film the Shawshank Redemption, such a candidate is played by Morgan Freeman’s character Red.

So the board asked the candidate questions about their crime, what they did during their time in prison, and asked them to make a case as to why they should be allowed to leave. So this is where the title of my talk comes from.

So the parole candidate is able to speak in complete sentences to explain how they’ve been rehabilitated, their behaviors in prison, if they’ve been on good behavior. They’re allowed to express remorse and so on, again, in complete sentences.

In turn, the board, again, represented by the people at the table, they’re able to determine whether or not they should extend the sentence or to release them back into society.

So as it turns out, the state of California manages their parole process in a fairly similar way. I wonder if there’s some connection between Hollywood depicting parole processes like this and the way California does it.

And California is also one of the few states to do it like this. So with the caveat, though, that when these parole hearings are happening, there’s less windows. It’s not as well lit. And there’s fewer people in the room.

But there’s one key person missing from this scene that is a major player in California parole hearings. And that is a transcriptionist. So in California, the job of the transcriptionist is to accurately record anything and everything that is said in each parole hearing, so that the candidate, the parole candidate, and the board, members of the public are able to review everything that was said in a given hearing.

So I just want to note– the transcription is not actually in the room, but they play a key role in the entire parole hearing process in California. And this is the Penal Code that enshrines this practice in the state.

So this was codified in the early 2000s, meaning that ever since the early 2000s, every single parole hearing has been transcribed digitally into PDFs. And of course, it’s not just one parole hearing, it’s every parole hearing.

And as you can imagine, this creates a lot of data. This is a lot of writing. This is a lot of communication that is digitally inscribed and available for the public to examine. So prior to this, parole hearings were recorded physically as microfiche. But through digitization, the amount of data that becomes available just cascades.

So in this talk, in our broader project, we’re looking at digitized parole hearing transcripts from the year 2007 to 2019. And just in those 12 years, there have been over 35,000 parole hearings conducted in the state of California, which translates to over 5 million pages transcribed, representing over 700 million words. So again, this is a very large corpus.

So lots of information is embedded in the transcripts. I mean, most notably the words that are spoken by each member of a given hearing, the parole candidate, members of the parole board, and attorneys representing the parole candidates.

But it’s not just literally the words, but it’s also what they’re saying. So for example, an attorney might note that the victim of the parole candidate’s crime is in the audience. And again, this information is available in the transcripts.

I just want to note, though, that the process of transcription might not be perfect. So there was a 2019 study by a team of linguists which found that courtroom transcribers, which presumably also make up the pool of parole hearing transcribers, they tended to be less accurate in transcribing the words of Black plaintiffs and defendants. So there is a little bit of– there’s a limitation there, but there’s still a lot of data to work with.

But even with this amount of data, it still isn’t everything that is important if you want to understand parole hearings. So for example, a key piece of information such as the race of the candidate is typically not explicitly stated in the parole transcripts. So you need to do– you need to do some more homework to get that information, which is what we did.

So this is just a little bit of meta discussion about the project. So even with all the data recorded in the transcripts, there were key pieces we still needed to acquire, specifically and especially the racial identity of the parole candidate.

So the California Department of Corrections and Rehabilitation, or the CDCR, they would not share race ethnicity data for– they gave a couple reasons. But obviously, this is a key piece of information that could have really broad impact on society and policy and our understanding of moments of evaluation like this.

So in order to get this data, our team sued the state of California and they got a lot of help from the Electronic Frontier Foundation, the z, and forced them to provide racial data. There were other pieces of information that they also provided, along with race, that I’ll go over, but this was the key thing that we were looking for.

So combined, we have the parole hearing transcripts and now we have, this traditional tabular data that we call it CSV, spreadsheets, et cetera. And these form the data core of the project. So it’s this unstructured information from the transcripts and then highly structured information collected, organized, collated, and provided by the CDCR that we had to get through a lawsuit.

So I know there’s a lot of students in the audience. And before getting into the specifics of the paper and our analyses, I also wanted to spend a little bit of time talking about some meta considerations and questions, especially in light of the current US political landscape.

So the paper I’m going to present today– I actually began my career as a tenure track professor in sociology at the University of Florida, even before the current Trump administration. And administration at the University of Florida told me pretty flatly that this work is illegal based on new state level policies regarding higher education.

They also told me, also flatly, that we were applying for grants. And we were invited to apply for a pretty big grant. And they told me that there was a non-zero chance that they would reject the money. So I just want to– I’m not trying to spook you all or anything. And of course, we in this room, we’re not immune to a lot of the federal restrictions and regulations that are coming down.

Maybe I’m wrong, but I do believe that in California, you’re not going to have the same kind of eagerness to enforce these highly restrictive policies that our colleagues in states like Florida have to face. So again, just talk to you all as the students. So that’s one consideration.

Beyond this, there’s additional basic questions that I hope this talk can answer for you all. So one is, how do you do this kind of work? I’m talking about lawsuits, and 35,000 transcriptions, 700 million words. What are some really basic philosophies that you need to consider when you set out to do this work?

The other is why study transcripts at all? There’s a fair amount of literature that relies on tabular data to examine parole processes. But what is it about the transcripts that provides something that traditional data does not? So as you might guess, I believe studying the transcripts is very important. But again, I hope I can convince you as well.

So the first point I want to make is that NLP and AI, despite all the advances, despite the wonders of ChatGPT, it is still not powerful enough to do this work without strong explicit human input.

So there’s a strong– in this work, we draw a lot on computational grounded theory. This is a methodological framework designed by sociologist Laura Nelson, where, again, human input and curation is a key part of the process.

So the next thing, the next meta point I want to make is that doing this work requires a combination of domain expertise and interdisciplinary collaboration. So these are the current main collaborators on this project.

So two of my collaborators, Kristen Bell, is a law professor at the University of Oregon, and Ryan Sakoda, who is a law professor at the University of Iowa. Beyond their legal expertise, they actually have practical, professional experience in parole hearings. So again, they bring a lot of domain and expertise to this project.

Our other collaborator, Jake Searcy, is a professor in data science at the University of Oregon. And he has a lot of experience designing like larger data science projects, as well as technical training in particle physics, which, again, you might not think they’re connected, but he brings a lot of that knowledge into this work.

I think, collectively, what I’m trying to argue here is that if we want to solve these kind of big, broad social problems and questions– similar to 2017 paper by Duncan Watts, where he argues, social scientists, maybe we could do a better job at solving social problems. I think it’s important to bring in different experiences and perspectives.

So then finally, the other meta consideration is that the influx of computational methods to analyze text has created opportunities to study communications, processes, language, and an important social outcomes that were not designed with these tools in mind.

When the decision to transcribe California parole hearings was made, they weren’t thinking about, oh, yeah. And in the future, there could be some cool computational work that leverages text as data. This was not on their radar.

And I think we’re still in that moment where this idea that text as data or text is data is fairly novel, especially to these big social actors. And I think we have an opportunity to talk back to different social institutions and processes.

And other scholars and other domains have made similar arguments. So there’s a historian at Columbia named Matthew Connelly. He’s working with a large archive of federal government communications, like totally internal data. I think he has every communication from the Pentagon, like internal, over several decades.

And again, they weren’t– when they were communicating, they weren’t thinking about these tools coming into existence a few decades later. So these are some meta questions. And I hope these can help seed the conversation and get you thinking about how you can also set out and do this work.

But now we’re going to put those aside for now and go straight into the paper that I’m going to talk about, which is a forthcoming paper in the Berkeley Technology Law Journal. And it’s just about machine learning and parole. And I want to briefly touch on some theoretical and empirical motivations.

So the first is that there’s actually a rich history of social scientists studying parole. So in doing some lit review work, I found articles from the American Sociological Review, which is the flagship journal of the American Sociological Association, as early as 1940, where they’re trying to find out and explain– how are these parole decisions made, and how is it that people are affected differently given their backgrounds, and identities, and experiences? And there was another one in 1955, which compared sociological with psychological perspectives to this basic question.

Anwar and Fang, Young, Huebner and Bynum, and many others all focused on the role of race in predicting outcomes. Is it the case that the race of the parole candidate yields differential predictability in the outcome in a given outcome?

So two caveats to consider is that while these studies are rich and they’re very interesting, they did not consider the full breadth of the transcripts in the actual hearings. So it is possible that in states like California, where they have a full hearing that is, again, transcribed, there could be other things at play that could predict outcomes. And that’s where we come in.

The other thing that I think it’s really important to consider, especially if you’re interested in doing this work, is that the definitions of race here are not self-defined. There isn’t a moment where the parole candidate says, well, this is how I define my own race, as I understand it. All the definitions of race are imposed by the CDCR. They look at you. They say you belong in this category. Job’s done.

And I really wanted– it’s a subtle but important point. And it’s aligned with what sociologist Nancy Lopez called street race. So how you perceive yourself might be different from the way the world perceives you.

So we’re also motivated by our own past research. So this project, I’ve alluded to a couple of times, we call ourselves Project Recon. And the recon there is short for both reconsideration and reconnaissance.

So the overarching goal is to influence policy and practice by shining a light on the parole process in California. And then eventually to help make a tool to help decision makers identify anomalous outcomes and decisions in order to reconsider those cases.

So for example, if it turns out that there’s systematic racial bias in California parole hearings, we want to make a tool that can help them identify those cases and provide evidence as to why they should spend the time reconsidering them.

So to that end, we have former collaborators Jenny Hong. She’s now a research scientist at Meta. She spent a lot of her PhD working on this exact tool. And some of the technical innovations that she came up with, we leverage in the study that I’m going to talk about today.

This is also true for work done by Graham Todd, who’s now a computer science PhD student at NYU, as well as Catalin Voss, who is a– he did the classic Silicon Valley thing. He was computer scientist turned startup founder. But he also did a lot of work on this project.

And even with the current team, Ryan Sakoda, who’s the law professor at Iowa that we’re working with. Independently of all of our work, he has also been examining parole processes and outcomes.

So I also just want to point out the early work led by Kristen Bell, she’s the leader of this project. She wrote a paper in 2021 that, again, designed and implemented this platform, not at a high level, but was testing it out. And in the paper I present today, we present the analyses that informed that platform, if that makes sense.

And that leads, again, to the paper that I’m going to spend most of the talk today going over, which is called Using Machine Learning to Scrutinize Parole Release Hearings. So it’s forthcoming in the Berkeley Technology Law Journal. We’re really excited. If you want a copy, I think I can share. I have to check with my lawyer, literally, about what I’m allowed to do.

Again, beyond this work and this very particular paper, I also want to share some of the perspectives that I bring into this work as a computational sociologist. So one is what I call or think about as this nexus between science and policy. And this first bucket is something that, again, I personally draw a lot from.

And a lot has been written specifically about parole in this context, as well as technology in society. So for example, there was a special issue in The ANNALS of the American Academy of Political and Social Science, co-edited by David Harding, who’s in the room, Bruce Western, and Jasmin Sandelson. But even beyond that, again, there’s a lot written about technology and society.

So quantification is a framework that I draw a lot on, such as the Espeland and Stevens paper from 2008. And there’s also– as technology has advanced, there’s also a lot of work about datacization, digitization, and algorithmic bifurcation.

And there’s a great annual review piece by Jenna Burrell and Marion Fourcade, also in the room, about algorithms in society. And I think, a lot of those arguments are about how state actors, agencies, organizations, things like the CDCR use data and algorithms to shape outcomes, to shape social processes, to mold society into the ways that they want to see it for– in their respective domains.

So my, perhaps, idealistic hope is that in the same way that there’s this top down energy to using data, using algorithms to shape society, perhaps there’s also opportunities to be reflexive, to take a more of a bottom-up approach, to talk back to these state actors, and agencies, and organizations, to push back with data and analyses.

So I’m particularly inspired by 2018 Gender Shades paper. This is by Buolamwini and Gebru. And in that paper, they described racial bias in facial recognition technology. And the pushback from Amazon and the big tech companies was really strong.

But eventually, they changed their policies and practices. And now, arguably, we live in a world where this idea of mass facial recognition is frowned upon. And I think their work really helped develop that.

So finally, this is getting more classic sociological. So I tend to view things a lot from the perspective of culture and language. And to that end, I draw on a lot of work on evaluations such as the work by Lauren Rivera about cultural matching in job hiring.

And Michele Lamont has done a lot of work about evaluation. In the legal research setting, there’s also been studies about parole that kind of also take this perspective. So there was a paper by Bronniman in 2020 that examined expressions of remorse in parole hearings.

And there was another paper by Greene and Dalke, which analyzed expressions of anger and masculinity in parole hearings. So there’s a rich body of literature there. So I just wanted to share– this is the literature that we’re building upon and from. And these are the perspectives that I bring to my contributions to this forthcoming paper.

OK, so getting into the actual paper. So Using Machine Learning to Scrutinize Parole Release Hearings. So we go over the following questions– so one is information extracted from the transcripts using manual annotation– this is the human input that I mentioned, and NLP more predictive of outcomes than traditional tabular data?

Which, the spreadsheets that– again, highly structured data created and collected by the CDCR, that is used most often in this research. So I’m going to give you a sneak peek as to some of the answers to these questions. The answer is yes, but with some important caveats.

So question two, to what extent does the commissioner assigned to preside over a given hearing explain variation in parole release decisions? The answer is quite a bit. It’s a shocking amount. And I’m going to go over that.

Then finally, is hiring a private attorney correlated with higher likelihood to receive parole? So one important feature of California parole is that parole candidates are entitled to an attorney. So you have the option to use your own money and resources or have someone do it on your behalf to hire someone or the board will appoint an attorney for you.

And the answer is that hiring a private attorney has a higher likelihood of receiving parole. And it’s not just because they have cool shoes and they’re very fancy. A lot of what we see is that they have a very different approach to defending a parole candidate in each hearing.

So just to go over the data– so as I mentioned, our data coverage is 12 years. So the first– we begin in January 1, 2007. We go all the way to November 22, 2019. And this represents the universe of parole hearings in this time frame.

So as I mentioned, there’s over 35,000 total hearings, but there were some confidentiality concerns and a few data extraction issues. So the final data set for this particular paper is 34,993.

And the reason why I wanted to point this out is that there is a small difference. And even though there are some key limitations to using NLP, machine learning, and AI in this work, it’s still pretty good. You’re still only going to lose a little over 100 transcripts, in our case, if that makes sense.

So as I already mentioned, over 5 million pages, over 700 million words. Each transcript is about 100 pages, on average, and contains 20,000 words. And I just really want to be explicit here. So every time I mentioned information extracted from the transcripts, this is literally what I’m talking about– words or groups of words or phrases that were extracted from the transcripts.

So most people who go on parole or go to these parole hearings, do not get– are not granted parole the first time, so hence you have fewer parole candidates and more parole hearings. And then finally, the three sources of data that I’m going to be explaining a lot are the NLP. And what I mean by that is the features from the transcripts that were extracted using the computational methods.

There was also a large, long process of manually annotating the transcripts that we use to help with the NLP. But there were also things that we extracted that we didn’t end up using and then the tabular data, which, again, this is the traditional source of data for this kind of work. And this is the data that we sued the CDCR to access.

So just to give you a sense of what this process looks like on a year to year basis– so these are the numbers for 2019. So in 2019, there were 55,000 prisoners eligible for parole. And of those 55,000, 6,000 hearings were scheduled in 2019. So just right there, your odds are not looking good.

So these are the key members of each parole hearing. You have the commissioner. You have the presiding commissioner, who makes the final decision, and the deputy commissioner, who helps out, talks through, gives points, et cetera, but they don’t make the final decision.

You have the parole candidate. You have their attorney. And sometimes you’ll have– the victim will show up to either say, oh, throw away the lock and key or the DA might also show up and do the same thing.

So from these 6,000 hearings, 4,800 were denied, meaning that they were given an additional prison sentence and 1,000 were granted parole. So this is about 80%. So 80% of the time, they are going to deny. And then there’s a very small chance that if you’re in this 1,100, the governor will review and overturn. Oh, yes.

AUDIENCE: How should we think about the commissioners? Who are these people? What are their backgrounds?

AJ ALVERO: They are employees of the CDCR. I will go over some of this. They are employees of the CDCR. And this is just part of their job. They’re somewhat randomly assigned, but not totally. So if you do parole hearings in Soledad, they’re not going to ask you to go down to Tehachapi, if that makes sense. Yeah, yeah.

AUDIENCE: [INAUDIBLE] stipulated [INAUDIBLE] or all the other [INAUDIBLE]?

AJ ALVERO: Yes, anything that ends, they’re back here. I mean, the vast majority of which are just– they go through this whole process and they’re denied. Again, this is just 2019, but the process looks the same for 2007, all the way on, even though the numbers change.

The transcripts that I’ve been talking about for so long already, they record every single word that is spoken here. Methods, so I already mentioned we have this tabular data, which was provided by the CDCR. And we have the hearing the hearing transcripts.

So the tabular data– I mean, again, if you look at the literature, you throw it into a regression model. Voila. Not to make light of it, but this is just how it is.

Again, this is where our intervention is coming in. So we use the tabular data also for training. But here we have these two processes where there’s human reading and human labeling of text, which gets us our manually extracted data. And then computationally read and extracted data from our extraction model. And that’s our NLP. Yes.

AUDIENCE: [INAUDIBLE]

AJ ALVERO: I’m going to show, yeah. This is a good question. But these are probably, arguably the key ones right. So for the extraction model, we used a pre-trained RoBERTa and BigBird model. And again, there was heavy input from the manual annotation and extraction.

I also want to note, part of the reason why we use this was, a lot of this work was done before the release of ChatGPT and the LLMs. But it’s also the case that even after the most powerful models, they all rely on APIs. And we can’t just share the data with OpenAI, if that makes sense. But that might not be the case now with smaller models.

So these are word embedding models. So BERT was created at Google. It stands for bidirectional encoding reinforcement– something transfer. I’d have to– I have to do my homework. So essentially, word embedding models are– you have a big data set.

When I say pre-trained, I just mean– if you look at before LLMs came out, Google Research Labs, they take all the texts on the internet. They use the Common Crawl. And they train their models using that. And then from there, you can use these models to fine tune based on your given data set.

I mean, it was literally out of the box. And then we fine tune it for our purposes. I mean for that slice of time, that was a computational standard. It was very– because this is where I– if this helps, let me know.

So getting to the first question– is information extracted from the transcripts more predictive of outcomes than traditional data? So again, for the tabular data, we had, again, this highly structured data for each parole hearing. And then we have the NLP extracted features for each parole hearing as well. And then for the manual coding, we only hand coded 688 only.

And our model is a pretty straightforward logistic regression model. And basically, the punchline here is that using out-of-sample area under the curve. And we chose this approach because of imbalances in the data, both in terms of comparing 35,000 with 700, but also in the fact that most of the hearings end in a denial. We see that the NLP approach outperformed the other two.

So there was a question earlier about, OK, well, what is this tabular data that you’ve been going on about? So we have the demographics, all of which or most of which are statistically significant. We have whether or not they retained a private attorney.

We have the commissioner grant rate. And then we have whether or not this is initial hearing, years since 2007, and the prison type. And this is it. That’s everything from the tabular data. Again, a lot of literature is based on this.

And look at the other information that you might want to know that goes in– that would go into deciding whether or not someone– or if you’re analyzing a trends and tendencies in parole hearings. So nothing about the conviction, nothing about rehabilitation steps, nothing about disciplinary things or special designations, such as whether or not they’re elderly or if they’re a youth offender.

These are the kinds of things that– I mean, just between us, this seems very important if you want to understand how people are making these decisions. But these are not available in the traditional data. They are available when you add this computational lens to the data.

And here, just going through, we can see that for the manual inspection, psychological assessment was very predictive. The closer to 0, the lower the likelihood. Time, chronos bucket is programs and activities you did while you were serving.

And then with the NLP model, what we were able to do is take all of the manually annotated features and only focus on the ones that had the strongest predictive capabilities. And that’s what you see here. Yes.

AUDIENCE: The psychological assessment, what is it? Is it simply you’ve taken this step or is it–

AJ ALVERO: It’s a risk assessment.

AUDIENCE: [INAUDIBLE]

AJ ALVERO: Yes, it’s a risk assessment. So if you’re rated as high risk to society. This is capturing that. I mean, it shouldn’t be too surprising that– if you’re labeled that– if you have that label, they’re probably not going to grant you parole.

So obviously, there is more information in the transcripts. I mean, we didn’t need to have a whole talk to understand that. But importantly, this information is more predictive of the decisions and outcomes than the traditional data.

I also want to note that in the paper we control for commissioner variability and the trend still holds. Does this answer your question from– was this helpful or? No, not in the parole hearings.

All right, so going to question two. To what extent does the commissioner assign to preside over a given hearing? Explain variation in parole release decisions. So here on, I’m going to show you a bar chart. Each bar represents a presiding commissioner who conducted 50 or more hearings. And the y-axis here is their respective grant rates. I’m going to show you the five highest grant rates and the five lowest.

So on the high side, there is a commissioner who will grant parole over 50% of the time. And it goes down a little bit, but even still, if you’re in the top five, you’re granting parole 50 to 40% of the time.

On the low end, you have less than a 5% chance, based on the presiding commissioner. And here’s the rest. And as you can see, it’s just a pretty steady slope. So hopefully you’re here, but you could very well just be somewhere over here or maybe even down there.

So it’s important– again, as a reminder, the presiding commissioner, they’re the ones who make the final decision as to whether or not someone is granted parole or not. And the potential for idiosyncrasy suggests that these trends are worth pursuing in further analysis.

So next, is hiring a private attorney correlated with higher likelihood to receive parole? So as a reminder, each parole candidate is given an attorney. The privately hired retained attorney or a board-assigned attorney.

I also just want to note that this public attorney is not the same thing as a public defender in regular court, but it is analogous to that, where they’re assigned someone.

So if you have one of these appointed attorneys or public attorneys, just like totally naively, you have a little over 20% chance of receiving parole, meaning that you have, slightly under 80% chance of not receiving parole.

If you now we compare with the privately hired attorneys, of which 7,000 people went this route. And you can see the probability of getting parole essentially doubles. You go from 20% to 40%.

We see that you have roughly double the likelihood, but do the transcripts give us a clue as to why that might be the case? Or at the very least, are the private attorneys, are they saying something different? There’s not a stamp on their forehead that says private. What are they doing differently in the actual hearings?

So here, this is taking the total number of words used by each member or each actor in the hearings and breaking it down by who’s speaking them. So for the board– again, this is the publicly appointed attorneys. They take up about 8% of the words used in each hearing. So the parole candidates, there are about 26%. And the commissioners there are about 40%.

Compare this with the privately hired attorneys. I know 12% versus 8% doesn’t seem like a whole lot, but this means that the privately retained and hired attorneys are getting 50% more air time, literally just the words being spoken at each hearing, than the board-appointed attorneys. I also want to note, these differences– through statistical testing, these are all statistically significant differences as well.

OK, so purely from this perspective again, OK, the private attorneys are taking up more air time. What does that even mean? Are they speaking more? Or are they just taking up space? Or are they speaking differently?

So here’s where we answer that question. So we had this bespoke statistical model of word frequency for each word used in the hearings. And then here is just a log scale of the total– of how often those words appeared.

So here, if you’re lower, going down, these are the words most often associated or used by the board appointed attorneys. And I know this is here, you can see most of the words, it’s just there’s not a huge difference. But these were the words that were most associated with the publicly assigned attorneys coming from them.

Like, literally, in the transcripts, uh was the most common, followed by um. And you can even see here inaudible, meaning that maybe they were muffled or they weren’t speaking clearly. So there’s just– I mentioned air time and noise. It’s quite literally noise. It’s most often associated with the publicly assigned attorneys.

So on the other side, we have the privately retained attorneys. And here we can see she and her. So it was the case that female parole candidates were more likely to hire attorneys. And so this is captured by that. I know this one is a little bit harder to see, but a lot of these terms are like argumentative terms– evidence, arbitrary exhibit, et cetera.

So what we interpret this as is that the privately held attorneys are making their case more clearly and maybe more articulately as opposed to the public attorneys. Yeah.

AUDIENCE: [INAUDIBLE]

AJ ALVERO: No, but we did the same thing but in the other direction. So it is the case that there are many, very specific terms that aren’t as ambiguous. And we wanted to see, to what extent are the different attorney types using these different types of terms?

And again, just as a reminder, we have two former attorneys who worked in parole on our team. So we were able to come up with a list of– these are legal standard terms and also cases where parole was central.

So I know this is a lot, but the trend I want to point out here is that for all of these– again, keywords and terms about parole, the retained attorney were more likely to use them. In fact, the only one– and each of these, with the little arrows going up means that these were the terms that are most associated with getting a grant.

So as a note, a plausible here is not– it is the word plausible, but it’s also a legal standard for when someone is claiming innocence. So it would pop up in more cases where the parole candidate is claiming innocence.

And in general, that the board does not like that. They don’t want you to claim innocence. They want you to show remorse, et cetera. So I think that helps explain why this was used more often by the retained attorneys, yes, but in general had a lower likelihood of receiving parole.

We have not done that yet, but we’re working on analogous studies. I’m going to talk about some of them in a second. Yeah, a total administrative, demographic, perspective. We get into some of this– more to your question and comment, we get into some of this in the future studies. But if you have that question still after I go over them, raise your hand again.

But along these same lines, we also broke it down by– again, this is the CDCR label of the parole candidate’s race. We see that if you’re labeled as white or other– and other captures a lot of people from Asian-American backgrounds, you’re more likely to retain an attorney compared to Black and Latinx.

So just a quick recap, the information is more predictive, but also, it comes from domain relevant information. But again, something we want to follow up with is–

AUDIENCE: [INAUDIBLE] they a lot, at this point, [INAUDIBLE].

AJ ALVERO: So question 2, to what extent? So high variability, over 50% to less than 5%, meaning that these important decisions, there is some influence from idiosyncrasy. And is hiring a private attorney– how does that change your outcomes? You’re more likely to receive parole. And also, they tend to speak differently. They have a different approach.

So again, what I’m trying to argue here is, going back, is parole transcripts uniquely show the ways that different factors of the parole hearings are predictive of outcomes. So these results, they also point to future directions of research. So I’m going to quickly go over them.

So I mentioned earlier, we had a grant that we applied for. I don’t work at UF anymore. So I’m able to reapply and all that. So that’s awesome. So one thing we want to look at is racial bias in parole outcomes.

So we see that the tabular data, and the transcripts, and the attorney type, and the presiding commissioner, we see that all these are important in explaining parole outcomes.

But it’s also the case that the racial identification of the parole candidate is also very important to all of these same pieces of information. So what we want to do is start to compare and triangulate, how does racial bias affect the hearings, and evaluations, cetera?

So we also want to do a causal study. So Ryan Sakoda, who’s on our team, not only is he an awesome law professor, he also has a PhD in economics from Harvard. And he was, like, let me do my causal thing.

So we want to examine a couple of policy implementations. So one was the expansion of elderly parole in 2014 and 2020 and the implementation of youth offender parole in these three years.

So then finally, something else we want to do is model the parole hearing as a stochastic process. And what is a stochastic process? You can imagine all the parole hearings have this start point. And all of them have an endpoint of either grant or deny. And each parole hearing takes a path.

You’re going to start here. And you’re going to end up in grant or you’re going to go to deny. But of course, they’re not going to go in straight lines. There’s going to be moments where maybe you end up here, but maybe it seems like you’re going to go to grant, or maybe it seems like, oh, for sure, you’re going to go to deny, et cetera. Same for the grant decisions.

And I think here is where we want to start doing these– at what point in the hearing is the outcome– can we accurately, reliably predict the outcome? Is it the case that even in the first page, where it’s like, what is your name? These kinds of things.

So there was a study in PNAS a couple of years ago, where they did a similar analysis with movie scripts. And it’s like each turn for each character was treated as a step in the process.

Literally if you have a script and it’s Bobby says this, Jimmy says that those. And the transcripts are structured in a very similar way. And it allows for these– what happens at each turn or each step of the hearings.

Thank you very much. This has been great. And I also wanted to put up some citations that informed this work. So thank you all.

[APPLAUSE]

DAVID G.: Hi, I’m David G. I’m faculty at public health. Really interesting talk. This is obviously far from my discipline, but really interested in how this might get taken up by practitioners. Is it something that attorneys that are representing these clients might be interested in and receptive to?

AJ ALVERO: Yeah, so I have two comments on that. So one is at least purely from a data information perspective stance, absolutely. There are things– at least from, again, also a correlational perspective, there seem to be things that parole boards respond more favorably to than others.

So I think as long as we’re going to continue to have this process in California, I think there is a lot of opportunity to coach people up in certain ways. I don’t know how receptive they are to it, but I think, this analysis and information has never been provided to them. So it could be helpful.

The other thing is that California passed the Racial Justice Act. So even beyond coaching up attorneys and what they say, it is possible that some of this work could also maybe even overturn decisions from, again, a legal policy perspective. So maybe we don’t even have to go to the lawyers. Maybe we can just overturn some of these decisions in that way.

AUDIENCE: Seems like there might be some bias.

AUDIENCE MEMBER: Hi, my name is Isaac D.. I’m a graduate student in the [INAUDIBLE].

AJ ALVERO: Excited, you too.

ISAAC D: Yeah, thank you. And yeah, thank you for this wonderful talk. And I was also struck by the just wild variance in commissioner outcomes. And it strikes me that there’s also potentially differences, not just in the outcomes but what individual commissioners pay attention to and care about. And so that might show up.

And some commissioners care a lot more about domestic violence than other commissioners. Some commissioners show more racial bias than other commissioners. So I’m curious if you’ve looked into that or how you would approach that.

And then also thinking a little bit about the spatial and time dynamics. So there’s a lot of changes over this period. And certain commissioners or hearing cases at other different periods of time. There’s a different, both legal framework and informal expectations around the parole board. So yeah, the time, place question. And then just other forms of variation beyond just outcomes with the commissioners. Thank you.

AJ ALVERO: So the first thing I’ll say is that we have looked at what you were asking but from a slightly different perspective. So the racial bias future studies thing that I mentioned, one thing that we want to do is take a very upfront and explicit sociolinguistic perspective and ask this question of, is it the case that there are things that, generally, will lead to– are there things that parole candidates can say in the hearings or maybe the way they describe themselves, again, going back to the sentiment analysis idea.

Are there things that they can say that is generally associated with getting a grant? But is it the case that favorability is mediated by race? So for example, is it the case that if you are seen by the parole board as a Black parole candidate and you take a similar approach as white parole candidates who end up getting parole, are you treated the same way?

Or is there some kind of you should be speaking in this particular way because of this particular background, getting into that sociolinguistics 101. So we’re on the wavelength, but maybe slightly different.

The other question is– so I actually didn’t mention this, but we’re going to get transcripts all the way up to 2025. So something we wanted to look at was the effect of COVID. So during COVID, the parole hearings moved to Zoom. And we want to do a study examining, is it the case that the move to Zoom hurt people, it helped people, et cetera? So we’re also on a similar wavelength, but we haven’t– there’s a lot of things we want to do.

AUDIENCE: Hi, thank you. My name is Alan. I’m a PhD student, also from a public health. And I was curious if you– I don’t know if quantitatively this is a different mechanism, but if you looked at the likelihood and patterns in odds of not getting your parole granted.

So not just the outcome of what increase, linguistically, contextually increase in the likelihood of getting your goal granted. But also, was there something along the process, it was, oh, this is going to– at this point or these– these kinds of common patterns, OK, this isn’t going to happen. And understanding the bias and injustice through that lens.

AJ ALVERO: For this particular paper, I think we did some of that. We backed into some of that, in the sense of we weren’t– that’s not what we were setting out to do because ultimately, again, the big overarching idea is that these analyses can inform some tool that gets used to identify anomalous cases.

So I think if we found patterns where we can identify what are things that are said that– you’re talking about reducing the likelihood of–

AUDIENCE: I’m just thinking counterfactually, like when you’re showing how– I mean, in a causal sense. But if you’re showing how these were the terms and phrases used that increase the likelihood of parole. And you might suggest, in a practice sense, these might be ways to cater a practitioner’s or a lawyer’s logic, words, process.

But also understanding, well, what’s happening if the other outcome is occurring where these might be places to avoid? Especially if you’re thinking about potentially for this walk of– OK, where along the hearing is there going to be– I’m imagining.

If I’m a commissioner and, say, hypothetically I have a racial bias. And my racial bias gets really activated at one moment. And at that point, the rest of the hearing is–

[INTERPOSING VOICES]

–because well, why? But it’s just entertained. And through this hearing, it just might feel insensitive or impractical, but the commissioner’s, OK, implicitly it’s over.

And that might reveal some other– I’m imagining, some other mechanisms where the system is working to not actually grant parole is working to enforce and control. And that’s perhaps highlighting some feedback mechanism. OK, well, there is some other pattern here in which the prisons are doing as they’re intended into incarcerate.

AJ ALVERO: Yeah, what is their actual function here? Yeah, I mean, as context, the legal standard is that you go through the parole process to be released. That is the law. That’s what’s enshrined. That’s, again, this legal standard. But as we can see, it’s 80% of the time, you’re not going to get– you’re not going to get parole.

Something we’ve talked about in, again, this stochastic process is exactly what you’re talking about. What are the points when– I mean, it could even be the presiding commissioner says something, where you just start going– it seems like you’re going to be– you’re in that grant trajectory. What are the moments that you just plummet and end up in denial? Yeah, we’re doing a lot of work on that right now.

AUDIENCE: Thank you.

AJ ALVERO: I don’t know if I answered your question.

AUDIENCE: Yeah, no, that was helpful. Thank you.

AUDIENCE: Hey, what’s up everybody. My name is Clarence. I’m an alumni from the Goldman School of Public Policy. So OG up in here. What’s up, Professor Harding, I just recognized. You. But yeah, I want to say amazing job. It’s so dope to see text analysis getting the recognition that it deserves because usually tabular data is king and the go to.

But I had a similar question. I just want to know, how long did that lawsuit take to get all that data and information? And did you say it was only for the tabular data that the lawsuit was for or was it for the transcripts, too?

AJ ALVERO: It was for the tabular data. This is also one of the tensions of this work. Technically, the parole transcripts are publicly available. So you could go cdcr.com, find the parole hearing transcripts. But getting them at– I think they had to help us get all of them rather than us download them one by one.

And the other issue is, even though they’re technically publicly– they’re public data, they’re also really sensitive. So imagine if we do this work. And we just publish everyone’s name who has appeared in our data set. And then every time they google themselves, it’s not, I overcame. I went through this process, whatever. It’s here’s the parole paper. So there’s a lot of tensions there.

Yeah, to answer your question, yes, the tabular data is what we had to sue for. I think overall, I don’t know the timeline, maybe a year or two. I’m not totally sure. I’d have to check.

PROFESSOR: I’ll allow one last question.

AUDIENCE: Thank you. My name is [? Rae ?] [? Willis-Conger. ?] I’m a grad student in the sociology and demography department. Thank you for this talk. It seems like you extracted a lot of information through NLP that was largely categorical and then compare it to the largely categorical demographic data that you sued for.

And obviously, there are differences. But you talk later in the talk about maybe a more narrative analysis. So I’m interested in your thoughts on the differences between narrative analysis using NLP versus this more categorical approach, whether something is more suited to public policy intervention or speaking to inequalities, that sort of thing.

AJ ALVERO: Yeah, so for the first question about the categorical nature, you’re 100% correct. And what we were trying to do was– basically, we trained our model to classify the transcripts.

And that’s where they can get tricky, where it’s maybe one parole or one presiding commissioner uses a slightly different term than another one, but they’re talking about the same piece of information. That’s where the model is able to say, well, this person said that. That person said that, but it all means the same thing. So we can add it to the bucket.

That’s where a lot of this work like was built on. And that’s where a lot of it is going. And I would imagine, for public policy, that’s where it’s– they would prefer that, is my assumption. For the narrative piece, I think part of– you can make really interesting cases and arguments.

But sometimes it can get tricky where it’s– there is this narrative interpretation act that– you’re assuming that when they said something, that is exactly what they meant. And it’s part of this broader story and narrative.

And I think we have to be careful with that because maybe they aren’t– maybe they’re really nervous. And we’re not going to know that just from the transcripts. Or maybe someone made a wink, I don’t know, a furtive wink at the parole candidate and totally threw them off their game. We aren’t going to know that either.

Anyway to get to your question, that’s part of the reason why we’re really trying to anchor some of these future analyzes in the outcomes because regardless of what we don’t know about the– or what we don’t know about the actual hearings, we do know what ends up happening at the end.

And I think that’s our move to get out of some of the interpretive trouble that can pop up with narrative analysis, if that makes sense. I will also say, personally, I did a lot of– my PhD was about studying college admissions essays.

And in that case, we didn’t have the outcomes. So we were always keenly aware of our limitations. So I think here, that’s been my thing is we need to ground this in the outcomes to have the most credibility and validity in our claims.

PROFESSOR: I think let’s join in thanking AJ for a wonderful presentation.

[MUSIC PLAYING]

 

You May Like

CRELS

Recap

Published April 23, 2025

Alex Roehrkasse: The New Contours of Mass Incarceration

Recorded on March 18, 2025, this video features a talk by Alexander F. Roehrkasse, Assistant Professor of Sociology and Criminology at Butler University. In the talk, Roehrkasse presents new evidence of declining Black–White inequality and skyrocketing educational inequality in U.S. prison admissions.

Learn More >

Matrix On Point

Recap

Published April 10, 2025

Mainstreaming Psychedelics

Psychedelics are steadily moving from the fringes of counterculture to the heart of mainstream society, driven by a growing body of research and shifting public perception. As psychedelics shed their stigma, they are catalyzing a broader conversation about mental health, spirituality, and the boundaries of human consciousness. Recorded on March 6, 2025, this panel featured Diana Negrin, David Presti, Charles Hirschkind, and Graham Pechenik, with Poulomi Saha moderating.

Learn More >

Podcast

Interview

Published April 1, 2025

Social, Spatial, Ecological, and Racial Fixes in New Deal South Carolina: Interview with Morgan Vickers

This episode of the Matrix Podcast features an interview with Morgan P. Vickers, an Assistant Professor of Race/Racialization in the Department of Law, Societies & Justice at the University of Washington. Vickers received their Ph.D. in Geography from UC Berkeley.

Learn More >