Bruce Murphy
Murphy’s Law

State Justice System Uses Racially Biased Test

Wisconsin Corrections uses test with bias against blacks in sentencing, parole.

By - Jan 5th, 2017 12:39 pm

Columbia Correctional Institution. Photo by Dual Freq (Own work) [CC BY-SA 3.0 ( or GFDL (], via Wikimedia Commons.

Columbia Correctional Institution. Photo by Dual Freq (Own work) [CC BY-SA 3.0 ( or GFDL (], via Wikimedia Commons.

Paul Zilly had been convicted of stealing a push lawnmower and some tools in Barron County in northwestern Wisconsin back in 2013. “The prosecutor recommended a year in county jail and follow-up supervision that could help Zilly with ‘staying on the right path,'” as a story by Pro Publica recounts. “His lawyer agreed to a plea deal.”

But Judge James Babler also had before him the results of Zilly’s scores on the Correctional Offender Management Profiling for Alternative Sanctions, or COMPAS test. The test, sold by the for-profit company Northpointe, and used widely in Wisconsin, rates an offender’s chances of committing more crimes in the future. It had rated Zilly as a high risk for future violent crime and a medium risk for general recidivism. “When I look at the risk assessment,” Babler said in court, “it is about as bad as it could be,” the story reports.

As a result, Babler overturned the plea deal agreed on by the prosecution and defense and instead sentenced Zilly to two years in state prison and three years of supervision.

The use of tests like this, known as risk assessments, “are increasingly common in courtrooms across the nation” and “used to inform decisions about who can be set free at every stage of the criminal justice system, from assigning bond amounts — as is the case in Fort Lauderdale — to even more fundamental decisions about defendants’ freedom,” Pro Publica reports. And “Northpointe’s software is among the most widely used assessment tools in the country… In Arizona, Colorado, Delaware, Kentucky, Louisiana, Oklahoma, Virginia, Washington and Wisconsin, the results of such assessments are given to judges during criminal sentencing.”

The objective was to make these decisions more scientific and objective. But in 2014, “then U.S. Attorney General Eric Holder warned that the risk scores might be injecting bias into the courts” and “called for the U.S. Sentencing Commission to study their use,” Pro Publica notes. But no such study was undertaken.

And so Pro Publica did its own study: its reporters obtained the risk scores assigned to more than 7,000 people arrested in Broward County, Florida, in 2013 and 2014 “and checked to see how many were charged with new crimes over the next two years.”

The results showed the COMPAS test was “remarkably unreliable in forecasting violent crime: Only 20 percent of the people predicted to commit violent crimes actually went on to do so.”

The results also suggested the test might be racially biased: “The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.” Meanwhile, “white defendants were mislabeled as low risk more often than black defendants.”

“Could this disparity be explained by defendants’ prior crimes or the type of crimes they were arrested for?” Pro Publica asked. “No. We ran a statistical test that isolated the effect of race from criminal history and recidivism, as well as from defendants’ age and gender. Black defendants were still 77 percent more likely to be pegged as at higher risk of committing a future violent crime and 45 percent more likely to be predicted to commit a future crime of any kind.”

Since that story was published, in May 2016, “some of the nation’s top researchers at Stanford University, Cornell University, Harvard University, Carnegie Mellon University, University of Chicago and Google” decided to study the test, as a follow up story published by Pro Publica last week reports:

“The scholars set out to address this question: Since blacks are re-arrested more often than whites, is it possible to create a formula that is equally predictive for all races without disparities in who suffers the harm of incorrect predictions?

“Working separately and using different methodologies, four groups of scholars all reached the same conclusion. It’s not…The researchers found that the formula, and others like it, have been written in a way that guarantees black defendants will be inaccurately identified as future criminals more often than their white counterparts.”

Northpointe, the company that sells the COMPAS tool, said it had no comment on the critiques. As for Wisconsin,  “State corrections officials declined repeated requests to comment” for the first story published by Pro Publica, it noted.

In response to my request for comment, Tristan D. Cook, Communications Director for the Wisconsin Department of Corrections, asked me to email my questions and then sent this response:

“In 2010, DOC selected COMPAS as the primary tool for the Department of Corrections. One of the primary functions of criminogenic risk and need assessments is to provide a score that reflects the relative criminogenic risk of a specific offender relative to the overall population. Criminogenic risk and need assessments can be used in a variety of situations, including pre-sentence investigations, inmate classification, supervision intake and reclassification, and release planning. Counties can also elect to use COMPAS. ”

Cook had no response to my main question, whether the department was concerned about the potential bias of the COMPAS test. But his response suggests the test can be used at every stage of the criminal justice system. Which is exactly what Pro Publica reported:

“Wisconsin has been among the most eager and expansive users of Northpointe’s risk assessment tool in sentencing decisions…In 2012, the Wisconsin Department of Corrections launched the use of the software throughout the state. It is used at each step in the prison system, from sentencing to parole.

“In a 2012 presentation, corrections official Jared Hoy (a Policy Initiatives Advisor for Wisconsin’s Department of Corrections) described the system as a ‘giant correctional pinball machine’ in which correctional officers could use the scores at every ‘decision point.’

“…Some Wisconsin counties use other risk assessment tools at arrest to determine if a defendant is too risky for pretrial release. Once a defendant is convicted of a felony anywhere in the state, the Department of Corrections attaches Northpointe’s assessment to the confidential presentence report given to judges, according to Hoy’s presentation.”

It seems remarkable that Wisconsin officials would have no concern about racial bias in a test used so widely, all the more so in a state criticized as a leader in the incarceration of African American males. As the watershed study by the UW-Milwaukee Employment & Training Institute found, Wisconsin leads the nation by far in the percent of black males who are incarcerated: statewide, 12.8 percent of all African-American males are incarcerated, nearly double the national average (6.7 percent) and well ahead of 2nd place Oklahoma (9.7 percent).

“State DOC records show incarceration rates at epidemic levels for African American males in Milwaukee County,” the study notes. “Over half of African American men in their 30s and half of men in their early 40s have been incarcerated in state correctional facilities.”

This problem goes back decades and predates the COMPAS test. But the question is whether the adoption of this test will undercut reform efforts (such as the creation of drug courts in cities like Milwaukee and Madison that aim to reduce the level of incarceration for non-violent offenders). Is the use of the COMPAS test as a “giant correctional pinball machine” throughout the system a wise decision given the evidence of its racial bias?

In the case of Paul Zilly, the man sentenced to two years in prison for stealing the lawnmower and tools, after he was sent to prison, a public defender appealed the sentence and called on an interesting witness: Tim Brennan, the former professor of statistics at the University of Colorado who had created the COMPAS test and later sold it to the company that now owns it. “Brennan testified that he didn’t design his software to be used in sentencing,” Pro Publica reports.

“After Brennan’s testimony, Judge Babler reduced Zilly’s sentence, from two years in prison to 18 months. ‘Had I not had the COMPAS, I believe it would likely be that I would have given one year, six months,’ the judge said at an appeals hearing…

“Zilly said the score didn’t take into account all the changes he was making in his life — his conversion to Christianity, his struggle to quit using drugs and his efforts to be more available for his son. ‘Not that I’m innocent, but I just believe people do change.'”

Zilly, by the way, is white. Meanwhile, the cumulative impact of the COMPAS test on black offenders in Wisconsin is unmeasured and unknown.

If you think stories like this are important, become a member of Urban Milwaukee and help support real independent journalism. Plus you get some cool added benefits, all explained here.

15 thoughts on “Murphy’s Law: State Justice System Uses Racially Biased Test”

  1. Vincent Hanna says:

    Why would a person who stole a lawn mower be deemed “a high risk for future violent crime?” That didn’t raise anyone’s eyebrows?

  2. happyjack27 says:

    don’t know enough about the tests or issues in the article, but

    any such test should be based on empirical data and definitely should be bayesian.

    also should include confidence intervals. an expected correlation of 1 with a variance of 0.1 is a lot different than of 1 with a variance of 10.

    not sure of any analytic way to approach that, but one could assign a “cost” value to each outcome and then just run a monte-carlo, to get the expected cost.

  3. Tim says:

    happyjack27, I’m not a statistics expert, so we’ll start there. My knowledge of Bayesian statistics is the basic coin flip scenario, but uncle Google has told me that it also means a model that takes in new info as it comes.

    How would a Bayesian model improve outcomes?

    Separately, any model used should be open to the public and open to debate.

  4. happyjack27 says:

    “bayes” means a lot of things, but it generally refers to what can be gleaned from “bayes rule”.
    e.g. using bayesian priors on a probability distribution is acknowledging bayes rule.
    uncle google was probably refering to “bayesian updating”.

    i’m not aware of any “coin flip scenario” that’s bayesian. bayesian reasoning needs at least two probability distributions, a coin flip seems like 1.


    let’s say we’re doing a test for cancer.

    we can’t just say if you pass the test you have cancer.
    you have to consider the false positive rate of the test, the false negative rate of the test, and the initial probability of having cancer.

    You can’t just say “90% of the people who test positive have cancer, therefore if you fail the test, there’s only a 10% chance of having cancer.” That’s bad math. You need to know the tests’ miss rate.

    here’s it better explained:

    now the same principle holds true for predicting future behavior.

    indeed, the same holds true for EVERYTHING – it’s just how probabilities work.

    on your separate point, i agree with the first part: the model should be published openly.
    as for “open to debate”, i’d say definitely not by the public – the public is stupid and arrogant. open to experts, yes. mathematicians, software developers, etc. but i’d hate to see sound mathematics being “balanced out” by unsound mathematics.

  5. happyjack27 says:

    let me continue on my doctor analogy for your second point:

    i wouldn’t want my medical treatment to be crowd-sourced. i’d rather be treated by a medical professional.

    likewise for anything involving complex formal reasoning, such as this.

  6. happyjack27 says:

    a good video explanation of how bayes rule leads to more accurate predictions:

  7. andsoitgoes says:

    For Vincent Hanna – the assessment scores a number of “criminogenic” factors such as criminal history/criminal thinking, antisocial thinking/friends/family, criminal history, alcohol and drug issues, education/employment factors and attitudes, age etc… The lawn mower thief very likely had a lot of other stuff going on before this break in. Was he sentenced for his behavior involved in his crime or for his attitudes and beliefs? Most likely the latter. These factors are used in some algorithm that yes, can be adjusted by someone with authority and without the user’s knowledge to change outcomes overall. The adjusting trend seems to have been to rate groups LOWER in risk overall to avoid stiffer, more costly penalties because, not surprisingly, in an increasingly unequal society, there are more and more people out there with antisocial beliefs. Certainly too many to lock them all up. That said, political pressure can cause a broad upward adjustment as has been done in the case of sex offenders or say, repeat drunk drivers. The assessment is known by trained users NOT to be a valid instrument in cases of offenders with mental health problems and/or chronic drug/alcohol use, for sexual offenders, or for those with personality disorders – basically the majority of criminals. The assessment is used on offenders with those issues regardless. They are playing games with symptoms and, in my mind, while pretending to have effective “treatments” while profiteers figure out new and inventive ways to take advantage financially and politically of the mess.

  8. happyjack27 says:

    From andsoitgoes’ description, it sounds like i had feared – the scoring is rudimentary, unsophisticated, unscientific, arbitrary, and capricious.

    They could at the very least collect empirical data (e.g. criminal records) and run it through a basic classifier.

  9. andsoitgoes says:

    This numbers stuff is not my field. Could be they do whatever that is – empirical data. Criminal history is used and somehow data is accumulated from different users across their customers I am told, but to what effect and what purpose is not clear to me. What I am saying is, it takes an hour or 2 or 3 of gathering history, looking at records, and talking with someone to have a very good idea of what a person’s risks and needs are without expensive software. The problem isn’t figuring out risks and needs – it never was. The problem is there are no effective cures/treatment within the prison/jail walls OR outside of them in our communities and we can’t just keep locking everyone up. We are throwing expensive resources at individuals, most of it in salaries for someone else, and most are rejecting those resources – participating in them to avoid consequences but rejecting them all the same. I would reject what is offered too. The few successful find their way with authentic supports and motivations that are mostly free and fundamental – the same ones you and I work off of. Without progress in gaining socio/economic/political equity and power, there is not progress in society and that includes reducing crime.

  10. happyjack27 says:

    “What I am saying is, it takes an hour or 2 or 3 of gathering history, looking at records, and talking with someone to have a very good idea of what a person’s risks and needs are without expensive software. ”

    Let’s remove the part “…talking with someone…”, since clearly computer software can’t do this, so it doesn’t make for a fair comparison.

    So an 1-3 hours of labor at lets say $50/hr. Compare that with 1-3 seconds for a computer program to do the same, no labor costs. Only electricity costs.

    Clearly the software is the far cheaper approach.

    Now let’s look at accuracy.

    An individual has neither the billions of data points that a machine learning algorithm would have, nor the processing power to handle it all, nor the objectivity.

    Each individual human is going to come to drastically different conclusions based on their different experiences, prejudices, mental short-cuts, personality, and basically how they’re feeling at the time.

    In sum, the judgement of a human, without proper and objective data analysis, is by the very definition, arbitrary and capricious.

    That’s not even to getting into the accuracy of having only very few data points at your disposal.

    All in all, as far as ability to accurately predict, without “…talking with someone…”, a properly written and trained machine learning can beat any human hands down. In addition to producing more consistent — and thus equitable — results.

    A human can then take these likilihoods and confidence intervals computed, and then go ahead and do their interview and all that, and then make a judgement off of the combination.

    There’s nothing to be gained by making one channel of information noisy.

    One might make an analogy to insurance. Would an insurance company be better off hiring quants to plug empirical data into a probability model, or to have a whole bunch of people set individual rates for people based on their “gut”? Clearly the former. The latter would both be more expensive and more risky.

  11. andsoitgoes says:

    oh, I agree with you. For all the good those results do, you can do your assessment in a kiosk or online at home or if homeless, at the library and get the results back in an email along with the recommended programming. The disclaimer is that the results are not accurate for mental health, chronic alcohol and drug issues, sex offenses, and personality disorders – most criminal behavior involves one or more of these. The tool, right off the bat, applies to few, but used for all anyway. But regardless, offenders can also be emailed that programming/treatment and email their work back to the treatment completed inbox. A software program is really very, very, cheap and efficient. All you need is the right software. I am sure our present administration can come up with a privateer than can provide those services after checking the donations list. Offenders can complete modules of recommended programming from their smart phone as they case the neighborhood. That will work right? I don’t think so, anymore than what we are doing now, but it will be a whole lot cheaper. But what really does engage people and moves people toward pro-social and away from anti-social? Better software? Echo? Siri? More data?

  12. happyjack27 says:

    Me thinks we need more data to answer that question. hehe.

    Not sure we’re talking about the same thing.

    The software i’m talking about is like a medical history + risk factor analysis based on history, except for crime.

    So think of a medical history. fill in stuff that’s missing, such as alcohol abuse, personality disorders, allergies, previous surgeries, etc. and the doctor can fill stuff in too. And out comes your risk profile for malaria, cancer, dementia, etc.

    computers are already better at finding cancer than your doctor:

    And then the doctor can take this risk profile under advisement when considering treatment options, etc.

  13. Matine says:

    The problem isn’t about the cost of gathering history, looking at records, and feeding hard data into an algorithm. The problem is the bias inherent in relevancy of the data gathered. You can see a sample test here=>

    Here are questions that are related to poverty: how often have you moved in the last 12 months; how many of your friends/acquaintances have ever been arrested; in your neighborhood, have some of your friends or family been crime victims. Other questions are influenced by racial bias that may have impacted the respondent: have you ever been expelled from school. Some questions just don’t make any sense to me as factors that signal future crime behavior: how often do you feel bored; do you feel discouraged at times. The test is inherently racially biased and biased against the poor and its questions are not well thought out.

  14. happyjack27 says:

    Matine, most of those highlighted questions would pose no issue for a machine learning algorithm.

    #4 obviously is problematic. It’s a non-first-person subjective question. Still, a machine learning algorithm would probably learn to ignore it after finding little correlation. So it still wouldn’t be problematic except for being a waste of space. (Since the algorithm just throws the answer away.)

    The rest of those are self-assessment, so they aren’t problematic.

    A machine learning algorithm doesn’t care about the “accuracy” of the answers, only how well they’re correlated with other things. So even if a person self-assesses inaccurately — even if many people in a group do — that doesn’t matter to the algorithm. It can still learn something from that data point, even if it’s only that a certain group consistently mis-asseses.

    Again, “accuracy” is not important for a machine learning algorithm, only correlation. You want a broad spectrum of input data that all has a high correlation to the variables you’re trying to predict.

    The ideal input data for a machine learning algorithm has:

    * Maximum mutual information between independant variables and dependant variables
    * Maximum entropy among independant variables (in other words, the questions are all very different)

    This gives you maximum information about the dependant variables.

Leave a Reply

You must be an Urban Milwaukee member to leave a comment. Membership, which includes a host of perks, including an ad-free website, tickets to marquee events like Summerfest, the Wisconsin State Fair and the Florentine Opera, a better photo browser and access to members-only, behind-the-scenes tours, starts at $9/month. Learn more.

Join now and cancel anytime.

If you are an existing member, sign-in to leave a comment.

Have questions? Need to report an error? Contact Us