Death: Reality vs Reported


Site and writeup by Owen Shen.
Data collection and analysis by Hasan Al-Jamaly, Maximillian Siemers, Owen Shen, and Nicole Stone.

Background:

How do people die?

How do people think we die?

And is there a difference?

Well, it turns out there's a fascinating study conducted by Paul Slovic and Barbara Combs where they looked at how often different types of deaths were mentioned in the news. They then compared the frequency of news coverage with the actual frequency of people who died for each cause.

The results are what one might cynically expect:

"Although all diseases claim almost 1,OOO times as many lives as do homicides, there were about three times as many articles about homicides than about all diseases. Furthermore, homicide articles tended to be more than twice as long as articles reporting deaths from diseases and accidents."

Since 1979, when the original Combs and Slovic study was conducted, there have been several more empirical analyses which have found largely similar results. (Notably, here and here)

For our final capstone project for the fantastic Bradley Voytek's COGS 108 course at UCSD, we thought it would be interesting for us to have our own go at examining potential disparities between actual deaths and their corresponding media attention.

For anyone curious about any of the steps throughout this project, the original data and code we used to do all this analysis is available here on GitHub.


Data: The Gathering

For our project, we looked at four sources:

  1. The Center for Disease Control’s WONDER database for public health data (1999-2016).
  2. Google Trends search volume (2004-2016).
  3. The Guardian’s article database.
  4. The New York Times’ article database.

For all of the above data, we looked at the top 10 largest causes of mortality, as well as terrorism, overdoses, and homicides, three other causes of death which we believe receive a lot of media attention.

In all the charts below, we’ve normalized their value by dividing by the sum of all values for that year. Thus, the values given represent their relative share, rather than absolute counts. This is mainly to make comparisons between distributions easier, as what we really care about here is the proportionality in representation across different sources.

First off, as our “ground truth”, we’ll look at the causes of mortality as given by the CDC.

Year:

Immediately, we can see that cancer and heart disease take up a major chunk of all deaths, each responsible for around 30% of the total death count. On the graph, everything is visible except for terrorism, which is so small it doesn’t show up unless we zoom in (You can do this by clicking on different causes in the legend to “strike them out” from the graph).

Next, here’s the Google Trends data. (Because Google Trends didn’t start until 2004, we alas aren’t able to explore search data from 1999-2003.)

Year:

The two major changes here seem to be that heart disease is underrepresented here, and terrorism is very much overrepresented. Suicide also looks like it has several times more relative share here than compared to the actual death rate. The rest of the causes look like they’re within the right order of magnitude as the CDC data.

Now here’s the data for The Guardian and The New York Times. We put them both below as they appear quite similar. (We’ll be able to quantify the degree of similarity in the next section.)

Year:

Year:

Here, we see that terrorism, cancer, and homicides are the causes of death that are most mentioned in the newspapers. Though the share that cancer occupies seems largely proportional, the share given to both homicides and terrorism appears grossly overrepresented, given their respective share of total deaths.

Finally, here’s all of the above data presented in one graph, so we can see them side-by-side:

Year:


Data Analysis

After our cursory glance at the data, we have reason to think that the distributions given to these different causes of death for each source (CDC, Google Trends, The Guardian, and The NYT) are not in fact the same.

To examine whether or not these distributions are the same, we’ll use a 𝛘2 (chi-squared) test for homogeneity, which can tell us if the way that different categorical variables are distributed in two groups are the same.

We’ll run 𝛘2 tests with these four pairings of our data:

  1. CDC and Google Trends
  2. CDC and The Guardian
  3. CDC and The New York Times
  4. The Guardian and The New York Times

Here are the results:

Data Compared 𝛘2 Test Statistic p-value
CDC and Google Trends 49.242 1.897×10-6
CDC and The Guardian 1198.758 3.205×10-249
CDC and The NYT 1204.499 1.860×10-250
The Guardian and The NYT 0.056 0.999

As we guessed, the 𝛘2 value for tests 1-3 are indeed quite high. Especially for tests 2 and 3, the p-value is incredibly low, meaning that we would basically never expect to see results of this kind, if it were the case that our null hypothesis that the newspaper’s categorical distribution matches that of the CDC’s distribution was true.

We can also see that the NYT and the Guardian’s have a very low 𝛘2 value, indicating that it is quite likely they came from the same distribution. So now we have evidence that our two media sources are roughly similar, and this distribution is different from that of how causes of death actually affect the population.

During our preliminary graphing of the data, we noted that terrorism and homicides appeared overrepresented in the news data, and that heart disease appeared underrepresented. Below, we’ve listed the difference of factors in representation across the different sources for the 13 causes of deaths.

(For the Factor of Difference column, we took the larger value of Avg Deaths Proportion/Avg Newspaper Proportion and Avg Newspaper Proportion/Avg Deaths Proportion and added "Over" or "Under" to denote whether this value was over or underrepresented relative to the Avg Deaths Proportion value.)

Cause of Death Avg Deaths Proportion Avg Newspaper Proportion Factor of Difference
Alzheimer's Disease 0.036 0.009 4.172 Under
Cancer 0.279 0.171 1.631 Under
Car Accidents 0.057 0.025 2.285 Under
Diabetes 0.035 0.028 1.260 Under
Heart Disease 0.305 0.029 10.388 Under
Homicide 0.008 0.251 30.796 Over
Kidney Disease 0.023 0.002 10.793 Under
Lower Respiratory Disease 0.064 0.018 3.520 Under
Overdose 0.014 0.002 7.143 Under
Pneumonia & Influenza 0.028 0.041 1.486 Over
Stroke 0.053 0.059 1.119 Over
Suicide 0.017 0.118 6.878 Over
Terrorism 0.000 0.306 3906.304 Over

Here's a graphical representation of the Avg Newspaper Proportion/Avg Deaths Proportion factors. (Note that the y-axis is log-scaled)

The most striking disparities here are that of kidney disease, heart disease, terrorism, and homicide. Kidney disease and heart disease are both about 10 times underrepresented in the news, while homicide is about 31 times overrepresented, and terrorism is a whopping 3900 times overrepresented. Kidney disease is a little surprising; we had guessed at the other three, but it was only by calculating the factor here that this disparity became visible.


Conclusion

We set out to see if the public attention given to causes of death was similar to the actual distribution of deaths. After looking at our data, we found that, like results before us, the attention given by news outlets and Google searches does not match the actual distribution of deaths.

This suggests that general public sentiment is not well-calibrated with the ways that people actually die. Heart disease and kidney disease appear largely underrepresented in the sphere of public attention, while terrorism and homicides capture a far larger share, relative to their share of deaths caused.

Though we have shown a disparity between attention and reality, we caution from drawing immediate conclusions for policy. One major issue we have failed to address here is that of tractability; just because a cause of death claims more lives does not mean that it is easily addressable.

A more nuanced look at which causes of mortality to prioritize would likely be with a model like an evaluation framework.


Full Disclosure

Throughout the course of this project, we engaged in several, shall we say, questionable, methodological conveniences to make the analysis easier on us. These transgressions would likely doom us to the third circle of Statistics Hell—not as bad as p-hacking, but definitely worse than failing to preregister. Thus, to keep our consciences clean, we present to you:

Statistical Sins We Committed:

  1. The article search APIs returned a list of all articles which contained the word anywhere (headline or body). Though we originally wanted to look just at headlines, filtering for titles ended up proving unwieldy, so we ended up just grabbing the direct number of hits anywhere. This is a potential confounder in our analysis, especially as some words like “stroke” have multiple usages; it also means that our news data isn't exactly representative of media hype.
  2. Also, for the article search, we searched for different synonyms and added them up for our categories, as certain words, e.g. “murder”, have roughly the same meaning as our initial search terms, e.g. “homicide”, and we wanted to take this into account. However, this might have led to unequal coverage of different topics, as certain words had more synonyms than others. For example, we used hits from “heart disease”, “heart failure”, and “cardiovascular disease” to account for the heart disease category, but only “Alzheimer’s” for the “Alzheimer’s Disease” category.
  3. My understanding is that a 𝛘2 test is typically used to measure counts for categorical data where the categories are mutually independent; that’s a dubious assumption here, as several keywords, e.g. “homicide” and “terrorism”, might be mentioned in the same article. So there’s definitely some double-counting going on here, which muddies our analysis.
  4. Also, for the 𝛘2 test, we used the average counts across all years, rather than running pairwise tests year-by-year. This could prove problematic because, if the underlying distribution differs from year to year, our comparisons using just the average might not be totally valid.