August 07, 2010
"We often say, rightly, that literacy is crucial to public life: If you can't write, you can't think. The same is now true in math. Statistics is the new grammar."
—Clive Thompson, May 2010
As a child I used to wonder if it was possible to read everything that was printed in the world, as it was published. I would imagine getting up in the morning, reading the New York Times, the Wall Street Journal, the Washington Post, the Chicago Tribune, the newspapers in Canada, Japan, Russia, China, etc, then moving on to any magazines that were published that week, then the works of fiction, the non-fiction, the academic journals, the reference books, and I would realize pretty quickly that it wasn't possible. But for some reason this imaginary game has stayed with me for years. If you gathered a group of people, how many would it take to read everything? There must be fewer writers than readers, so how many people would you need? One percent of the world's population? Two percent? More?
Now that the Web is mainstream and there are millions of authors publishing every day, keeping up is even more daunting. And what about data? There are datasets out there, generated by automated sensors that record billions or trillions of numbers per day. More each day than one could consume in a lifetime. In fact, of the many zettabytes of data that now exist, only a small portion has or will ever be read by a human.
Fortunately we don't need to read all the world's newspapers or datasets to get by in life. Through some crude process, part intuitive and part rational, we choose to read what we guess is relevant to us. These choices have a profound impact on how we see, think, and feel about the world. American newspapers paint a very different picture of the world than do Chinese newspapers. Liberal magazines have different values than conservative magazines. The sports section is different from the financial section, and fiction is different from documentary.
We absorb media because it is relevant to what we are doing (or because it is entertaining, but then it is still usually relevant to our lives). A financial analyst will tend to read financial publications and watch financial news reports. A sports writer will read sports news and digest tables of sports data. Most people, including world leaders, read things that help them solve the problems they are working with.
Even if we can find sources of information that we like, we cannot trust them entirely. Writing is still done by humans (well, mostly), and humans make mistakes. Humans also have biases and agendas and they may occasionally try to convince us of something, subtly or overtly. We have to interpret the "facts," as told by our news sources, and try to decide whether they are true, based on our experience and all the other information (also subject to interpretation) that we have received during our lives. This is not an easy task, and it is one at which we are becoming increasingly lazy.
It's hard enough to discern fact from almost-fact when we're reading words. In school we're trained to read words critically, but not numbers, and numbers are getting more and more prevalent in everyday discourse, often given extra authority by use of the word "statistics." Except for those of us who are trained statisticians, we are not equipped to understand statistics. I include myself in this. There are simply too many ways to manipulate (and innocently misunderstand) numbers for any small table or one-liner published in a newspaper to be at all meaningful, and yet we tend to believe, blindly, because there are numbers, and because somehow numbers equal truth, or at least some sense of the truth.
But it's not just that the vast majority of us lack the numerical familiarity required for fluency. The dangerous part is that we don't realize this. We think we get it. Numbers seem simple. Most of us can do basic arithmetic, many of us remember some algebra, and some of us recall a little calculus, but this is very different from an understanding of statistics and aggregation, which is what's needed to evaluate statements like these:
- "The world's climate has been warming for the past 80 years."
- "The world's climate has been cooling for the past 10 years."
- "The earth has another 20 years of crude oil remaining."
- "The earth has another 50 years of crude oil remaining."
These statements are powerful. They influence political policy, the direction of research and industry, and even our personal values and lifestyles.
Dr. Albert A. Bartlett (University of Colorado) goes as far as to say that "the greatest shortcoming of the human race is our inability to understand the exponential function." His famous lecture on simple arithmetic is easy to understand but is virtually a revelation for most people (I highly encourage you to watch at least the first 30 minutes). We know how to add and multiply, but we don't really understand how real world quantities grow and accumulate.
In addition to exponential growth, there are statistical concepts like sample size, standard deviation, and confidence intervals which are prerequisites to understanding and reading numbers critically. And these are just the mathematical problems.
There are also data collection and filtering problems. Temple University's John Allen Paulos recently published an article in the New York Times Magazine which gives some very good examples of how drastically data handling affects published numbers. How data is gathered, cleaned, and categorized is critical to any statistical study, and the details are rarely (if ever) provided by increasingly short news blurbs about the latest hot button issues.
Along with today's onslaught of numbers come the accompanying charts and graphs: tools for visualizing relationships among numbers. Thanks to more accessible graphing software, there are a growing number of authors who include graphs with their work, and also those for whom graphs are their work. Among this latter group are David McCandless, who runs a data visualization blog called Information is Beautiful. McCandless calls himself "an independent data journalist and information designer" and he has a very nice aesthetic sensibility. Most of the graphs on his site (created both by himself and others) are quite attractive, and they have appeared in reputable sources like The Guardian as well as in his two books of data graphics.
He also likes circles.
I like circles too, and I think most of us do. They're simple and yet powerful. The circle has the highest possible area-to-circumference ratio which makes it efficient, and it can be visually strong or soft, depending on context, so it's also quite versatile.
But the circle is not suitable for visualizing data, at least not in the manner of the current data graphics fashion, of which McCandless is a part. Let me illustrate with one of his recent charts from The Guardian (I'm going to pick on McCandless here, but he is only part of a larger trend):
Let's look at the left-hand column first, which shows total amount of money contributed by each country. Quickly now, look at the circles and answer these questions about the data:
A. Did the US give twice as much as Canada?
B. Did Canada give twice as much as Spain?
C. Did Spain give twice as much as Germany?
Let's look at each of these questions more closely:
A: No, the US did not give twice as much as Canada. They gave just 28% more, but the US's circle on McCandless's chart is more than three times (300%) larger than Canada's. When I say "larger" I'm talking about geometric area, which is presumably how you're supposed to read it, although you have the option of reading the diameter too, which actually makes more sense if you consider that we're dealing with one-dimensional numbers; in any case, there are not usually instructions for reading this kind of chart. Here are the areas and diameters of all the circles:
Left-Hand Column Circle Data
It's obvious that the quantities being depicted don't really align with the sizes of the circles. The problem is shown clearly in the Pixels per $1m column in the table. If you want to build a readable graph your scale absolutely cannot change. It undermines the entire premise of data visualization.
B: Canada actually gave nearly three times as much as Spain. The area of its circle is over three times that of Spain's, but it's fairly close and I don't think many people can tell the difference anyway. (Can you?)
C: If you got this one correct you're very sharp. You're part of the minority who can compare geometric areas accurately. Spain gave 2.25 times as much as Germany and its circle is 1.95 times as large, which is not correct, but sort of close. Again, not many people are capable of making this distinction. To most (myself included) Spain's circle looks only slightly larger than Germany's. At right is an illustration of the problem.
It doesn't get much easier when the circles are larger and closer together: it's also hard to tell that the U.S.‘s circle is three times as large as Canada's. Even if you have some experience comparing the areas of circles (considering two dimensions at once) it's still not nearly as natural as comparing the lengths of lines (one dimension).
If we now look at the second and third columns in McCandless's chart ("Most giving people?" and "Most generous?") we see the same circles as in the first column. This seems suspicious even before closer examination (three different data sets with the same ratios among the top ten values?). A quick look at the numbers in the third column should convince you that these circles are at least as inaccurate as those in the first column. Guyana gave .088% of its GDP, while Ghana gave .018%. That's nearly 5 times as much while its circle, as seen above, is just over 3 times as large. The rest of the circles are no better, and the scale is even more inconsistent than that of the left-hand column. In effect, the U.S.‘s contribution in the first column gets overstated and Guyana's in the third column is understated. However, the entire chart is so misleading that this seems like a moot point.
To summarize, there are two major problems here:
- A lack of consistent scale in the shapes that represent the numbers.
- A visualization device that hurts our ability to recognize problem #1.
Problem #1 is bad, and #2 is what is enabling its persistence in many graphics produced today.
In case you're curious, here are bar graphs of the data in the first and third columns. Note how much easier it is to compare the lengths of the bars than it is to compare the areas of the circles. Also note how different the data looks in these graphs compared to McCandless's.
"Most cash?" (millions of dollars)
"Most generous?" (percentage of GDP)
The area perception problem is not unique to circles. The chart at right, using rectangles, was posted on Gizmodo and provoked a good discussion in the comments. Several readers point out the lack of consistent scale, and one goes as far as redrawing the chart using a consistent scale. The result is different, which makes the reader's point, but not that much more readable or useful.
My favorite comment is from a user named Merricat on April 14, 2010, who redraws the chart as a bar graph and adds an additional data point (see below, right). It's not that Merricat's graph is accurate (I'm pretty sure it's not, nor is it trying to be), but in posting it he shows that the original chart uses a data set that has been highly curated to make a specific point (Google has a lot more servers than these other companies). This is where it's important to ask questions like:
- Does this data include all known, relevant companies?
- If not, why have these companies been selected?
- What does this particular comparison show?
I found this chart through a link at the O'Reilly Radar, the blog of technical publisher O'Reilly Media which has become an authority on technology and innovation over the past 30 years. They run a multitude of conferences and publish thousands of books, many of which I've purchased over the years because they tend to be the best on their subject. That O'Reilly's blog links to such a careless and misleading chart represents a real problem. It's one thing if David McCandless and Gizmodo are making attractive designs, adding some data, calling it journalism, and publishing it on their own web sites. It's another thing entirely that McCandless publishes regularly in The Guardian and that O'Reilly is buying Gizmodo's junk.
McCandless even says outright on his web site:
"I'm interested in how designed information can help us understand the world, cut through BS and reveal the hidden connections, patterns and stories underneath. Or, failing that, it can just look cool!"
If that's not a dead giveaway that a source shouldn't be trusted, I don't know what is. Imagine the same sentiment offered by a writer:
"I'm interested in how well-written articles can help us understand the world, cut through BS and reveal the hidden connections, patterns and stories underneath. Or, failing that, they can just sound good!"
If a writer said that they would be considered an author of fiction. For some reason we don't have the same standards for data graphic authors.
The data graphics we read shape our thoughts and decisions just as much as the words we read. Their great expressive power increases the need for a high degree of literacy. Most of us are pretty good at spotting words which are, as McCandless says, BS, but we're not so good with graphs. It's hard enough to wade through the numerical thinking behind a study, harder still when that thinking is encoded in a poorly conceived illustration. It's a little like we've moved to a new country and don't yet know the language. We're picking up words here and there but not really learning the grammar, and that's a problem.
Veteran visualizer Edward Tufte says that graphs should "induce the viewer to think about the substance" of a graphic, "encourage the eye to compare difference pieces of data," and "reveal the data at several levels of detail" (VDQI, p. 13). In other words, they should help us see things that we couldn't otherwise see. There is a real pleasure in reading a good chart, map, or graph. The best visualizations are the ones where you can immerse yourself in the data, making comparisons, learning, and developing a feel for the numbers almost effortlessly. You are drawn into the graph, like you are drawn into a good story.
Good graphs invite a multitude of thoughtful comparisons. Today's new wave of eye-candy charts often just say: "look how big that one is!" They are products of laziness, not rigor; sensationalism, not journalism.
The second chapter of Edward Tufte's third major book, Visual Explanations, is unforgettable. I don't know anyone who's read the book who can't immediately recall his tragic explanation of how the Challenger space shuttle explosion could have been avoided if proper charts and graphs had been used by those advising NASA not to launch on January 28, 1986. Tufte's evidence shows that those making the argument had all the necessary information to make a persuasive argument to NASA. In all of their visual aids they failed to display a single chart which clearly related air temperature and o-ring damage. A simple plot of previous launch temperatures and the resulting o-ring damage (which Tufte provides) results in a graph so clear that even a child would have stopped the launch.
Whenever we read charts or graphs which present numbers we must think about the statistical properties of those numbers, as well as their source, and the author's motivation. Even without a deep knowledge of statistics one can spot misleading information simply by observing certain graphical practices I've described in this article:
Be skeptical of charts that use geometric area to represent quantities. The use of geometric areas doesn't always indicate bad reasoning underneath, but any such chart is confusing, potentially deceptive, and needs to be re-drawn so that it can be read by humans.
Be skeptical of charts that look particularly good. Authors should be spending time showing the data, not dressing it up. This does not mean that all good-looking charts are bad, nor that good charts should be ugly, only that visual design skill does not indicate authority on a subject.
Be skeptical of charts that don't show a lot of data. Lack of data often means an author is stretching to make a point, or has not found a lot of evidence to back up their claim.
Be skeptical of charts that don't cite a source or an author's name. Ideally they should also discuss how and when the data was gathered and processed.
Seven people died in the Challenger explosion. Countless others have died over the past 50 years in wars waged and policies based on statistical evidence. Government officials are persuaded by graphical presentations that their enemies' likelihood of attack is increasing, or that their destructive capability is growing. Citizens are convinced by similar "statistical" evidence that Country X will soon outnumber and overrun them, or that some threat within the populace is looming. Graphs, appropriately, have permeated the highest levels of dialog, where the conversations affect the greatest number of lives. When used carefully they are powerful tools of reasoning. When used ignorantly or nefariously they are powerful tools of deception.
Numbers are just words. Don't believe the hype.