Home » Uncategorized » The Myth of “Raw” Data

The Myth of “Raw” Data

Previously I wrote about the allure of big data. Now I turn to the question of “raw” data. Is there such a thing or is it a myth, an oxymoron — like “jumbo shrimp” or “just one episode on Netflix”?

Why do we cling to this notion of raw data if it doesn’t exist?

I recently read “Raw Data” is an Oxymoron (2013, edited by Lisa Gitelman), which is a fascinating book that turns “raw” data on its head just about every which way, looking both back in time and across disciplines. Just listen to this mind-blowing sentence from the introduction: “Indeed, the seemingly indispensable misperception that data are ever raw seems to be one way in which data are forever contextualized — that is, framed —according to a mythology of their own supposed decontextualization” (pg 6). Basically, by thinking the thought of “raw data” we are already framing and molding it to our preconceptions of rawness.

Data: from Latin to English

One of my favorite chapters in this edited volume was a historical and linguistic overview of the word “data.” When faced with any dauntingly broad topic, I always love honing in on the word itself to get some definitional and etymological clarity (remember, etymology=words, entomology=bugs). So I just ate up Daniel Rosenberg’s chapter on “Data Before the Fact”; the following sections are my summary of this chapter.

Rosenberg traces the word “data” from Latin into English, a “naturalization” process that occurred during the 1700’s. (I suppose that if people can be naturalized when they change citizenship, so too can words when they take up residence in a new language.) “Data” comes from the Latin verb “dare,” to give, so right off the bat we have this inclination to think of data as “a given.” The common Latin phrase “data desuper” means “given from above.”

Indeed, the early English language instances of the word “data” were primarily in the context of theology and mathematics. Data were either given from above and were therefore not questioned, or they were given as a set of assumptions before starting a mathematical proof. Either way, data were something you started with, something everyone mutually agreed were “beyond argument.”

Data: from given to gotten

By the 1800’s, English language usage of the word “data” had begun to shift away from something given to something obtained. Specifically, data came to be thought of as something gained through empirical observation and experimentation. This latter connotation is closer to what we have today: even if we think of data as raw, we do tend to think of it as something that you get or collect from out in the world.

Rosenberg made these observations by searching a large collection of texts called the Eighteenth-Century Collections Online. While not available at the time of his research, he also discusses using Google Ngram for these types of queries. Just for fun, I tried a Google Ngram of the words “data, fact, and evidence” from 1800 to 2000. Here’s what that looks like:

I have to note the irony of the Google Ngram page footer: “Run your own experiment! Raw data is available for download here.”

Data: from plural to “mass” noun

Returning to the book’s introduction, which explains how data are inherently “aggregative” — i.e., we tend to think of data in herds rather than as solo animals. And here I was sadly robbed of what I thought was one of my solidly “smarty pants” moves. I used to pride myself on correctly conjugating “data” as a plural noun: i.e., “data are” versus “datum is.” But now I understand that it’s about equally common to say “data is” versus “data are,” and bright folks like Steven Pinker are telling us to wake up and smell the mass noun (pg 19). This “massness” of data is broader than just a grammatical issue, however. I tie it back to the concept of big data: data are (is? ack!) powerful in the aggregate. Data kind of presupposes a horde of like-minded data, such that we don’t pay much attention to an individual datum/data point.

Data: rawness is relative              

My experience with the idea of “raw” data is that it’s all relative. In the genetics data coordinating where I work, raw data are the genetic data (the A’s, C’s, T’s and G’s) that we get from the genotyping lab. When I’ve talked to people who work in genotyping labs, they say “Oh no, the raw data is what comes off the machine” (here the “machine” being a genotyping or sequencing machine). Seems like data are raw when they first come into our possession — at least that’s a convenient way to think about it. Similar to when you go to the grocery store: the raw produce are in the bins, it’s what you take home to chop up and cook. Rawness may be relative in practice, but in absolute terms – Gitelman and the book’s contributing authors would remind us it’s elusive!

1 Comment

  1. […] Today’s society has a borderline morbid fascination with big data, which I’ve also written about previously in “Big Data, Big Deal?”, and you can see how the dominant metaphors perpetuate this fascination.  A particularly problematic metaphor in my mind is that of data as a natural resource that should be mined, extracted, and purified. In this construct, data are commodified and spatialized. Just think of all the untapped reserves of “raw” data waiting for the boldest and most pioneering person to tap into: data logged daily by our smartphones, our Facebook profiles, and even our very bodies. In this metaphor, data become pre-factual and given, rather than contextual and imagined (whereas in actuality you have to conceive of something as a data point before you collect it — aha, even there,  I did it: “collect data” as if I was picking wild huckleberries on a mountainside…which I recently did, incidentally). But full circle back to etymology: the very word “data” is from the Latin verb for “to give”….so it’s not totally our fault that it’s easy to take data as “a given.” (More on other cool things you can learn about the word “data” in my earlier post.) […]

Leave a comment

Your email address will not be published. Required fields are marked *

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 573 other subscribers