Home » 2016 (Page 2)

Yearly Archives: 2016

The Myth of “Raw” Data

Previously I wrote about the allure of big data. Now I turn to the question of “raw” data. Is there such a thing or is it a myth, an oxymoron — like “jumbo shrimp” or “just one episode on Netflix”?

Why do we cling to this notion of raw data if it doesn’t exist?

I recently read “Raw Data” is an Oxymoron (2013, edited by Lisa Gitelman), which is a fascinating book that turns “raw” data on its head just about every which way, looking both back in time and across disciplines. Just listen to this mind-blowing sentence from the introduction: “Indeed, the seemingly indispensable misperception that data are ever raw seems to be one way in which data are forever contextualized — that is, framed —according to a mythology of their own supposed decontextualization” (pg 6). Basically, by thinking the thought of “raw data” we are already framing and molding it to our preconceptions of rawness.

Data: from Latin to English

One of my favorite chapters in this edited volume was a historical and linguistic overview of the word “data.” When faced with any dauntingly broad topic, I always love honing in on the word itself to get some definitional and etymological clarity (remember, etymology=words, entomology=bugs). So I just ate up Daniel Rosenberg’s chapter on “Data Before the Fact”; the following sections are my summary of this chapter.

Rosenberg traces the word “data” from Latin into English, a “naturalization” process that occurred during the 1700’s. (I suppose that if people can be naturalized when they change citizenship, so too can words when they take up residence in a new language.) “Data” comes from the Latin verb “dare,” to give, so right off the bat we have this inclination to think of data as “a given.” The common Latin phrase “data desuper” means “given from above.”

Indeed, the early English language instances of the word “data” were primarily in the context of theology and mathematics. Data were either given from above and were therefore not questioned, or they were given as a set of assumptions before starting a mathematical proof. Either way, data were something you started with, something everyone mutually agreed were “beyond argument.”

Data: from given to gotten

By the 1800’s, English language usage of the word “data” had begun to shift away from something given to something obtained. Specifically, data came to be thought of as something gained through empirical observation and experimentation. This latter connotation is closer to what we have today: even if we think of data as raw, we do tend to think of it as something that you get or collect from out in the world.

Rosenberg made these observations by searching a large collection of texts called the Eighteenth-Century Collections Online. While not available at the time of his research, he also discusses using Google Ngram for these types of queries. Just for fun, I tried a Google Ngram of the words “data, fact, and evidence” from 1800 to 2000. Here’s what that looks like:

I have to note the irony of the Google Ngram page footer: “Run your own experiment! Raw data is available for download here.”

Data: from plural to “mass” noun

Returning to the book’s introduction, which explains how data are inherently “aggregative” — i.e., we tend to think of data in herds rather than as solo animals. And here I was sadly robbed of what I thought was one of my solidly “smarty pants” moves. I used to pride myself on correctly conjugating “data” as a plural noun: i.e., “data are” versus “datum is.” But now I understand that it’s about equally common to say “data is” versus “data are,” and bright folks like Steven Pinker are telling us to wake up and smell the mass noun (pg 19). This “massness” of data is broader than just a grammatical issue, however. I tie it back to the concept of big data: data are (is? ack!) powerful in the aggregate. Data kind of presupposes a horde of like-minded data, such that we don’t pay much attention to an individual datum/data point.

Data: rawness is relative              

My experience with the idea of “raw” data is that it’s all relative. In the genetics data coordinating where I work, raw data are the genetic data (the A’s, C’s, T’s and G’s) that we get from the genotyping lab. When I’ve talked to people who work in genotyping labs, they say “Oh no, the raw data is what comes off the machine” (here the “machine” being a genotyping or sequencing machine). Seems like data are raw when they first come into our possession — at least that’s a convenient way to think about it. Similar to when you go to the grocery store: the raw produce are in the bins, it’s what you take home to chop up and cook. Rawness may be relative in practice, but in absolute terms – Gitelman and the book’s contributing authors would remind us it’s elusive!

My CV and a decade of evolving DNA genotyping technologies

Lately I’ve been describing myself as “having over a decade of experience in human genetics research,” which makes me feel rather old (I recognize that older people will scoff at this and younger people will smirk and nod). Nevertheless, it’s true: I started working in human genetics research at Duke University right after finishing my undergraduate degree, in the winter of 2005. In the summer of 2009 I moved to a genetics research group at the University of Washington, where I still work.

The developments I’ve witnessed in just this relatively short time demonstrate how quickly the field of genetic research has been changing. This is no coincidence, as one reason I was drawn to genetics was the promise that I wouldn’t be stuck doing the same thing my whole career.  And it’s proven true thus far, because my job path has been partly shaped by different phases or waves of genotyping technology. Each is a way of looking at DNA – to tell which of the chemical bases A, C, T, and G exists at specific places in the human genome. I’ll walk you down memory lane, stopping at three signposts along the way to discuss these different technologies. But I’ll also start and end with a detour….

Detour 1: One summer during my undergraduate

One summer at UNC-Greensboro where I was doing my undergrad, my genetics professor hired me to do a small summer project. It was only a few hours a week and a little bit of money, but I was thrilled to have something to supplement my job at Panera’s. The task was to use a software program to help design bits of DNA that can be used to genotype single genetic variants. These bits of DNA are called primers, and they are made up of typically ~20 to 30 DNA bases that are located near the variant of interest. The primers are used to bind to the nearby DNA and then make many copies of it so that it’s easier to measure the variant. This whole process is called polymerase chain reaction, or PCR, and it basically launched modern biotechnology.

I was tasked with designing primers for a few dozen variants that my professor and his collaborators wanted to study in relation ADHD. So I was going into genetic databases, finding the flanking sequences, and then plugging them into this primer design program to find the optimal bits of DNA to use. This was the whole summer project. Now, I haven’t done this type of work since, but my guess is that current bioinformatics tools would enable one person to do that whole project in an hour. Maybe even 30 minutes and still have time to get a coffee.

Phase 1: Single variants (Duke)

When I started working at Duke, things were pretty far along. We had PCR, we had the Human Genome Project and thus a database of the complete human genome sequence. The projects I worked on initially were genotyping single variants at a time, via something called  a TaqMan assay (“taq” is a special type of the enzyme polymerase – yes, the same polymerase of PCR fame!). A single person working in the lab could push through a dozen or so TaqMan assays in a day, if they were wearing their headphones (with no music) just so other people wouldn’t bother them on the lab floor (I know for a fact this is done). This single variant approach was pretty standard at the time. Before I left Duke, however, this trickle of genetic data was starting to turn into babbling stream.

Phase 2: Microarrays (Duke -> UW)

In the early 2000’s, companies were starting to develop ways to multi-plex these genotyping assays. Called microarrays, or DNA “chips,” these were small surfaces on which you could array hundreds of thousands (now millions) of genotyping experiments at once. One of the Duke projects I worked on did a microarray experiment during my last year there. I remember that it was too much data to go through our normal database process, so our senior programmer had to manually force it in. All of a sudden there were 300,000 more variants than before. Of course then there’s the data cleaning, which was then required on a much larger scale. And that’s what brought me to UW….

I came to UW to work on a new set of projects instigated by the National Institutes of Health to look at gene and environment interactions in a series of complex human diseases. Each of these projects was using microarray technology, so they needed a lot of manpower and brainpower (and Sarah power!) to help do quality control and assurance for all that microarray genotyping data.

Phase 3: Sequencing (UW -> now)

While my work at UW is still primarily with microarray datasets, our center is starting to work more and more with DNA sequencing data. Recall microarray experiments involve looking at a million or so pre-defined places in the genome. DNA sequencing, on the other hand, is going base by base to look at almost every site. Even though sequencing has gotten must faster and cheaper in the past few years, it’s still too pricey to be the de facto approach for every research project. But give it a few years and it will likely have supplanted DNA microarrays.

DNA sequencing readout from an automated sequencer.
DNA sequencing readout from an automated sequencer. Image Credit: NHGRI Image Gallery

Detour 2: Doby-Croc

When I first started dating my husband a few years back, there was some lore generated about what I did for a living. During a conversation I was not present for, my now husband told his uncle that I worked in genetics and inevitably their conversation ended with the decision that I should make the “Doby-Croc.” Half Doberman Pinscher, half crocodile, slogan “the ultimate in homeland security” (don’t tell Trump!). Clearly they envisioned me tinkering away at a lab bench with a white coat and safety goggles, bioengineering the species mash-ups of tomorrow. (Had I been there I would have headed off this misconception at the pass by clarifying that I work at a computer, in what otherwise looks like your typical office job).

DNA technology isn’t quite there yet, though with CRISPR who knows – but that’s a story for another day!

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 566 other subscribers