Processing #Pruittdata: an ode to data, data collection, and data analysts

In trying to wrap my brain around the effects of #Pruittdata, I’ve been thinking a lot about how I collaborate, who I trust, and how I approach the process of doing science. The following is my thought process, trying to deal with the aftermath. It became somewhat of a manifesto – an ode to data, data collection, data analysis, and my empirical and analytical collaborators.


TL;DR: Don’t be snobby: both empiricists and analysts bring incredibly valuable skills and perspectives to the process of doing science. Don’t blame victims. Be careful who you trust in this process, but continue to do science that excites you.


To outsiders, the life of a data-collecting animal behavior researcher must look very strange. Especially when we conduct observational studies on wild animals, we spend significant amounts of time just waiting for our study animals to show up, get caught in traps, or stop sleeping and do something scientifically interesting. But despite the waiting, don’t get bored, don’t lose concentration, don’t even blink or you might miss that critical observation.

That leaves a lot of tedious, boring hours in between. This is especially true when we study animals which move around or hide, or seem to disappear from the surface of the world the second they sense a biologist anywhere near. Out collecting data on slow days, I sometimes feel like some sort of ambush predator, maybe a crocodile, motionless and camouflaged, waiting for a tasty morsel of an observation to wander by so that I can pounce on it and trap it in my notebook as data so that I can crunch, digest, and assimilate it later.

On the analysis side, the daily work flow can also look strange. Analysts are often motivated by trying to find generalities, applying a similar analytical approach across many different kinds of datasets for similar goals or to find similar patterns. It’s easy for an outsider to view analysts as “just pushing a button”, causing results to be magically summarized in the most efficient and appealing way. It’s common to see us analysts as people with hammers that can be applied uniformly to a suite of different types of questions and types of data, or as “data monkeys” who swoop in once all the hard work on collecting data is done, who, perhaps with little or no experience in the biology of the system, dive into the analyses, willy-nilly making assumptions with only the loosest basis in the biological reality of the system.

But there is just as much expertise on this side as on the side of the data collectors. I’ve been on both sides of this – both in the field collecting my own data, and on the side of the data analyst, brought in after data have already been collected to help make sense of complicated patterns. Sitting down with someone else’s data, even when everything is well documented, is an exercise in data forensics. A good data collector / analyst collaboration requires a significant amount of discussion. How the data were collected or how behaviors were measured can require radically different types of analytical or statistical approaches. Seemingly benign assumptions on both sides can result in wildly different results, so it’s critical to build this discussion into the collaboration. In some cases, like with historical data, or data scraped from the web, there’s a lot that we can’t know about how the data came to be, and it’s especially critical to keep those missing details in mind as analyses progress.

Staying curious and poking into the potential assumptions others may have made can slow down the process of collaborating, but this slowing down, and taking time to think things out, is essential for my collaborative process, whether I’m the empiricist or the analyst. In an environment of publish or perish, and with such strong incentives for high productivity, this sometimes turns into a fight against these pressures. Finding collaborators willing to take this time, or requiring impatient collaborators to make time for this process, is one way to ensure good and trustworthy results.

It’s easy for both empiricists and analysts to be snobby, to build up the skills and expertise of their side and dismiss or minimize the skills and expertise on the other side. When the goal is to do good science, and be confident in the results, this approach is counterproductive. I have seen this blase attitude from both sides, from data collectors who dismiss the efforts of their analyst collaborators and ridicule their lack of knowledge of a particular biological system, as well as on the side of data analysts, who seem completely unconcerned and uninterested in the particulars of data collection or a system’s natural history and just want to get their hands on some – any – data that they can throw into their analytical meat grinder. This is a giant red flag for me, and I try to avoid collaborating both with analysts requesting data or accepting data from empirical scientists requesting analyses if I get a sense that they have these kinds of views. Sometimes it is hard to tell.

Integrating empiricists and analysts is becoming more and more critical. Computational approaches can be used to model complicated interactions, to predict how systems “should” behave given certain rules, and those simulated results can then be compared against real empirical data. If those match, the “rules” in the computational model could plausibly be the same “rules” that generated similar patterns in the real data. But it’s critical to have a back and forth discussion, and to think of the plausibility of these rules from both sides. As more and more animal behavior studies assemble long-term datasets on their systems, more and more I am seeing the typical small sample size behavioral dataset morph into something much larger. Smaller datasets are often easier and less complicated to manage by someone without a lot of analytical training – as longterm datasets grow, more empiricists are starting to be faced with a dataset that has grown into “big data” and is now too cumbersome for them to process easily.

On a much shorter timescale, a similar process can happen when empiricists adopt remote sensing methods, which instead of requiring empiricists to chase through forests for hours after their elusive study animals, hoping for the chance of a single sighting per day, can now stream location data back to empiricists and track those animals every second. Remote recording of vocalizations or camera trap data can result in massive datasets. These changes require empiricists to change their approach to collecting and managing these suddenly gigantic datasets.

My first postdoc was at the National Institute for Mathematical and Biological Synthesis, referred to as NIMBioS (pronounced “nimbus”, like the clouds). As a synthesis center, all of us worked with other people’s data, or on modeling projects. I was actually contractually forbidden from collecting new data while I was at NIMBioS. Soon after I started at NIMBioS, I went to a bird conference. It was funny how people couldn’t categorize me anymore: one person I knew approached me and said “So Liz, are you like… a mathematician now?” I gave a talk about some cool analyses with a longterm dataset and someone came up to me afterward and said “You know, it’s good to actually go into the field sometimes to better understand the biology of the system you’re working with”. Both of these illustrate that idea that if I was not out collecting the data, I was suddenly suspect as not a “real” field biologist. Funnily enough, I had collected data for that particular project in the field for four months, and prior to learning how to code and run social network analyses part-way through grad school, I’d considered myself a pure field biologist! I’ve since built my career on leveraging the fact that I enjoy working with both empiricists and analysts.

When we’ve managed to get it, data are prized, coveted, and worried over to no end, no matter what kind of scientist you are. Looking back on old data notebooks, recorded in the field, I can see traces of the back-story of data collection. Slow days may have doodles in the notebook margins. Especially rare observations may be surrounded with excitedly scrawled exclamation points and arrows. Pages are often smudged with sweat or dirt or dust, with mosquitoes sometimes trapped between the pages, flattened like pressed flowers. Looking back on my old R code, especially the in-progress trouble-shooting versions, is a hilarious dive into the frustrations and successes of working through the logic involved in complicated analyses, excited comments when code actually works, increasingly frustrated comments when things aren’t coming together and I’m missing some critical piece of logic. I joke that the best thing about coding in R is shaking my fist at the sky and yelling “ARRRRRRR!!!!” like a pirate. When my kids were little and getting dropped off at daycare, I trained them to say “Bye Ma, good luck with R” and on coming home after a long day of analysis “So, did you win against R”? Neither empiricists nor analysts have it easy, and an appreciation of the frustrations encountered and expertise required leads to healthier, more creative, and more respectful collaboration.

Both empiricists and analysts love their data. Our results are only as good as our underlying data, and science can only move forward in productive directions when our results are based on good data and our analyses are done in a clear and well-justified manner. This is what makes the idea of intentionally fabricated data so horrifying, from either perspective. The idea of going into a painstakingly collected dataset and “adjusting values”, or “adding observations”, is absolutely abhorrent. The idea that a trusted collaborator could pass on a manipulated dataset to be analyzed in good faith by analytical collaborators breaks the trust that is fundamental to the scientific process.

I was asked once, in a weirdly aggressive job interview question (*not* at my current institution) “I’m sorry, can you even do your science without your collaborators?”. This question has stuck with me ever since. At the time, this was an attack on my analytical capabilities. But more generally, no. No I can’t. No one can collect all the data in the world, no empiricist knows every single best way to try and detect complex patterns in the data, and no analyst could do all the analyses they’d want to without data.

In my favorite collaborations, we all think carefully about assumptions, we try to understand the details about both the biology and the mechanics of analyses, and we come to new insight that would not be possible otherwise. I like to bridge these two approaches together, almost serving as a translator between the two. It’s become one of my favorite ways to do science.

So, thank you to all of my collaborators who have helped make this work. I could not do what I do without you and I could not be as creative as I’d like without the understanding and insight I get from your new perspectives. Like others, in order to do the science that excites me, that enriches our knowledge of the world, that makes this whole process worthwhile, I need to continue to trust both my empirical and analytical collaborators and to be trustworthy myself. I also need to break ties with those who have given me reason not to trust them further and to be careful about who I trust in the future.