Tuesday, July 26, 2011

Dealing With Data

When I arrived at GiantU, I got involved in a project to understand how scientists deal with data, how the Information Sciences can help, and how to get scientists to start sharing data. I've learned more about metadata than I ever wanted to, but I've also realized this: dealing with data is largely a personal issue. The knee jerk reaction seems to be to declare that we need standards. However, I feel XKCD describes the issue with standard succinctly:

The project here involves the Information Science people, biologist, and materials scientists. The biologists seem to have it (comparatively) together, especially for gene sequencing. MSE, on the other hand, is a mess. Within a lab group, there may be some standards for how data is organized and formated (behold the power of nested folders and descriptive file names), but there's very little agreement as what is the data, and what is the metadata (i.e., descriptive information). If you're tracking rainfall in the Pacific Northwest basin, there is a fairly discrete list of data/metadata you need to report. Time frame of data collection, geographic coordinates, amount of precipitation, etc.

Materials science can be very process driven (remember the tetrahedron?) and so if you're tracking the density of carbon precipitates in a bar sample, you need the precipitate count and the bar size, but you also need the composition, the specimen manufacturing history, the heat treating history, the sectioning method, the polishing method, the counting technique and microscopy method, just to start with. So while the final data may be density of carbon precipitates, the *main* data is more or less everything else leading up to it. Several years ago, there was a project to develop a markup language for materials science, sponsored by some of the major professional societies. This conservative list still came up with 39 tags, most of which require significant text for any given data point.

There is an increasing push for researchers to share data, but in what format? Who decides what format? Right now, the biggest decider for raw data formats is the software you're using, which can depend on your equipment manufacturer, and may be adjustable. What counts as data?

Many of the Information Science folks feel very passionately about the issue, but seemed to think of it as an issue of building an appropriate database. The overwhelming reaction they've had in working with us has been "We never realized how much metadata you had".

Open data is a nice philosophy, but there are a lot of barriers to be overcome, far beyond where the data will be stored.

1 comment:

  1. I'm just finishing my first year as a grad student and just beginning to generate significant amounts (i.e., more than I have ever had to handle before) of data, so I've been thinking about this a lot lately. What is important to include? What is critical to put in filenames without making them too unwieldy? My system is still evolving but I think it's on an upward swing.
    (Hello from a materials chemist!)