Idea: Gold Standards for Microscopy Data - Guidelines

From NoskeWiki
Jump to navigation Jump to search


NOTE: This page is a daughter page of: Gold Standards for Microscopy Data

The idea of this page is to introduce important concepts, advice, and recommendations to programmers who create software and image files in microscopy can be measured.

Guidelines: An Introduction to the Field

One thing is certain: programmers will always want to write/contribute new tools. What is not certain however, is wether or these tools get used! For this reason, should aim to provide practical advice and code snippets (via a wiki interface) to anyone about to commence a programming project in the field of cell science. Our goal is to help these people learn from the advice and mistakes of more experienced programmer: people who have been around much longer and know what does and doesn't work well in cell science. Although we push the idea of standards, I imagine will be only 50% (probably less) standards and 50% advice for new developers. Think of it as "a guidebook for programmers in life sciences".

This big section is almost like a "discussion" where the reader will be introduced to all the important concepts he needs to know, and leads up to the List of gold standards. It's also very incomplete right now, so feel free to add text!

New Programs versus Plugins

Here we can highly recommend that if the program truly wants his tool used he should probably aim to plugin to existing software (especially in cases where the programmer can talk to the main author of the program and integrate helpful tools into a standard release), and we could create a list of such programs which should be considered. We should really emphasize here that it is highly unlikely any new program will take off outside of a group/university unless it offers a five-fold improvement; because most cell biologists are set in their ways, and with hundreds of programs already out there, no-body has time to test them all!

New File Format versus Existing Formats

Here we can highly list of some formats already used and talk about proprietary microscope formats. Unless there is a good reason for creating a new format, it is easiest to adopt something existing, because even if a program is written to convert between formats, scientists hate dealing with those extra steps!

Choice of Platform and Language

Here we can discuss what factors should influence the programmer's choice of platform (Windows / OS X / Linux etc). In ideal cases they should choose something cross platform!

Making Code Open Source: Sharing Code

Here we can encourage people to make code open source.

Sharing Data

Here we can discuss briefly the problem biologists have sharing data, but why it's important that they do share their data around (especially if already published), so that it gains exposure. The reason biologists are often unwilling to share data are the same reasons programmers are often unwilling to share code: (a) they might know it's dodgy, or (b) they're afraid it will be reproduced by people without acknowledging where it came from!


Here we explain to new programmers in life sciences, how critical it is to acknowledge where data comes from: all to often acknowledgement gets forgotten. At my old institute I saw many talks where "software" was presented, without the programmers ever mentioning who actually produced the data (the important thing) in all their images. As good practice: every image in a slide should have the name of the person who collected the data... and on the last slide you should emphasis who collected the data, and who did the specimen preparation! If you can say who did the specimen prep, you earn big credibility with biologists. This isn't always easy to keep track of: so the best software should actually provide functionality to record who did what.


Here we introduce the (important) idea of providence, Kepler workflow and being able to reproduce results by running exactly the same filters etc.

New Concept: The "Life-cycle of Microscopy Data"

Here we might introduce the idea of a "life-cycle" shown as diagrams. In many cases data collected from the scope might go:

*acquired* > Reconstructed > Segmented > Movie > *DIE*   :(

Where *acquired* means collected on the scope, and *dead* means it is kept private, forgotten about, erased to make space and realistically won't ever get used/queried again. Hopefully data is not just turned into images/movies, but can be analyzed to help answer a scientific question:

*acquired* > Reconstructed > Segmented > Analyzed > Published > *DIE*  :(

Unfortunately, most published data dies too. What we hope for is that data like this data should be able to live forever and be accessibly by your peers.... and this can be achieved by uploading to a public, distributed database:

*acquired* > Reconstructed > Segmented > Analyzed > Published > Shared on Database > *lives on*  :)

By doing this, the data you've worked for may lead to further collaboration/publications/animations, and maybe even contribute to Simulations.

Making Use of Ontologies

Here we discuss what ontologies are and why it's important that objects get uniquely identified and labelled correctly so that the next person to look at the data can work out what they're looking at.

Using of Specimen Coordinates and Atlases

Here we discuss why it's important to not only say what specimen something is from, but exactly where - and could cite a couple of examples such as mouse retina! Biologists should be encouraged to use coordinate systems/atlases to register a more exact position.

Image File Format

Here we could/should list out minimum meta data requirements, which should include (at a bare minimum):

  • microscope details
  • specimen details (species, sex, condition, age)
  • location (eg: not just retina, but a coordinate system)
  • pixel size along each axis
  • some level of providence (history of what's done)

We should highly recommend using existing formats are used rather than each person recreating their own. We could also suggest any program should be able to load in and write out tiff stacks (001.tif, 002.tif), since this is something most programs (ImageJ, Chimaera, IMOD, QuickTime etc) support, but it is best to use only under necessary circumstances as you lost meta data in the process.

Pyramidal Images

Here we can talk about the importance of pyramidal files, and suggest a compliance with existing standards/formats.

Agreeing on ONE standard might be trickier than I thought though! Of the three people I talked to, there are three different standards! Jamie (Marclab) uses a hierarchy of jpeg (but admits this may not be the best way), Raj (UCSD) uses BigTiff, Rich Stoner (UCSD) uses JPeg 2000. Whatever we use I think it would be a good idea to have a format which supports the metadata we described above. It is not my decision, but probably we could push for either:

  • BigTif
    • pros: can add own metadata, although single file it supports pyramidal 3D files using internal file structure, good for image processing.
    • cons: no compression options.
  • JPEG 2000
    • pro: can add own metadata, supports pyramidal files, smaller file size if needed, supports loss-less compression, code at iipimage.
    • con: slow for image autosegmentation **.

I'm actually no expert in either, so you'd have to adjust this list (are probably mistakes).

... and most recently Pete van-der Heide(from Australia) suggested [[1]] - something Neils Volkerman (Burhham Institute) is starting to use.

On Segmentation

Here we can start by suggesting that certain standard names: eg "object type", "contour", "point", "slice" should be used for various elements as variable names etc. Although binary formats are faster to load, we might push somewhat for XML, since it is always difficult to know what one might want to add later. I image we should suggest that files should support hierarchy, and tags to uniquely tie every object to an ontology to prevent name ambiguity and allow maximum data-mining and analysis. We could also recommend use of spatial data structure, and a minimum bounding box maintained around every 2D/3D structure in order to allow fast analysis.

  • Dmitri's tree example
  • Maryann's stuff

On Segmentation GUI

Certain things work well. In this section we could list some good examples (with diagrams and animation) and even some bad examples. We should emphasize that the best tools are created by the people who use the tools.

On Animation

Here we can suggest that it's possible to spend a lot of time creating tools for animation, but it's also very valuable to be able to export to a format such as .obj, .vrml - since this allows you to import the data into a huge array of other programs for animation - including actual animation programs such as Blender, Cinema4D, Maya etc.

On Databases

Here we could mention some of the databases that already exist (such as CCDB), to discourage people from reinventing the wheel. If however, they feel they must, we can suggest all the standards they should meet.

On Simulations

Here we can point out that GOOD data - if correctly labelled (using shared vocabularies) and indexed etc actually has a chance of living on and being used to dock proteins, and/or used in simulations. This reinforces the importance of naming thing correctly at the very beginning!