Idea: Gold Standards for Microscopy Data
NOTE: This page is place to get communication started, but the next step is to create a website (possibly http://www.microscopystandards.org/) - something without my face on it!
In order to increase work output, file interoperability and to realize the full potential of microscopy data we wish to create a set of "gold standards" and "best practices" to guide and measure both file formats and software used to view, segment and analyze microscopy data. In the fields of cell biology and structural biology, software standards are all but non-existent, so it is really not that surprising software development is a mess. Without any guidance/authority, researchers (mostly young programmers) develop their own image viewing/segmenting/analysis software in "isolation", and thus the following three problems have become so common it has become a joke among biologists:
- Reinventing the wheel: Recreating features that already exists in other programs, and not necessarily making any improvement.
- Tools which never see use by biologists: An incredibly common phenomenon where programs isolate themselves from biologists and their programs/algorithms are so hard to install and/or use (or don't really help answer any biological questions), the programmer is the only person on earth who ever uses it.
- Inventing new (incompatible) file formats: A problem whereby each developer decides to create his own unique structures for storing images, image alignment, points, contours, surfaces etc.
By creating a website (affiliated with not one, but a plethora of universities), drumming up support/awareness from researchers all over the world, and creating publications, we hope to create a set of practical guidelines to help new developers (1) create tools which are biologists find useful and easy-to-use, (2) integrate algorithms into existing full featured and open source applications and (3) maximize file interoperability between applications. Only though these steps do we believe we can start to reduce the millions of research dollars lost due to by reinventing the wheel or creating software than no-one but the author ever uses!
Before creating a website, we first need to (a) thoroughly check something similar to this idea doesn't already exist (it would look bad if we create a site warning about standards/work duplication if a similar site already exists), (b) give it a great name!
The name we chose should sound authoritative and broad (in case its scope increases later) but must also be memorable and have an available domain name. See: How to: come up with good names.
One domain name I noticed was totally free is microscopystandards.org ("Microscopy Standards"). This isn't necessarily the best name (I have zero personal attachment to it), but until we brainstorm other names this is the name and website I'll refer to. Ultimately I believe we don't want this website to appear closely associated with any one university or single person. One of the problems we face is that most research groups want to become a dominant authority in their field and instinctively resists taking on idea from competing institutes - hence anything closely associated with a single group looks "less official". Instead we should associate with a large collection of universities and people (the more the better), although obviously it will take a few key people to drive it and write a publication. One outrageous idea to generate public interest/publicity/equality, is to call have the first author "Dr About Time" from the "The University of Common Sense", although I have no idea how this might fly with a journal! What might make more sense it to try and get our name or site absorbed/connected/adopted/affiliated with INCF (since this is a large organization and all about uniting cell biologists) and/or NITRC (which might be a bit less likely as these guys are an independent group and mostly concerned with MRI programs).
This idea is not something I'd ever claim as my own! Many, many people are sick of inoperable software and reinvention (and other problems I've described) and believe standards should exist in our field (see: EMDB for example).
I was inspired to create this page after Maryann Martone (a huge advocate of uniting cell biologists and making data shareable) introduced me to Dr James Anderson from the Marclab at a neuroscience convention. Not only did James share many of my same views and frustation, but after spending several years working for Microsoft, he was appalled by the lack of any standards or formal training which represents a critical cornerstone in the software industry. Standards for interface design, file formats and coding practices are used throughout industry - in fact this kind of standard are ubiquitous in most organizations and field, yet are completely lacking in cell science. We decided it is high time a group of us united to do something about it! With many modern microscopes now able to collect petabytes worth of information per year (1 pentabyte = 1 million GB), the next level of scientific discovery in our field can only be achieved by storing, sharing and data-mining this data using databases and distributed structures - now, more than ever, it is important to try and introduce some basic standards.
The ideas here have been born partly by a growing necessity, but mostly from pure frustration. This frustration is shared by thousands of senior biologists and software engineers across the world: frustration that microscope data rarely exists in a format that can be shared, data-mined or reused between groups, frustration that so many software projects never get used, and frustration that the cellular and structural biology community has a total lack of standards for microscopy image files and segmentation (making it hard to convert between programs).
Almost every microscopy group around the world - every university with an electron microscope and/or light microscopes - has between one and a dozen computer scientists/engineers that create their own software products and file formats to visualize, segment (to automatically and/or manually delimit compartments) and analyze data from their microscopes - not to mention programs for image alignment/reconstruction/montaging/joining etc. Most of these tools represent duplication of work, most are never used by anyone outside the lab and most are abandoned after the programmer leaves. There are several reasons things got to this poor state in academia:
- Lack of Training, Documentation and Advice. When a new programmer arrives, he's often on his own. Without guidance and good advice, he's likely to a project which sounds fun/appealing, but will start from scratch, reinventing many, many wheels (loading/saving images, viewing images, drawing contours etc) along the way. In addition to poor coding practice he'll almost certainly use the platform/libraries and file formats, he's most familiar with - not the ones which best suit other people.
- Students Programmers. Students can be excellent programmers, but without guidance (without anyone to teach them good coding practice for example) these budding programmers are all-too-likely to get over-excited and do their "own thing". Rushing ahead like this, without first checking what else exists and what is actually valuable is a deadly combination of youthful eagerness and lack of experience.
- Wanting Glory. Every so often someone will take on a huge project. The programmer wants to create a masterpiece, but he VASTLY underestimates the amount of work. Underestimate work (typically by a factor of at least 2, and often up to 100) is something all programmers do, but it is the younger ones are exceptionally bad. When you add together unrealistic expectation, poor planning, "heroism" and "feature creep" the result will be a failed project. Even if this programmer manages to create something brilliant he fails to realize that unless it is stable, integrates well with existing data (without needing extra steps for file conversion), unless he's willing to maintain the program for many years (ultimately the rest of his working life), unless it works on every platform, and unless it is an order of magnitude better/faster than the dozen other programs which do the same thing, hundreds of users won't suddenly just switch to it!
- Promoting the Groups Image. This is similar to the previous problem, except it is the group seeking glory. The big problem here is that the collective group (especially the group leader) wants to be able to say "we develop some cool software". Unfortunately it's all too easy to use software to generate a nice image your group leader can show off: "this is what we can do", and this may impress a few people, but more-and-more people are realizing these picture often represent something only one computer scientists in the lab knew how to generate, and is probably not reproducible by anyone else! The tough questions to face are: (1) how many other program out there can already do the same type of visualization/segmentation/analysis - and is yours anymore popular or powerful, (2) how many people and years have gone into this software, (3) how many people are currently using this software?! Far more impressive than a pretty 3D picture is a graph of the number of users who (genuinely) use your program over time!
- Overprotecting code. In this field it's incredibly for a programmer to bury into a hole, and not share their code with anyone - at very best they might ask one or two biologists to give them some data to analyze. In many cases you can understand the paranoia involved: often this code represents the bulk of a person's thesis, and they're afraid sharing it because "someone might copy my unique ideas and publish before I do". In fact, probably ideas in this field are rarely unique as you think: for example in automatic segmentation every man and his dog is creating the next brilliant 3D watershed/filter/energy-minimizing/machine-learning etc.... but the guy who gets the glory isn't the guy who turns it into a single publication (only marginally different from a hundred other publications), but who actually turns it into a practical tool or plugin which a biologist can run with a couple of button clicks! The danger of programming in isolation is you'll probably write your code in such a way that only you can ever use it.... and when you finally do publish and finish your thesis, you'll move onto the next thing and find it too much effort to turn your code into something a single person will ever use again. Deep down you'll realize your brilliant new algorithm hasn't helped anyone in the way you hoped it would. Keeping code private until publication isn't necessarily a bad thing, but you should always write your code in such a way that your ready to distribute it to the world on the day your publication is accepted.
- A separation between biologists and computer scientists. Wise advice my old supervisor gave me: the most successful software engineers in cell science are almost always who have a biology background or (since only a precious few who have been trained in both cell and computer science) who "sit with" cell biologists and have their own "biological driver". Only interacting every day and sharing an office with the biologists and/or microscopes (the people who do the real work) is it possible for a programmer to slowly learn/absorb what is useful and what is not useful to these people in their day-to-day projects. Slowly this programmer will learn to talk "both languages" and learn to introduce himself not as a "programmer" but as a "scientists". Only when he truly understands the importance of (a) having a "biological question" and (b) hanging out with and forming relationship with biologist can this new scientist start to produce useful tools and become indispensable. Far too often, the computer scientists/engineers end up physically isolated (sometimes is different buildings!) from the biologists. When you put a bunch of programmers together it's very easy for them to distract each other with "the next greatest thing" and slowly forget the biologists exist. Rather than contributing tools that biologists find useful and believe in, there's now a danger these people will contribute to the mental "divide" or even resentment between the two groups. If biologists in your own university don't use your tools, what hope do you have of convincing others to?!
- Inventing new (incompatible) file formats. When it comes time to save data generated by your tool (or even to important data) the temptation is to simply generate your own custom text or binary file. This is a bad idea! Any data you output should be easy for biologists to share with each other. For every new format you invent, you'll need to implement new conversion programs to turn this data into a file format your collaborators can open in the other programs they use for further visualization/segmentation/analysis. Depending on what you need to save (especially if it's image data, rather than vector) there is probably already a well supported file format out there which you could be using. The file formats which are most useful are the ones which allow you to add extra variables (eg: RDF triples or XML) if you need to.
- Reinventing the wheel. When you consider the vast amount of software around (much of it kept private) and the large number of student programmers (young programmers who are enthusiastic to jump into coding, without necessarily checking what exists), it's not surprising there is so much "wheel reinvention" in our field. Years after starting a project, many programs may here the words they fear most: "wait, so how is your program different from the X program from the Y lab which we all use - why didn't you just use that?!". Often the programmer simply doesn't want to know - denial is bliss - but this is the kind of thing you should know before you invest so much time in any project!
- Creating system-dependent code. Within academia the best code is (a) open source, (b) cross-platform and (c) easy to get working! Java although slow, has become popular because of it will work on any computer and Python is gaining popularity for the same reason. C++ is another desirable language, but only if written using only cross-platform libraries! C#, Objective C, Visual Basic and a host of other languages may be setting yourself up for failure! Many academics will test their code on only one computer, and never consider that within cell biology the use of Mac, Windows and various breeds of Unix systems (plus a mixture of 32 and 64 bit) is very common. In any given lab you are likely to find a big mixture, hence for software to become popular and code to be easily reusable it's imperative that you compile and text your program regularly on multiple platforms - right from the outset! Choosing the right libraries is the first step, but unless you regularly test on a Mac and Window and Linux, you could easily find yourself with a Java program which (without you knowing it) crashes on Mac and you don't know why.
These are just a few of the many problems, plaguing our field, and preventing us from increasing our useful output to the type of levels which would be expected in industry. If you worked in industry and made a software tool which (a) doesn't integrate/write/read existing data files, (b) nobody else finds useful/valuable, (c) nobody else can use/install/learn easily or (d) already exists... you would be fired! The idea of microscopy.org will be to help programmers in our feild (especially new ones) quickly learn what is useful, and what expectations/standards they must follow to ensure their software contributes in a meaningful way to the field of cell biology!
NOTE: Something I deliberately haven't listed above is the continual rise of new technologies and flavor of the month. Once upon a time Cobol and Fortran was all the rage! After that came C, then C++, then C# and most recently objective C (used to program Apple products like iPhone and iPads). More recently again has been the introduction of new web technologies: WebGL and HTML5. Many people swear by procedural programing, many swear by object-oriented programming and other methodologies... some swear by Java while others swear by mathematically oriented languages like Python. The "flavor of the month" will probably always change as language fall in and out of fashion..... this problem however is not specific to academia. Many businesses and website suffer the same problem, perhaps the only difference is academia is sometimes (even) slower to adapt, and so many of the program we use are still in legacy code!
Guidelines: An Introduction to the Field
This section is designed to introduce important concepts, building up a master "list" of standards. Since the section was getting too long I've moved it here:
The List: A List of Gold Standards
I've moved the list of standards I've been drafting up (and would like feedback on) to here:
This section contains important terms the reader must understand. I've moved this list of terminology here:
Plan of Attack
These are just rough steps, but represent a "path to success".
- Step 1: assemble lots of people (eg: Maryann, James, Dmitry) who are enthusiastic about this idea
- Step 2: formulate and agree on basic gold standards / recommendations (the tricky part)!
- Step 3: register and create a website (ideally one which a nice front, but also has a media wiki where anyone/everyone can add code etc)
- Step 4: e-mail everyone - send an e-mail to every cell biology lab on earth we can find to asking what program(s) they use, and what gold standards they think it meets (almost like a checkbox survey) - good for publicity and to get people thinking about how they can get more stars.
- Step 5: write a nice publication. One so helpful and (refreshingly) honest it is a pleasure to read!
- Step 6: try to get publicity, and hopefully make some high-level presentations (especially if we can target the extreme frustration of scientists over the "reinventing the wheel" problem)
- Step 7: continue to maintain wiki, but also hope it gets integrated into INCF.... or we find a way to get funded
The end dream of this project is that whenever a new programmer wants to create a new tool - even if he just hints at the idea - his supervisor should tell him not to even think of typing a single line of code before visiting microscopystandards.org and that "MicroscopyStandards.org: The Gold Standards for Microscopy Software" is the very first paper he should read. If any software developer presents his work to biologists, we want the biologists to the question "does your software adhere to microscopystandards.org standards". If this is done properly, any software developer who's never heard of these standards should be made to feel silly!
NOTE: Feel free to edit this page! What I'm hoping is that this page won't just get the ball rolling, but also help get a paper started! I image "Problems" section might map nicely to an "Introduction", the "Guidelines" (advice) almost represents "Discussion" and the "The List" is presented almost like results.
- IMOD - naming objects - a page I wrote which talks about the importance of using ontologies and naming objects correctly.
- EMDBataBank.org - A unified data resource for Cryo-Electron Microscopy data run jointly Protein Databank in Europe (PDBe), the Research Collaboratory for Structural Bioinformatics (RCSB), and the National Center for Macromolecular Imaging (NCMI). This site covers a variety of techniques, including single-particle analysis, electron tomography, and electron (2D) crystallography. They've also made a publication:
- A 2005 publication from J.B.Heymann, M.Chagoyen and D.M.Belnap entitled 'Common Conventions for Interchange and Archiving of Three-Dimensional Electron Microscopy Information in Structural Biology', with corrigendum, addresses the issue of conventions for exchange of cryoEM data.
- This publication is a little more technical/narrow than the one we'd hope to write, but definitely a great start and important reference to read and built on.
- Biosharing.org - an international push to develop catalogs and data sharing policies. Apparently in Europe there is a big push that scientists, as employees of the government, should be required to publish their data.
- Sites which provide libraries for greater file interoperability:
- CCP4 - CCP4 ("Collaborative Computational Project No. 4" - Software for Macromolecular X-Ray Crystallography) is a project which aim produce and support a world-leading, integrated suite of programs that allows researchers to determine macromolecular structures by X-ray crystallography, and other biophysical techniques.
- BSoft - Bsoft ("Bernard's Software Package") is a collection of programs and a platform for development of software for image and molecular processing in structural biology. Problems in structural biology are approached with a highly modular design, allowing fast development of new algorithms without the burden of issues such as file I/O. It provides an easily accessible interface, a resource that can be and has been used in other packages..