ISMB 2014: BioinfoCoreWorkshopWriteUp

From BioWiki
Revision as of 08:16, 15 July 2014 by Simon andrews (talk | contribs)
Jump to navigationJump to search

Introduction

The annual bioinfo-core workshop ran successfully at the 2014 ISMB conference. We had a good attendance for the meeting despite the workshop clashing exactly with the world cup final and we're very grateful for everyone who chose to come along.

We changed the format of the workshop slightly from previous years. In the past we had always had two sets of presentations followed by a moderated group discussion. This year we had only one formal session with presentations, with the second half of the workshop being taken up with a larger group discussion covering a number of topics which were collated from suggestions taken from the bioinfo-core mailing list.

The group discussions were very lively and we had a large number of people contributing to them. For those who weren't able to attend we will try to summarise some of the main points of the discussions below - as these were all active discussion sessions we don't have a great number of notes to work from so anyone who has other points they remember please fill in anything which has been missed.

Introduction to the Workshop by Brent Richter

Earlier in the conference Brent had already presented to a special session which had introduced bioinfo-core as one of the new ISCB COSIs (communities of special interest). He was able to summarise the rationale for having bioinfo-core as a group and talk about the activities the group performs. Hopefully the increased exposure the group receives through becoming a COSI will help bring us to the attention of some people who might not have known about us before.


Topic 1: Core on a Budget vs Enterprise

Moderated by Matt Eldridge and David Sexton

Speakers:

  • Alastair Kerr - Edinburgh Uni
  • Mike Poidinger - A*Star

The purpose of this session was to look at whether it is possible to run a core facility on a limited budget, and to explore what becomes possible when you have a larger amount of money to spend.

Alastair Kerr led the session talking about his small core (2 people) at Edinburgh University. His core is almost completely self-reliant and has to cover all of the hardware, storage, software and analysis infrastructure required for the full range of users he supports. Alastair described how his infrastructure is built on a number of key open source components from ZFS based storage systems which provide 0.5PB of storage for a fraction of the cost of commercial systems to pipelining and workflow systems built within Galaxy, to user-friendly analysis scripts provided to users though the R Shiny system.

Alastair described how he actively avoids the use of commercial software within his group and described occasions in the past where their adoption of initially useful commercial packages had ultimately had negative impacts when the software later changed its licensing fees or became unsupported. The only commercial package they still have is Lasergene for basic molecular biology manipulations and this is mostly for historical reasons and for the lack of a suitable open alternative.

Mike Poidinger then went on to present the contrary case. His group is very well funded by his supporting institution and is somewhat larger than the Edinburgh group with 9 members. Mike's initial contention was that it should be a requirement when setting up a core that sufficient funding be provided and that it would be reasonable to refuse to head up a core where suitable funding to provide an appropriate infrastructure was not forthcoming. Mike stressed that open source software played a major role in the operation of his core, with much of the analysis of data being provided by these types of packages, which are generally much more agile and able that their commercial equivalents. However, he made a strong case for two particular pieces of commercial support software which now form a key part of his infrastructure - Pipeline Pilot and Spotfire.

Mike's contention was that whilst open source packages are very good at performing individual analyses, they can be difficult to work with due to the difficulty in collating and understanding the wide variety of output files and formats they generate. His group uses pipeline pilot to abstract away a lot of the 'dirty' parts of an analysis so that they can leave the commercial system to store and retrieve appropriate data and to handle the format conversions required to pass data through several parts of a larger pipeline. Having this type of collation system in place means that all of the analysis can be done in the form of pipelines and a complete record of all analyses is preserved and can be reproduced or reused very easily.

The other package heavily used within his group is Spotfire. This is a data presentation and integration package which makes it easy for users to explore the quantitative data coming out of the end of an analysis pipeline. It would compete with simple solutions such as Excel, or more complex analyses and visualisations in R, but provides a friendly and powerful interface to the data. Mike's team have linked these packages to other tools such as the Moin moin wiki to provide a combined system which keeps a complete record of analyses, presents it back to the original users in a friendly way and provides an interface through which they can themselves manipulate and explore the data further.

Overall it was Mike's contention that the use of these commercial products within his group added around 20% to the efficiency of his staff, and also allowed new members to get started much more quickly. The cost of the licensing for these packages was therefore outweighed by the efficiency improvements which his group gained from their use.

Discussion

Much of the initial discussion for the session focussed around whether there were individual or groups of commercial packages for which there wasn't a suitable free and open alternative. The major area which came up was for packages providing functional annotation, with the main contenders being Ingenuity IPA and | GeneGo Metcore. Several sites are paying for these types of packages and the consensus was that what you're paying for wasn't the software but rather the expanded set of gene lists and connections which have been manually mined from the literature by these companies.

These types of system are generally liked by users as they provide an easy way into the biology of an analysis result. They offer some advantages over equivalent open source products, but their major open competitors such as DAVID and GeneMania are also very good and well used.

There was a general opinion that the costs and licensing terms for the commercial annotation packages were quite severe. This was especially the case for IPA where some sites had starting to do cost recovery for the licence and found that many of the previous users weren't prepared to pay the costs for this. MetaCore licensing was more flexible with the ability to buy licences for a given number of simultaneous users which fitted better for many people's use cases.

Comments were also made about the utility of these systems. There was some concern that although these systems are popular they may not be all that biologically informative. Some groups had experienced that people tended to pick and choose hits from functional analysis in the same way that they picked from gene hit lists to try to reinforce an idea they already had, rather than trying to formulate novel hypotheses.

Another case for commercial packages was made for cases where you want to quickly enter into a new area of science and you don't have the resources available to build up an in-house platform for open tools. This can often happen if there is an important but likely transient interest in a new area of science. The example cited was the use of DNA Nexus for variant calling, which may not be the absolute best in class, but is likely good enough for new users and is a well researched and validated platform. Setting this up takes minimal time and effort and can provide a cost effective solution for cores without the time or experience to develop a more tailored pipeline.