ISMB 2014: BioinfoCoreWorkshopWriteUp
The annual bioinfo-core workshop ran successfully at the 2014 ISMB conference. We had a good attendance for the meeting despite the workshop clashing exactly with the world cup final and we're very grateful for everyone who chose to come along.
We changed the format of the workshop slightly from previous years. In the past we had always had two sets of presentations followed by a moderated group discussion. This year we had only one formal session with presentations, with the second half of the workshop being taken up with a larger group discussion covering a number of topics which were collated from suggestions taken from the bioinfo-core mailing list.
The group discussions were very lively and we had a large number of people contributing to them. For those who weren't able to attend we will try to summarise some of the main points of the discussions below - as these were all active discussion sessions we don't have a great number of notes to work from so anyone who has other points they remember please fill in anything which has been missed.
Introduction to the Workshop by Brent Richter
Earlier in the conference Brent had already presented to a special session which had introduced bioinfo-core as one of the new ISCB COSIs (communities of special interest). He was able to summarise the rationale for having bioinfo-core as a group and talk about the activities the group performs. Hopefully the increased exposure the group receives through becoming a COSI will help bring us to the attention of some people who might not have known about us before.
Topic 1: Core on a Budget vs Enterprise
Moderated by Matt Eldridge and David Sexton
The purpose of this session was to look at whether it is possible to run a core facility on a limited budget, and to explore what becomes possible when you have a larger amount of money to spend.
Alastair Kerr led the session talking about his small core (2 people) at Edinburgh University. His core is almost completely self-reliant and has to cover all of the hardware, storage, software and analysis infrastructure required for the full range of users he supports. Alastair described how his infrastructure is built on a number of key open source components from ZFS based storage systems which provide 0.5PB of storage for a fraction of the cost of commercial systems to pipelining and workflow systems built within Galaxy, to user-friendly analysis scripts provided to users though the R Shiny system.
Alastair described how he actively avoids the use of commercial software within his group and described occasions in the past where their adoption of initially useful commercial packages had ultimately had negative impacts when the software later changed its licensing fees or became unsupported. The only commercial package they still have is Lasergene for basic molecular biology manipulations and this is mostly for historical reasons and for the lack of a suitable open alternative.
A key benefit with this choice of open source infrastructure is that the process of data analysis, from raw data to paper figures, can be shared and used by anyone. Alastair's group facilitates this process by ensuring that any scripts developed in-house are shared in either Galaxy Toolshed or github.
Mike Poidinger then went on to present the contrary case. His group is very well funded by his supporting institution and is somewhat larger than the Edinburgh group with 9 members. Mike's initial contention was that it should be a requirement when setting up a core that sufficient funding be provided and that it would be reasonable to refuse to head up a core where suitable funding to provide an appropriate infrastructure was not forthcoming. Mike stressed that open source software played a major role in the operation of his core, with much of the analysis of data being provided by these types of packages, which are generally much more agile and able that their commercial equivalents. However, he made a strong case for two particular pieces of commercial support software which now form a key part of his infrastructure - Pipeline Pilot and Spotfire.
Mike's contention was that whilst open source packages are very good at performing individual analyses, they can be difficult to work with due to the difficulty in collating and understanding the wide variety of output files and formats they generate. His group uses pipeline pilot to abstract away a lot of the 'dirty' parts of an analysis so that they can leave the commercial system to store and retrieve appropriate data and to handle the format conversions required to pass data through several parts of a larger pipeline. Having this type of collation system in place means that all of the analysis can be done in the form of pipelines and a complete record of all analyses is preserved and can be reproduced or reused very easily.
The other package heavily used within his group is Spotfire. This is a data presentation and integration package which makes it easy for users to explore the quantitative data coming out of the end of an analysis pipeline. It would compete with simple solutions such as Excel, or more complex analyses and visualisations in R, but provides a friendly and powerful interface to the data. Mike's team have linked these packages to other tools such as the Moin moin wiki to provide a combined system which keeps a complete record of analyses, presents it back to the original users in a friendly way and provides an interface through which they can themselves manipulate and explore the data further.
Overall it was Mike's contention that the use of these commercial products within his group added around 20% to the efficiency of his staff, and also allowed new members to get started much more quickly. The cost of the licensing for these packages was therefore outweighed by the efficiency improvements which his group gained from their use.
There were some questions about the talks which had been presented. There was some lively discussion about the financial benefits of using commercial software, with some people arguing that the amount of money spent on a big commercial system would fund an additional FTE and that this would be a more productive use of the funding. Whilst a consensus was not really reached on this point, it seems that the merits of this, and possibly several other commercial / open decisions depends on the scale of your group. Smaller cores are more able to support their own infrastructure but as the size of the group or the community expands then the support of infrastructure becomes more of a burden. At this point getting commercial support for storage, pipelining or data management becomes more attractive and allows the core to focus on the science rather than the specifics of the platforms being used.
A suggestion which came out of this discussion was that bioinfo-core could try to collate some ideas about what infrastructure would be useful to put in place when establishing a new core. The idea would be that we could generate a basic check list of the types of components you would want and give some options for available solutions for each area and add comments about the merits of each. To this end we've set up a basic template page which we can expand after further discussion on the list.
Much of the subsequent discussion for the session focussed around whether there were individual or groups of commercial packages for which there wasn't a suitable free and open alternative. The major area which came up was for packages providing functional annotation, with the main contenders being Ingenuity IPA and | GeneGo Metcore. Several sites are paying for these types of packages and the consensus was that what you're paying for wasn't the software but rather the expanded set of gene lists and connections which have been manually mined from the literature by these companies.
These types of system are generally liked by users as they provide an easy way into the biology of an analysis result. They offer some advantages over equivalent open source products, but their major open competitors such as DAVID, GOrilla and GeneMania are also very good and well used.
There was a general opinion that the costs and licensing terms for the commercial annotation packages were quite severe. This was especially the case for IPA where some sites had starting to do cost recovery for the licence and found that many of the previous users weren't prepared to pay the costs for this. MetaCore licensing was more flexible with the ability to buy licences for a given number of simultaneous users which fitted better for many people's use cases.
Comments were also made about the utility of these systems. There was some concern that although these systems are popular they may not be all that biologically informative. Some groups had experienced that people tended to pick and choose hits from functional analysis in the same way that they picked from gene hit lists to try to reinforce an idea they already had, rather than trying to formulate novel hypotheses.
Another case for commercial packages was made for cases where you want to quickly enter into a new area of science and you don't have the resources available to build up an in-house platform for open tools. This can often happen if there is an important but likely transient interest in a new area of science. The example cited was the use of DNA Nexus for variant calling, which may not be the absolute best in class, but is likely good enough for new users and is a well researched and validated platform. Setting this up takes minimal time and effort and can provide a cost effective solution for cores without the time or experience to develop a more tailored pipeline.
Session 2: Community suggested open discussions
Moderated by Simon Andrews and Brent Richter
For this session the workshop organisers had put out a request on the mailing list for topics the group would like to discuss. There were a large number of responses which were then collated to pull out the most common topics for the discussion session. Other suggestions will either be put back to the list or will be used as part of one of the forthcoming conference calls.
There were 3 major areas which were selected for coverage within the session:
- Using pipelines within a core
- Managing workloads
- Funding your core
Using pipelines within a core
The motivation for this topic was to see how many cores had already introduced automation within their core and to look at the factors which influenced their choice of pipelineing system. We had already heard from a group which was heavily involved in the commercial pipeline pilot system and one which had used galaxy to construct workflows. We then heard from a couple of other groups - one had started from the GATK queue system but had found that this wasn't directly useable on their system. Another group had developed a new pipelining system ClusterFlow since they found that none of the available systems fitted well with their existing infrastructure immediately and that it was as easy to develop their own system from scratch than try to tailor an existing system to fit their needs.
Several people said that they had developed pipelines but without using a traditional pipelining system. Rather they had simply produced specific scripts which ran an analysis end to end. Sometimes they had split this up into several modular scripts which were chained together, but without the benefits in parallelisation and scheduling which could have been provided by a formal system. These types of scripted pipelines are common in that they naturally develop out of individual analyses but there was some concern about how well they could scale in future.
In practical terms the factors which had influenced or limited the adoption of pipelines were things such as:
- The ability to configure the settings used for steps within a pipeline. There were competing theories about this - some people preferred steps which were very static once set up so that it was easier to maintain consistent operations. Others wanted the ability to tweak all settings easily to have maximum flexibility within the pipeline. A lot of people had found that the overhead of writing suitable wrappers for individual programs within a pipline was quite high, especially if all options needed to be encoded in this to allow them to be changed.
- Fit with existing infrastructure. Many people implementing pipelines will not be building a new system from scratch but will need this to integrate with an existing cluster, so the ability of the pipeline to support the scheduling system being used, the software management system, the nature of the filesystem and various other factors.
- Recording and reproducibility. There is quite a lot of variability in how the results and settings for pipelines are recorded and what information is retained. Some groups need the ability to easily query and collate results from factors within a large set of pipelines and depending on how the data is recorded this may be more or less easy.