8th Discussion-1 March 2010
Brief Description and Continuing Discussion (please edit in line):
Analysing Large Scale Image Data (led by Simon Andrews)
Although much attention has recently been focused on the high throughput datasets coming from next generation sequencers there is another potentially large source of data on the horizon. Automated imaging systems are now being scaled up such that they will be able to routinely produce data in a volume which matches that of the larger sequencing experiments. In contrast to sequence data, which is generally easy to manipulate and analyse, extracting relevant information from images is difficult. This section of the call aims to find out whether many core facilities are encountering this data, and how they are handling it.
Recruiting and retaining bioinformatics staff (Led by David Sexton)
One of the biggest challenges for any bioinformatics core facility is recruiting and retaining good staff. Many cores only employ a small number of people and tackle a wide range of jobs so finding people who are competent, independent and adaptable is key to the success of the core.
This section of the call will look at how different cores recruit new staff. What qualifications do they ask for? Where do they advertise? How do they interview? It will provide an opportunity for people to tell of their successes and failures and see if we can all benefit from the experience of others.
Transcript of Minutes
Most of the recent discussion about the handling of large datasets has surrounded next generation sequencing technology. However advances in imaging technology has meant that imaging datasets could potentially generate datasets of equal size. Unlike sequencing, the analysis of imaging data is often non-trivial and carries a high computational load. It's therefore possible that imaging could present the biggest challenge on the horizon for the workload of bioinformatics core facilities.
The aim of this session was to guage whether imaging data was starting to have an impact on the workload of bioinformatics facilities, and how they were coping with this type of data.
[Simon Andrews – SA] Asked whether people were seeing large imaging datasets and what impact these were having on facilities.
[Charlie Whitacre CW] Said that his users were running image analysis programs based around custom Matlab programs or other open source packages. These analyses were placing a large computational strain on their clusters to the point that they were becoming overloaded. Jobs generally ran in a single thread with each job requiring a single core for ~1 hour. With thousands of jobs being submitted this is enough to tie up the Sun Grid Engine.
[Fran Lewitter – FL] Observed that groups purchasing new imaging equipment weren't considering the requirement for the storage and analysis of data.
[David Sexton – DS] said that his institution requires all grants which could possibly generate large amounts of data to notify the informatics group to ensure the analysis load can be handled. [Joe Andresen – JA] asked how this was enforced [DS] said that it was enforced by the central administration through whom grants were submitted.
[SA] asked whether people were generally relying on the software which came with imaging systems or using either external programs or custom developed solutions.
[Brent Richter – BR] said that for the data they dealt with the software from the imaging systems wasn't used at all for the analysis. All images were put into Matlab or custom tools based around Scale MP. The data was usually medical imaging (MRI etc) and required sophisticated analysis.
[Dawei Lin – DL] wondered what sizes of images or movies people were getting.
[BR] said that it was the number of images which was an issue rather than the size of individual images. A single dataset often comprised multiple tiled images.
[SA] added that the biggest potential problem was automated imaging systems which were starting to replace traditional biochemical assays with assays based on high-content imaging. These systems could image hundreds of samples, generating time series of hundreds of images for each one. Scientists generally weren't interested in the image data itself but just the extracted data from the features in the images.
[Thomas ? - T?] Asked how long images from these types of systems were being stored.
[SA] replied that although scientists weren't interested in the images directly they still wanted them stored since they were the primary data collected in the experiment. [BR] Said that storage requirements for their imaging was considerable. They had put in place a 2PB storage system. They stored everything if they possibly could.
[Matthew Eldridge – ME] Asked if the software which came with the imaging systems was flawed or do we require specialised software.
[SA} Said that they had had lots of problems getting the software which came with the imaging systems to work as expected. Scientists often believed that the supplied software would be able to handle their analysis, but found when they came to it that they couldn't quite get what they wanted. They then approached bioinformatics for a solution. It wasn't that the supplied software didn't work, but just that it was often so generic that it wasn't able to be tailored well enough to a specific purpose.
[SA] asked whether other sites had dedicated imaging facilities which undertook much of the image analysis work, or whether it passed on to the bioinformatics group?
[BR] said that more work was coming back to the informatics core. Radiology and They were slowly moving toward analysis applications though.
[FL] said that a lot of people ask about image analysis but that most analysis was done off software already available within the imaging core. Generally the informatics core would just research a more specialised group able to help with a particular imaging application and point the user to them rather than trying to do the analysis themselves.
[DS] said that image storage was the main problem he faced.
[SA] added that in addition to total storage some high speed acquisition systems also caused strain on the network between the imaging facility and the main storage due to the rate at which data could be acquired. It was possible that systems would have to wait for data to transfer before being run again.
[BR] said that his capturing systems had been OK up to now. Most were slow enough that data could be transferred off quickly enough.
[SA] asked if any groups were getting actively involved in the image analysis directly rather than providing computational infrastructure.
The consensus was that at this point few people were actively involved in direct analysis of images.
[SA] asked if anyone had tried any of the publicly available analysis toolkits. He had experience of ImageJ and wondered whether anyone else had used this either as a stand alone program or as a development platform.
[A few people] mentioned that they used ImageJ as a viewer.
[Someone] said that they had looked into using ImageJ as a development platform but hadn't done any work on this yet and wondered whether this was something worth pursuing.
[SA] said that ImageJ had been working very well for them as a development platform. It was possible to use it either directly by writing plugins for the program or indirectly as a data model in a larger stand alone program. He also recommended the BioFormats project which provides access to a range of proprietary image formats used on commercial imaging systems which are otherwise not well supported in other imaging libraries.
[SA] asked about how people were using Matlab for image analysis.
[Someone from Uni Mass] said that they were using the image processing toolbox which in turn required the signal processing toolbox. These two together provided simple way to do complex analyses such as a fast fourier transform.
[Deanne Taylor – DT] said that they had been using Octave as an open source alternative to Matlab and that this supported an image analysis package.
[Joe Anderson – JA] said that a downside of using Matlab was the licencing model. The use of multiple toolboxes could make it expensive and these dependencies made it difficult to compartmentalise the limited use of different toolboxes.
[Someone – DL?] said they got 5 Matlab licences and 4 licence packs for toolboxes for $200 for each core licence and $120 for each additional module. There was a 20% annual charge for maintainance.
[BR] said that he used the Image Analysis and Signal Analysis Statistics toolboxes for the licences they held on the nodes in their cluster.
At this point this session was ended.