ISMB 2014: InfrastructureForNewCores
This page was created in response to a suggestion made at the 2014 ISMB workshop. One of the discussions there centered around configuring resources for a core facility and the merits of commercial packages. As an extension of this discussion some people suggested that it would be nice to have an idea of what core pieces of infrastructure people had within their cores, and more specifically to be able to make suggestions for which pieces of hardware and software the group would consider to be useful when setting up a new core and some specific suggestions for packages which could fulfil each of these functions.
- Hardware: buy or rent (AWS)?
- Assume: ~20% usage for own hardware (my own average for the past year)
- Assume: assume equivalent pricing on S3 vs institutional storage ($0.03/GB/mo).
- who does the sysadmin stuff for your own hardware
- "Pathway Analysis" software. IPA is ~$6,000/year, or ~$20,000/year for concurrent license. Cheaper/free alternatives?
- "Plasmid drawing / in silico cloning" tools commercial (CLCBio/VectorNTI) or free, open source solutions?
- "Sample tracking" and LIMS
- "workflow managers"
- Time tracking: spreadsheet (easy for startup, hard to scale) vs enterprise solution (more cumbersome initially (?), scales better)
- Hire personnel
- core people vs. embedded bioinformatician
- Interaction with exiting groups
- other core facilities (functional genomics, mass spec, etc)
- computational / statistical Biology research groups
- Setting up a culture of collaboration with the wet lab research groups
- make sure they talk to you before they start the experiment
- co-authorship vs acknowledgement
- offering training courses
- co-supervision of students
Notes from the call 14-Nov-2014
Need to be careful when buying hardware - make sure that you have the infrastructure in place to host servers you buy. Putting a suitable machine room in place with cooling and power
Make sure you match the hardware to the tasks. Clusters are not generic. A lot of modelling or chemistry is fast CPU and low memory / storage but bioinformatics often need high storage and IO. Work out what sort of jobs you’re going to do and buy appropriately.
Need to have a lot of storage which might be the biggest problem. Need to be sure that you’re confident in your admin skills if you’re going to take this on yourself.
Might be possible to re-purpose existing hardware if you have something available. You should find out what hardware might be available in your institution. You might find there are groups which have infrastructure already available which you could tap into which can get you up and running quickly and from which you can make a decision about whether to create a joint system, or whether you will ultimately need to go on your own.
You can do a lot with a single large SMP box. Custom distributions like BioLinux can help. If you look to scale up then get advice from a group or company who have done this many times before. Companies like BioTeam can help to put together a design and will think of things you’ve not considered.
Once you have significant infrastructure you really need to look towards having a full-time sysadmin. Maintaining the hardware, backups, storage, software and data pipelines is a huge task. Ideally you can involve central core IT services at you infrastructure to allow your people
Once you start with a multi-user cluster then managing the software, hardware, queues etc becomes a full time job and if you don’t have this then you will be continually falling behind on security patches, upgrades. A nice fall back is to set up a consultancy agreement with an individual or company which allows a smooth transition to having a permanent position within your group.
Pretty much no one has gone with using cloud services. Although this seems attractive, the practical problems of maintaining an instance to your specification is expensive and difficult. Might be worth looking at openstack as an alternative.
Managing interactions with existing groups
There are often people who will be working in computational biology or statistics groups and its important to establish good relationships with these people at the start. Make sure people are aware that your group is being started and what its purpose is. Try to talk to all interested parties up front and be aware of any political problems which might exist. Try to confront issues up front and don’t wait for them to fester. Try to collaborate with groups rather than compete with them - in the end there’s always more than enough work to go around.
Try to be the place that people come to get pointed to other experts. Don’t try to do everything yourself but be quick to forward people to other groups when you know they have more specific experience than your group does. This will provide a better service and will not alienate people.
Your first hire
Don’t rush into it. When there are only two of you you will need to be completely confident in both the skill set and the personal qualities of the first person you hire. Ideally try to find someone you know already or someone who comes recommended from someone you trust.
For their skills you generally want someone with good problem solving skills. When you’re a small core everyone needs to be a jack of all trades so don’t focus on their existing skills but try to see how well they’re likely to pick up new areas since things will quickly change.
Make personal interactions with the informaticians who are already in the institution.
Often a good idea to have some sort of tracking in place from an early stage so you can justify the time you are spending. Even if you’re not having to charge for the work you’re doing then it can still be useful to know where your time is going.
Could do something as simple as a spreadsheet. Projects, who is working on it etc.
There are also lots of projects management systems such as Redmine which can do the same sort of thing. Also Atalassian Jira can be useful in this area, also click time is an online system which can do some of this. Most of them have some kind of time tracking capability. Really pays off in the long term when you can collate statistics. Can extend these from project tracking to help desk or other systems.
Don’t use this as a barrier to people. Always make it easy for people to come and talk to you and don’t track to track or bill this. You want to try to make yourself as useful as possible and make your group the place that people think of first. Don’t make them make appointments - try to have an open door policy.
Even if you have big projects these systems still work well and you can link tasks together.
The really important thing is to be able to keep track of the work you’ve done. All of your data is electronic so you need a way to be able to store data / scripts / notes etc.
When your project list grows these systems are also useful to be able to flag up problems in your workflow when jobs have waited too long or have gone on longer than expected.
What software do you need.
Galaxy is a good place to start. Easy way to offer both datasets and tools to people and can be useful for teaching.
Should have a revision control system to keep track all of the scripts and software you write. They can also make it easy to share code. If you use something like git then it’s easy to use a public git repository or keep things private deepening on your needs. There is a github for science github for science and also projects like gitorius if you want to share this yourself.
You need some way to share data. Set up an FTP server or similar.
Using simple apache web directories can be useful for bigwig and bed files for use in genome browsers.
Can use dropbox for small files. Can set up ownCloud for larger more manageable storage.
Have some way to share notes and experience. Evernote works well but you could use a local wiki or a system such as confluence. The main thing is to not lose track of your hard won experience.
For sending data to users can use Shiny which is a nice way to share tabular data and interactive graphs. Any shiny app can be set up to read file directories or SQL databases for their source data and allow the user to download selected data or graphs via simple commands in the user interface.