1
Improving the scalability of Mothur for large metagenomic studies Kristin Muterspaw ‘15, Tara Urner ‘16, Nicholas Arnold ‘17, Ashutosh Rai ‘17, Charlie Peck ‘84 What is 16s The Goal Mothur is an open source bioinformatics software package for analyzing microbial DNA sequences. Mothur takes raw 16S sequencer data as input and returns a list of organism present by the desired taxonomic level (species, family, class) acknowledgements: Chris Smith, Shodor Foundation, Blue Waters UIUC, NCSA How we use it Mothur WorkFlow The Problem The Solution 16S is the name given to the portion of bacterial DNA that codes for the 16S ribosomal RNA component of prokaryotic ribosomes (pictured right). All prokaryotic organisms possess the 16S rRNA gene, but it is different enough in each such that the gene can be used for identification down to the species. The ‘Big Three’ commands in Mothur source code are written using a form of parallel programing known as fork(). Fork() works by splitting up the processes on the same physical machine and then recombining them after each process has run through. Fork() is a very ineffective way of using currently available hardware when compared to other types of shared and distributed memory architecture. Our goal is to improve the scalability of Mothur for use in metagenomic studies. This means improving Mothur’s capabilities with larger data sets and modern hardware. We use Mothur at Earlham as part of projects in Biology and Computer Science to process the 16S rRNA sequencer output generated from soil and leaf samples collected from Wayne County farm fields, archeological sites in Iceland, and coffee farms in Nicaragua. Strong Scaling Strong scaling is how the solution time varies when the number of processors used to solve a problem changes but the problem size does not. Using analysis tools to profile run times on our smaller cluster as well as the Blue Waters supercomputer and XSEDE resources - we determined how Mothur commands scale when we increase computational resources. Weak Scaling Weak scaling means changing the problem size proportionally with the number of processors and observing the effect on the solution time. We are now investigating weak scaling of the ‘Big Three’ in mothur. What is Mothur We run Mothur using computers that contain large amounts of random access memory (RAM) that are optimized for running scientific programs. The mothur work flow is very memory and time intensive - three of the the commands, the ‘Big Three’, take up the most time and memory in the workflow (see below) Scaling of the ‘Big Three’ Our current aim is to modify the ‘Big Three’ and replace the less efficient attempts at parallelism such as fork() with a combination of parallel and distributed memory programming. Options we are considering include openMP threading, and MPI.

URC Mothur poster

Embed Size (px)

Citation preview

Page 1: URC Mothur poster

Improving the scalability of Mothur for large metagenomic studiesKristin Muterspaw ‘15, Tara Urner ‘16, Nicholas Arnold ‘17, Ashutosh Rai ‘17, Charlie Peck ‘84

What is 16s The Goal

Mothur is an open source bioinformatics software package for analyzing microbial DNA sequences. Mothur takes raw 16S sequencer data as input and returns a list of organism present by the desired taxonomic level (species, family, class)

acknowledgements: Chris Smith, Shodor Foundation, Blue Waters UIUC, NCSA

How we use it

Mothur WorkFlow

The Problem

The Solution

16S is the name given to the portion of bacterial DNA that codes for the 16S ribosomal RNA component of prokaryotic ribosomes (pictured right). All prokaryotic organisms possess the 16S rRNA gene, but it is different enough in each such that the gene can be used for identification down to the species.

The ‘Big Three’ commands in Mothur source code are written using a form of parallel programing known as fork(). Fork() works by splitting up the processes on the same physical machine and then recombining them after each process has run through. Fork() is a very ineffective way of using currently available hardware when compared to other types of shared and distributed memory architecture.

Our goal is to improve the scalability of Mothur for use in metagenomic studies. This means improving Mothur’s capabilities with larger data sets and modern hardware.

We use Mothur at Earlham as part of projects in Biology and Computer Science to process the 16S rRNA sequencer output generated from soil and leaf samples collected from Wayne County farm fields, archeological sites in Iceland, and coffee farms in Nicaragua.

Strong ScalingStrong scaling is how the solution time varies when the number of processors used to solve a problem changes but the problem size does not. Using analysis tools to profile run times on our smaller cluster as well as the Blue Waters supercomputer and XSEDE resources - we determined how Mothur commands scale when we increase computational resources.Weak ScalingWeak scaling means changing the problem size proportionally with the number of processors and observing the effect on the solution time. We are now investigating weak scaling of the ‘Big Three’ in mothur.

What is Mothur

We run Mothur using computers that contain large amounts of random access memory (RAM) that are optimized for running scientific programs. The mothur work flow is very memory and time intensive - three of the the commands, the ‘Big Three’, take up the most time and memory in the workflow (see below)

Scaling of the ‘Big Three’

Our current aim is to modify the ‘Big Three’ and replace the less efficient attempts at parallelism such as fork() with a combination of parallel and distributed memory programming. Options we are considering include openMP threading, and MPI.