Bioconductor : open software development for computational biology and

نویسندگان

  • Robert C Gentleman
  • Vincent J Carey
  • Douglas M Bates
  • Ben Bolstad
  • Sandrine Dudoit
  • Byron Ellis
  • Laurent Gautier
  • Yongchao Ge
  • Jeff Gentry
  • Kurt Hornik
  • Torsten Hothorn
  • Wolfgang Huber
  • Stefano Iacus
  • Rafael Irizarry
  • Friedrich Leisch
  • Cheng Li
  • Martin Maechler
  • Anthony J Rossini
  • Gunther Sawitzki
  • Colin Smith
  • Gordon Smyth
  • Luke Tierney
  • Jean YH Yang
  • Jianhua Zhang
چکیده

The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples. Published: 15 September 2004 Genome Biology 2004, 5:R80 Received: 19 April 2004 Revised: 1 July 2004 Accepted: 3 August 2004 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2004/5/10/R80 R80.2 Genome Biology 2004, Volume 5, Issue 10, Article R80 Gentleman et al. http://genomebiology.com/2004/5/10/R80 Genome Biology 2004, 5:R80 Background The Bioconductor project [1] is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics (CBB). Biology, molecular biology in particular, is undergoing two related transformations. First, there is a growing awareness of the computational nature of many biological processes and that computational and statistical models can be used to great benefit. Second, developments in high-throughput data acquisition produce requirements for computational and statistical sophistication at each stage of the biological research pipeline. The main goal of the Bioconductor project is creation of a durable and flexible software development and deployment environment that meets these new conceptual, computational and inferential challenges. We strive to reduce barriers to entry to research in CBB. A key aim is simplification of the processes by which statistical researchers can explore and interact fruitfully with data resources and algorithms of CBB, and by which working biologists obtain access to and use of state-of-the-art statistical methods for accurate inference in CBB. Among the many challenges that arise for both statisticians and biologists are tasks of data acquisition, data management, data transformation, data modeling, combining different data sources, making use of evolving machine learning methods, and developing new modeling strategies suitable to CBB. We have emphasized transparency, reproducibility, and efficiency of development in our response to these challenges. Fundamental to all these tasks is the need for software; ideas alone cannot solve the substantial problems that arise. The primary motivations for an open-source computing environment for statistical genomics are transparency, pursuit of reproducibility and efficiency of development. Transparency High-throughput methodologies in CBB are extremely complex, and many steps are involved in the conversion of information from low-level information structures (for example, microarray scan images) to statistical databases of expression measures coupled with design and covariate data. It is not possible to say a priori how sensitive the ultimate analyses are to variations or errors in the many steps in the pipeline. Credible work in this domain requires exposure of the entire process. Pursuit of reproducibility Experimental protocols in molecular biology are fully published lists of ingredients and algorithms for creating specific substances or processes. Accuracy of an experimental claim can be checked by complete obedience to the protocol. This standard should be adopted for algorithmic work in CBB. Portable source code should accompany each published analysis, coupled with the data on which the analysis is based. Efficiency of development By development, we refer not only to the development of the specific computing resource but to the development of computing methods in CBB as a whole. Software and data resources in an open-source environment can be read by interested investigators, and can be modified and extended to achieve new functionalities. Novices can use the open sources as learning materials. This is particularly effective when good documentation protocols are established. The open-source approach thus aids in recruitment and training of future generations of scientists and software developers. The rest of this article is devoted to describing the computing science methodology underlying Bioconductor. The main sections detail design methods and specific coding and deployment approaches, describe specific unmet challenges and review limitations and future aims. We then consider a number of other open-source projects that provide software solutions for CBB and end with an example of how one might use Bioconductor software to analyze microarray data. Results and discussion Methodology The software development strategy we have adopted has several precedents. In the mid-1980s Richard Stallman started the Free Software Foundation and the GNU project [2] as an attempt to provide a free and open implementation of the Unix operating system. One of the major motivations for the project was the idea that for researchers in computational sciences "their creations/discoveries (software) should be available for everyone to test, justify, replicate and work on to boost further scientific innovation" [3]. Together with the Linux kernel, the GNU/Linux combination sparked the huge open-source movement we know today. Open-source software is no longer viewed with prejudice, it has been adopted by major information technology companies and has changed the way we think about computational sciences. A large body of literature exists on how to manage open-source software projects: see Hill [4] for a good introduction and a comprehensive bibliography. One of the key success factors of the Linux kernel is its modular design, which allows for independent and parallel development of code [5] in a virtual decentralized network [3]. Developers are not managed within the hierarchy of a company, but are directly responsible for parts of the project and interact directly (where necessary) to build a complex system [6]. Our organization and development model has attempted to follow these principles, as well as those that have evolved from the R project [7,8]. In this section, we review seven topics important to establishment of a scientific open source software project and discuss them from a CBB point of view: language selection, infrastructure resources, design strategies and commitments, http://genomebiology.com/2004/5/10/R80 Genome Biology 2004, Volume 5, Issue 10, Article R80 Gentleman et al. R80.3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BeadArray Expression Analysis Using Bioconductor

Illumina whole-genome expression BeadArrays are a popular choice in gene profiling studies. Aside from the vendor-provided software tools for analyzing BeadArray expression data (GenomeStudio/BeadStudio), there exists a comprehensive set of open-source analysis tools in the Bioconductor project, many of which have been tailored to exploit the unique properties of this platform. In this article,...

متن کامل

A Quick Guide to Teaching R Programming to Computational Biology Students

The name ‘‘R’’ refers to the computational environment initially created by Robert Gentleman and Robert Ihaka, similar in nature to the ‘‘S’’ statistical environment developed at Bell Laboratories (http://www.r-project.org/about. html) [1]. It has since been developed and maintained by a strong team of core developers (R-core), who are renowned researchers in computational disciplines. R has ga...

متن کامل

iFlow: A Graphical User Interface for Flow Cytometry Tools in Bioconductor

Flow cytometry (FCM) has become an important analysis technology in health care and medical research, but the large volume of data produced by modern high-throughput experiments has presented significant new challenges for computational analysis tools. The development of an FCM software suite in Bioconductor represents one approach to overcome these challenges. In the spirit of the R programmin...

متن کامل

Bioconductor: an open source framework for bioinformatics and computational biology.

This chapter describes the Bioconductor project and details of its open source facilities for analysis of microarray and other high-throughput biological experiments. Particular attention is paid to concepts of container and workflow design, connections of biological metadata to statistical analysis products, support for statistical quality assessment, and calibration of inference uncertainty m...

متن کامل

RDBMS in Bioinformatics: The Bioconductor Experience

Bioconductor (http://www.bioconductor.org/) is an open source collection of resources aimed at transparently advancing the theory and practice of bioinformatics, with a focus on expression arrays and the R statistical computing environment. I will sketch the key data structures and data flow processes addressed in Bioconductor thus far. I will review the role played by RDBMS in the development ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017