Bioinformatics Services

Bioinformatics Services (Company):

Forensic Bioinformatics – Link

A to Z of Bioinformatics Services, EBI – Link

Scionics Computer Innovation GmbH – Link

BIDMC Genomics and Proteomics Center – Link

Expression Analysis – Link

Sequentix – Link

Craic Computing – Link

Bioinformatics - recent issues

Bioinformatics - RSS feed of recent issues (covers the latest 3 issues, including the current issue)

Summary: Recently, several high profile studies collected cell viability data from panels of cancer cell lines treated with many drugs applied at different concentrations. Such drug sensitivity data for cancer cell lines provide suggestive treatments for different types and subtypes of cancer. Visualization of these datasets can reveal patterns that may not be obvious by examining the data without such efforts. Here we introduce Drug/Cell-line Browser (DCB), an online interactive HTML5 data visualization tool for interacting with three of the recently published datasets of cancer cell lines/drug-viability studies. DCB uses clustering and canvas visualization of the drugs and the cell lines, as well as a bar graph that summarizes drug effectiveness for the tissue of origin or the cancer subtypes for single or multiple drugs. DCB can help in understanding drug response patterns and prioritizing drug/cancer cell line interactions by tissue of origin or cancer subtype.

Availability and implementation: DCB is an open source Web-based tool that is freely available at: http://www.maayanlab.net/LINCS/DCB

Contact: avi.maayan@mssm.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Duan, Q., Wang, Z., Fernandez, N. F., Rouillard, A. D., Tan, C. M., Benes, C. H., Ma'ayan, A.
Posted: November 5, 2014, 11:27 am

Summary: Automated analysis of imaged phenotypes enables fast and reproducible quantification of biologically relevant features. Despite recent developments, recordings of complex networked structures, such as leaf venation patterns, cytoskeletal structures or traffic networks, remain challenging to analyze. Here we illustrate the applicability of img2net to automatedly analyze such structures by reconstructing the underlying network, computing relevant network properties and statistically comparing networks of different types or under different conditions. The software can be readily used for analyzing image data of arbitrary 2D and 3D network-like structures.

Availability and Implementation: img2net is open-source software under the GPL and can be downloaded from http://mathbiol.mpimp-golm.mpg.de/img2net/, where supplementary information and datasets for testing are provided.

Contact: breuer@mpimp-golm.mpg.de

Author: Breuer, D., Nikoloski, Z.
Posted: November 5, 2014, 11:27 am

Summary: Because cancer has heterogeneous clinical behaviors due to the progressive accumulation of multiple genetic and epigenetic alterations, the identification of robust molecular signatures for predicting cancer outcome is profoundly important. Here, we introduce the APPEX Web-based analysis platform as a versatile tool for identifying prognostic molecular signatures that predict cancer diversity. We incorporated most of statistical methods for survival analysis and implemented seven survival analysis workflows, including CoxSingle, CoxMulti, IntransSingle, IntransMulti, SuperPC, TimeRoc and multivariate. A total of 236 publicly available datasets were collected, processed and stored to support easy independent validation of prognostic signatures. Two case studies including disease recurrence and bladder cancer progression were described using different combinations of the seven workflows.

Availability and implementation: APPEX is freely available at http://www.appex.kr.

Contact: kimsy@kribb.re.kr

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Kim, S.-K., Hwan Kim, J., Yun, S.-J., Kim, W.-J., Kim, S.-Y.
Posted: November 5, 2014, 11:27 am

Summary: The application of protein–protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of >97% strong scaling.

Availability and Implementation: MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock.

Contact: akiyama@cs.titech.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online

Author: Ohue, M., Shimoda, T., Suzuki, S., Matsuzaki, Y., Ishida, T., Akiyama, Y.
Posted: November 5, 2014, 11:27 am

Summary: Non-targeted metabolomics technologies often yield data in which abundance for any given metabolite is observed and quantified for some samples and reported as missing for other samples. Apparent missingness can be due to true absence of the metabolite in the sample or presence at a level below detectability. Mixture-model analysis can formally account for metabolite ‘missingness’ due to absence or undetectability, but software for this type of analysis in the high-throughput setting is limited. The R package metabomxtr has been developed to facilitate mixture-model analysis of non-targeted metabolomics data in which only a portion of samples have quantifiable abundance for certain metabolites.

Availability and implementation: metabomxtr is available through Bioconductor. It is released under the GPL-2 license.

Contact: dscholtens@northwestern.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Nodzenski, M., Muehlbauer, M. J., Bain, J. R., Reisetter, A. C., Lowe, W. L., Scholtens, D. M.
Posted: November 5, 2014, 11:27 am

Summary: AliView is an alignment viewer and editor designed to meet the requirements of next-generation sequencing era phylogenetic datasets. AliView handles alignments of unlimited size in the formats most commonly used, i.e. FASTA, Phylip, Nexus, Clustal and MSF. The intuitive graphical interface makes it easy to inspect, sort, delete, merge and realign sequences as part of the manual filtering process of large datasets. AliView also works as an easy-to-use alignment editor for small as well as large datasets.

Availability and implementation: AliView is released as open-source software under the GNU General Public License, version 3.0 (GPLv3), and is available at GitHub (www.github.com/AliView). The program is cross-platform and extensively tested on Linux, Mac OS X and Windows systems. Downloads and help are available at http://ormbunkar.se/aliview

Contact: anders.larsson@ebc.uu.se

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Larsson, A.
Posted: November 5, 2014, 11:27 am

Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length.

Availability and implementation: https://github.com/lh3/ropebwt2

Contact: hengli@broadinstitute.org

Author: Li, H.
Posted: November 5, 2014, 11:27 am

Motivation: Kotai Antibody Builder is a Web service for tertiary structural modeling of antibody variable regions. It consists of three main steps: hybrid template selection by sequence alignment and canonical rules, 3D rendering of alignments and CDR-H3 loop modeling. For the last step, in addition to rule-based heuristics used to build the initial model, a refinement option is available that uses fragment assembly followed by knowledge-based scoring. Using targets from the Second Antibody Modeling Assessment, we demonstrate that Kotai Antibody Builder generates models with an overall accuracy equal to that of the best-performing semi-automated predictors using expert knowledge.

Availability and implementation: Kotai Antibody Builder is available at http://kotaiab.org

Contact: standley@ifrec.osaka-u.ac.jp

Author: Yamashita, K., Ikeda, K., Amada, K., Liang, S., Tsuchiya, Y., Nakamura, H., Shirai, H., Standley, D. M.
Posted: November 5, 2014, 11:27 am

Summary: There have been numerous applications developed for decoding and visualization of ab1 DNA sequencing files for Windows and MAC platforms, yet none exists for the increasingly popular smartphone operating systems. The ability to decode sequencing files cannot easily be carried out using browser accessed Web tools. To overcome this hurdle, we have developed a new native app called DNAApp that can decode and display ab1 sequencing file on Android and iOS. In addition to in-built analysis tools such as reverse complementation, protein translation and searching for specific sequences, we have incorporated convenient functions that would facilitate the harnessing of online Web tools for a full range of analysis. Given the high usage of Android/iOS tablets and smartphones, such bioinformatics apps would raise productivity and facilitate the high demand for analyzing sequencing data in biomedical research.

Availability and implementation: The Android version of DNAApp is available in Google Play Store as ‘DNAApp’, and the iOS version is available in the App Store. More details on the app can be found at www.facebook.com/APDLab; www.bii.a-star.edu.sg/research/trd/apd.php

The DNAApp user guide is available at http://tinyurl.com/DNAAppuser, and a video tutorial is available on Google Play Store and App Store, as well as on the Facebook page.

Contact: samuelg@bii.a-star.edu.sg

Author: Nguyen, P.-V., Verma, C. S., Gan, S. K.-E.
Posted: November 5, 2014, 11:27 am

Summary: Basic4Cseq is an R/Bioconductor package for basic filtering, analysis and subsequent near-cis visualization of 4C-seq data. The package processes aligned 4C-seq raw data stored in binary alignment/map (BAM) format and maps the short reads to a corresponding virtual fragment library. Functions are included to create virtual fragment libraries providing chromosome position and further information on 4C-seq fragments (length and uniqueness of the fragment ends, and blindness of a fragment) for any BSGenome package. An optional filter is included for BAM files to remove invalid 4C-seq reads, and further filter functions are offered for 4C-seq fragments. Additionally, basic quality controls based on the read distribution are included. Fragment data in the vicinity of the experiment’s viewpoint are visualized as coverage plot based on a running median approach and a multi-scale contact profile. Wig files or csv files of the fragment data can be exported for further analyses and visualizations of interactions with other programs.

Availability and implementation: Basic4Cseq is implemented in R and available at http://www.bioconductor.org/. A vignette with detailed descriptions of the functions is included in the package.

Contact: Carolin.Walter@uni-muenster.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Walter, C., Schuetzmann, D., Rosenbauer, F., Dugas, M.
Posted: November 5, 2014, 11:27 am

Summary: Next-generation sequencing (NGS) has a large potential in HIV diagnostics, and genotypic prediction models have been developed and successfully tested in the recent years. However, albeit being highly accurate, these computational models lack computational efficiency to reach their full potential.

In this study, we demonstrate the use of graphics processing units (GPUs) in combination with a computational prediction model for HIV tropism. Our new model named gCUP, parallelized and optimized for GPU, is highly accurate and can classify >175 000 sequences per second on an NVIDIA GeForce GTX 460. The computational efficiency of our new model is the next step to enable NGS technologies to reach clinical significance in HIV diagnostics. Moreover, our approach is not limited to HIV tropism prediction, but can also be easily adapted to other settings, e.g. drug resistance prediction.

Availability and implementation: The source code can be downloaded at http://www.heiderlab.de

Contact: d.heider@wz-straubing.de

Author: Olejnik, M., Steuwer, M., Gorlatch, S., Heider, D.
Posted: November 5, 2014, 11:27 am

Motivation: Gene models from draft genome assemblies of metazoan species are often incorrect, missing exons or entire genes, particularly for large gene families. Consequently, labour-intensive manual curation is often necessary. We present Figmop (Finding Genes using Motif Patterns) to help with the manual curation of gene families in draft genome assemblies. The program uses a pattern of short sequence motifs to identify putative genes directly from the genome sequence. Using a large gene family as a test case, Figmop was found to be more sensitive and specific than a BLAST-based approach. The visualization used allows the validation of potential genes to be carried out quickly and easily, saving hours if not days from an analysis.

Availability and implementation: Source code of Figmop is freely available for download at https://github.com/dave-the-scientist, implemented in C and Python and is supported on Linux, Unix and MacOSX.

Contact: curran.dave.m@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Curran, D. M., Gilleard, J. S., Wasmuth, J. D.
Posted: November 5, 2014, 11:27 am

Motivation: The ability to accurately read the order of nucleotides in DNA and RNA is fundamental for modern biology. Errors in next-generation sequencing can lead to many artifacts, from erroneous genome assemblies to mistaken inferences about RNA editing. Uneven coverage in datasets also contributes to false corrections.

Result: We introduce Trowel, a massively parallelized and highly efficient error correction module for Illumina read data. Trowel both corrects erroneous base calls and boosts base qualities based on the k-mer spectrum. With high-quality k-mers and relevant base information, Trowel achieves high accuracy for different short read sequencing applications.The latency in the data path has been significantly reduced because of efficient data access and data structures. In performance evaluations, Trowel was highly competitive with other tools regardless of coverage, genome size read length and fragment size.

Availability and implementation: Trowel is written in C++ and is provided under the General Public License v3.0 (GPLv3). It is available at http://trowel-ec.sourceforge.net.

Contact: euncheon.lim@tue.mpg.de or weigel@tue.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Lim, E.-C., Muller, J., Hagmann, J., Henz, S. R., Kim, S.-T., Weigel, D.
Posted: November 5, 2014, 11:27 am

Motivation: Whole-exome sequencing (WES) has opened up previously unheard of possibilities for identifying novel disease genes in Mendelian disorders, only about half of which have been elucidated to date. However, interpretation of WES data remains challenging.

Results: Here, we analyze protein–protein association (PPA) networks to identify candidate genes in the vicinity of genes previously implicated in a disease. The analysis, using a random-walk with restart (RWR) method, is adapted to the setting of WES by developing a composite variant-gene relevance score based on the rarity, location and predicted pathogenicity of variants and the RWR evaluation of genes harboring the variants. Benchmarking using known disease variants from 88 disease-gene families reveals that the correct gene is ranked among the top 10 candidates in ≥50% of cases, a figure which we confirmed using a prospective study of disease genes identified in 2012 and PPA data produced before that date. We implement our method in a freely available Web server, ExomeWalker, that displays a ranked list of candidates together with information on PPAs, frequency and predicted pathogenicity of the variants to allow quick and effective searches for candidates that are likely to reward closer investigation.

Availability and implementation: http://compbio.charite.de/ExomeWalker

Contact: peter.robinson@charite.de

Author: Smedley, D., Kohler, S., Czeschik, J. C., Amberger, J., Bocchini, C., Hamosh, A., Veldboer, J., Zemojtel, T., Robinson, P. N.
Posted: November 5, 2014, 11:27 am

Motivation: This article presents Thresher, an improved technique for finding peak height thresholds for automated rRNA intergenic spacer analysis (ARISA) profiles. We argue that thresholds must be sample dependent, taking community richness into account. In most previous fragment analyses, a common threshold is applied to all samples simultaneously, ignoring richness variations among samples and thereby compromising cross-sample comparison. Our technique solves this problem, and at the same time provides a robust method for outlier rejection, selecting for removal any replicate pairs that are not valid replicates.

Results: Thresholds are calculated individually for each replicate in a pair, and separately for each sample. The thresholds are selected to be the ones that minimize the dissimilarity between the replicates after thresholding. If a choice of threshold results in the two replicates in a pair failing a quantitative test of similarity, either that threshold or that sample must be rejected. We compare thresholded ARISA results with sequencing results, and demonstrate that the Thresher algorithm outperforms conventional thresholding techniques.

Availability and Implementation: The software is implemented in R, and the code is available at http://verenastarke.wordpress.com or by contacting the author.

Contact: vstarke@ciw.edu or http://verenastarke.wordpress.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Starke, V., Steele, A.
Posted: November 5, 2014, 11:27 am

Motivation: Elementary flux mode (EFM) is a useful tool in constraint-based modeling of metabolic networks. The property that every flux distribution can be decomposed as a weighted sum of EFMs allows certain applications of EFMs to studying flux distributions. The existence of biologically infeasible EFMs and the non-uniqueness of the decomposition, however, undermine the applicability of such methods. Efforts have been made to find biologically feasible EFMs by incorporating information from transcriptional regulation and thermodynamics. Yet, no attempt has been made to distinguish biologically feasible EFMs by considering their graphical properties. A previous study on the transcriptional regulation of metabolic genes found that distinct branches at a branch point metabolite usually belong to distinct metabolic pathways. This suggests an intuitive property of biologically feasible EFMs, i.e. minimal branching.

Results: We developed the concept of minimal branching EFM and derived the minimal branching decomposition (MBD) to decompose flux distributions. Testing in the core Escherichia coli metabolic network indicated that MBD can distinguish branches at branch points and greatly reduced the solution space in which the decomposition is often unique. An experimental flux distribution from a previous study on mouse cardiomyocyte was decomposed using MBD. Comparison with decomposition by a minimum number of EFMs showed that MBD found EFMs more consistent with established biological knowledge, which facilitates interpretation. Comparison of the methods applied to a complex flux distribution in Lactococcus lactis similarly showed the advantages of MBD. The minimal branching EFM concept underlying MBD should be useful in other applications.

Contact: sinhu@bio.dtu.dk or p.ji@polyu.edu.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Chan, S. H. J., Solem, C., Jensen, P. R., Ji, P.
Posted: November 5, 2014, 11:27 am

Motivation: Recent studies on human disease have revealed that aberrant interaction between proteins probably underlies a substantial number of human genetic diseases. This suggests a need to investigate disease inheritance mode using interaction, and based on which to refresh our conceptual understanding of a series of properties regarding inheritance mode of human disease.

Results: We observed a strong correlation between the number of protein interactions and the likelihood of a gene causing any dominant diseases or multiple dominant diseases, whereas no correlation was observed between protein interaction and the likelihood of a gene causing recessive diseases. We found that dominant diseases are more likely to be associated with disruption of important interactions. These suggest inheritance mode should be understood using protein interaction. We therefore reviewed the previous studies and refined an interaction model of inheritance mode, and then confirmed that this model is largely reasonable using new evidences. With these findings, we found that the inheritance mode of human genetic diseases can be predicted using protein interaction. By integrating the systems biology perspectives with the classical disease genetics paradigm, our study provides some new insights into genotype–phenotype correlations.

Contact: haodapeng@ems.hrbmu.edu.cn or biofomeng@hotmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hao, D., Li, C., Zhang, S., Lu, J., Jiang, Y., Wang, S., Zhou, M.
Posted: November 5, 2014, 11:27 am

Motivation: A popular method for classification of protein domain movements apportions them into two main types: those with a ‘hinge’ mechanism and those with a ‘shear’ mechanism. The intuitive assignment of domain movements to these classes has limited the number of domain movements that can be classified in this way. Furthermore, whether intended or not, the term ‘shear’ is often interpreted to mean a relative translation of the domains.

Results: Numbers of occurrences of four different types of residue contact changes between domains were optimally combined by logistic regression using the training set of domain movements intuitively classified as hinge and shear to produce a predictor for hinge and shear. This predictor was applied to give a 10-fold increase in the number of examples over the number previously available with a high degree of precision. It is shown that overall a relative translation of domains is rare, and that there is no difference between hinge and shear mechanisms in this respect. However, the shear set contains significantly more examples of domains having a relative twisting movement than the hinge set. The angle of rotation is also shown to be a good discriminator between the two mechanisms.

Availability and implementation: Results are free to browse at http://www.cmp.uea.ac.uk/dyndom/interface/.

Contact: sjh@cmp.uea.ac.uk.

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Taylor, D., Cawley, G., Hayward, S.
Posted: November 5, 2014, 11:27 am

Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods.

Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500.

Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/.

Contact: heckerma@microsoft.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Lippert, C., Xiang, J., Horta, D., Widmer, C., Kadie, C., Heckerman, D., Listgarten, J.
Posted: November 5, 2014, 11:27 am

Motivation: Individuals in each family are genetically more homogeneous than unrelated individuals, and family-based designs are often recommended for the analysis of rare variants. However, despite the importance of family-based samples analysis, few statistical methods for rare variant association analysis are available.

Results: In this report, we propose a FAmily-based Rare Variant Association Test (FARVAT). FARVAT is based on the quasi-likelihood of whole families, and is statistically and computationally efficient for the extended families. FARVAT assumed that families were ascertained with the disease status of family members, and incorporation of the estimated genetic relationship matrix to the proposed method provided robustness under the presence of the population substructure. Depending on the choice of working matrix, our method could be a burden test or a variance component test, and could be extended to the SKAT-O-type statistic. FARVAT was implemented in C++, and application of the proposed method to schizophrenia data and simulated data for GAW17 illustrated its practical importance.

Availability: The software calculates various statistics for the analysis of related samples, and it is freely downloadable from http://healthstats.snu.ac.kr/software/farvat.

Contact: won1@snu.ac.kr or tspark@stats.snu.ac.kr

Supplementary information: supplementary data are available at Bioinformatics online.

Author: Choi, S., Lee, S., Cichon, S., Nothen, M. M., Lange, C., Park, T., Won, S.
Posted: November 5, 2014, 11:27 am

Motivation: Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations composed of AND, OR, XOR, complement, shift and addition. Bit-parallelism has been successfully applied to the longest common subsequence (LCS) and edit-distance problems, producing fast algorithms in practice.

Results: We have developed BitPAl, a bit-parallel algorithm for general, integer-scoring global alignment. Integer-scoring schemes assign integer weights for match, mismatch and insertion/deletion. The BitPAl method uses structural properties in the relationship between adjacent scores in the scoring matrix to construct classes of efficient algorithms, each designed for a particular set of weights. In timed tests, we show that BitPAl runs 7–25 times faster than a standard iterative algorithm.

Availability and implementation: Source code is freely available for download at http://lobstah.bu.edu/BitPAl/BitPAl.html. BitPAl is implemented in C and runs on all major operating systems.

Contact: jloving@bu.edu or yhernand@bu.edu or gbenson@bu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Loving, J., Hernandez, Y., Benson, G.
Posted: November 5, 2014, 11:27 am

Motivation: The clonal theory of adaptive immunity proposes that immunological responses are encoded by increases in the frequency of lymphocytes carrying antigen-specific receptors. In this study, we measure the frequency of different T-cell receptors (TcR) in CD4 + T cell populations of mice immunized with a complex antigen, killed Mycobacterium tuberculosis, using high throughput parallel sequencing of the TcRβ chain. Our initial hypothesis that immunization would induce repertoire convergence proved to be incorrect, and therefore an alternative approach was developed that allows accurate stratification of TcR repertoires and provides novel insights into the nature of CD4 + T-cell receptor recognition.

Results: To track the changes induced by immunization within this heterogeneous repertoire, the sequence data were classified by counting the frequency of different clusters of short (3 or 4) continuous stretches of amino acids within the antigen binding complementarity determining region 3 (CDR3) repertoire of different mice. Both unsupervised (hierarchical clustering) and supervised (support vector machine) analyses of these different distributions of sequence clusters differentiated between immunized and unimmunized mice with 100% efficiency. The CD4 + TcR repertoires of mice 5 and 14 days postimmunization were clearly different from that of unimmunized mice but were not distinguishable from each other. However, the repertoires of mice 60 days postimmunization were distinct both from naive mice and the day 5/14 animals. Our results reinforce the remarkable diversity of the TcR repertoire, resulting in many diverse private TcRs contributing to the T-cell response even in genetically identical mice responding to the same antigen. However, specific motifs defined by short stretches of amino acids within the CDR3 region may determine TcR specificity and define a new approach to TcR sequence classification.

Availability and implementation: The analysis was implemented in R and Python, and source code can be found in Supplementary Data.

Contact: b.chain@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Thomas, N., Best, K., Cinelli, M., Reich-Zeliger, S., Gal, H., Shifrut, E., Madi, A., Friedman, N., Shawe-Taylor, J., Chain, B.
Posted: November 5, 2014, 11:27 am

Motivation: A proper target or marker is essential in any diagnosis (e.g. an infection or cancer). An ideal diagnostic target should be both conserved in and unique to the pathogen. Currently, these targets can only be identified manually, which is time-consuming and usually error-prone. Because of the increasingly frequent occurrences of emerging epidemics and multidrug-resistant ‘superbugs’, a rapid diagnostic target identification process is needed.

Results: A new method that can identify uniquely conserved regions (UCRs) as candidate diagnostic targets for a selected group of organisms solely from their genomic sequences has been developed and successfully tested. Using a sequence-indexing algorithm to identify UCRs and a k-mer integer-mapping model for computational efficiency, this method has successfully identified UCRs within the bacteria domain for 15 test groups, including pathogenic, probiotic, commensal and extremophilic bacterial species or strains. Based on the identified UCRs, new diagnostic primer sets were designed, and their specificity and efficiency were tested by polymerase chain reaction amplifications from both pure isolates and samples containing mixed cultures.

Availability and implementation: The UCRs identified for the 15 bacterial species are now freely available at http://ucr.synblex.com. The source code of the programs used in this study is accessible at http://ucr.synblex.com/bacterialIdSourceCode.d.zip

Contact: yazhousun@synblex.com

Supplementary Information: Supplementary data are available at Bioinformatics online.

Author: Zhang, Y., Sun, Y.
Posted: November 5, 2014, 11:27 am

Motivation: Clustering methods can be useful for automatically grouping documents into meaningful clusters, improving human comprehension of a document collection. Although there are clustering algorithms that can achieve the goal for relatively large document collections, they do not always work well for small and homogenous datasets.

Methods: In this article, we present Retro—a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections. Unlike common clustering approaches, our algorithm predicts cluster titles before clustering. It relies on the hypergeometric distribution model to discover key phrases, and generates candidate clusters by assigning documents to these phrases. Further, the statistical significance of candidate clusters is tested using supervised learning methods, and a multiple testing correction technique is used to control the overall quality of clustering.

Results: We test our system on five disease datasets from OMIM® and evaluate the results based on MeSH® term assignments. We further compare our method with several baseline and state-of-the-art methods, including K-means, expectation maximization, latent Dirichlet allocation-based clustering, Lingo, OPTIMSRC and adapted GK-means. The experimental results on the 20-Newsgroup and ODP-239 collections demonstrate that our method is successful at extracting significant clusters and is superior to existing methods in terms of quality of clusters. Finally, we apply our system to a collection of 6248 topical sets from the HomoloGene® database, a resource in PubMed®. Empirical evaluation confirms the method is useful for small homogenous datasets in producing meaningful clusters with descriptive titles.

Availability and implementation: A web-based demonstration of the algorithm applied to a collection of sets from the HomoloGene database is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/CLUSTERING_HOMOLOGENE/index.html.

Contact: lana.yeganova@nih.gov

Supplementary Information: Supplementary data are available at Bioinformatics online.

Author: Yeganova, L., Kim, W., Kim, S., Wilbur, W. J.
Posted: November 5, 2014, 11:27 am

Summary: The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed® and related biological databases. Herein, we describe BioTextQuest+, a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest+ enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest+ addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest+ through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing.

Availability: The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest.

Contact: g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Papanikolaou, N., Pavlopoulos, G. A., Pafilis, E., Theodosiou, T., Schneider, R., Satagopam, V. P., Ouzounis, C. A., Eliopoulos, A. G., Promponas, V. J., Iliopoulos, I.
Posted: November 5, 2014, 11:27 am

Motivation: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30–60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts.

Results: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package ‘MLbias’ and all source files are publicly available.

Availability and implementation: tsenglab.biostat.pitt.edu/software.htm.

Contact: ctseng@pitt.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Ding, Y., Tang, S., Liao, S. G., Jia, J., Oesterreich, S., Lin, Y., Tseng, G. C.
Posted: November 5, 2014, 11:27 am

Motivation: The identification of active transcriptional regulatory elements is crucial to understand regulatory networks driving cellular processes such as cell development and the onset of diseases. It has recently been shown that chromatin structure information, such as DNase I hypersensitivity (DHS) or histone modifications, significantly improves cell-specific predictions of transcription factor binding sites. However, no method has so far successfully combined both DHS and histone modification data to perform active binding site prediction.

Results: We propose here a method based on hidden Markov models to integrate DHS and histone modifications occupancy for the detection of open chromatin regions and active binding sites. We have created a framework that includes treatment of genomic signals, model training and genome-wide application. In a comparative analysis, our method obtained a good trade-off between sensitivity versus specificity and superior area under the curve statistics than competing methods. Moreover, our technique does not require further training or sequence information to generate binding location predictions. Therefore, the method can be easily applied on new cell types and allow flexible downstream analysis such as de novo motif finding.

Availability and implementation: Our framework is available as part of the Regulatory Genomics Toolbox. The software information and all benchmarking data are available at http://costalab.org/wp/dh-hmm.

Contact: ivan.costa@rwth-aachen.de or eduardo.gusmao@rwth-aachen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Gusmao, E. G., Dieterich, C., Zenke, M., Costa, I. G.
Posted: November 5, 2014, 11:27 am

Motivation: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material.

Results: We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries.

Availability and implementation: The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq.

Contact: andrewds@usc.edu

Supplementary information: Supplementary material is available at Bioinformatics online.

Author: Daley, T., Smith, A. D.
Posted: November 5, 2014, 11:27 am

Summary: Sequencing oligosaccharides by exoglycosidases, either sequentially or in an array format, is a powerful tool to unambiguously determine the structure of complex N- and O-link glycans. Here, we introduce GlycoDigest, a tool that simulates exoglycosidase digestion, based on controlled rules acquired from expert knowledge and experimental evidence available in GlycoBase. The tool allows the targeted design of glycosidase enzyme mixtures by allowing researchers to model the action of exoglycosidases, thereby validating and improving the efficiency and accuracy of glycan analysis.

Availability and implementation: http://www.glycodigest.org.

Contact: matthew.campbell@mq.edu.au or frederique.lisacek@isb-sib.ch

Author: Gotz, L., Abrahams, J. L., Mariethoz, J., Rudd, P. M., Karlsson, N. G., Packer, N. H., Campbell, M. P., Lisacek, F.
Posted: October 17, 2014, 1:05 pm

Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application.

Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third ‘Quest for Orthologs’ meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes.

The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking.

Availability and implementation: All such materials are available at http://questfororthologs.org.

Contact: erik.sonnhammer@scilifelab.se or c.dessimoz@ucl.ac.uk

Author: Sonnhammer, E. L. L., Gabaldon, T., Sousa da Silva, A. W., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P. D., Dessimoz, C., the Quest for Orthologs consortium
Posted: October 17, 2014, 1:05 pm

Summary: The fluorescence in situ hybridization (FISH) method has been providing valuable information on physical distances between loci (via image analysis) for several decades. Recently, high-throughput data on nearby chemical contacts between and within chromosomes became available with the Hi-C method. Here, we present FisHiCal, an R package for an iterative FISH-based Hi-C calibration that exploits in full the information coming from these methods. We describe here our calibration model and present 3D inference methods that we have developed for increasing its usability, namely, 3D reconstruction through local stress minimization and detection of spatial inconsistencies. We next confirm our calibration across three human cell lines and explain how the output of our methods could inform our model, defining an iterative calibration pipeline, with applications for quality assessment and meta-analysis.

Availability and implementation: FisHiCal v1.1 is available from http://cran.r-project.org/.

Contact: ys388@cam.ac.uk

Supplementary information: Supplementary Data is available at Bioinformatics online.

Author: Shavit, Y., Hamey, F. K., Lio, P.
Posted: October 17, 2014, 1:05 pm

Motivation: A large number of experimental studies on ageing focus on the effects of genetic perturbations of the insulin/insulin-like growth factor signalling pathway (IIS) on lifespan. Short-lived invertebrate laboratory model organisms are extensively used to quickly identify ageing-related genes and pathways. It is important to extrapolate this knowledge to longer lived mammalian organisms, such as mouse and eventually human, where such analyses are difficult or impossible to perform. Computational tools are needed to integrate and manipulate pathway knowledge in different species.

Results: We performed a literature review and curation of the IIS and target of rapamycin signalling pathways in Mus Musculus. We compare this pathway model to the equivalent models in Drosophila melanogaster and Caenorhabtitis elegans. Although generally well-conserved, they exhibit important differences. In general, the worm and mouse pathways include a larger number of feedback loops and interactions than the fly. We identify ‘functional orthologues’ that share similar molecular interactions, but have moderate sequence similarity. Finally, we incorporate the mouse model into the web-service NetEffects and perform in silico gene perturbations of IIS components and analyses of experimental results. We identify sub-paths that, given a mutation in an IIS component, could potentially antagonize the primary effects on ageing via FOXO in mouse and via SKN-1 in worm. Finally, we explore the effects of FOXO knockouts in three different mouse tissues.

Availability and implementation: http://www.ebi.ac.uk/thornton-srv/software/NetEffects

Contact: ip8@sanger.ac.uk or thornton@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Papatheodorou, I., Petrovs, R., Thornton, J. M.
Posted: October 17, 2014, 1:05 pm
Author: Zhang, L., Pei, Y.-F., Fu, X., Lin, Y., Wang, Y.-P., Deng, H.-W.
Posted: October 17, 2014, 1:05 pm

Motivation: Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, single molecule real-time (SMRT) sequencing, was developed that could address these challenges, as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches that use high-quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects.

Results: Here we present proovread, a hybrid correction pipeline for SMRT reads, which can be flexibly adapted on existing hardware and infrastructure from a laptop to a high-performance computing cluster. On genomic and transcriptomic test cases covering Escherichia coli, Arabidopsis thaliana and human, proovread achieved accuracies up to 99.9% and outperformed the existing hybrid correction programs. Furthermore, proovread-corrected sequences were longer and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. It will therefore increase the applicability and value of SMRT sequencing.

Availability and implementation: proovread is available at the following URL: http://proovread.bioapps.biozentrum.uni-wuerzburg.de

Contact: frank.foerster@biozentrum.uni-wuerzburg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hackl, T., Hedrich, R., Schultz, J., Forster, F.
Posted: October 17, 2014, 1:05 pm

Motivation: The increasing availability of mitochondria-targeted and off-target sequencing data in whole-exome and whole-genome sequencing studies (WXS and WGS) has risen the demand of effective pipelines to accurately measure heteroplasmy and to easily recognize the most functionally important mitochondrial variants among a huge number of candidates. To this purpose, we developed MToolBox, a highly automated pipeline to reconstruct and analyze human mitochondrial DNA from high-throughput sequencing data.

Results: MToolBox implements an effective computational strategy for mitochondrial genomes assembling and haplogroup assignment also including a prioritization analysis of detected variants. MToolBox provides a Variant Call Format file featuring, for the first time, allele-specific heteroplasmy and annotation files with prioritized variants. MToolBox was tested on simulated samples and applied on 1000 Genomes WXS datasets.

Availability and implementation: MToolBox package is available at https://sourceforge.net/projects/mtoolbox/.

Contact: marcella.attimonelli@uniba.it

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Calabrese, C., Simone, D., Diroma, M. A., Santorsola, M., Gutta, C., Gasparre, G., Picardi, E., Pesole, G., Attimonelli, M.
Posted: October 17, 2014, 1:05 pm

Motivation: Riboswitches are short sequences of messenger RNA that can change their structural conformation to regulate the expression of adjacent genes. Computational prediction of putative riboswitches can provide direction to molecular biologists studying riboswitch-mediated gene expression.

Results: The Denison Riboswitch Detector (DRD) is a new computational tool with a Web interface that can quickly identify putative riboswitches in DNA sequences on the scale of bacterial genomes. Riboswitch descriptions are easily modifiable and new ones are easily created. The underlying algorithm converts the problem to a ‘heaviest path’ problem on a multipartite graph, which is then solved using efficient dynamic programming. We show that DRD can achieve ~88–99% sensitivity and >99.99% specificity on 13 riboswitch families.

Availability and implementation: DRD is available at http://drd.denison.edu.

Contact: havill@denison.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Havill, J. T., Bhatiya, C., Johnson, S. M., Sheets, J. D., Thompson, J. S.
Posted: October 17, 2014, 1:05 pm

Summary: Circleator is a Perl application that generates circular figures of genome-associated data. It leverages BioPerl to support standard annotation and sequence file formats and produces publication-quality SVG output. It is designed to be both flexible and easy to use. It includes a library of circular track types and predefined configuration files for common use-cases, including. (i) visualizing gene annotation and DNA sequence data from a GenBank flat file, (ii) displaying patterns of gene conservation in related microbial strains, (iii) showing Single Nucleotide Polymorphisms (SNPs) and indels relative to a reference genome and gene set and (iv) viewing RNA-Seq plots.

Availability and implementation: Circleator is freely available under the Artistic License 2.0 from http://jonathancrabtree.github.io/Circleator/ and is integrated with the CloVR cloud-based sequence analysis Virtual Machine (VM), which can be downloaded from http://clovr.org or run on Amazon EC2.

Contact: jcrabtree@som.umaryland.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Crabtree, J., Agrawal, S., Mahurkar, A., Myers, G. S., Rasko, D. A., White, O.
Posted: October 17, 2014, 1:05 pm

Motivation: Brownian models have been introduced in phylogenetics for describing variation in substitution rates through time, with applications to molecular dating or to the comparative analysis of variation in substitution patterns among lineages. Thus far, however, the Monte Carlo implementations of these models have relied on crude approximations, in which the Brownian process is sampled only at the internal nodes of the phylogeny or at the midpoints along each branch, and the unknown trajectory between these sampled points is summarized by simple branchwise average substitution rates.

Results: A more accurate Monte Carlo approach is introduced, explicitly sampling a fine-grained discretization of the trajectory of the (potentially multivariate) Brownian process along the phylogeny. Generic Monte Carlo resampling algorithms are proposed for updating the Brownian paths along and across branches. Specific computational strategies are developed for efficient integration of the finite-time substitution probabilities across branches induced by the Brownian trajectory. The mixing properties and the computational complexity of the resulting Markov chain Monte Carlo sampler scale reasonably with the discretization level, allowing practical applications with up to a few hundred discretization points along the entire depth of the tree. The method can be generalized to other Markovian stochastic processes, making it possible to implement a wide range of time-dependent substitution models with well-controlled computational precision.

Availability: The program is freely available at www.phylobayes.org

Contact: nicolas.lartillot@univ-lyon1.fr

Author: Horvilleur, B., Lartillot, N.
Posted: October 17, 2014, 1:05 pm

Summary: Single nucleotide variations (SNVs) located within a reading frame can result in single amino acid polymorphisms (SAPs), leading to alteration of the corresponding amino acid sequence as well as function of a protein. Accurate detection of SAPs is an important issue in proteomic analysis at the experimental and bioinformatic level. Herein, we present sapFinder, an R software package, for detection of the variant peptides based on tandem mass spectrometry (MS/MS)-based proteomics data. This package automates the construction of variation-associated databases from public SNV repositories or sample-specific next-generation sequencing (NGS) data and the identification of SAPs through database searching, post-processing and generation of HTML-based report with visualized interface.

Availability and implementation: sapFinder is implemented as a Bioconductor package in R. The package and the vignette can be downloaded at http://bioconductor.org/packages/devel/bioc/html/sapFinder.html and are provided under a GPL-2 license.

Contact: siqiliu@genomics.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Wen, B., Xu, S., Sheynkman, G. M., Feng, Q., Lin, L., Wang, Q., Xu, X., Wang, J., Liu, S.
Posted: October 17, 2014, 1:05 pm

Motivation: Biological system behaviors are often the outcome of complex interactions among a large number of cells and their biotic and abiotic environment. Computational biologists attempt to understand, predict and manipulate biological system behavior through mathematical modeling and computer simulation. Discrete agent-based modeling (in combination with high-resolution grids to model the extracellular environment) is a popular approach for building biological system models. However, the computational complexity of this approach forces computational biologists to resort to coarser resolution approaches to simulate large biological systems. High-performance parallel computers have the potential to address the computing challenge, but writing efficient software for parallel computers is difficult and time-consuming.

Results: We have developed Biocellion, a high-performance software framework, to solve this computing challenge using parallel computers. To support a wide range of multicellular biological system models, Biocellion asks users to provide their model specifics by filling the function body of pre-defined model routines. Using Biocellion, modelers without parallel computing expertise can efficiently exploit parallel computers with less effort than writing sequential programs from scratch. We simulate cell sorting, microbial patterning and a bacterial system in soil aggregate as case studies.

Availability and implementation: Biocellion runs on x86 compatible systems with the 64 bit Linux operating system and is freely available for academic use. Visit http://biocellion.com for additional information.

Contact: seunghwa.kang@pnnl.gov

Author: Kang, S., Kahan, S., McDermott, J., Flann, N., Shmulevich, I.
Posted: October 17, 2014, 1:05 pm

Motivation: Boolean network models are suitable to simulate GRNs in the absence of detailed kinetic information. However, reducing the biological reality implies making assumptions on how genes interact (interaction rules) and how their state is updated during the simulation (update scheme). The exact choice of the assumptions largely determines the outcome of the simulations. In most cases, however, the biologically correct assumptions are unknown. An ideal simulation thus implies testing different rules and schemes to determine those that best capture an observed biological phenomenon. This is not trivial because most current methods to simulate Boolean network models of GRNs and to compute their attractors impose specific assumptions that cannot be easily altered, as they are built into the system.

Results: To allow for a more flexible simulation framework, we developed ASP-G. We show the correctness of ASP-G in simulating Boolean network models and obtaining attractors under different assumptions by successfully recapitulating the detection of attractors of previously published studies. We also provide an example of how performing simulation of network models under different settings help determine the assumptions under which a certain conclusion holds. The main added value of ASP-G is in its modularity and declarativity, making it more flexible and less error-prone than traditional approaches. The declarative nature of ASP-G comes at the expense of being slower than the more dedicated systems but still achieves a good efficiency with respect to computational time.

Availability and implementation: The source code of ASP-G is available at http://bioinformatics.intec.ugent.be/kmarchal/Supplementary_Information_Musthofa_2014/asp-g.zip.

Contact: Kathleen.Marchal@UGent.be or Martine.DeCock@UGent.be

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Mushthofa, M., Torres, G., Van de Peer, Y., Marchal, K., De Cock, M.
Posted: October 17, 2014, 1:05 pm

Motivation: The ability to accurately model protein structures at the atomistic level underpins efforts to understand protein folding, to engineer natural proteins predictably and to design proteins de novo. Homology-based methods are well established and produce impressive results. However, these are limited to structures presented by and resolved for natural proteins. Addressing this problem more widely and deriving truly ab initio models requires mathematical descriptions for protein folds; the means to decorate these with natural, engineered or de novo sequences; and methods to score the resulting models.

Results: We present CCBuilder, a web-based application that tackles the problem for a defined but large class of protein structure, the α-helical coiled coils. CCBuilder generates coiled-coil backbones, builds side chains onto these frameworks and provides a range of metrics to measure the quality of the models. Its straightforward graphical user interface provides broad functionality that allows users to build and assess models, in which helix geometry, coiled-coil architecture and topology and protein sequence can be varied rapidly. We demonstrate the utility of CCBuilder by assembling models for 653 coiled-coil structures from the PDB, which cover >96% of the known coiled-coil types, and by generating models for rarer and de novo coiled-coil structures.

Availability and implementation: CCBuilder is freely available, without registration, at http://coiledcoils.chm.bris.ac.uk/app/cc_builder/

Contact: D.N.Woolfson@bristol.ac.uk or Chris.Wood@bristol.ac.uk

Author: Wood, C. W., Bruning, M., Ibarra, A. A., Bartlett, G. J., Thomson, A. R., Sessions, R. B., Brady, R. L., Woolfson, D. N.
Posted: October 17, 2014, 1:05 pm

Motivation: Oncogenes are known drivers of cancer phenotypes and targets of molecular therapies; however, the complex and diverse signaling mechanisms regulated by oncogenes and potential routes to targeted therapy resistance remain to be fully understood. To this end, we present an approach to infer regulatory mechanisms downstream of the HER2 driver oncogene in SUM-225 metastatic breast cancer cells from dynamic gene expression patterns using a succession of analytical techniques, including a novel MP grammars method to mathematically model putative regulatory interactions among sets of clustered genes.

Results: Our method highlighted regulatory interactions previously identified in the cell line and a novel finding that the HER2 oncogene, as opposed to the proto-oncogene, upregulates expression of the E2F2 transcription factor. By targeted gene knockdown we show the significance of this, demonstrating that cancer cell-matrix adhesion and outgrowth were markedly inhibited when E2F2 levels were reduced. Thus, validating in this context that upregulation of E2F2 represents a key intermediate event in a HER2 oncogene-directed gene expression-based signaling circuit. This work demonstrates how predictive modeling of longitudinal gene expression data combined with multiple systems-level analyses can be used to accurately predict downstream signaling pathways. Here, our integrated method was applied to reveal insights as to how the HER2 oncogene drives a specific cancer cell phenotype, but it is adaptable to investigate other oncogenes and model systems.

Availability and implementation: Accessibility of various tools is listed in methods; the Log-Gain Stoichiometric Stepwise algorithm is accessible at http://www.cbmc.it/software/Software.php.

Contact: bollig@karmanos.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Bollig-Fischer, A., Marchetti, L., Mitrea, C., Wu, J., Kruger, A., Manca, V., Drăghici, S.
Posted: October 17, 2014, 1:05 pm

Motivation: Most approaches used to identify cancer driver genes focus, true to their name, on entire genes and assume that a gene, treated as one entity, has a specific role in cancer. This approach may be correct to describe effects of gene loss or changes in gene expression; however, mutations may have different effects, including their relevance to cancer, depending on which region of the gene they affect. Except for rare and well-known exceptions, there are not enough data for reliable statistics for individual positions, but an intermediate level of analysis, between an individual position and the entire gene, may give us better statistics than the former and better resolution than the latter approach.

Results: We have developed e-Driver, a method that exploits the internal distribution of somatic missense mutations between the protein’s functional regions (domains or intrinsically disordered regions) to find those that show a bias in their mutation rate as compared with other regions of the same protein, providing evidence of positive selection and suggesting that these proteins may be actual cancer drivers. We have applied e-Driver to a large cancer genome dataset from The Cancer Genome Atlas and compared its performance with that of four other methods, showing that e-Driver identifies novel candidate cancer drivers and, because of its increased resolution, provides deeper insights into the potential mechanism of cancer driver genes identified by other methods.

Availability and implementation: A Perl script with e-Driver and the files to reproduce the results described here can be downloaded from https://github.com/eduardporta/e-Driver.git

Contact: adam@godziklab.org or eppardo@sanfordburnham.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Porta-Pardo, E., Godzik, A.
Posted: October 17, 2014, 1:05 pm

Motivation: Asymmetry is frequently observed in the empirical distribution of test statistics that results from the analysis of gene expression experiments. This asymmetry indicates an asymmetry in the distribution of effect sizes. A common method for identifying differentially expressed (DE) genes in a gene expression experiment while controlling false discovery rate (FDR) is Storey’s q-value method. This method ranks genes based solely on the P-values from each gene in the experiment.

Results: We propose a method that alters and improves upon the q-value method by taking the sign of the test statistics, in addition to the P-values, into account. Through two simulation studies (one involving independent normal data and one involving microarray data), we show that the proposed method, when compared with the traditional q-value method, generally provides a better ranking for genes as well as a higher number of truly DE genes declared to be DE, while still adequately controlling FDR. We illustrate the proposed method by analyzing two microarray datasets, one from an experiment of thale cress seedlings and the other from an experiment of maize leaves.

Availability and implementation: The R code and data files for the proposed method and examples are available at Bioinformatics online.

Contact: megan.orr@ndsu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Orr, M., Liu, P., Nettleton, D.
Posted: October 17, 2014, 1:05 pm

The Affymetrix Axiom genotyping standard and ‘best practice’ workflow for Linux and Mac users consists of three stand-alone executable programs (Affymetrix Power Tools) and an R package (SNPolisher). Currently, SNP analysis has to be performed in a step-by-step procedure. Manual intervention and/or programming skills by the user is required at each intermediate point, as Affymetrix Power Tools programs do not produce input files for the program next-in-line. An additional problem is that the output format of genotypes is not compatible with most analysis software currently available. AffyPipe solves all the above problems, by automating both standard and ‘best practice’ workflows for any species genotyped with the Axiom technology. AffyPipe does not require programming skills and performs all the steps necessary to obtain a final genotype file. Furthermore, users can directly edit SNP probes and export genotypes in PLINK format.

Availability and implementation: https://github.com/nicolazzie/AffyPipe.git.

Contact: ezequiel.nicolazzi@tecnoparco.org

Author: Nicolazzi, E. L., Iamartino, D., Williams, J. L.
Posted: October 17, 2014, 1:05 pm

Motivation: A rapid progression of esophageal squamous cell carcinoma (ESCC) causes a high mortality rate because of the propensity for metastasis driven by genetic and epigenetic alterations. The identification of prognostic biomarkers would help prevent or control metastatic progression. Expression analyses have been used to find such markers, but do not always validate in separate cohorts. Epigenetic marks, such as DNA methylation, are a potential source of more reliable and stable biomarkers. Importantly, the integration of both expression and epigenetic alterations is more likely to identify relevant biomarkers.

Results: We present a new analysis framework, using ESCC progression-associated gene regulatory network (GRNescc), to identify differentially methylated CpG sites prognostic of ESCC progression. From the CpG loci differentially methylated in 50 tumor–normal pairs, we selected 44 CpG loci most highly associated with survival and located in the promoters of genes more likely to belong to GRNescc. Using an independent ESCC cohort, we confirmed that 8/10 of CpG loci in the promoter of GRNescc genes significantly correlated with patient survival. In contrast, 0/10 CpG loci in the promoter genes outside the GRNescc were correlated with patient survival. We further characterized the GRNescc network topology and observed that the genes with methylated CpG loci associated with survival deviated from the center of mass and were less likely to be hubs in the GRNescc. We postulate that our analysis framework improves the identification of bona fide prognostic biomarkers from DNA methylation studies, especially with partial genome coverage.

Contact: tsengsm@mail.ncku.edu.tw or ycw5798@mail.ncku.edu.tw

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Cheng, C.-P., Kuo, I.-Y., Alakus, H., Frazer, K. A., Harismendy, O., Wang, Y.-C., Tseng, V. S.
Posted: October 17, 2014, 1:05 pm

Summary: STAMP is a graphical software package that provides statistical hypothesis tests and exploratory plots for analysing taxonomic and functional profiles. It supports tests for comparing pairs of samples or samples organized into two or more treatment groups. Effect sizes and confidence intervals are provided to allow critical assessment of the biological relevancy of test results. A user-friendly graphical interface permits easy exploration of statistical results and generation of publication-quality plots.

Availability and implementation: STAMP is licensed under the GNU GPL. Python source code and binaries are available from our website at: http://kiwi.cs.dal.ca/Software/STAMP

Contact: donovan.parks@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Parks, D. H., Tyson, G. W., Hugenholtz, P., Beiko, R. G.
Posted: October 17, 2014, 1:05 pm

Motivation: The successful translation of genomic signatures into clinical settings relies on good discrimination between patient subgroups. Many sophisticated algorithms have been proposed in the statistics and machine learning literature, but in practice simpler algorithms are often used. However, few simple algorithms have been formally described or systematically investigated.

Results: We give a precise definition of a popular simple method we refer to as más-o-menos, which calculates prognostic scores for discrimination by summing standardized predictors, weighted by the signs of their marginal associations with the outcome. We study its behavior theoretically, in simulations and in an extensive analysis of 27 independent gene expression studies of bladder, breast and ovarian cancer, altogether totaling 3833 patients with survival outcomes. We find that despite its simplicity, más-o-menos can achieve good discrimination performance. It performs no worse, and sometimes better, than popular and much more CPU-intensive methods for discrimination, including lasso and ridge regression.

Availability and Implementation: Más-o-menos is implemented for survival analysis as an option in the survHD package, available from http://www.bitbucket.org/lwaldron/survhd and submitted to Bioconductor.

Contact: sdzhao@illinois.edu

Author: Zhao, S. D., Parmigiani, G., Huttenhower, C., Waldron, L.
Posted: October 17, 2014, 1:05 pm

Motivation: Recent breakthroughs in protein residue–residue contact prediction have made reliable de novo prediction of protein structures possible. The key was to apply statistical methods that can distinguish direct couplings between pairs of columns in a multiple sequence alignment from merely correlated pairs, i.e. to separate direct from indirect effects. Two classes of such methods exist, either relying on regularized inversion of the covariance matrix or on pseudo-likelihood maximization (PLM). Although PLM-based methods offer clearly higher precision, available tools are not sufficiently optimized and are written in interpreted languages that introduce additional overheads. This impedes the runtime and large-scale contact prediction for larger protein families, multi-domain proteins and protein–protein interactions.

Results: Here we introduce CCMpred, our performance-optimized PLM implementation in C and CUDA C. Using graphics cards in the price range of current six-core processors, CCMpred can predict contacts for typical alignments 35–113 times faster and with the same precision as the most accurate published methods. For users without a CUDA-capable graphics card, CCMpred can also run in a CPU mode that is still 4–14 times faster. Thanks to our speed-ups (http://dictionary.cambridge.org/dictionary/british/speed-up) contacts for typical protein families can be predicted in 15–60 s on a consumer-grade GPU and 1–6 min on a six-core CPU.

Availability and implementation: CCMpred is free and open-source software under the GNU Affero General Public License v3 (or later) available at https://bitbucket.org/soedinglab/ccmpred

Contact: johannes.soeding@mpibpc.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Seemayer, S., Gruber, M., Soding, J.
Posted: October 17, 2014, 1:05 pm

Motivation: MicroRNAs (miRNAs) play crucial roles in complex cellular networks by binding to the messenger RNAs (mRNAs) of protein coding genes. It has been found that miRNA regulation is often condition-specific. A number of computational approaches have been developed to identify miRNA activity specific to a condition of interest using gene expression data. However, most of the methods only use the data in a single condition, and thus, the activity discovered may not be unique to the condition of interest. Additionally, these methods are based on statistical associations between the gene expression levels of miRNAs and mRNAs, so they may not be able to reveal real gene regulatory relationships, which are causal relationships.

Results: We propose a novel method to infer condition-specific miRNA activity by considering (i) the difference between the regulatory behavior that an miRNA has in the condition of interest and its behavior in the other conditions; (ii) the causal semantics of miRNA–mRNA relationships. The method is applied to the epithelial–mesenchymal transition (EMT) and multi-class cancer (MCC) datasets. The validation by the results of transfection experiments shows that our approach is effective in discovering significant miRNA–mRNA interactions. Functional and pathway analysis and literature validation indicate that the identified active miRNAs are closely associated with the specific biological processes, diseases and pathways. More detailed analysis of the activity of the active miRNAs implies that some active miRNAs show different regulation types in different conditions, but some have the same regulation types and their activity only differs in different conditions in the strengths of regulation.

Availability and implementation: The R and Matlab scripts are in the Supplementary materials.

Contact: jiuyong.li@unisa.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Zhang, J., Le, T. D., Liu, L., Liu, B., He, J., Goodall, G. J., Li, J.
Posted: October 17, 2014, 1:05 pm

Summary: The linear mixed model is the state-of-the-art method to account for the confounding effects of kinship and population structure in genome-wide association studies (GWAS). Current implementations test the effect of one or more genetic markers while including prespecified covariates such as sex. Here we develop an efficient implementation of the linear mixed model that allows composite hypothesis tests to consider genotype interactions with variables such as other genotypes, environment, sex or ancestry. Our R package, lrgpr, allows interactive model fitting and examination of regression diagnostics to facilitate exploratory data analysis in the context of the linear mixed model. By leveraging parallel and out-of-core computing for datasets too large to fit in main memory, lrgpr is applicable to large GWAS datasets and next-generation sequencing data.

Availability and implementation: lrgpr is an R package available from lrgpr.r-forge.r-project.org

Contact: gabriel.hoffman@mssm.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hoffman, G. E., Mezey, J. G., Schadt, E. E.
Posted: October 17, 2014, 1:05 pm

Motivation: The increasing interest in rare genetic variants and epistatic genetic effects on complex phenotypic traits is currently pushing genome-wide association study design towards datasets of increasing size, both in the number of studied subjects and in the number of genotyped single nucleotide polymorphisms (SNPs). This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data.

Results: We present a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies. Our algorithm is based on two main ideas: (i) compress linkage disequilibrium blocks in terms of differences with a reference SNP and (ii) compress reference SNPs exploiting information on their call rate and minor allele frequency. Tested on two SNP datasets and compared with several state-of-the-art software tools, our compression algorithm is shown to be competitive in terms of compression rate and to outperform all tools in terms of time to load compressed data.

Availability and implementation: Our compression and decompression algorithms are implemented in a C++ library, are released under the GNU General Public License and are freely downloadable from http://www.dei.unipd.it/~sambofra/snpack.html.

Contact: sambofra@dei.unipd.it or cobelli@dei.unipd.it.

Author: Sambo, F., Di Camillo, B., Toffolo, G., Cobelli, C.
Posted: October 17, 2014, 1:05 pm

Summary: NetPathMiner is a general framework for mining, from genome-scale networks, paths that are related to specific experimental conditions. NetPathMiner interfaces with various input formats including KGML, SBML and BioPAX files and allows for manipulation of networks in three different forms: metabolic, reaction and gene representations. NetPathMiner ranks the obtained paths and applies Markov model-based clustering and classification methods to the ranked paths for easy interpretation. NetPathMiner also provides static and interactive visualizations of networks and paths to aid manual investigation.

Availability: The package is available through Bioconductor and from Github at http://github.com/ahmohamed/NetPathMiner

Contact: mohamed@kuicr.kyoto-u.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Mohamed, A., Hancock, T., Nguyen, C. H., Mamitsuka, H.
Posted: October 17, 2014, 1:05 pm

Motivation: Diseases and adverse drug reactions are frequently caused by disruptions in gene functionality. Gaining insight into the global system properties governing the relationships between genotype and phenotype is thus crucial to understand and interfere with perturbations in complex organisms such as diseases states.

Results: We present a systematic analysis of phenotypic information of 5047 perturbations of single genes in mice, 4766 human diseases and 1666 drugs that examines the relationships between different gene properties and the phenotypic impact at the organ system level in mammalian organisms. We observe that while single gene perturbations and alterations of nonessential, tissue-specific genes or those with low betweenness centrality in protein–protein interaction networks often show organ-specific effects, multiple gene alterations resulting e.g. from complex disorders and drug treatments have a more widespread impact. Interestingly, certain cellular localizations are distinctly associated to systemic effects in monogenic disease genes and mouse gene perturbations, such as the lumen of intracellular organelles and transcription factor complexes, respectively. In summary, we show that the broadness of the phenotypic effect is clearly related to certain gene properties and is an indicator of the severity of perturbations. This work contributes to the understanding of gene properties influencing the systemic effects of diseases and drugs.

Contact: monica.campillos@helmholtz-muenchen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Vogt, I., Prinz, J., Worf, K., Campillos, M.
Posted: October 17, 2014, 1:05 pm

Summary: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.

Availability and implementation: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.

Contact: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Gafni, E., Luquette, L. J., Lancaster, A. K., Hawkins, J. B., Jung, J.-Y., Souilmi, Y., Wall, D. P., Tonellato, P. J.
Posted: October 3, 2014, 1:33 pm

Summary: Conserved water molecules play a crucial role in protein structure, stabilization of secondary structure, protein activity, flexibility and ligand binding. Clustering of water molecules in superimposed protein structures, obtained by X-ray crystallography at high resolution, is an established method to identify consensus water molecules in all known protein structures of the same family. PyWATER is an easy-to-use PyMOL plug-in and identifies conserved water molecules in the protein structure of interest. PyWATER can be installed via the user interface of PyMOL. No programming or command-line knowledge is required for its use.

Availability and Implementation: PyWATER and a tutorial are available at https://github.com/hiteshpatel379/PyWATER. PyMOL is available at http://www.pymol.org/ or http://sourceforge.net/projects/pymol/.

Contact: stefan.guenther@pharmazie.uni-freiburg.de

Author: Patel, H., Gruning, B. A., Gunther, S., Merfort, I.
Posted: October 3, 2014, 1:33 pm

Motivation: Runs of homozygosity (ROH) are sizable chromosomal stretches of homozygous genotypes, ranging in length from tens of kilobases to megabases. ROHs can be relevant for population and medical genetics, playing a role in predisposition to both rare and common disorders. ROHs are commonly detected by single nucleotide polymorphism (SNP) microarrays, but attempts have been made to use whole-exome sequencing (WES) data. Currently available methods developed for the analysis of uniformly spaced SNP-array maps do not fit easily to the analysis of the sparse and non-uniform distribution of the WES target design.

Results: To meet the need of an approach specifically tailored to WES data, we developed $${H}^{3}{M}^{2}$$, an original algorithm based on heterogeneous hidden Markov model that incorporates inter-marker distances to detect ROH from WES data. We evaluated the performance of $${H}^{3}{M}^{2}$$ to correctly identify ROHs on synthetic chromosomes and examined its accuracy in detecting ROHs of different length (short, medium and long) from real 1000 genomes project data. $${H}^{3}{M}^{2}$$ turned out to be more accurate than GERMLINE and PLINK, two state-of-the-art algorithms, especially in the detection of short and medium ROHs.

Availability and implementation: $${H}^{3}{M}^{2}$$ is a collection of bash, R and Fortran scripts and codes and is freely available at https://sourceforge.net/projects/h3m2/.

Contact: albertomagi@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Magi, A., Tattini, L., Palombo, F., Benelli, M., Gialluisi, A., Giusti, B., Abbate, R., Seri, M., Gensini, G. F., Romeo, G., Pippucci, T.
Posted: October 3, 2014, 1:33 pm

Summary: Clustered regularly interspaced short palindromic repeats (CRISPR)-based technologies have revolutionized human genome engineering and opened countless possibilities to basic science, synthetic biology and gene therapy. Albeit the enormous potential of these tools, their performance is far from perfect. It is essential to perform a posterior careful analysis of the gene editing experiment. However, there are no computational tools for genome editing assessment yet, and current experimental tools lack sensitivity and flexibility.

We present a platform to assess the quality of a genome editing experiment only with three mouse clicks. The method evaluates next-generation data to quantify and characterize insertions, deletions and homologous recombination. CRISPR Genome Analyzer provides a report for the locus selected, which includes a quantification of the edited site and the analysis of the different alterations detected. The platform maps the reads, estimates and locates insertions and deletions, computes the allele replacement efficiency and provides a report integrating all the information.

Availability and implementation: CRISPR-GA Web is available at http://crispr-ga.net. Documentation on CRISPR-GA instructions can be found at http://crispr-ga.net/documentation.html

Contact: mguell@genetics.med.harvard.edu

Author: Guell, M., Yang, L., Church, G. M.
Posted: October 3, 2014, 1:33 pm

Motivation: In chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and other short-read sequencing experiments, a considerable fraction of the short reads align to multiple locations on the reference genome (multi-reads). Inferring the origin of multi-reads is critical for accurately mapping reads to repetitive regions. Current state-of-the-art multi-read allocation algorithms rely on the read counts in the local neighborhood of the alignment locations and ignore the variation in the copy numbers of these regions. Copy-number variation (CNV) can directly affect the read densities and, therefore, bias allocation of multi-reads.

Results: We propose cnvCSEM (CNV-guided ChIP-Seq by expectation-maximization algorithm), a flexible framework that incorporates CNV in multi-read allocation. cnvCSEM eliminates the CNV bias in multi-read allocation by initializing the read allocation algorithm with CNV-aware initial values. Our data-driven simulations illustrate that cnvCSEM leads to higher read coverage with satisfactory accuracy and lower loss in read-depth recovery (estimation). We evaluate the biological relevance of the cnvCSEM-allocated reads and the resultant peaks with the analysis of several ENCODE ChIP-seq datasets.

Availability and implementation: Available at http://www.stat.wisc.edu/~qizhang/

Contact: qizhang@stat.wisc.edu or keles@stat.wisc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Zhang, Q., Keleş, S.
Posted: October 3, 2014, 1:33 pm

Motivation: In proteomes of higher eukaryotes, many alternative splice variants can only be detected by their shared peptides. This makes it highly challenging to use peptide-centric mass spectrometry to distinguish and to quantify protein isoforms resulting from alternative splicing events.

Results: We have developed two complementary algorithms based on linear mathematical models to efficiently compute a minimal set of shared and unique peptides needed to quantify a set of isoforms and splice variants. Further, we developed a statistical method to estimate the splice variant abundances based on stable isotope labeled peptide quantities. The algorithms and databases are integrated in a web-based tool, and we have experimentally tested the limits of our quantification method using spiked proteins and cell extracts.

Availability and implementation: The TAPAS server is available at URL http://davinci.crg.es/tapas/.

Contact: luis.serrano@crg.eu or christina.kiel@crg.eu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Yang, J.-S., Sabido, E., Serrano, L., Kiel, C.
Posted: October 3, 2014, 1:33 pm

Motivation: Transcriptional regulation is directly enacted by the interactions between DNA and many proteins, including transcription factors (TFs), nucleosomes and polymerases. A critical step in deciphering transcriptional regulation is to infer, and eventually predict, the precise locations of these interactions, along with their strength and frequency. While recent datasets yield great insight into these interactions, individual data sources often provide only partial information regarding one aspect of the complete interaction landscape. For example, chromatin immunoprecipitation (ChIP) reveals the binding positions of a protein, but only for one protein at a time. In contrast, nucleases like MNase and DNase can be used to reveal binding positions for many different proteins at once, but cannot easily determine the identities of those proteins. Currently, few statistical frameworks jointly model these different data sources to reveal an accurate, holistic view of the in vivo protein–DNA interaction landscape.

Results: Here, we develop a novel statistical framework that integrates different sources of experimental information within a thermodynamic model of competitive binding to jointly learn a holistic view of the in vivo protein–DNA interaction landscape. We show that our framework learns an interaction landscape with increased accuracy, explaining multiple sets of data in accordance with thermodynamic principles of competitive DNA binding. The resulting model of genomic occupancy provides a precise mechanistic vantage point from which to explore the role of protein–DNA interactions in transcriptional regulation.

Availability and implementation: The C source code for compete and Python source code for MCMC-based inference are available at http://www.cs.duke.edu/~amink.

Contact: amink@cs.duke.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Zhong, J., Wasson, T., Hartemink, A. J.
Posted: October 3, 2014, 1:33 pm

Summary: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples—typical of ancient DNA data—particularly when only low amounts of data are available for those samples.

Availability and implementation: The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools).

Contact: bammds-users@nongnu.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Malaspinas, A.-S., Tange, O., Moreno-Mayar, J. V., Rasmussen, M., DeGiorgio, M., Wang, Y., Valdiosera, C. E., Politis, G., Willerslev, E., Nielsen, R.
Posted: October 3, 2014, 1:33 pm

Motivation: Small non-coding RNAs (sRNAs) have major roles in the post-transcriptional regulation in prokaryotes. The experimental validation of a relatively small number of sRNAs in few species requires developing computational algorithms capable of robustly encoding the available knowledge and using this knowledge to predict sRNAs within and across species.

Results: We present a novel methodology designed to identify bacterial sRNAs by incorporating the knowledge encoded by different sRNA prediction methods and optimally aggregating them as potential predictors. Because some of these methods emphasize specificity, whereas others emphasize sensitivity while detecting sRNAs, their optimal aggregation constitutes trade-off solutions between these two contradictory objectives that enhance their individual merits. Many non-redundant optimal aggregations uncovered by using multiobjective optimization techniques are then combined into a multiclassifier, which ensures robustness during detection and prediction even in genomes with distinct nucleotide composition. By training with sRNAs in Salmonella enterica Typhimurium, we were able to successfully predict sRNAs in Sinorhizobium meliloti, as well as in multiple and poorly annotated species. The proposed methodology, like a meta-analysis approach, may begin to lay a possible foundation for developing robust predictive methods across a wide spectrum of genomic variability.

Availability and implementation: Scripts created for the experimentation are available at http://m4m.ugr.es/SupInfo/sRNAOS/sRNAOSscripts.zip.

Contact: delval@decsai.ugr.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Arnedo, J., Romero-Zaliz, R., Zwir, I., del Val, C.
Posted: October 3, 2014, 1:33 pm

Summary: Targeting peptides are N-terminal sorting signals in proteins that promote their translocation to mitochondria through the interaction with different protein machineries. We recently developed TPpred, a machine learning-based method scoring among the best ones available to predict the presence of a targeting peptide into a protein sequence and its cleavage site. Here we introduce TPpred2 that improves TPpred performances in the task of identifying the cleavage site of the targeting peptides. TPpred2 is now available as a web interface and as a stand-alone version for users who can freely download and adopt it for processing large volumes of sequences.

Availability and implementaion: TPpred2 is available both as web server and stand-alone version at http://tppred2.biocomp.unibo.it.

Contact: gigi@biocomp.unibo.it

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Savojardo, C., Martelli, P. L., Fariselli, P., Casadio, R.
Posted: October 3, 2014, 1:33 pm

Motivation: Studies of the biochemical functions and activities of uncultivated microorganisms in the environment require analysis of DNA sequences for phylogenetic characterization and for the development of sequence-based assays for the detection of microorganisms. The numbers of sequences for genes that are indicators of environmentally important functions such as nitrogen (N2) fixation have been rapidly growing over the past few decades. Obtaining these sequences from the National Center for Biotechnology Information’s GenBank database is problematic because of annotation errors, nomenclature variation and paralogues; moreover, GenBank’s structure and tools are not conducive to searching solely by function. For some genes, such as the nifH gene commonly used to assess community potential for N2 fixation, manual collection and curation are becoming intractable because of the large number of sequences in GenBank and the large number of highly similar paralogues. If analysis is to keep pace with sequence discovery, an automated retrieval and curation system is necessary.

Results: ARBitrator uses a two-step process composed of a broad collection of potential homologues followed by screening with a best hit strategy to conserved domains. 34 420 nifH sequences were identified in GenBank as of November 20, 2012. The false-positive rate is ~0.033%. ARBitrator rapidly updates a public nifH sequence database, and we show that it can be adapted for other genes.

Availability and implementation: Java source and executable code are freely available to non-commercial users at http://pmc.ucsc.edu/~wwwzehr/research/database/.

Contact: zehrj@ucsc.edu

Supplementary information: Supplementary information is available at Bioinformatics online.

Author: Heller, P., Tripp, H. J., Turk-Kubo, K., Zehr, J. P.
Posted: October 3, 2014, 1:33 pm

Summary: Unraveling transcriptional circuits controlling embryonic stem cell maintenance and fate has great potential for improving our understanding of normal development as well as disease. To facilitate this, we have developed a novel web tool called ‘TRES’ that predicts the likely upstream regulators for a given gene list. This is achieved by integrating transcription factor (TF) binding events from 187 ChIP-sequencing and ChIP-on-chip datasets in murine and human embryonic stem (ES) cells with over 1000 mammalian TF sequence motifs. Using 114 TF perturbation gene sets, as well as 115 co-expression clusters in ES cells, we validate the utility of this approach.

Availability and implementation: TRES is freely available at http://www.tres.roslin.ed.ac.uk.

Contact: Anagha.Joshi@roslin.ed.ac.uk or bg200@cam.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Pooley, C., Ruau, D., Lombard, P., Gottgens, B., Joshi, A.
Posted: October 3, 2014, 1:33 pm

Motivation: Structural information of macromolecular complexes provides key insights into the way they carry out their biological functions. The reconstruction process leading to the final 3D map requires an approximate initial model. Generation of an initial model is still an open and challenging problem in single-particle analysis.

Results: We present a fast and efficient approach to obtain a reliable, low-resolution estimation of the 3D structure of a macromolecule, without any a priori knowledge, addressing the well-known issue of initial volume estimation in the field of single-particle analysis. The input of the algorithm is a set of class average images obtained from individual projections of a biological object at random and unknown orientations by transmission electron microscopy micrographs. The proposed method is based on an initial non-lineal dimensionality reduction approach, which allows to automatically selecting representative small sets of class average images capturing the most of the structural information of the particle under study. These reduced sets are then used to generate volumes from random orientation assignments. The best volume is determined from these guesses using a random sample consensus (RANSAC) approach. We have tested our proposed algorithm, which we will term 3D-RANSAC, with simulated and experimental data, obtaining satisfactory results under the low signal-to-noise conditions typical of cryo-electron microscopy.

Availability: The algorithm is freely available as part of the Xmipp 3.1 package [http://xmipp.cnb.csic.es].

Contact: jvargas@cnb.csic.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Vargas, J., Alvarez-Cabrera, A.-L., Marabini, R., Carazo, J. M., Sorzano, C. O. S.
Posted: October 3, 2014, 1:33 pm

Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods.

Results: We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10–15 kb, but the error rate of post-filtered calls is reduced to 1 in 100–200 kb without significant compromise on the sensitivity.

Availability and implementation: BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp.

Contact: hengli@broadinstitute.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Li, H.
Posted: October 3, 2014, 1:33 pm

Liquid chromatography coupled to mass spectrometry (LC/MS) has become widely used in Metabolomics. Several artefacts have been identified during the acquisition step in large LC/MS metabolomics experiments, including ion suppression, carryover or changes in the sensitivity and intensity. Several sources have been pointed out as responsible for these effects. In this context, the drift effects of the peak intensity is one of the most frequent and may even constitute the main source of variance in the data, resulting in misleading statistical results when the samples are analysed. In this article, we propose the introduction of a methodology based on a common variance analysis before the data normalization to address this issue. This methodology was tested and compared with four other methods by calculating the Dunn and Silhouette indices of the quality control classes. The results showed that our proposed methodology performed better than any of the other four methods. As far as we know, this is the first time that this kind of approach has been applied in the metabolomics context.

Availability and implementation: The source code of the methods is available as the R package intCor at http://b2slab.upc.edu/software-and-downloads/intensity-drift-correction/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Fernandez-Albert, F., Llorach, R., Garcia-Aloy, M., Ziyatdinov, A., Andres-Lacueva, C., Perera, A.
Posted: October 3, 2014, 1:33 pm

Motivation: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation.

Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints.

Availability and implementation: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license.

Contact: lavenier@irisa.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Drezen, E., Rizk, G., Chikhi, R., Deltel, C., Lemaitre, C., Peterlongo, P., Lavenier, D.
Posted: October 3, 2014, 1:33 pm

Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available.

Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $${\chi }^{2}$$ association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/.

Contact: bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu

Supplementary information: Supplementary materials are available at Bioinformatics online.

Author: Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., Hirschhorn, J., Strachan, D. P., Patterson, N., Price, A. L.
Posted: October 3, 2014, 1:33 pm

Non-invasive prenatal testing (NIPT) of fetal aneuploidy using cell-free fetal DNA is becoming part of routine clinical practice. RAPIDR (Reliable Accurate Prenatal non-Invasive Diagnosis R package) is an easy-to-use open-source R package that implements several published NIPT analysis methods. The input to RAPIDR is a set of sequence alignment files in the BAM format, and the outputs are calls for aneuploidy, including trisomies 13, 18, 21 and monosomy X as well as fetal sex. RAPIDR has been extensively tested with a large sample set as part of the RAPID project in the UK. The package contains quality control steps to make it robust for use in the clinical setting.

Availability and implementation: RAPIDR is implemented in R and can be freely downloaded via CRAN from here: http://cran.r-project.org/web/packages/RAPIDR/index.html.

Contact: kitty.lo@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Lo, K. K., Boustred, C., Chitty, L. S., Plagnol, V.
Posted: October 3, 2014, 1:33 pm

Motivation: Unique modeling and computational challenges arise in locating the geographic origin of individuals based on their genetic backgrounds. Single-nucleotide polymorphisms (SNPs) vary widely in informativeness, allele frequencies change non-linearly with geography and reliable localization requires evidence to be integrated across a multitude of SNPs. These problems become even more acute for individuals of mixed ancestry. It is hardly surprising that matching genetic models to computational constraints has limited the development of methods for estimating geographic origins. We attack these related problems by borrowing ideas from image processing and optimization theory. Our proposed model divides the region of interest into pixels and operates SNP by SNP. We estimate allele frequencies across the landscape by maximizing a product of binomial likelihoods penalized by nearest neighbor interactions. Penalization smooths allele frequency estimates and promotes estimation at pixels with no data. Maximization is accomplished by a minorize–maximize (MM) algorithm. Once allele frequency surfaces are available, one can apply Bayes’ rule to compute the posterior probability that each pixel is the pixel of origin of a given person. Placement of admixed individuals on the landscape is more complicated and requires estimation of the fractional contribution of each pixel to a person’s genome. This estimation problem also succumbs to a penalized MM algorithm.

Results: We applied the model to the Population Reference Sample (POPRES) data. The model gives better localization for both unmixed and admixed individuals than existing methods despite using just a small fraction of the available SNPs. Computing times are comparable with the best competing software.

Availability and implementation: Software will be freely available as the OriGen package in R.

Contact: ranolaj@uw.edu or klange@ucla.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Ranola, J. M., Novembre, J., Lange, K.
Posted: October 3, 2014, 1:33 pm

Motivation: Identification of single nucleotide polymorphisms that are associated with a phenotype in more than one study is of great scientific interest in the genome-wide association studies (GWAS) research. The empirical Bayes approach for discovering whether results have been replicated across studies was shown to be a reliable method, and close to optimal in terms of power.

Results: The R package repfdr provides a flexible implementation of the empirical Bayes approach for replicability analysis and meta-analysis, to be used when several studies examine the same set of null hypotheses. The usefulness of the package for the GWAS community is discussed.

Availability and implementation: The R package repfdr can be downloaded from CRAN.

Contact: ruheller@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Heller, R., Yaacoby, S., Yekutieli, D.
Posted: October 3, 2014, 1:33 pm

Motivation: The emergence of network medicine not only offers more opportunities for better and more complete understanding of the molecular complexities of diseases, but also serves as a promising tool for identifying new drug targets and establishing new relationships among diseases that enable drug repositioning. Computational approaches for drug repositioning by integrating information from multiple sources and multiple levels have the potential to provide great insights to the complex relationships among drugs, targets, disease genes and diseases at a system level.

Results: In this article, we have proposed a computational framework based on a heterogeneous network model and applied the approach on drug repositioning by using existing omics data about diseases, drugs and drug targets. The novelty of the framework lies in the fact that the strength between a disease–drug pair is calculated through an iterative algorithm on the heterogeneous graph that also incorporates drug-target information. Comprehensive experimental results show that the proposed approach significantly outperforms several recent approaches. Case studies further illustrate its practical usefulness.

Availability and implementation: http://cbc.case.edu

Contact: jingli@cwru.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Wang, W., Yang, S., Zhang, X., Li, J.
Posted: October 3, 2014, 1:33 pm

Summary: Here we introduce ccSOL omics, a webserver for large-scale calculations of protein solubility. Our method allows (i) proteome-wide predictions; (ii) identification of soluble fragments within each sequences; (iii) exhaustive single-point mutation analysis.

Results: Using coil/disorder, hydrophobicity, hydrophilicity, β-sheet and α-helix propensities, we built a predictor of protein solubility. Our approach shows an accuracy of 79% on the training set (36 990 Target Track entries). Validation on three independent sets indicates that ccSOL omics discriminates soluble and insoluble proteins with an accuracy of 74% on 31 760 proteins sharing <30% sequence similarity.

Availability and implementation: ccSOL omics can be freely accessed on the web at http://s.tartaglialab.com/page/ccsol_group. Documentation and tutorial are available at http://s.tartaglialab.com/static_files/shared/tutorial_ccsol_omics.html.

Contact: gian.tartaglia@crg.es

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Agostini, F., Cirillo, D., Livi, C. M., Delli Ponti, R., Tartaglia, G. G.
Posted: October 3, 2014, 1:33 pm

Motivation: Biological network alignment aims to identify similar regions between networks of different species. Existing methods compute node similarities to rapidly identify from possible alignments the high-scoring alignments with respect to the overall node similarity. But, the accuracy of the alignments is then evaluated with some other measure that is different than the node similarity used to construct the alignments. Typically, one measures the amount of conserved edges. Thus, the existing methods align similar nodes between networks hoping to conserve many edges (after the alignment is constructed!).

Results: Instead, we introduce MAGNA to directly ‘optimize’ edge conservation while the alignment is constructed, without decreasing the quality of node mapping. MAGNA uses a genetic algorithm and our novel function for ‘crossover’ of two ‘parent’ alignments into a superior ‘child’ alignment to simulate a ‘population’ of alignments that ‘evolves’ over time; the ‘fittest’ alignments survive and proceed to the next ‘generation’, until the alignment accuracy cannot be optimized further. While we optimize our new and superior measure of the amount of conserved edges, MAGNA can optimize any alignment accuracy measure, including a combined measure of both node and edge conservation. In systematic evaluations against state-of-the-art methods (IsoRank, MI-GRAAL and GHOST), on both synthetic networks and real-world biological data, MAGNA outperforms all of the existing methods, in terms of both node and edge conservation as well as both topological and biological alignment accuracy.

Availability: Software: http://nd.edu/~cone/MAGNA

Contact: tmilenko@nd.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Saraph, V., Milenkovi&#x0107;, T.
Posted: October 3, 2014, 1:33 pm

Summary: Today's graphics processing units (GPUs) compose the scene from individual triangles. As about 320 triangles are needed to approximate a single sphere—an atom—in a convincing way, visualizing larger proteins with atomic details requires tens of millions of triangles, far too many for smooth interactive frame rates. We describe a new approach to solve this ‘molecular graphics problem’, which shares the work between GPU and multiple CPU cores, generates high-quality results with perfectly round spheres, shadows and ambient lighting and requires only OpenGL 1.0 functionality, without any pixel shader Z-buffer access (a feature which is missing in most mobile devices).

Availability and implementation: YASARA View, a molecular modeling program built around the visualization algorithm described here, is freely available (including commercial use) for Linux, MacOS, Windows and Android (Intel) from www.YASARA.org.

Contact: elmar@yasara.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Krieger, E., Vriend, G.
Posted: October 3, 2014, 1:33 pm

Motivation: Peak detection is a key step in the preprocessing of untargeted metabolomics data generated from high-resolution liquid chromatography-mass spectrometry (LC/MS). The common practice is to use filters with predetermined parameters to select peaks in the LC/MS profile. This rigid approach can cause suboptimal performance when the choice of peak model and parameters do not suit the data characteristics.

Results: Here we present a method that learns directly from various data features of the extracted ion chromatograms (EICs) to differentiate between true peak regions from noise regions in the LC/MS profile. It utilizes the knowledge of known metabolites, as well as robust machine learning approaches. Unlike currently available methods, this new approach does not assume a parametric peak shape model and allows maximum flexibility. We demonstrate the superiority of the new approach using real data. Because matching to known metabolites entails uncertainties and cannot be considered a gold standard, we also developed a probabilistic receiver-operating characteristic (pROC) approach that can incorporate uncertainties.

Availability and implementation: The new peak detection approach is implemented as part of the apLCMS package available at http://web1.sph.emory.edu/apLCMS/

Contact: tyu8@emory.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Yu, T., Jones, D. P.
Posted: October 3, 2014, 1:33 pm

Summary: Several methods and computational tools have been developed to design novel metabolic pathways. A major challenge is evaluating the metabolic efficiency of the designed pathways in the host organism. Here we present FindPath, a unified system to predict and rank possible pathways according to their metabolic efficiency in the cellular system. This tool uses a chemical reaction database to generate possible metabolic pathways and exploits constraint-based models (CBMs) to identify the most efficient synthetic pathway to achieve the desired metabolic function in a given host microorganism. FindPath can be used with common tools for CBM manipulation and uses the standard SBML format for both input and output files.

Availability and implementation: http://metasys.insa-toulouse.fr/software/findpath/.

Contact: heux@insa-toulouse.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Vieira, G., Carnicer, M., Portais, J.-C., Heux, S.
Posted: October 3, 2014, 1:33 pm

Motivation: Several technical challenges in metagenomic data analysis, including assembling metagenomic sequence data or identifying operational taxonomic units, are both significant and well known. These forms of analysis are increasingly cited as conceptually flawed, given the extreme variation within traditionally defined species and rampant horizontal gene transfer. Furthermore, computational requirements of such analysis have hindered content-based organization of metagenomic data at large scale.

Results: In this article, we introduce the Amordad database engine for alignment-free, content-based indexing of metagenomic datasets. Amordad places the metagenome comparison problem in a geometric context, and uses an indexing strategy that combines random hashing with a regular nearest neighbor graph. This framework allows refinement of the database over time by continual application of random hash functions, with the effect of each hash function encoded in the nearest neighbor graph. This eliminates the need to explicitly maintain the hash functions in order for query efficiency to benefit from the accumulated randomness. Results on real and simulated data show that Amordad can support logarithmic query time for identifying similar metagenomes even as the database size reaches into the millions.

Availability and implementation: Source code, licensed under the GNU general public license (version 3) is freely available for download from http://smithlabresearch.org/amordad

Contact: andrewds@usc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Behnam, E., Smith, A. D.
Posted: October 3, 2014, 1:33 pm

Summary: We present a new C implementation of an advanced Markov chain Monte Carlo (MCMC) method for the sampling of ordinary differential equation (ode) model parameters. The software mcmc_clib uses the simplified manifold Metropolis-adjusted Langevin algorithm (SMMALA), which is locally adaptive; it uses the parameter manifold’s geometry (the Fisher information) to make efficient moves. This adaptation does not diminish with MC length, which is highly advantageous compared with adaptive Metropolis techniques when the parameters have large correlations and/or posteriors substantially differ from multivariate Gaussians. The software is standalone (not a toolbox), though dependencies include the GNU scientific library and sundials libraries for ode integration and sensitivity analysis.

Availability and implementation: The source code and binary files are freely available for download at http://a-kramer.github.io/mcmc_clib/. This also includes example files and data. A detailed documentation, an example model and user manual are provided with the software.

Contact: andrei.kramer@ist.uni-stuttgart.de

Author: Kramer, A., Stathopoulos, V., Girolami, M., Radde, N.
Posted: October 3, 2014, 1:33 pm