Bioinformatics Services

Bioinformatics Services (Company):

Forensic Bioinformatics – Link

A to Z of Bioinformatics Services, EBI – Link

Scionics Computer Innovation GmbH – Link

BIDMC Genomics and Proteomics Center – Link

Expression Analysis – Link

Sequentix – Link

Craic Computing – Link

Bioinformatics - recent issues

Bioinformatics - RSS feed of recent issues (covers the latest 3 issues, including the current issue)

Summary: I present MR_predictor, a simulation engine designed to guide the development and interpretation of statistical tests of causality between phenotypes using genetic instruments. MR_predictor provides a framework to model either individual traits or complex scenarios where multiple phenotypes are correlated or dependent on each other. Crucially, MR_predictor can incorporate the effects of multiple biallelic loci (linked or unlinked) contributing genotypic variability to one or more simulated phenotypes. The software has a range of options for sample generation, and output files generated by MR_predictor port into commonly used analysis tools (e.g. PLINK, R), facilitating analyses germane for Mendelian Randomization studies. Benchmarks for speed and power calculations for summary statistic-based Mendelian Randomization analyses are presented and compared with analytical expectation.

Availability and implementation: The simulation engine is implemented in PERL, and the associated scripts can be downloaded from github.com, and online documentation, tutorial and example datasets are available at http://coruscant.itmat.upenn.edu/mr_predictor.

Contact: bvoight@upenn.edu

Supplementary information: Supplementary derivations are available at Bioinformatics online.

Author: Voight, B. F.
Posted: November 25, 2014, 7:34 am

Summary: The Illumina 450k array is a frequently used platform for large-scale genome-wide DNA methylation studies, i.e. epigenome-wide association studies. Currently, quality control of 450k data can be performed with Illumina’s GenomeStudio and is part of a limited number 450k analysis pipelines. However, GenomeStudio cannot handle large-scale studies, and existing pipelines provide limited options for quality control and neither support interactive exploration by the user.

To aid the detection of bad-quality samples in large-scale genome-wide DNA methylation studies as flexible and transparent as possible, we have developed MethylAid; a visual and interactive Web application using RStudio’s shiny package. Bad-quality samples are detected using sample-dependent and sample-independent quality control probes present on the array and user-adjustable thresholds. In-depth exploration of bad-quality samples can be performed using several interactive diagnostic plots. Furthermore, plots can be annotated with user-provided metadata, for example, to identify outlying batches. Our new tool makes quality assessment of 450k array data interactive, flexible and efficient and is, therefore, expected to be useful for both data analysts and core facilities.

Availability and implementation: MethylAid is implemented as an R/Bioconductor package (www.bioconductor.org/packages/3.0/bioc/html/MethylAid.html). A demo application is available from shiny.bioexp.nl/MethylAid.

Contact: m.van_iterson@lumc.nl

Author: van Iterson, M., Tobi, E. W., Slieker, R. C., den Hollander, W., Luijk, R., Slagboom, P. E., Heijmans, B. T.
Posted: November 25, 2014, 7:34 am

Summary: We present a tool, diCal-IBD, for detecting identity-by-descent (IBD) tracts between pairs of genomic sequences. Our method builds on a recent demographic inference method based on the coalescent with recombination, and is able to incorporate demographic information as a prior. Simulation study shows that diCal-IBD has significantly higher recall and precision than that of existing single-nucleotide polymorphism–based IBD detection methods, while retaining reasonable accuracy for IBD tracts as small as 0.1 cM.

Availability: http://sourceforge.net/projects/dical-ibd

Contact: yss@eecs.berkeley.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Tataru, P., Nirody, J. A., Song, Y. S.
Posted: November 25, 2014, 7:34 am

Motivation: Efficient simulation of population genetic samples under a given demographic model is a prerequisite for many analyses. Coalescent theory provides an efficient framework for such simulations, but simulating longer regions and higher recombination rates remains challenging. Simulators based on a Markovian approximation to the coalescent scale well, but do not support simulation of selection. Gene conversion is not supported by any published coalescent simulators that support selection.

Results: We describe cosi2, an efficient simulator that supports both exact and approximate coalescent simulation with positive selection. cosi2 improves on the speed of existing exact simulators, and permits further speedup in approximate mode while retaining support for selection. cosi2 supports a wide range of demographic scenarios, including recombination hot spots, gene conversion, population size changes, population structure and migration.

cosi2 implements coalescent machinery efficiently by tracking only a small subset of the Ancestral Recombination Graph, sampling only relevant recombination events, and using augmented skip lists to represent tracked genetic segments. To preserve support for selection in approximate mode, the Markov approximation is implemented not by moving along the chromosome but by performing a standard backwards-in-time coalescent simulation while restricting coalescence to node pairs with overlapping or near-overlapping genetic material. We describe the algorithms used by cosi2 and present comparisons with existing selection simulators.

Availability and implementation: A free C++ implementation of cosi2 is available at http://broadinstitute.org/mpg/cosi2.

Contact: ilya@broadinstitute.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Shlyakhter, I., Sabeti, P. C., Schaffner, S. F.
Posted: November 25, 2014, 7:34 am

Motivation: Experimental reproducibility is fundamental to the progress of science. Irreproducible research decreases the efficiency of basic biological research and drug discovery and impedes experimental data reuse. A major contributing factor to irreproducibility is difficulty in interpreting complex experimental methodologies and designs from written text and in assessing variations among different experiments. Current bioinformatics initiatives either are focused on computational research reproducibility (i.e. data analysis) or laboratory information management systems. Here, we present a software tool, ProtocolNavigator, which addresses the largely overlooked challenges of interpretation and assessment. It provides a biologist-friendly open-source emulation-based tool for designing, documenting and reproducing biological experiments.

Availability and implementation: ProtocolNavigator was implemented in Python 2.7, using the wx module to build the graphical user interface. It is a platform-independent software and freely available from http://protocolnavigator.org/index.html under the GPL v2 license.

Contact: wpciak@cf.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Khan, I. A., Fraser, A., Bray, M.-A., Smith, P. J., White, N. S., Carpenter, A. E., Errington, R. J.
Posted: November 25, 2014, 7:34 am

Summary: Cordova is an out-of-the-box solution for building and maintaining an online database of genetic variations integrated with pathogenicity prediction results from popular algorithms. Our primary motivation for developing this system is to aid researchers and clinician–scientists in determining the clinical significance of genetic variations. To achieve this goal, Cordova provides an interface to review and manually or computationally curate genetic variation data as well as share it for clinical diagnostics and the advancement of research.

Availability and implementation: Cordova is open source under the MIT license and is freely available for download at https://github.com/clcg/cordova.

Contact: sean.ephraim@gmail.com or terry-braun@uiowa.edu

Author: Ephraim, S. S., Anand, N., DeLuca, A. P., Taylor, K. R., Kolbe, D. L., Simpson, A. C., Azaiez, H., Sloan, C. M., Shearer, A. E., Hallier, A. R., Casavant, T. L., Scheetz, T. E., Smith, R. J. H., Braun, T. A.
Posted: November 25, 2014, 7:34 am

Summary: MicroRNAs (miRNAs) represent an important class of small non-coding RNAs regulating gene expression in eukaryotes. Present algorithms typically rely on genomic data to identify miRNAs and require extensive installation procedures. Niche model organisms lacking genomic sequences cannot be analyzed by such tools. Here we introduce the MIRPIPE application enabling rapid and simple browser-based miRNA homology detection and quantification. MIRPIPE features automatic trimming of raw RNA-Seq reads originating from various sequencing instruments, processing of isomiRs and quantification of detected miRNAs versus public- or user-uploaded reference databases.

Availability and implementation: The Web service is freely available at http://bioinformatics.mpi-bn.mpg.de. MIRPIPE was implemented in Perl and integrated into Galaxy. An offline version for local execution is also available from our Web site.

Contact: Mario.Looso@mpi-bn.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Kuenne, C., Preussner, J., Herzog, M., Braun, T., Looso, M.
Posted: November 25, 2014, 7:33 am

Motivation: Genome-wide association studies (GWAS) have identified many loci implicated in disease susceptibility. Integration of GWAS summary statistics (P-values) and functional genomic datasets should help to elucidate mechanisms.

Results: We extended a non-parametric SNP set enrichment method to test for enrichment of GWAS signals in functionally defined loci to a situation where only GWAS P-values are available. The approach is implemented in VSEAMS, a freely available software pipeline. We use VSEAMS to identify enrichment of type 1 diabetes (T1D) GWAS associations near genes that are targets for the transcription factors IKZF3, BATF and ESRRA. IKZF3 lies in a known T1D susceptibility region, while BATF and ESRRA overlap other immune disease susceptibility regions, validating our approach and suggesting novel avenues of research for T1D.

Availability and implementation: VSEAMS is available for download (http://github.com/ollyburren/vseams).

Contact: chris.wallace@cimr.cam.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Burren, O. S., Guo, H., Wallace, C.
Posted: November 25, 2014, 7:33 am

Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is ‘almost unbiased’ as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics.

Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an ‘almost unbiased’ theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.

Availability and implementation: The source code in C++, along with the Supplementary Materials, is available at: http://gsp.tamu.edu/Publications/supplementary/zollanvari13/.

Contact: ulisses@ece.tamu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Braga-Neto, U. M., Zollanvari, A., Dougherty, E. R.
Posted: November 25, 2014, 7:33 am

Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. To draw biological conclusions based on RNA-Seq data, several steps, some of which are computationally intensive, have to be taken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes and RNA immunoprecipitated with proteins, not only from bacteria but also from eukaryotes and archaea.

Availability and implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

Contact: cynthia.sharma@uni-wuerzburg.de and konrad.foerstner@uni-wuerzburg.de

Author: Forstner, K. U., Vogel, J., Sharma, C. M.
Posted: November 25, 2014, 7:33 am

Motivation: Knowing the subcellular location of proteins is critical for understanding their function and developing accurate networks representing eukaryotic biological processes. Many computational tools have been developed to predict proteome-wide subcellular location, and abundant experimental data from green fluorescent protein (GFP) tagging or mass spectrometry (MS) are available in the model plant, Arabidopsis. None of these approaches is error-free, and thus, results are often contradictory.

Results: To help unify these multiple data sources, we have developed the SUBcellular Arabidopsis consensus (SUBAcon) algorithm, a naive Bayes classifier that integrates 22 computational prediction algorithms, experimental GFP and MS localizations, protein–protein interaction and co-expression data to derive a consensus call and probability. SUBAcon classifies protein location in Arabidopsis more accurately than single predictors.

Availability: SUBAcon is a useful tool for recovering proteome-wide subcellular locations of Arabidopsis proteins and is displayed in the SUBA3 database (http://suba.plantenergy.uwa.edu.au). The source code and input data is available through the SUBA3 server (http://suba.plantenergy.uwa.edu.au//SUBAcon.html) and the Arabidopsis SUbproteome REference (ASURE) training set can be accessed using the ASURE web portal (http://suba.plantenergy.uwa.edu.au/ASURE).

Contact: cornelia.hooper@uwa.edu.au or ian.castleden@uwa.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hooper, C. M., Tanz, S. K., Castleden, I. R., Vacher, M. A., Small, I. D., Millar, A. H.
Posted: November 25, 2014, 7:33 am

Summary: pViz.js is a visualization library for displaying protein sequence features in a Web browser. By simply providing a sequence and the locations of its features, this lightweight, yet versatile, JavaScript library renders an interactive view of the protein features. Interactive exploration of protein sequence features over the Web is a common need in Bioinformatics. Although many Web sites have developed viewers to display these features, their implementations are usually focused on data from a specific source or use case. Some of these viewers can be adapted to fit other use cases but are not designed to be reusable. pViz makes it easy to display features as boxes aligned to a protein sequence with zooming functionality but also includes predefined renderings for secondary structure and post-translational modifications. The library is designed to further customize this view. We demonstrate such applications of pViz using two examples: a proteomic data visualization tool with an embedded viewer for displaying features on protein structure, and a tool to visualize the results of the variant_effect_predictor tool from Ensembl.

Availability and implementation: pViz.js is a JavaScript library, available on github at https://github.com/Genentech/pviz. This site includes examples and functional applications, installation instructions and usage documentation. A Readme file, which explains how to use pViz with examples, is available as Supplementary Material A.

Contact: masselot.alexandre@gene.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Mukhyala, K., Masselot, A.
Posted: November 25, 2014, 7:33 am

Motivation: Knowledge of drug–drug interactions (DDIs) is crucial for health-care professionals to avoid adverse effects when co-administering drugs to patients. As most newly discovered DDIs are made available through scientific publications, automatic DDI extraction is highly relevant.

Results: We propose a novel feature-based approach to extract DDIs from text. Our approach consists of three steps. First, we apply text preprocessing to convert input sentences from a given dataset into structured representations. Second, we map each candidate DDI pair from that dataset into a suitable syntactic structure. Based on that, a novel set of features is used to generate feature vectors for these candidate DDI pairs. Third, the obtained feature vectors are used to train a support vector machine (SVM) classifier. When evaluated on two DDI extraction challenge test datasets from 2011 and 2013, our system achieves F-scores of 71.1% and 83.5%, respectively, outperforming any state-of-the-art DDI extraction system.

Availability and implementation: The source code is available for academic use at http://www.biosemantics.org/uploads/DDI.zip

Contact: q.bui@erasmusmc.nl

Author: Bui, Q.-C., Sloot, P. M. A., van Mulligen, E. M., Kors, J. A.
Posted: November 25, 2014, 7:33 am

Summary: GlycoPattern is Web-based bioinformatics resource to support the analysis of glycan array data for the Consortium for Functional Glycomics. This resource includes algorithms and tools to discover structural motifs, a heatmap visualization to compare multiple experiments, hierarchical clustering of Glycan Binding Proteins with respect to their binding motifs and a structural search feature on the experimental data.

Availability and implementation: GlycoPattern is freely available on the Web at http://glycopattern.emory.edu with all major browsers supported.

Contact: sanjay.agravat@emory.edu

Author: Agravat, S. B., Saltz, J. H., Cummings, R. D., Smith, D. F.
Posted: November 25, 2014, 7:33 am

Large datasets can be screened for sequences from a specific organism, quickly and with low memory requirements, by a data structure that supports time- and memory-efficient set membership queries. Bloom filters offer such queries but require that false positives be controlled. We present BioBloom Tools, a Bloom filter-based sequence-screening tool that is faster than BWA, Bowtie 2 (popular alignment algorithms) and FACS (a membership query algorithm). It delivers accuracies comparable with these tools, controls false positives and has low memory requirements.

Availability and implementaion: www.bcgsc.ca/platform/bioinfo/software/biobloomtools

Contact: cjustin@bcgsc.ca or ibirol@bcgsc.ca

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Chu, J., Sadeghi, S., Raymond, A., Jackman, S. D., Nip, K. M., Mar, R., Mohamadi, H., Butterfield, Y. S., Robertson, A. G., Birol, I.
Posted: November 25, 2014, 7:33 am

Motivation: Sharing genomic data is crucial to support scientific investigation such as genome-wide association studies. However, recent investigations suggest the privacy of the individual participants in these studies can be compromised, leading to serious concerns and consequences, such as overly restricted access to data.

Results: We introduce a novel cryptographic strategy to securely perform meta-analysis for genetic association studies in large consortia. Our methodology is useful for supporting joint studies among disparate data sites, where privacy or confidentiality is of concern. We validate our method using three multisite association studies. Our research shows that genetic associations can be analyzed efficiently and accurately across substudy sites, without leaking information on individual participants and site-level association summaries.

Availability and implementation: Our software for secure meta-analysis of genetic association studies, SecureMA, is publicly available at http://github.com/XieConnect/SecureMA. Our customized secure computation framework is also publicly available at http://github.com/XieConnect/CircuitService

Contact: b.malin@vanderbilt.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Xie, W., Kantarcioglu, M., Bush, W. S., Crawford, D., Denny, J. C., Heatherly, R., Malin, B. A.
Posted: November 25, 2014, 7:33 am

Motivation: The tried and true approach of flow cytometry data analysis is to manually gate on each biomarker separately, which is feasible for a small number of biomarkers, e.g. less than five. However, this rapidly becomes confusing as the number of biomarker increases. Furthermore, multivariate structure is not taken into account. Recently, automated gating algorithms have been implemented, all of which rely on unsupervised learning methodology. However, all unsupervised learning outputs suffer the same difficulties in validation in the absence of external knowledge, regardless of application domain.

Results: We present a new semi-automated algorithm for population discovery that is based on comparison to fluorescence-minus-one controls, thus transferring the problem into that of one-class classification, as opposed to being an unsupervised learning problem. The novel one-class classification algorithm is based on common principal components and can accommodate complex mixtures of multivariate densities. Computational time is short, and the simple nature of the calculations means the algorithm can easily be adapted to process large numbers of cells (106). Furthermore, we are able to find rare cell populations as well as populations with low biomarker concentration, both of which are inherently hard to do in an unsupervised learning context without prior knowledge of the samples’ composition.

Availability and implementation: R scripts are available via https://fccf.mpiib-berlin.mpg.de/daten/drfz/bioinformatics/with{username,password}={bioinformatics,Sar=Gac4}.

Contact: kristen.feher@drfz.de or kaiser@drfz.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Feher, K., Kirsch, J., Radbruch, A., Chang, H.-D., Kaiser, T.
Posted: November 25, 2014, 7:33 am

Motivation: Numerous public microarray datasets are valuable resources for the scientific communities. Several online tools have made great steps to use these data by querying related datasets with users’ own gene signatures or expression profiles. However, dataset annotation and result exhibition still need to be improved.

Results: ExpTreeDB is a database that allows for queries on human and mouse microarray experiments from Gene Expression Omnibus with gene signatures or profiles. Compared with similar applications, ExpTreeDB pays more attention to dataset annotations and result visualization. We introduced a multiple-level annotation system to depict and organize original experiments. For example, a tamoxifen-treated cell line experiment is hierarchically annotated as ‘agent->drug->estrogen receptor antagonist->tamoxifen’. Consequently, retrieved results are exhibited by an interactive tree-structured graphics, which provide an overview for related experiments and might enlighten users on key items of interest.

Availability and implementation: The database is freely available at http://biotech.bmi.ac.cn/ExpTreeDB. Web site is implemented in Perl, PHP, R, MySQL and Apache.

Contact: boxc@bmi.ac.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Ni, M., Ye, F., Zhu, J., Li, Z., Yang, S., Yang, B., Han, L., Wu, Y., Chen, Y., Li, F., Wang, S., Bo, X.
Posted: November 25, 2014, 7:33 am

Summary: Current sequence alignment browsers allow visualization of large and complex next-generation sequencing datasets. However, most of these tools provide inadequate display of insertions and can be cumbersome to use on large datasets. I implemented PyBamView, a lightweight Web application for visualizing short read alignments. It provides an easy-to-use Web interface for viewing alignments across multiple samples, with a focus on accurate visualization of insertions.

Availability and Implementation: PyBamView is available as a standard python package. The source code is freely available under the MIT license at https://mgymrek.github.io/pybamview.

Contact: mgymrek@mit.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Gymrek, M.
Posted: November 25, 2014, 7:33 am

Motivation: Set-based network similarity metrics are increasingly used to productively analyze genome-wide data. Conventional approaches, such as mean shortest path and clique-based metrics, have been useful but are not well suited to all applications. Computational scientists in other disciplines have developed communicability as a complementary metric. Network communicability considers all paths of all lengths between two network members. Given the success of previous network analyses of protein–protein interactions, we applied the concepts of network communicability to this problem. Here we show that our communicability implementation has advantages over traditional approaches. Overall, analyses suggest network communicability has considerable utility in analysis of large-scale biological networks.

Availability and implementation: We provide our method as an R package for use in both human protein–protein interaction network analyses and analyses of arbitrary networks along with a tutorial at http://www.shawlab.org/NetComm/.

Contact: cashaw@bcm.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Campbell, I. M., James, R. A., Chen, E. S., Shaw, C. A.
Posted: November 25, 2014, 7:33 am

Summary: We introduce PHOXTRACK (PHOsphosite-X-TRacing Analysis of Causal Kinases), a user-friendly freely available software tool for analyzing large datasets of post-translational modifications of proteins, such as phosphorylation, which are commonly gained by mass spectrometry detection. In contrast to other currently applied data analysis approaches, PHOXTRACK uses full sets of quantitative proteomics data and applies non-parametric statistics to calculate whether defined kinase-specific sets of phosphosite sequences indicate statistically significant concordant differences between various biological conditions. PHOXTRACK is an efficient tool for extracting post-translational information of comprehensive proteomics datasets to decipher key regulatory proteins and to infer biologically relevant molecular pathways.

Availability: PHOXTRACK will be maintained over the next years and is freely available as an online tool for non-commercial use at http://phoxtrack.molgen.mpg.de. Users will also find a tutorial at this Web site and can additionally give feedback at https://groups.google.com/d/forum/phoxtrack-discuss.

Contact: sauer@molgen.mpg.de.

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Weidner, C., Fischer, C., Sauer, S.
Posted: November 25, 2014, 7:33 am

Summary: For practical and robust de novo identification of genomic fusions and breakpoints from targeted paired-end DNA sequencing data, we developed Fusion And Chromosomal Translocation Enumeration and Recovery Algorithm (FACTERA). Our method has minimal external dependencies, works directly on a preexisting Binary Alignment/Map file and produces easily interpretable output. We demonstrate FACTERA’s ability to rapidly identify breakpoint-resolution fusion events with high sensitivity and specificity in patients with non-small cell lung cancer, including novel rearrangements. We anticipate that FACTERA will be broadly applicable to the discovery and analysis of clinically relevant fusions from both targeted and genome-wide sequencing datasets.

Availability and implementation: http://factera.stanford.edu.

Contact: arasha@stanford.edu or diehn@stanford.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Newman, A. M., Bratman, S. V., Stehr, H., Lee, L. J., Liu, C. L., Diehn, M., Alizadeh, A. A.
Posted: November 25, 2014, 7:33 am

Motivation: RNA-seq has become the method of choice to quantify genes and exons, discover novel transcripts and detect fusion genes. However, reliable variant identification from RNA-seq data remains challenging because of the complexities of the transcriptome, the challenges of accurately mapping exon boundary spanning reads and the bias introduced during the sequencing library preparation.

Method: We developed RVboost, a novel method specific for RNA variant prioritization. RVboost uses several attributes unique in the process of RNA library preparation, sequencing and RNA-seq data analyses. It uses a boosting method to train a model of ‘good quality’ variants using common variants from HapMap, and prioritizes and calls the RNA variants based on the trained model. We packaged RVboost in a comprehensive workflow, which integrates tools of variant calling, annotation and filtering.

Results: RVboost consistently outperforms the variant quality score recalibration from the Genome Analysis Tool Kit and the RNA-seq variant-calling pipeline SNPiR in 12 RNA-seq samples using ground-truth variants from paired exome sequencing data. Several RNA-seq–specific attributes were identified as critical to differentiate true and false variants, including the distance of the variant positions to exon boundaries, and the percent of the reads supporting the variant in the first six base pairs. The latter identifies false variants introduced by the random hexamer priming during the library construction.

Availability and implementation: The RVboost package is implemented to readily run in Mac or Linux environments. The software and user manual are available at http://bioinformaticstools.mayo.edu/research/rvboost/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Wang, C., Davila, J. I., Baheti, S., Bhagwate, A. V., Wang, X., Kocher, J.-P. A., Slager, S. L., Feldman, A. L., Novak, A. J., Cerhan, J. R., Thompson, E. A., Asmann, Y. W.
Posted: November 25, 2014, 7:33 am

Motivation: Increasing attention has been devoted to estimation of species-level phylogenetic relationships under the coalescent model. However, existing methods either use summary statistics (gene trees) to carry out estimation, ignoring an important source of variability in the estimates, or involve computationally intensive Bayesian Markov chain Monte Carlo algorithms that do not scale well to whole-genome datasets.

Results: We develop a method to infer relationships among quartets of taxa under the coalescent model using techniques from algebraic statistics. Uncertainty in the estimated relationships is quantified using the nonparametric bootstrap. The performance of our method is assessed with simulated data. We then describe how our method could be used for species tree inference in larger taxon samples, and demonstrate its utility using datasets for Sistrurus rattlesnakes and for soybeans.

Availability and implementation: The method to infer the phylogenetic relationship among quartets is implemented in the software SVDquartets, available at www.stat.osu.edu/~lkubatko/software/SVDquartets.

Contact: lkubatko@stat.osu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Chifman, J., Kubatko, L.
Posted: November 25, 2014, 7:33 am

We introduce Pepper (Protein complex Expansion using Protein–Protein intERactions), a Cytoscape app designed to identify protein complexes as densely connected subnetworks from seed lists of proteins derived from proteomic studies. Pepper identifies connected subgraph by using multi-objective optimization involving two functions: (i) the coverage, a solution must contain as many proteins from the seed as possible, (ii) the density, the proteins of a solution must be as connected as possible, using only interactions from a proteome-wide interaction network. Comparisons based on gold standard yeast and human datasets showed Pepper’s integrative approach as superior to standard protein complex discovery methods. The visualization and interpretation of the results are facilitated by an automated post-processing pipeline based on topological analysis and data integration about the predicted complex proteins. Pepper is a user-friendly tool that can be used to analyse any list of proteins.

Availability: Pepper is available from the Cytoscape plug-in manager or online (http://apps.cytoscape.org/apps/pepper) and released under GNU General Public License version 3.

Contact: mohamed.elati@issb.genopole.fr

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Winterhalter, C., Nicolle, R., Louis, A., To, C., Radvanyi, F., Elati, M.
Posted: November 25, 2014, 7:33 am

Motivation: Functional relationship networks, which summarize the probability of co-functionality between any two genes in the genome, could complement the reductionist focus of modern biology for understanding diverse biological processes in an organism. One major limitation of the current networks is that they are static, while one might expect functional relationships to consistently reprogram during the differentiation of a cell lineage. To address this potential limitation, we developed a novel algorithm that leverages both differentiation stage-specific expression data and large-scale heterogeneous functional genomic data to model such dynamic changes. We then applied this algorithm to the time-course RNA-Seq data we collected for ex vivo human erythroid cell differentiation.

Results: Through computational cross-validation and literature validation, we show that the resulting networks correctly predict the (de)-activated functional connections between genes during erythropoiesis. We identified known critical genes, such as HBD and GATA1, and functional connections during erythropoiesis using these dynamic networks, while the traditional static network was not able to provide such information. Furthermore, by comparing the static and the dynamic networks, we identified novel genes (such as OSBP2 and PDZK1IP1) that are potential drivers of erythroid cell differentiation. This novel method of modeling dynamic networks is applicable to other differentiation processes where time-course genome-scale expression data are available, and should assist in generating greater understanding of the functional dynamics at play across the genome during development.

Availability and implementation: The network described in this article is available at http://guanlab.ccmb.med.umich.edu/stageSpecificNetwork.

Contact: gyuanfan@umich.edu or engel@umich.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Zhu, F., Shi, L., Li, H., Eksi, R., Engel, J. D., Guan, Y.
Posted: November 25, 2014, 7:33 am

Motivation: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment.

Results: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth.

Availability and Implementation: The subSeq R package is available at http://github.com/StoreyLab/subSeq/.

Contact: dgrtwo@princeton.edu or jstorey@princeton.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Robinson, D. G., Storey, J. D.
Posted: November 25, 2014, 7:33 am

Motivation: The S/TQ cluster domain (SCD) constitutes a new type of protein domain that is not defined by sequence similarity but by the presence of multiple S/TQ motifs within a variable stretch of amino acids. SCDs are recognized targets for DNA damage response (DDR) kinases like ATM and ATR. Characterizing DDR targets is of significant interest. The aim of this work was to develop a web-based tool to allow for easy identification and visualization of SCDs within specific proteins or in whole proteome sets, a feature not supported by current domain and motif search tools.

Results: We have developed an algorithm that (i) generates a list of all proteins in an organism containing at least one user-defined SCD within their sequence, or (ii) identifies and renders a visual representation of all user-defined SCDs present in a single sequence or batch of sequences.

Availability and implementation: The application was developed using Pearl and Python, and is available at the following URL: http://ustbioinfo.webfactional.com/scd/.

Contact: ribesza@stthom.edu or lariosm@stthom.edu

Author: Cara, L., Baitemirova, M., Duong, F., Larios-Sanz, M., Ribes-Zamora, A.
Posted: November 25, 2014, 7:33 am

HPG Aligner applies suffix arrays for DNA read mapping. This implementation produces a highly sensitive and extremely fast mapping of DNA reads that scales up almost linearly with read length. The approach presented here is faster (over 20x for long reads) and more sensitive (over 98% in a wide range of read lengths) than the current state-of-the-art mappers. HPG Aligner is not only an optimal alternative for current sequencers but also the only solution available to cope with longer reads and growing throughputs produced by forthcoming sequencing technologies.

Availability and implementation: https://github.com/opencb/hpg-aligner.

Contact: jdopazo@cipf.es or imedina@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Tarraga, J., Arnau, V., Martinez, H., Moreno, R., Cazorla, D., Salavert-Torres, J., Blanquer-Espert, I., Dopazo, J., Medina, I.
Posted: November 25, 2014, 7:33 am

Motivation: Nanopore sequencing may be the next disruptive technology in genomics, owing to its ability to detect single DNA molecules without prior amplification, lack of reliance on expensive optical components, and the ability to sequence long fragments. The MinION™ from Oxford Nanopore Technologies (ONT) is the first nanopore sequencer to be commercialized and is now available to early-access users. The MinION™ is a USB-connected, portable nanopore sequencer that permits real-time analysis of streaming event data. Currently, the research community lacks a standardized toolkit for the analysis of nanopore datasets.

Results: We introduce poretools, a flexible toolkit for exploring datasets generated by nanopore sequencing devices from MinION™ for the purposes of quality control and downstream analysis. Poretools operates directly on the native FAST5 (an application of the HDF5 standard) file format produced by ONT and provides a wealth of format conversion utilities and data exploration and visualization tools.

Availability and implementation: Poretools is an open-source software and is written in Python as both a suite of command line utilities and a Python application programming interface. Source code is freely available in Github at https://www.github.com/arq5x/poretools

Contact: n.j.loman@bham.ac.uk and aaronquinlan@gmail.com

Supplementary information: An IPython notebook demonstrating the functionality of poretools is in Github. Complete documentation is available at http://poretools.readthedocs.org.

Author: Loman, N. J., Quinlan, A. R.
Posted: November 25, 2014, 7:33 am

Motivation: The human leukocyte antigen (HLA) gene cluster plays a crucial role in adaptive immunity and is thus relevant in many biomedical applications. While next-generation sequencing data are often available for a patient, deducing the HLA genotype is difficult because of substantial sequence similarity within the cluster and exceptionally high variability of the loci. Established approaches, therefore, rely on specific HLA enrichment and sequencing techniques, coming at an additional cost and extra turnaround time.

Result: We present OptiType, a novel HLA genotyping algorithm based on integer linear programming, capable of producing accurate predictions from NGS data not specifically enriched for the HLA cluster. We also present a comprehensive benchmark dataset consisting of RNA, exome and whole-genome sequencing data. OptiType significantly outperformed previously published in silico approaches with an overall accuracy of 97% enabling its use in a broad range of applications.

Contact: szolek@informatik.uni-tuebingen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Szolek, A., Schubert, B., Mohr, C., Sturm, M., Feldhahn, M., Kohlbacher, O.
Posted: November 25, 2014, 7:33 am

Motivation: Identifying somatic changes from tumor and matched normal sequences has become a standard approach in cancer research. More specifically, this requires accurate detection of somatic point mutations with low allele frequencies in impure and heterogeneous cancer samples. Although haplotype phasing information derived by using heterozygous germ line variants near candidate mutations would improve accuracy, no somatic mutation caller that uses such information is currently available.

Results: We propose a Bayesian hierarchical method, termed HapMuC, in which power is increased by using available information on heterozygous germ line variants located near candidate mutations. We first constructed two generative models (the mutation model and the error model). In the generative models, we prepared candidate haplotypes, considering a heterozygous germ line variant if available, and the observed reads were realigned to the haplotypes. We then inferred the haplotype frequencies and computed the marginal likelihoods using a variational Bayesian algorithm. Finally, we derived a Bayes factor for evaluating the possibility of the existence of somatic mutations. We also demonstrated that our algorithm has superior specificity and sensitivity compared with existing methods, as determined based on a simulation, the TCGA Mutation Calling Benchmark 4 datasets and data from the COLO-829 cell line.

Availability and implementation: The HapMuC source code is available from http://github.com/usuyama/hapmuc.

Contact: imoto@ims.u-tokyo.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Usuyama, N., Shiraishi, Y., Sato, Y., Kume, H., Homma, Y., Ogawa, S., Miyano, S., Imoto, S.
Posted: November 25, 2014, 7:33 am

Motivation: Researchers now have access to large volumes of genome sequences for comparative analysis, some generated by the plethora of public sequencing projects and, increasingly, from individual efforts. It is not possible, or necessarily desirable, that the public genome browsers attempt to curate all these data. Instead, a wealth of powerful tools is emerging to empower users to create their own visualizations and browsers.

Results: We introduce a pipeline to easily generate collections of Web-accessible UCSC Genome Browsers interrelated by an alignment. It is intended to democratize our comparative genomic browser resources, serving the broad and growing community of evolutionary genomicists and facilitating easy public sharing via the Internet. Using the alignment, all annotations and the alignment itself can be efficiently viewed with reference to any genome in the collection, symmetrically. A new, intelligently scaled alignment display makes it simple to view all changes between the genomes at all levels of resolution, from substitutions to complex structural rearrangements, including duplications. To demonstrate this work, we create a comparative assembly hub containing 57 Escherichia coli and 9 Shigella genomes and show examples that highlight their unique biology.

Availability and implementation: The source code is available as open source at: https://github.com/glennhickey/progressiveCactus The E.coli and Shigella genome hub is now a public hub listed on the UCSC browser public hubs Web page.

Contact: benedict@soe.ucsc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Nguyen, N., Hickey, G., Raney, B. J., Armstrong, J., Clawson, H., Zweig, A., Karolchik, D., Kent, W. J., Haussler, D., Paten, B.
Posted: November 25, 2014, 7:33 am

Summary: Automated analysis of imaged phenotypes enables fast and reproducible quantification of biologically relevant features. Despite recent developments, recordings of complex networked structures, such as leaf venation patterns, cytoskeletal structures or traffic networks, remain challenging to analyze. Here we illustrate the applicability of img2net to automatedly analyze such structures by reconstructing the underlying network, computing relevant network properties and statistically comparing networks of different types or under different conditions. The software can be readily used for analyzing image data of arbitrary 2D and 3D network-like structures.

Availability and Implementation: img2net is open-source software under the GPL and can be downloaded from http://mathbiol.mpimp-golm.mpg.de/img2net/, where supplementary information and datasets for testing are provided.

Contact: breuer@mpimp-golm.mpg.de

Author: Breuer, D., Nikoloski, Z.
Posted: November 5, 2014, 11:27 am

Summary: Recently, several high profile studies collected cell viability data from panels of cancer cell lines treated with many drugs applied at different concentrations. Such drug sensitivity data for cancer cell lines provide suggestive treatments for different types and subtypes of cancer. Visualization of these datasets can reveal patterns that may not be obvious by examining the data without such efforts. Here we introduce Drug/Cell-line Browser (DCB), an online interactive HTML5 data visualization tool for interacting with three of the recently published datasets of cancer cell lines/drug-viability studies. DCB uses clustering and canvas visualization of the drugs and the cell lines, as well as a bar graph that summarizes drug effectiveness for the tissue of origin or the cancer subtypes for single or multiple drugs. DCB can help in understanding drug response patterns and prioritizing drug/cancer cell line interactions by tissue of origin or cancer subtype.

Availability and implementation: DCB is an open source Web-based tool that is freely available at: http://www.maayanlab.net/LINCS/DCB

Contact: avi.maayan@mssm.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Duan, Q., Wang, Z., Fernandez, N. F., Rouillard, A. D., Tan, C. M., Benes, C. H., Ma'ayan, A.
Posted: November 5, 2014, 11:27 am

Summary: Non-targeted metabolomics technologies often yield data in which abundance for any given metabolite is observed and quantified for some samples and reported as missing for other samples. Apparent missingness can be due to true absence of the metabolite in the sample or presence at a level below detectability. Mixture-model analysis can formally account for metabolite ‘missingness’ due to absence or undetectability, but software for this type of analysis in the high-throughput setting is limited. The R package metabomxtr has been developed to facilitate mixture-model analysis of non-targeted metabolomics data in which only a portion of samples have quantifiable abundance for certain metabolites.

Availability and implementation: metabomxtr is available through Bioconductor. It is released under the GPL-2 license.

Contact: dscholtens@northwestern.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Nodzenski, M., Muehlbauer, M. J., Bain, J. R., Reisetter, A. C., Lowe, W. L., Scholtens, D. M.
Posted: November 5, 2014, 11:27 am

Summary: Because cancer has heterogeneous clinical behaviors due to the progressive accumulation of multiple genetic and epigenetic alterations, the identification of robust molecular signatures for predicting cancer outcome is profoundly important. Here, we introduce the APPEX Web-based analysis platform as a versatile tool for identifying prognostic molecular signatures that predict cancer diversity. We incorporated most of statistical methods for survival analysis and implemented seven survival analysis workflows, including CoxSingle, CoxMulti, IntransSingle, IntransMulti, SuperPC, TimeRoc and multivariate. A total of 236 publicly available datasets were collected, processed and stored to support easy independent validation of prognostic signatures. Two case studies including disease recurrence and bladder cancer progression were described using different combinations of the seven workflows.

Availability and implementation: APPEX is freely available at http://www.appex.kr.

Contact: kimsy@kribb.re.kr

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Kim, S.-K., Hwan Kim, J., Yun, S.-J., Kim, W.-J., Kim, S.-Y.
Posted: November 5, 2014, 11:27 am

Summary: The application of protein–protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of >97% strong scaling.

Availability and Implementation: MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock.

Contact: akiyama@cs.titech.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online

Author: Ohue, M., Shimoda, T., Suzuki, S., Matsuzaki, Y., Ishida, T., Akiyama, Y.
Posted: November 5, 2014, 11:27 am

Motivation: Kotai Antibody Builder is a Web service for tertiary structural modeling of antibody variable regions. It consists of three main steps: hybrid template selection by sequence alignment and canonical rules, 3D rendering of alignments and CDR-H3 loop modeling. For the last step, in addition to rule-based heuristics used to build the initial model, a refinement option is available that uses fragment assembly followed by knowledge-based scoring. Using targets from the Second Antibody Modeling Assessment, we demonstrate that Kotai Antibody Builder generates models with an overall accuracy equal to that of the best-performing semi-automated predictors using expert knowledge.

Availability and implementation: Kotai Antibody Builder is available at http://kotaiab.org

Contact: standley@ifrec.osaka-u.ac.jp

Author: Yamashita, K., Ikeda, K., Amada, K., Liang, S., Tsuchiya, Y., Nakamura, H., Shirai, H., Standley, D. M.
Posted: November 5, 2014, 11:27 am

Summary: AliView is an alignment viewer and editor designed to meet the requirements of next-generation sequencing era phylogenetic datasets. AliView handles alignments of unlimited size in the formats most commonly used, i.e. FASTA, Phylip, Nexus, Clustal and MSF. The intuitive graphical interface makes it easy to inspect, sort, delete, merge and realign sequences as part of the manual filtering process of large datasets. AliView also works as an easy-to-use alignment editor for small as well as large datasets.

Availability and implementation: AliView is released as open-source software under the GNU General Public License, version 3.0 (GPLv3), and is available at GitHub (www.github.com/AliView). The program is cross-platform and extensively tested on Linux, Mac OS X and Windows systems. Downloads and help are available at http://ormbunkar.se/aliview

Contact: anders.larsson@ebc.uu.se

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Larsson, A.
Posted: November 5, 2014, 11:27 am

Summary: We present a new method to incrementally construct the FM-index for both short and long sequence reads, up to the size of a genome. It is the first algorithm that can build the index while implicitly sorting the sequences in the reverse (complement) lexicographical order without a separate sorting step. The implementation is among the fastest for indexing short reads and the only one that practically works for reads of averaged kilobases in length.

Availability and implementation: https://github.com/lh3/ropebwt2

Contact: hengli@broadinstitute.org

Author: Li, H.
Posted: November 5, 2014, 11:27 am

Summary: Next-generation sequencing (NGS) has a large potential in HIV diagnostics, and genotypic prediction models have been developed and successfully tested in the recent years. However, albeit being highly accurate, these computational models lack computational efficiency to reach their full potential.

In this study, we demonstrate the use of graphics processing units (GPUs) in combination with a computational prediction model for HIV tropism. Our new model named gCUP, parallelized and optimized for GPU, is highly accurate and can classify >175 000 sequences per second on an NVIDIA GeForce GTX 460. The computational efficiency of our new model is the next step to enable NGS technologies to reach clinical significance in HIV diagnostics. Moreover, our approach is not limited to HIV tropism prediction, but can also be easily adapted to other settings, e.g. drug resistance prediction.

Availability and implementation: The source code can be downloaded at http://www.heiderlab.de

Contact: d.heider@wz-straubing.de

Author: Olejnik, M., Steuwer, M., Gorlatch, S., Heider, D.
Posted: November 5, 2014, 11:27 am

Summary: There have been numerous applications developed for decoding and visualization of ab1 DNA sequencing files for Windows and MAC platforms, yet none exists for the increasingly popular smartphone operating systems. The ability to decode sequencing files cannot easily be carried out using browser accessed Web tools. To overcome this hurdle, we have developed a new native app called DNAApp that can decode and display ab1 sequencing file on Android and iOS. In addition to in-built analysis tools such as reverse complementation, protein translation and searching for specific sequences, we have incorporated convenient functions that would facilitate the harnessing of online Web tools for a full range of analysis. Given the high usage of Android/iOS tablets and smartphones, such bioinformatics apps would raise productivity and facilitate the high demand for analyzing sequencing data in biomedical research.

Availability and implementation: The Android version of DNAApp is available in Google Play Store as ‘DNAApp’, and the iOS version is available in the App Store. More details on the app can be found at www.facebook.com/APDLab; www.bii.a-star.edu.sg/research/trd/apd.php

The DNAApp user guide is available at http://tinyurl.com/DNAAppuser, and a video tutorial is available on Google Play Store and App Store, as well as on the Facebook page.

Contact: samuelg@bii.a-star.edu.sg

Author: Nguyen, P.-V., Verma, C. S., Gan, S. K.-E.
Posted: November 5, 2014, 11:27 am

Summary: Basic4Cseq is an R/Bioconductor package for basic filtering, analysis and subsequent near-cis visualization of 4C-seq data. The package processes aligned 4C-seq raw data stored in binary alignment/map (BAM) format and maps the short reads to a corresponding virtual fragment library. Functions are included to create virtual fragment libraries providing chromosome position and further information on 4C-seq fragments (length and uniqueness of the fragment ends, and blindness of a fragment) for any BSGenome package. An optional filter is included for BAM files to remove invalid 4C-seq reads, and further filter functions are offered for 4C-seq fragments. Additionally, basic quality controls based on the read distribution are included. Fragment data in the vicinity of the experiment’s viewpoint are visualized as coverage plot based on a running median approach and a multi-scale contact profile. Wig files or csv files of the fragment data can be exported for further analyses and visualizations of interactions with other programs.

Availability and implementation: Basic4Cseq is implemented in R and available at http://www.bioconductor.org/. A vignette with detailed descriptions of the functions is included in the package.

Contact: Carolin.Walter@uni-muenster.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Walter, C., Schuetzmann, D., Rosenbauer, F., Dugas, M.
Posted: November 5, 2014, 11:27 am

Motivation: The ability to accurately read the order of nucleotides in DNA and RNA is fundamental for modern biology. Errors in next-generation sequencing can lead to many artifacts, from erroneous genome assemblies to mistaken inferences about RNA editing. Uneven coverage in datasets also contributes to false corrections.

Result: We introduce Trowel, a massively parallelized and highly efficient error correction module for Illumina read data. Trowel both corrects erroneous base calls and boosts base qualities based on the k-mer spectrum. With high-quality k-mers and relevant base information, Trowel achieves high accuracy for different short read sequencing applications.The latency in the data path has been significantly reduced because of efficient data access and data structures. In performance evaluations, Trowel was highly competitive with other tools regardless of coverage, genome size read length and fragment size.

Availability and implementation: Trowel is written in C++ and is provided under the General Public License v3.0 (GPLv3). It is available at http://trowel-ec.sourceforge.net.

Contact: euncheon.lim@tue.mpg.de or weigel@tue.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Lim, E.-C., Muller, J., Hagmann, J., Henz, S. R., Kim, S.-T., Weigel, D.
Posted: November 5, 2014, 11:27 am

Motivation: Gene models from draft genome assemblies of metazoan species are often incorrect, missing exons or entire genes, particularly for large gene families. Consequently, labour-intensive manual curation is often necessary. We present Figmop (Finding Genes using Motif Patterns) to help with the manual curation of gene families in draft genome assemblies. The program uses a pattern of short sequence motifs to identify putative genes directly from the genome sequence. Using a large gene family as a test case, Figmop was found to be more sensitive and specific than a BLAST-based approach. The visualization used allows the validation of potential genes to be carried out quickly and easily, saving hours if not days from an analysis.

Availability and implementation: Source code of Figmop is freely available for download at https://github.com/dave-the-scientist, implemented in C and Python and is supported on Linux, Unix and MacOSX.

Contact: curran.dave.m@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Curran, D. M., Gilleard, J. S., Wasmuth, J. D.
Posted: November 5, 2014, 11:27 am

Motivation: Elementary flux mode (EFM) is a useful tool in constraint-based modeling of metabolic networks. The property that every flux distribution can be decomposed as a weighted sum of EFMs allows certain applications of EFMs to studying flux distributions. The existence of biologically infeasible EFMs and the non-uniqueness of the decomposition, however, undermine the applicability of such methods. Efforts have been made to find biologically feasible EFMs by incorporating information from transcriptional regulation and thermodynamics. Yet, no attempt has been made to distinguish biologically feasible EFMs by considering their graphical properties. A previous study on the transcriptional regulation of metabolic genes found that distinct branches at a branch point metabolite usually belong to distinct metabolic pathways. This suggests an intuitive property of biologically feasible EFMs, i.e. minimal branching.

Results: We developed the concept of minimal branching EFM and derived the minimal branching decomposition (MBD) to decompose flux distributions. Testing in the core Escherichia coli metabolic network indicated that MBD can distinguish branches at branch points and greatly reduced the solution space in which the decomposition is often unique. An experimental flux distribution from a previous study on mouse cardiomyocyte was decomposed using MBD. Comparison with decomposition by a minimum number of EFMs showed that MBD found EFMs more consistent with established biological knowledge, which facilitates interpretation. Comparison of the methods applied to a complex flux distribution in Lactococcus lactis similarly showed the advantages of MBD. The minimal branching EFM concept underlying MBD should be useful in other applications.

Contact: sinhu@bio.dtu.dk or p.ji@polyu.edu.hk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Chan, S. H. J., Solem, C., Jensen, P. R., Ji, P.
Posted: November 5, 2014, 11:27 am

Motivation: This article presents Thresher, an improved technique for finding peak height thresholds for automated rRNA intergenic spacer analysis (ARISA) profiles. We argue that thresholds must be sample dependent, taking community richness into account. In most previous fragment analyses, a common threshold is applied to all samples simultaneously, ignoring richness variations among samples and thereby compromising cross-sample comparison. Our technique solves this problem, and at the same time provides a robust method for outlier rejection, selecting for removal any replicate pairs that are not valid replicates.

Results: Thresholds are calculated individually for each replicate in a pair, and separately for each sample. The thresholds are selected to be the ones that minimize the dissimilarity between the replicates after thresholding. If a choice of threshold results in the two replicates in a pair failing a quantitative test of similarity, either that threshold or that sample must be rejected. We compare thresholded ARISA results with sequencing results, and demonstrate that the Thresher algorithm outperforms conventional thresholding techniques.

Availability and Implementation: The software is implemented in R, and the code is available at http://verenastarke.wordpress.com or by contacting the author.

Contact: vstarke@ciw.edu or http://verenastarke.wordpress.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Starke, V., Steele, A.
Posted: November 5, 2014, 11:27 am

Motivation: The identification of active transcriptional regulatory elements is crucial to understand regulatory networks driving cellular processes such as cell development and the onset of diseases. It has recently been shown that chromatin structure information, such as DNase I hypersensitivity (DHS) or histone modifications, significantly improves cell-specific predictions of transcription factor binding sites. However, no method has so far successfully combined both DHS and histone modification data to perform active binding site prediction.

Results: We propose here a method based on hidden Markov models to integrate DHS and histone modifications occupancy for the detection of open chromatin regions and active binding sites. We have created a framework that includes treatment of genomic signals, model training and genome-wide application. In a comparative analysis, our method obtained a good trade-off between sensitivity versus specificity and superior area under the curve statistics than competing methods. Moreover, our technique does not require further training or sequence information to generate binding location predictions. Therefore, the method can be easily applied on new cell types and allow flexible downstream analysis such as de novo motif finding.

Availability and implementation: Our framework is available as part of the Regulatory Genomics Toolbox. The software information and all benchmarking data are available at http://costalab.org/wp/dh-hmm.

Contact: ivan.costa@rwth-aachen.de or eduardo.gusmao@rwth-aachen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Gusmao, E. G., Dieterich, C., Zenke, M., Costa, I. G.
Posted: November 5, 2014, 11:27 am

Motivation: Whole-exome sequencing (WES) has opened up previously unheard of possibilities for identifying novel disease genes in Mendelian disorders, only about half of which have been elucidated to date. However, interpretation of WES data remains challenging.

Results: Here, we analyze protein–protein association (PPA) networks to identify candidate genes in the vicinity of genes previously implicated in a disease. The analysis, using a random-walk with restart (RWR) method, is adapted to the setting of WES by developing a composite variant-gene relevance score based on the rarity, location and predicted pathogenicity of variants and the RWR evaluation of genes harboring the variants. Benchmarking using known disease variants from 88 disease-gene families reveals that the correct gene is ranked among the top 10 candidates in ≥50% of cases, a figure which we confirmed using a prospective study of disease genes identified in 2012 and PPA data produced before that date. We implement our method in a freely available Web server, ExomeWalker, that displays a ranked list of candidates together with information on PPAs, frequency and predicted pathogenicity of the variants to allow quick and effective searches for candidates that are likely to reward closer investigation.

Availability and implementation: http://compbio.charite.de/ExomeWalker

Contact: peter.robinson@charite.de

Author: Smedley, D., Kohler, S., Czeschik, J. C., Amberger, J., Bocchini, C., Hamosh, A., Veldboer, J., Zemojtel, T., Robinson, P. N.
Posted: November 5, 2014, 11:27 am

Motivation: Set-based variance component tests have been identified as a way to increase power in association studies by aggregating weak individual effects. However, the choice of test statistic has been largely ignored even though it may play an important role in obtaining optimal power. We compared a standard statistical test—a score test—with a recently developed likelihood ratio (LR) test. Further, when correction for hidden structure is needed, or gene–gene interactions are sought, state-of-the art algorithms for both the score and LR tests can be computationally impractical. Thus we develop new computationally efficient methods.

Results: After reviewing theoretical differences in performance between the score and LR tests, we find empirically on real data that the LR test generally has more power. In particular, on 15 of 17 real datasets, the LR test yielded at least as many associations as the score test—up to 23 more associations—whereas the score test yielded at most one more association than the LR test in the two remaining datasets. On synthetic data, we find that the LR test yielded up to 12% more associations, consistent with our results on real data, but also observe a regime of extremely small signal where the score test yielded up to 25% more associations than the LR test, consistent with theory. Finally, our computational speedups now enable (i) efficient LR testing when the background kernel is full rank, and (ii) efficient score testing when the background kernel changes with each test, as for gene–gene interaction tests. The latter yielded a factor of 2000 speedup on a cohort of size 13 500.

Availability: Software available at http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/Fastlmm/.

Contact: heckerma@microsoft.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Lippert, C., Xiang, J., Horta, D., Widmer, C., Kadie, C., Heckerman, D., Listgarten, J.
Posted: November 5, 2014, 11:27 am

Motivation: Individuals in each family are genetically more homogeneous than unrelated individuals, and family-based designs are often recommended for the analysis of rare variants. However, despite the importance of family-based samples analysis, few statistical methods for rare variant association analysis are available.

Results: In this report, we propose a FAmily-based Rare Variant Association Test (FARVAT). FARVAT is based on the quasi-likelihood of whole families, and is statistically and computationally efficient for the extended families. FARVAT assumed that families were ascertained with the disease status of family members, and incorporation of the estimated genetic relationship matrix to the proposed method provided robustness under the presence of the population substructure. Depending on the choice of working matrix, our method could be a burden test or a variance component test, and could be extended to the SKAT-O-type statistic. FARVAT was implemented in C++, and application of the proposed method to schizophrenia data and simulated data for GAW17 illustrated its practical importance.

Availability: The software calculates various statistics for the analysis of related samples, and it is freely downloadable from http://healthstats.snu.ac.kr/software/farvat.

Contact: won1@snu.ac.kr or tspark@stats.snu.ac.kr

Supplementary information: supplementary data are available at Bioinformatics online.

Author: Choi, S., Lee, S., Cichon, S., Nothen, M. M., Lange, C., Park, T., Won, S.
Posted: November 5, 2014, 11:27 am

Motivation: A popular method for classification of protein domain movements apportions them into two main types: those with a ‘hinge’ mechanism and those with a ‘shear’ mechanism. The intuitive assignment of domain movements to these classes has limited the number of domain movements that can be classified in this way. Furthermore, whether intended or not, the term ‘shear’ is often interpreted to mean a relative translation of the domains.

Results: Numbers of occurrences of four different types of residue contact changes between domains were optimally combined by logistic regression using the training set of domain movements intuitively classified as hinge and shear to produce a predictor for hinge and shear. This predictor was applied to give a 10-fold increase in the number of examples over the number previously available with a high degree of precision. It is shown that overall a relative translation of domains is rare, and that there is no difference between hinge and shear mechanisms in this respect. However, the shear set contains significantly more examples of domains having a relative twisting movement than the hinge set. The angle of rotation is also shown to be a good discriminator between the two mechanisms.

Availability and implementation: Results are free to browse at http://www.cmp.uea.ac.uk/dyndom/interface/.

Contact: sjh@cmp.uea.ac.uk.

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Taylor, D., Cawley, G., Hayward, S.
Posted: November 5, 2014, 11:27 am

Motivation: The clonal theory of adaptive immunity proposes that immunological responses are encoded by increases in the frequency of lymphocytes carrying antigen-specific receptors. In this study, we measure the frequency of different T-cell receptors (TcR) in CD4 + T cell populations of mice immunized with a complex antigen, killed Mycobacterium tuberculosis, using high throughput parallel sequencing of the TcRβ chain. Our initial hypothesis that immunization would induce repertoire convergence proved to be incorrect, and therefore an alternative approach was developed that allows accurate stratification of TcR repertoires and provides novel insights into the nature of CD4 + T-cell receptor recognition.

Results: To track the changes induced by immunization within this heterogeneous repertoire, the sequence data were classified by counting the frequency of different clusters of short (3 or 4) continuous stretches of amino acids within the antigen binding complementarity determining region 3 (CDR3) repertoire of different mice. Both unsupervised (hierarchical clustering) and supervised (support vector machine) analyses of these different distributions of sequence clusters differentiated between immunized and unimmunized mice with 100% efficiency. The CD4 + TcR repertoires of mice 5 and 14 days postimmunization were clearly different from that of unimmunized mice but were not distinguishable from each other. However, the repertoires of mice 60 days postimmunization were distinct both from naive mice and the day 5/14 animals. Our results reinforce the remarkable diversity of the TcR repertoire, resulting in many diverse private TcRs contributing to the T-cell response even in genetically identical mice responding to the same antigen. However, specific motifs defined by short stretches of amino acids within the CDR3 region may determine TcR specificity and define a new approach to TcR sequence classification.

Availability and implementation: The analysis was implemented in R and Python, and source code can be found in Supplementary Data.

Contact: b.chain@ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Thomas, N., Best, K., Cinelli, M., Reich-Zeliger, S., Gal, H., Shifrut, E., Madi, A., Friedman, N., Shawe-Taylor, J., Chain, B.
Posted: November 5, 2014, 11:27 am

Motivation: A proper target or marker is essential in any diagnosis (e.g. an infection or cancer). An ideal diagnostic target should be both conserved in and unique to the pathogen. Currently, these targets can only be identified manually, which is time-consuming and usually error-prone. Because of the increasingly frequent occurrences of emerging epidemics and multidrug-resistant ‘superbugs’, a rapid diagnostic target identification process is needed.

Results: A new method that can identify uniquely conserved regions (UCRs) as candidate diagnostic targets for a selected group of organisms solely from their genomic sequences has been developed and successfully tested. Using a sequence-indexing algorithm to identify UCRs and a k-mer integer-mapping model for computational efficiency, this method has successfully identified UCRs within the bacteria domain for 15 test groups, including pathogenic, probiotic, commensal and extremophilic bacterial species or strains. Based on the identified UCRs, new diagnostic primer sets were designed, and their specificity and efficiency were tested by polymerase chain reaction amplifications from both pure isolates and samples containing mixed cultures.

Availability and implementation: The UCRs identified for the 15 bacterial species are now freely available at http://ucr.synblex.com. The source code of the programs used in this study is accessible at http://ucr.synblex.com/bacterialIdSourceCode.d.zip

Contact: yazhousun@synblex.com

Supplementary Information: Supplementary data are available at Bioinformatics online.

Author: Zhang, Y., Sun, Y.
Posted: November 5, 2014, 11:27 am

Motivation: Mapping of high-throughput sequencing data and other bulk sequence comparison applications have motivated a search for high-efficiency sequence alignment algorithms. The bit-parallel approach represents individual cells in an alignment scoring matrix as bits in computer words and emulates the calculation of scores by a series of logic operations composed of AND, OR, XOR, complement, shift and addition. Bit-parallelism has been successfully applied to the longest common subsequence (LCS) and edit-distance problems, producing fast algorithms in practice.

Results: We have developed BitPAl, a bit-parallel algorithm for general, integer-scoring global alignment. Integer-scoring schemes assign integer weights for match, mismatch and insertion/deletion. The BitPAl method uses structural properties in the relationship between adjacent scores in the scoring matrix to construct classes of efficient algorithms, each designed for a particular set of weights. In timed tests, we show that BitPAl runs 7–25 times faster than a standard iterative algorithm.

Availability and implementation: Source code is freely available for download at http://lobstah.bu.edu/BitPAl/BitPAl.html. BitPAl is implemented in C and runs on all major operating systems.

Contact: jloving@bu.edu or yhernand@bu.edu or gbenson@bu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Loving, J., Hernandez, Y., Benson, G.
Posted: November 5, 2014, 11:27 am

Motivation: Single-cell DNA sequencing is necessary for examining genetic variation at the cellular level, which remains hidden in bulk sequencing experiments. But because they begin with such small amounts of starting material, the amount of information that is obtained from single-cell sequencing experiment is highly sensitive to the choice of protocol employed and variability in library preparation. In particular, the fraction of the genome represented in single-cell sequencing libraries exhibits extreme variability due to quantitative biases in amplification and loss of genetic material.

Results: We propose a method to predict the genome coverage of a deep sequencing experiment using information from an initial shallow sequencing experiment mapped to a reference genome. The observed coverage statistics are used in a non-parametric empirical Bayes Poisson model to estimate the gain in coverage from deeper sequencing. This approach allows researchers to know statistical features of deep sequencing experiments without actually sequencing deeply, providing a basis for optimizing and comparing single-cell sequencing protocols or screening libraries.

Availability and implementation: The method is available as part of the preseq software package. Source code is available at http://smithlabresearch.org/preseq.

Contact: andrewds@usc.edu

Supplementary information: Supplementary material is available at Bioinformatics online.

Author: Daley, T., Smith, A. D.
Posted: November 5, 2014, 11:27 am

Motivation: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30–60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts.

Results: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package ‘MLbias’ and all source files are publicly available.

Availability and implementation: tsenglab.biostat.pitt.edu/software.htm.

Contact: ctseng@pitt.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Ding, Y., Tang, S., Liao, S. G., Jia, J., Oesterreich, S., Lin, Y., Tseng, G. C.
Posted: November 5, 2014, 11:27 am

Summary: The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed® and related biological databases. Herein, we describe BioTextQuest+, a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest+ enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest+ addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest+ through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing.

Availability: The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest.

Contact: g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Papanikolaou, N., Pavlopoulos, G. A., Pafilis, E., Theodosiou, T., Schneider, R., Satagopam, V. P., Ouzounis, C. A., Eliopoulos, A. G., Promponas, V. J., Iliopoulos, I.
Posted: November 5, 2014, 11:27 am

Motivation: Clustering methods can be useful for automatically grouping documents into meaningful clusters, improving human comprehension of a document collection. Although there are clustering algorithms that can achieve the goal for relatively large document collections, they do not always work well for small and homogenous datasets.

Methods: In this article, we present Retro—a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections. Unlike common clustering approaches, our algorithm predicts cluster titles before clustering. It relies on the hypergeometric distribution model to discover key phrases, and generates candidate clusters by assigning documents to these phrases. Further, the statistical significance of candidate clusters is tested using supervised learning methods, and a multiple testing correction technique is used to control the overall quality of clustering.

Results: We test our system on five disease datasets from OMIM® and evaluate the results based on MeSH® term assignments. We further compare our method with several baseline and state-of-the-art methods, including K-means, expectation maximization, latent Dirichlet allocation-based clustering, Lingo, OPTIMSRC and adapted GK-means. The experimental results on the 20-Newsgroup and ODP-239 collections demonstrate that our method is successful at extracting significant clusters and is superior to existing methods in terms of quality of clusters. Finally, we apply our system to a collection of 6248 topical sets from the HomoloGene® database, a resource in PubMed®. Empirical evaluation confirms the method is useful for small homogenous datasets in producing meaningful clusters with descriptive titles.

Availability and implementation: A web-based demonstration of the algorithm applied to a collection of sets from the HomoloGene database is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/CLUSTERING_HOMOLOGENE/index.html.

Contact: lana.yeganova@nih.gov

Supplementary Information: Supplementary data are available at Bioinformatics online.

Author: Yeganova, L., Kim, W., Kim, S., Wilbur, W. J.
Posted: November 5, 2014, 11:27 am

Motivation: Recent studies on human disease have revealed that aberrant interaction between proteins probably underlies a substantial number of human genetic diseases. This suggests a need to investigate disease inheritance mode using interaction, and based on which to refresh our conceptual understanding of a series of properties regarding inheritance mode of human disease.

Results: We observed a strong correlation between the number of protein interactions and the likelihood of a gene causing any dominant diseases or multiple dominant diseases, whereas no correlation was observed between protein interaction and the likelihood of a gene causing recessive diseases. We found that dominant diseases are more likely to be associated with disruption of important interactions. These suggest inheritance mode should be understood using protein interaction. We therefore reviewed the previous studies and refined an interaction model of inheritance mode, and then confirmed that this model is largely reasonable using new evidences. With these findings, we found that the inheritance mode of human genetic diseases can be predicted using protein interaction. By integrating the systems biology perspectives with the classical disease genetics paradigm, our study provides some new insights into genotype–phenotype correlations.

Contact: haodapeng@ems.hrbmu.edu.cn or biofomeng@hotmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hao, D., Li, C., Zhang, S., Lu, J., Jiang, Y., Wang, S., Zhou, M.
Posted: November 5, 2014, 11:27 am

Motivation: Most approaches used to identify cancer driver genes focus, true to their name, on entire genes and assume that a gene, treated as one entity, has a specific role in cancer. This approach may be correct to describe effects of gene loss or changes in gene expression; however, mutations may have different effects, including their relevance to cancer, depending on which region of the gene they affect. Except for rare and well-known exceptions, there are not enough data for reliable statistics for individual positions, but an intermediate level of analysis, between an individual position and the entire gene, may give us better statistics than the former and better resolution than the latter approach.

Results: We have developed e-Driver, a method that exploits the internal distribution of somatic missense mutations between the protein’s functional regions (domains or intrinsically disordered regions) to find those that show a bias in their mutation rate as compared with other regions of the same protein, providing evidence of positive selection and suggesting that these proteins may be actual cancer drivers. We have applied e-Driver to a large cancer genome dataset from The Cancer Genome Atlas and compared its performance with that of four other methods, showing that e-Driver identifies novel candidate cancer drivers and, because of its increased resolution, provides deeper insights into the potential mechanism of cancer driver genes identified by other methods.

Availability and implementation: A Perl script with e-Driver and the files to reproduce the results described here can be downloaded from https://github.com/eduardporta/e-Driver.git

Contact: adam@godziklab.org or eppardo@sanfordburnham.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Porta-Pardo, E., Godzik, A.
Posted: October 17, 2014, 1:05 pm

Motivation: A large number of experimental studies on ageing focus on the effects of genetic perturbations of the insulin/insulin-like growth factor signalling pathway (IIS) on lifespan. Short-lived invertebrate laboratory model organisms are extensively used to quickly identify ageing-related genes and pathways. It is important to extrapolate this knowledge to longer lived mammalian organisms, such as mouse and eventually human, where such analyses are difficult or impossible to perform. Computational tools are needed to integrate and manipulate pathway knowledge in different species.

Results: We performed a literature review and curation of the IIS and target of rapamycin signalling pathways in Mus Musculus. We compare this pathway model to the equivalent models in Drosophila melanogaster and Caenorhabtitis elegans. Although generally well-conserved, they exhibit important differences. In general, the worm and mouse pathways include a larger number of feedback loops and interactions than the fly. We identify ‘functional orthologues’ that share similar molecular interactions, but have moderate sequence similarity. Finally, we incorporate the mouse model into the web-service NetEffects and perform in silico gene perturbations of IIS components and analyses of experimental results. We identify sub-paths that, given a mutation in an IIS component, could potentially antagonize the primary effects on ageing via FOXO in mouse and via SKN-1 in worm. Finally, we explore the effects of FOXO knockouts in three different mouse tissues.

Availability and implementation: http://www.ebi.ac.uk/thornton-srv/software/NetEffects

Contact: ip8@sanger.ac.uk or thornton@ebi.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Papatheodorou, I., Petrovs, R., Thornton, J. M.
Posted: October 17, 2014, 1:05 pm

Summary: The linear mixed model is the state-of-the-art method to account for the confounding effects of kinship and population structure in genome-wide association studies (GWAS). Current implementations test the effect of one or more genetic markers while including prespecified covariates such as sex. Here we develop an efficient implementation of the linear mixed model that allows composite hypothesis tests to consider genotype interactions with variables such as other genotypes, environment, sex or ancestry. Our R package, lrgpr, allows interactive model fitting and examination of regression diagnostics to facilitate exploratory data analysis in the context of the linear mixed model. By leveraging parallel and out-of-core computing for datasets too large to fit in main memory, lrgpr is applicable to large GWAS datasets and next-generation sequencing data.

Availability and implementation: lrgpr is an R package available from lrgpr.r-forge.r-project.org

Contact: gabriel.hoffman@mssm.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hoffman, G. E., Mezey, J. G., Schadt, E. E.
Posted: October 17, 2014, 1:05 pm

Motivation: Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, single molecule real-time (SMRT) sequencing, was developed that could address these challenges, as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches that use high-quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects.

Results: Here we present proovread, a hybrid correction pipeline for SMRT reads, which can be flexibly adapted on existing hardware and infrastructure from a laptop to a high-performance computing cluster. On genomic and transcriptomic test cases covering Escherichia coli, Arabidopsis thaliana and human, proovread achieved accuracies up to 99.9% and outperformed the existing hybrid correction programs. Furthermore, proovread-corrected sequences were longer and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. It will therefore increase the applicability and value of SMRT sequencing.

Availability and implementation: proovread is available at the following URL: http://proovread.bioapps.biozentrum.uni-wuerzburg.de

Contact: frank.foerster@biozentrum.uni-wuerzburg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Hackl, T., Hedrich, R., Schultz, J., Forster, F.
Posted: October 17, 2014, 1:05 pm

Summary: STAMP is a graphical software package that provides statistical hypothesis tests and exploratory plots for analysing taxonomic and functional profiles. It supports tests for comparing pairs of samples or samples organized into two or more treatment groups. Effect sizes and confidence intervals are provided to allow critical assessment of the biological relevancy of test results. A user-friendly graphical interface permits easy exploration of statistical results and generation of publication-quality plots.

Availability and implementation: STAMP is licensed under the GNU GPL. Python source code and binaries are available from our website at: http://kiwi.cs.dal.ca/Software/STAMP

Contact: donovan.parks@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Parks, D. H., Tyson, G. W., Hugenholtz, P., Beiko, R. G.
Posted: October 17, 2014, 1:05 pm

Motivation: Riboswitches are short sequences of messenger RNA that can change their structural conformation to regulate the expression of adjacent genes. Computational prediction of putative riboswitches can provide direction to molecular biologists studying riboswitch-mediated gene expression.

Results: The Denison Riboswitch Detector (DRD) is a new computational tool with a Web interface that can quickly identify putative riboswitches in DNA sequences on the scale of bacterial genomes. Riboswitch descriptions are easily modifiable and new ones are easily created. The underlying algorithm converts the problem to a ‘heaviest path’ problem on a multipartite graph, which is then solved using efficient dynamic programming. We show that DRD can achieve ~88–99% sensitivity and >99.99% specificity on 13 riboswitch families.

Availability and implementation: DRD is available at http://drd.denison.edu.

Contact: havill@denison.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Havill, J. T., Bhatiya, C., Johnson, S. M., Sheets, J. D., Thompson, J. S.
Posted: October 17, 2014, 1:05 pm

Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application.

Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third ‘Quest for Orthologs’ meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes.

The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking.

Availability and implementation: All such materials are available at http://questfororthologs.org.

Contact: erik.sonnhammer@scilifelab.se or c.dessimoz@ucl.ac.uk

Author: Sonnhammer, E. L. L., Gabaldon, T., Sousa da Silva, A. W., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P. D., Dessimoz, C., the Quest for Orthologs consortium
Posted: October 17, 2014, 1:05 pm

Motivation: Brownian models have been introduced in phylogenetics for describing variation in substitution rates through time, with applications to molecular dating or to the comparative analysis of variation in substitution patterns among lineages. Thus far, however, the Monte Carlo implementations of these models have relied on crude approximations, in which the Brownian process is sampled only at the internal nodes of the phylogeny or at the midpoints along each branch, and the unknown trajectory between these sampled points is summarized by simple branchwise average substitution rates.

Results: A more accurate Monte Carlo approach is introduced, explicitly sampling a fine-grained discretization of the trajectory of the (potentially multivariate) Brownian process along the phylogeny. Generic Monte Carlo resampling algorithms are proposed for updating the Brownian paths along and across branches. Specific computational strategies are developed for efficient integration of the finite-time substitution probabilities across branches induced by the Brownian trajectory. The mixing properties and the computational complexity of the resulting Markov chain Monte Carlo sampler scale reasonably with the discretization level, allowing practical applications with up to a few hundred discretization points along the entire depth of the tree. The method can be generalized to other Markovian stochastic processes, making it possible to implement a wide range of time-dependent substitution models with well-controlled computational precision.

Availability: The program is freely available at www.phylobayes.org

Contact: nicolas.lartillot@univ-lyon1.fr

Author: Horvilleur, B., Lartillot, N.
Posted: October 17, 2014, 1:05 pm

The Affymetrix Axiom genotyping standard and ‘best practice’ workflow for Linux and Mac users consists of three stand-alone executable programs (Affymetrix Power Tools) and an R package (SNPolisher). Currently, SNP analysis has to be performed in a step-by-step procedure. Manual intervention and/or programming skills by the user is required at each intermediate point, as Affymetrix Power Tools programs do not produce input files for the program next-in-line. An additional problem is that the output format of genotypes is not compatible with most analysis software currently available. AffyPipe solves all the above problems, by automating both standard and ‘best practice’ workflows for any species genotyped with the Axiom technology. AffyPipe does not require programming skills and performs all the steps necessary to obtain a final genotype file. Furthermore, users can directly edit SNP probes and export genotypes in PLINK format.

Availability and implementation: https://github.com/nicolazzie/AffyPipe.git.

Contact: ezequiel.nicolazzi@tecnoparco.org

Author: Nicolazzi, E. L., Iamartino, D., Williams, J. L.
Posted: October 17, 2014, 1:05 pm

Motivation: The ability to accurately model protein structures at the atomistic level underpins efforts to understand protein folding, to engineer natural proteins predictably and to design proteins de novo. Homology-based methods are well established and produce impressive results. However, these are limited to structures presented by and resolved for natural proteins. Addressing this problem more widely and deriving truly ab initio models requires mathematical descriptions for protein folds; the means to decorate these with natural, engineered or de novo sequences; and methods to score the resulting models.

Results: We present CCBuilder, a web-based application that tackles the problem for a defined but large class of protein structure, the α-helical coiled coils. CCBuilder generates coiled-coil backbones, builds side chains onto these frameworks and provides a range of metrics to measure the quality of the models. Its straightforward graphical user interface provides broad functionality that allows users to build and assess models, in which helix geometry, coiled-coil architecture and topology and protein sequence can be varied rapidly. We demonstrate the utility of CCBuilder by assembling models for 653 coiled-coil structures from the PDB, which cover >96% of the known coiled-coil types, and by generating models for rarer and de novo coiled-coil structures.

Availability and implementation: CCBuilder is freely available, without registration, at http://coiledcoils.chm.bris.ac.uk/app/cc_builder/

Contact: D.N.Woolfson@bristol.ac.uk or Chris.Wood@bristol.ac.uk

Author: Wood, C. W., Bruning, M., Ibarra, A. A., Bartlett, G. J., Thomson, A. R., Sessions, R. B., Brady, R. L., Woolfson, D. N.
Posted: October 17, 2014, 1:05 pm

Motivation: Recent breakthroughs in protein residue–residue contact prediction have made reliable de novo prediction of protein structures possible. The key was to apply statistical methods that can distinguish direct couplings between pairs of columns in a multiple sequence alignment from merely correlated pairs, i.e. to separate direct from indirect effects. Two classes of such methods exist, either relying on regularized inversion of the covariance matrix or on pseudo-likelihood maximization (PLM). Although PLM-based methods offer clearly higher precision, available tools are not sufficiently optimized and are written in interpreted languages that introduce additional overheads. This impedes the runtime and large-scale contact prediction for larger protein families, multi-domain proteins and protein–protein interactions.

Results: Here we introduce CCMpred, our performance-optimized PLM implementation in C and CUDA C. Using graphics cards in the price range of current six-core processors, CCMpred can predict contacts for typical alignments 35–113 times faster and with the same precision as the most accurate published methods. For users without a CUDA-capable graphics card, CCMpred can also run in a CPU mode that is still 4–14 times faster. Thanks to our speed-ups (http://dictionary.cambridge.org/dictionary/british/speed-up) contacts for typical protein families can be predicted in 15–60 s on a consumer-grade GPU and 1–6 min on a six-core CPU.

Availability and implementation: CCMpred is free and open-source software under the GNU Affero General Public License v3 (or later) available at https://bitbucket.org/soedinglab/ccmpred

Contact: johannes.soeding@mpibpc.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Seemayer, S., Gruber, M., Soding, J.
Posted: October 17, 2014, 1:05 pm

Motivation: Oncogenes are known drivers of cancer phenotypes and targets of molecular therapies; however, the complex and diverse signaling mechanisms regulated by oncogenes and potential routes to targeted therapy resistance remain to be fully understood. To this end, we present an approach to infer regulatory mechanisms downstream of the HER2 driver oncogene in SUM-225 metastatic breast cancer cells from dynamic gene expression patterns using a succession of analytical techniques, including a novel MP grammars method to mathematically model putative regulatory interactions among sets of clustered genes.

Results: Our method highlighted regulatory interactions previously identified in the cell line and a novel finding that the HER2 oncogene, as opposed to the proto-oncogene, upregulates expression of the E2F2 transcription factor. By targeted gene knockdown we show the significance of this, demonstrating that cancer cell-matrix adhesion and outgrowth were markedly inhibited when E2F2 levels were reduced. Thus, validating in this context that upregulation of E2F2 represents a key intermediate event in a HER2 oncogene-directed gene expression-based signaling circuit. This work demonstrates how predictive modeling of longitudinal gene expression data combined with multiple systems-level analyses can be used to accurately predict downstream signaling pathways. Here, our integrated method was applied to reveal insights as to how the HER2 oncogene drives a specific cancer cell phenotype, but it is adaptable to investigate other oncogenes and model systems.

Availability and implementation: Accessibility of various tools is listed in methods; the Log-Gain Stoichiometric Stepwise algorithm is accessible at http://www.cbmc.it/software/Software.php.

Contact: bollig@karmanos.org

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Bollig-Fischer, A., Marchetti, L., Mitrea, C., Wu, J., Kruger, A., Manca, V., Drăghici, S.
Posted: October 17, 2014, 1:05 pm

Summary: NetPathMiner is a general framework for mining, from genome-scale networks, paths that are related to specific experimental conditions. NetPathMiner interfaces with various input formats including KGML, SBML and BioPAX files and allows for manipulation of networks in three different forms: metabolic, reaction and gene representations. NetPathMiner ranks the obtained paths and applies Markov model-based clustering and classification methods to the ranked paths for easy interpretation. NetPathMiner also provides static and interactive visualizations of networks and paths to aid manual investigation.

Availability: The package is available through Bioconductor and from Github at http://github.com/ahmohamed/NetPathMiner

Contact: mohamed@kuicr.kyoto-u.ac.jp

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Mohamed, A., Hancock, T., Nguyen, C. H., Mamitsuka, H.
Posted: October 17, 2014, 1:05 pm

Motivation: Asymmetry is frequently observed in the empirical distribution of test statistics that results from the analysis of gene expression experiments. This asymmetry indicates an asymmetry in the distribution of effect sizes. A common method for identifying differentially expressed (DE) genes in a gene expression experiment while controlling false discovery rate (FDR) is Storey’s q-value method. This method ranks genes based solely on the P-values from each gene in the experiment.

Results: We propose a method that alters and improves upon the q-value method by taking the sign of the test statistics, in addition to the P-values, into account. Through two simulation studies (one involving independent normal data and one involving microarray data), we show that the proposed method, when compared with the traditional q-value method, generally provides a better ranking for genes as well as a higher number of truly DE genes declared to be DE, while still adequately controlling FDR. We illustrate the proposed method by analyzing two microarray datasets, one from an experiment of thale cress seedlings and the other from an experiment of maize leaves.

Availability and implementation: The R code and data files for the proposed method and examples are available at Bioinformatics online.

Contact: megan.orr@ndsu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Orr, M., Liu, P., Nettleton, D.
Posted: October 17, 2014, 1:05 pm

Motivation: Biological system behaviors are often the outcome of complex interactions among a large number of cells and their biotic and abiotic environment. Computational biologists attempt to understand, predict and manipulate biological system behavior through mathematical modeling and computer simulation. Discrete agent-based modeling (in combination with high-resolution grids to model the extracellular environment) is a popular approach for building biological system models. However, the computational complexity of this approach forces computational biologists to resort to coarser resolution approaches to simulate large biological systems. High-performance parallel computers have the potential to address the computing challenge, but writing efficient software for parallel computers is difficult and time-consuming.

Results: We have developed Biocellion, a high-performance software framework, to solve this computing challenge using parallel computers. To support a wide range of multicellular biological system models, Biocellion asks users to provide their model specifics by filling the function body of pre-defined model routines. Using Biocellion, modelers without parallel computing expertise can efficiently exploit parallel computers with less effort than writing sequential programs from scratch. We simulate cell sorting, microbial patterning and a bacterial system in soil aggregate as case studies.

Availability and implementation: Biocellion runs on x86 compatible systems with the 64 bit Linux operating system and is freely available for academic use. Visit http://biocellion.com for additional information.

Contact: seunghwa.kang@pnnl.gov

Author: Kang, S., Kahan, S., McDermott, J., Flann, N., Shmulevich, I.
Posted: October 17, 2014, 1:05 pm

Motivation: A rapid progression of esophageal squamous cell carcinoma (ESCC) causes a high mortality rate because of the propensity for metastasis driven by genetic and epigenetic alterations. The identification of prognostic biomarkers would help prevent or control metastatic progression. Expression analyses have been used to find such markers, but do not always validate in separate cohorts. Epigenetic marks, such as DNA methylation, are a potential source of more reliable and stable biomarkers. Importantly, the integration of both expression and epigenetic alterations is more likely to identify relevant biomarkers.

Results: We present a new analysis framework, using ESCC progression-associated gene regulatory network (GRNescc), to identify differentially methylated CpG sites prognostic of ESCC progression. From the CpG loci differentially methylated in 50 tumor–normal pairs, we selected 44 CpG loci most highly associated with survival and located in the promoters of genes more likely to belong to GRNescc. Using an independent ESCC cohort, we confirmed that 8/10 of CpG loci in the promoter of GRNescc genes significantly correlated with patient survival. In contrast, 0/10 CpG loci in the promoter genes outside the GRNescc were correlated with patient survival. We further characterized the GRNescc network topology and observed that the genes with methylated CpG loci associated with survival deviated from the center of mass and were less likely to be hubs in the GRNescc. We postulate that our analysis framework improves the identification of bona fide prognostic biomarkers from DNA methylation studies, especially with partial genome coverage.

Contact: tsengsm@mail.ncku.edu.tw or ycw5798@mail.ncku.edu.tw

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Cheng, C.-P., Kuo, I.-Y., Alakus, H., Frazer, K. A., Harismendy, O., Wang, Y.-C., Tseng, V. S.
Posted: October 17, 2014, 1:05 pm

Motivation: The increasing availability of mitochondria-targeted and off-target sequencing data in whole-exome and whole-genome sequencing studies (WXS and WGS) has risen the demand of effective pipelines to accurately measure heteroplasmy and to easily recognize the most functionally important mitochondrial variants among a huge number of candidates. To this purpose, we developed MToolBox, a highly automated pipeline to reconstruct and analyze human mitochondrial DNA from high-throughput sequencing data.

Results: MToolBox implements an effective computational strategy for mitochondrial genomes assembling and haplogroup assignment also including a prioritization analysis of detected variants. MToolBox provides a Variant Call Format file featuring, for the first time, allele-specific heteroplasmy and annotation files with prioritized variants. MToolBox was tested on simulated samples and applied on 1000 Genomes WXS datasets.

Availability and implementation: MToolBox package is available at https://sourceforge.net/projects/mtoolbox/.

Contact: marcella.attimonelli@uniba.it

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Calabrese, C., Simone, D., Diroma, M. A., Santorsola, M., Gutta, C., Gasparre, G., Picardi, E., Pesole, G., Attimonelli, M.
Posted: October 17, 2014, 1:05 pm

Motivation: The successful translation of genomic signatures into clinical settings relies on good discrimination between patient subgroups. Many sophisticated algorithms have been proposed in the statistics and machine learning literature, but in practice simpler algorithms are often used. However, few simple algorithms have been formally described or systematically investigated.

Results: We give a precise definition of a popular simple method we refer to as más-o-menos, which calculates prognostic scores for discrimination by summing standardized predictors, weighted by the signs of their marginal associations with the outcome. We study its behavior theoretically, in simulations and in an extensive analysis of 27 independent gene expression studies of bladder, breast and ovarian cancer, altogether totaling 3833 patients with survival outcomes. We find that despite its simplicity, más-o-menos can achieve good discrimination performance. It performs no worse, and sometimes better, than popular and much more CPU-intensive methods for discrimination, including lasso and ridge regression.

Availability and Implementation: Más-o-menos is implemented for survival analysis as an option in the survHD package, available from http://www.bitbucket.org/lwaldron/survhd and submitted to Bioconductor.

Contact: sdzhao@illinois.edu

Author: Zhao, S. D., Parmigiani, G., Huttenhower, C., Waldron, L.
Posted: October 17, 2014, 1:05 pm

Summary: The fluorescence in situ hybridization (FISH) method has been providing valuable information on physical distances between loci (via image analysis) for several decades. Recently, high-throughput data on nearby chemical contacts between and within chromosomes became available with the Hi-C method. Here, we present FisHiCal, an R package for an iterative FISH-based Hi-C calibration that exploits in full the information coming from these methods. We describe here our calibration model and present 3D inference methods that we have developed for increasing its usability, namely, 3D reconstruction through local stress minimization and detection of spatial inconsistencies. We next confirm our calibration across three human cell lines and explain how the output of our methods could inform our model, defining an iterative calibration pipeline, with applications for quality assessment and meta-analysis.

Availability and implementation: FisHiCal v1.1 is available from http://cran.r-project.org/.

Contact: ys388@cam.ac.uk

Supplementary information: Supplementary Data is available at Bioinformatics online.

Author: Shavit, Y., Hamey, F. K., Lio, P.
Posted: October 17, 2014, 1:05 pm

Motivation: MicroRNAs (miRNAs) play crucial roles in complex cellular networks by binding to the messenger RNAs (mRNAs) of protein coding genes. It has been found that miRNA regulation is often condition-specific. A number of computational approaches have been developed to identify miRNA activity specific to a condition of interest using gene expression data. However, most of the methods only use the data in a single condition, and thus, the activity discovered may not be unique to the condition of interest. Additionally, these methods are based on statistical associations between the gene expression levels of miRNAs and mRNAs, so they may not be able to reveal real gene regulatory relationships, which are causal relationships.

Results: We propose a novel method to infer condition-specific miRNA activity by considering (i) the difference between the regulatory behavior that an miRNA has in the condition of interest and its behavior in the other conditions; (ii) the causal semantics of miRNA–mRNA relationships. The method is applied to the epithelial–mesenchymal transition (EMT) and multi-class cancer (MCC) datasets. The validation by the results of transfection experiments shows that our approach is effective in discovering significant miRNA–mRNA interactions. Functional and pathway analysis and literature validation indicate that the identified active miRNAs are closely associated with the specific biological processes, diseases and pathways. More detailed analysis of the activity of the active miRNAs implies that some active miRNAs show different regulation types in different conditions, but some have the same regulation types and their activity only differs in different conditions in the strengths of regulation.

Availability and implementation: The R and Matlab scripts are in the Supplementary materials.

Contact: jiuyong.li@unisa.edu.au

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Zhang, J., Le, T. D., Liu, L., Liu, B., He, J., Goodall, G. J., Li, J.
Posted: October 17, 2014, 1:05 pm

Summary: Circleator is a Perl application that generates circular figures of genome-associated data. It leverages BioPerl to support standard annotation and sequence file formats and produces publication-quality SVG output. It is designed to be both flexible and easy to use. It includes a library of circular track types and predefined configuration files for common use-cases, including. (i) visualizing gene annotation and DNA sequence data from a GenBank flat file, (ii) displaying patterns of gene conservation in related microbial strains, (iii) showing Single Nucleotide Polymorphisms (SNPs) and indels relative to a reference genome and gene set and (iv) viewing RNA-Seq plots.

Availability and implementation: Circleator is freely available under the Artistic License 2.0 from http://jonathancrabtree.github.io/Circleator/ and is integrated with the CloVR cloud-based sequence analysis Virtual Machine (VM), which can be downloaded from http://clovr.org or run on Amazon EC2.

Contact: jcrabtree@som.umaryland.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Crabtree, J., Agrawal, S., Mahurkar, A., Myers, G. S., Rasko, D. A., White, O.
Posted: October 17, 2014, 1:05 pm

Motivation: The increasing interest in rare genetic variants and epistatic genetic effects on complex phenotypic traits is currently pushing genome-wide association study design towards datasets of increasing size, both in the number of studied subjects and in the number of genotyped single nucleotide polymorphisms (SNPs). This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data.

Results: We present a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies. Our algorithm is based on two main ideas: (i) compress linkage disequilibrium blocks in terms of differences with a reference SNP and (ii) compress reference SNPs exploiting information on their call rate and minor allele frequency. Tested on two SNP datasets and compared with several state-of-the-art software tools, our compression algorithm is shown to be competitive in terms of compression rate and to outperform all tools in terms of time to load compressed data.

Availability and implementation: Our compression and decompression algorithms are implemented in a C++ library, are released under the GNU General Public License and are freely downloadable from http://www.dei.unipd.it/~sambofra/snpack.html.

Contact: sambofra@dei.unipd.it or cobelli@dei.unipd.it.

Author: Sambo, F., Di Camillo, B., Toffolo, G., Cobelli, C.
Posted: October 17, 2014, 1:05 pm

Summary: Sequencing oligosaccharides by exoglycosidases, either sequentially or in an array format, is a powerful tool to unambiguously determine the structure of complex N- and O-link glycans. Here, we introduce GlycoDigest, a tool that simulates exoglycosidase digestion, based on controlled rules acquired from expert knowledge and experimental evidence available in GlycoBase. The tool allows the targeted design of glycosidase enzyme mixtures by allowing researchers to model the action of exoglycosidases, thereby validating and improving the efficiency and accuracy of glycan analysis.

Availability and implementation: http://www.glycodigest.org.

Contact: matthew.campbell@mq.edu.au or frederique.lisacek@isb-sib.ch

Author: Gotz, L., Abrahams, J. L., Mariethoz, J., Rudd, P. M., Karlsson, N. G., Packer, N. H., Campbell, M. P., Lisacek, F.
Posted: October 17, 2014, 1:05 pm

Motivation: Boolean network models are suitable to simulate GRNs in the absence of detailed kinetic information. However, reducing the biological reality implies making assumptions on how genes interact (interaction rules) and how their state is updated during the simulation (update scheme). The exact choice of the assumptions largely determines the outcome of the simulations. In most cases, however, the biologically correct assumptions are unknown. An ideal simulation thus implies testing different rules and schemes to determine those that best capture an observed biological phenomenon. This is not trivial because most current methods to simulate Boolean network models of GRNs and to compute their attractors impose specific assumptions that cannot be easily altered, as they are built into the system.

Results: To allow for a more flexible simulation framework, we developed ASP-G. We show the correctness of ASP-G in simulating Boolean network models and obtaining attractors under different assumptions by successfully recapitulating the detection of attractors of previously published studies. We also provide an example of how performing simulation of network models under different settings help determine the assumptions under which a certain conclusion holds. The main added value of ASP-G is in its modularity and declarativity, making it more flexible and less error-prone than traditional approaches. The declarative nature of ASP-G comes at the expense of being slower than the more dedicated systems but still achieves a good efficiency with respect to computational time.

Availability and implementation: The source code of ASP-G is available at http://bioinformatics.intec.ugent.be/kmarchal/Supplementary_Information_Musthofa_2014/asp-g.zip.

Contact: Kathleen.Marchal@UGent.be or Martine.DeCock@UGent.be

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Mushthofa, M., Torres, G., Van de Peer, Y., Marchal, K., De Cock, M.
Posted: October 17, 2014, 1:05 pm

Summary: Single nucleotide variations (SNVs) located within a reading frame can result in single amino acid polymorphisms (SAPs), leading to alteration of the corresponding amino acid sequence as well as function of a protein. Accurate detection of SAPs is an important issue in proteomic analysis at the experimental and bioinformatic level. Herein, we present sapFinder, an R software package, for detection of the variant peptides based on tandem mass spectrometry (MS/MS)-based proteomics data. This package automates the construction of variation-associated databases from public SNV repositories or sample-specific next-generation sequencing (NGS) data and the identification of SAPs through database searching, post-processing and generation of HTML-based report with visualized interface.

Availability and implementation: sapFinder is implemented as a Bioconductor package in R. The package and the vignette can be downloaded at http://bioconductor.org/packages/devel/bioc/html/sapFinder.html and are provided under a GPL-2 license.

Contact: siqiliu@genomics.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Wen, B., Xu, S., Sheynkman, G. M., Feng, Q., Lin, L., Wang, Q., Xu, X., Wang, J., Liu, S.
Posted: October 17, 2014, 1:05 pm

Motivation: Diseases and adverse drug reactions are frequently caused by disruptions in gene functionality. Gaining insight into the global system properties governing the relationships between genotype and phenotype is thus crucial to understand and interfere with perturbations in complex organisms such as diseases states.

Results: We present a systematic analysis of phenotypic information of 5047 perturbations of single genes in mice, 4766 human diseases and 1666 drugs that examines the relationships between different gene properties and the phenotypic impact at the organ system level in mammalian organisms. We observe that while single gene perturbations and alterations of nonessential, tissue-specific genes or those with low betweenness centrality in protein–protein interaction networks often show organ-specific effects, multiple gene alterations resulting e.g. from complex disorders and drug treatments have a more widespread impact. Interestingly, certain cellular localizations are distinctly associated to systemic effects in monogenic disease genes and mouse gene perturbations, such as the lumen of intracellular organelles and transcription factor complexes, respectively. In summary, we show that the broadness of the phenotypic effect is clearly related to certain gene properties and is an indicator of the severity of perturbations. This work contributes to the understanding of gene properties influencing the systemic effects of diseases and drugs.

Contact: monica.campillos@helmholtz-muenchen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Author: Vogt, I., Prinz, J., Worf, K., Campillos, M.
Posted: October 17, 2014, 1:05 pm
Author: Zhang, L., Pei, Y.-F., Fu, X., Lin, Y., Wang, Y.-P., Deng, H.-W.
Posted: October 17, 2014, 1:05 pm