September 2019 Newsletter
Our September newsletter includes forty-four new titles, eight updated titles and an additional ten R packages in the latest available version of R (3.5.1). Nine new workshops are available on the Harvard Longwood campus.
We've got a couple of operating system alerts for those of you looking to buy new computers or update existing Mac or Linux machines:
• BioGrids is not compatible with MacOS 10.15 "Catalina", the next version of Apple's MacOS due out in October with significant changes. With Catalina, Apple will enforce strict filesystem controls that will break BioGrids installations. They have also dropped support fo 32-bit binaries that many older applications require. We strongly recommend against upgrading to 10.15 on any Mac with BioGrids software. A solution is in the works, but it will take some time before we can fully implement the changes to officially support 10.15. We are also compiling a list of 32-bit software applications that will not be available after machines are updated to version 10.15.
• CentOS 8 support coming soon: CentOS 8 was released on 9/24 and we are testing BioGrids with plans to add support for this Linux distribution in the near future. We are not there yet, so hold off on those updates for now, but we expect the transition to be relatively smooth. With the addition of support for CentOS 8, we will move our build pipeline to CentOS 7 and begin phasing out support for new titles on CentOS 6.
If your use of BioGrids supplied software was an important element in your publication, please include the following statement in your work:
"Software used in the project was installed and configured by BioGrids
(cite: eLife 2013;2:e01456, Collaboration gets the most out of software.)"
See our Grant Support page for additional details.
BioGrids is available to all Harvard affiliates on a trial basis for the 2019 calendar year.
Register here to try out our software installer, which allows users to choose from over 200 bioinfomatics tools that can be installed as ready-to-run applications on Mac or Linux machines with the click of a button or a short command from the CLI. No need to worry about dependencies or compilation.
BioGrids is supported by a team of scientists and engineers at HMS. We provide direct support to BioGrids members. This includes all aspects of software installation and management. If you need assistance of any kind please send a note to: help@biogrids.org.
Software Highlight: dDocent
dDocent is simple bash wrapper to QC, assemble, map, and call SNPs from almost any kind of RAD sequencing. It is in widespread use with over 100 citations, has excellent documentation, and is available now in BioGrids.
From the dDocent website:
dDocent employs a series of data reduction techniques, aligment based clustering (using CD-hit), and, for PE assembly, a specialized RAD assembly software (rainbow). This combination allows for accurate and effecient de novo assembly.
A comparison among pipelines
This is 1000 simulated ddRAD data loci being assembled across a variety of parameters for each pipeline.
Overclustering leads to bias
Above is a figure depicting the relative bias of pairwise FST values generated by different RADseq bioinformatic pipelines. Circles are sized according to the magnitude of bias (Observed - Expected)/Expected and are colored relative to the percentage of INDEL variation simulated in the data set: blue- 1%, red-5%, and green-10%. Simulations consisted of four populations in a stepping stone model with a decreasing migration rates.
BioGrids Installer
The BioGrids Installer is an easy to use application that makes installing and managing life sciences software simple and quick.
A command line version is also available for Macs and Linux. Download using the link button above and register here for activation.
The BioGrids team provides support, infrastructure and testing for scientific software packages. We currently provide over 200 titles in five categories and an additional 1,500 R, python and perl packages and modules. The collection grows weekly. Learn more here: About BioGrids
BioGrids QuickStart
If you are new to BioGrids and would like to quickly get started with the command line version, follow the instructions below:
1: Download the BioGrids Installer command line version
Linux CLI
curl -kLO https://biogrids.org/wiki/downloads/biogrids-1.0.694-Linux.tgz
tar zxf biogrids-1.0.694-Linux.tgz
cd biogrids-1.0.694-Linux
OSX CLI
curl -kLO https://biogrids.org/wiki/downloads/biogrids-1.0.694-Darwin.tgz
tar zxf biogrids-1.0.694-Darwin.tgz
cd biogrids-1.0.694-Darwin
2: Activate biogrids
./biogrids activate biogrid-production jvinent1 70rYFTDnmCr93VUklfbf1s3M4jdyC9bFVYHew==
Replace the site name, user name and activation key with your own credentials.
3: Install software with BioGrids
./biogrids install fastqc trimmomatic samtools star subread igv
When finished, verify applications are installed:
./biogrids installed
Software Updates
Anvi’o is an open-source, community-driven analysis and visualization platform for ‘omics data. It brings together many aspects of today’s cutting-edge genomic, metagenomic, metatranscriptomic, pangenomic, and phylogenomic analysis practices to address a wide array of needs.
Version: 5.5
slingshot is an R package that provides functions for inferring continuous, branching lineage structures in low-dimensional data. Slingshot was designed to model developmental trajectories in single-cell RNA sequencing data and serve as a component in an analysis pipeline after dimensionality reduction and clustering. It is flexible enough to handle arbitrarily many branching events and allows for the incorporation of prior knowledge through supervised graph construction.
Version 1.3.2
BLASR (Basic Local Alignment with Successive Refinement) maps Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error.
Version: 5.3.3
GCpp takes an alignment in the form of a BAM file and polishes the references with the provided subreads from the alignment. It uses the Arrow algorithm in multi-molecule consensus setting and can reach up to QV60 at coverage 100. GCpp is the machine-code successor of the venerable GenomicConsensus suite which has reached EOL.
Version: 1.0.0 | Linux only
pbbam software package provides components to create, query, & edit PacBio BAM files and associated indices. These components include a core C++ library, bindings for additional languages, and command-line utilities.
Version: 1.0.6
KMC (K-mer Counter) is a utility designed for counting k-mers (sequences of consecutive k symbols) in a set of reads from genome sequencing projects.
Version: 3.0.0
miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.
Version: 0.3_r179
preseq is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
Version: 2.0.3
VarScan is a platform-independent mutation caller for targeted, exome, and whole-genome resequencing data generated on Illumina, SOLiD, Life/PGM, Roche/454, and similar instruments.
Version: 2.4.4
SeqBuster integrates multiple analyses modules in a unique platform offering a deep characterization of miRNA variants (isomiRs).
Version: 3.5.0
Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb. It has been extensively tested on whole-genome, exon-capture, and targeted capture data, it has been run on very large datasets as part of the Thousand Genomes and WGS500 projects, and is being used in clinical sequencing trials in the Mainstreaming Cancer Genetics programme.
Version:0.8.1.2
R a free software environment for statistical computing and graphics.
Ten updated packages in version 3.5.1 include:
annotate copynumber DESeq2 genefilter geneplotter locfit
princurve RcppArmadillo slingshot squash
bcbio-variation-recall Parallel merging, squaring off and ensemble calling for genomic variants. Provide a general framework meant to combine multiple variant calls, either from single individuals, batched family calls, or multiple approaches on the same sample. Splits inputs based on shared genomic regions without variants, allowing independent processing of smaller regions with variant calls.
Version:0.2.6
bcbio-prioritize Prioritize small variants, structural variants and coverage based on biological inputs. The goal is to use pre-existing knowledge of relevant genes, domains and pathways involved with a disease to extract the most interesting signal from a set of high quality small or structural variant calls. Given information on coverage, it will be able to identify poorly covered regions in potential genes of interest.
Version:0.0.8
Crass is designed to identify and reconstruct CRISPR loci from raw metagenomic data without the need for assembly or prior knowledge of CRISPR in the data set.
Version: 1.0.1
OpenJDK is a free and open source implementation of the Java Platform, Standard Edition (Java SE).
Version: jdk12.0.2
MinCED is a program to find Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) in full genomes or environmental datasets such as assembled contigs from metagenomes.
Version: 0.4.2
dDocent is simple bash wrapper to QC, assemble, map, and call SNPs from almost any kind of RAD sequencing. If you have a reference already, dDocent can be used to call SNPs from almost any type of NGS data set.
Version: 2.7.8
grabix leverages the fantastic BGZF library in samtools to provide random access into text files that have been compressed with bgzip. grabix creates it's own index (.gbi) of the bgzipped file. Once indexed, one can extract arbitrary lines from the file with the grab command. Or choose random lines with the, well, random command.
Version: 0.1.8
CRISPRCasFinder enables the easy detection of CRISPRs and cas genes in user-submitted sequence data (allows sequences up to 50 Mo otherwise download standalone program). This is an update of the CRISPRFinder program with improved specificity and indication on the CRISPR orientation. MacSyFinder is used to identify cas genes, the CRISPR-Cas type and subtype.
Version: 4.2.19
csvtk is a set of tools for manipulation of CSV/TSV files. It is convenient for rapid data investigation and integration into analysis pipelines.
Version: 0.18.2
bmtool is part of BMTagger aka Best Match Tagger, for removing human reads from metagenomics datasets.
Version : 3.101
Exonerate is a generic tool for pairwise sequence comparison. It allows you to align sequences using a many alignment models, either exhaustive dynamic programming or a variety of heuristics.
Version: 2.4.0
PAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood.
Version: 4.9
GraphMap2 is an update containing tuning of alignments specific for long RNA reads.
Version: 0.6.1
GraphMap is a novel mapper targeted at aligning long, error-prone third-generation sequencing data. It is designed to handle Oxford Nanopore MinION 1d and 2d reads with very high sensitivity and accuracy, and also presents a significant improvement over the state-of-the-art for PacBio read mappers.
Version: 0.5.2
minialign is a fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.
Version: 0.6.0
BEDOPS is an open-source command-line toolkit that performs highly efficient and scalable Boolean and other set operations, statistical calculations, archiving, conversion and other management of genomic data of arbitrary scale. Tasks can be easily split by chromosome for distributing whole-genome analyses across a computational cluster.
Version: 2.4.36
vt is a tool set for short variant discovery in genetic sequence data.
Version: 0.57721
SINA aligns nucleotide sequences to match a pre-existing MSA using a graph based alignment algorithm similar to PoA. The graph approach allows SINA to incorporate information from many reference sequences building without blurring highly variable regions. While pure NAST implementations depend highly on finding a good match in the reference database, SINA is able to align sequences relatively distant to references with good quality and will yield a robust result for query sequences with many close reference.
Version: 1.6.0
SortMeRNA is a program for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. The core algorithm is based on approximate seeds and allows for fast and sensitive analyses of nucleotide sequences. The main application of SortMeRNA is filtering ribosomal RNA from metatranscriptomic data.
Version: 2.1b
goleft is a collection of bioinformatics tools written in go distributed together as a single binary under a liberal (MIT) license.
Version: 0.2.1
Starcode is a DNA sequence clustering software. Starcode clustering is based on all pairs search within a specified Levenshtein distance (allowing insertions and deletions), followed by a clustering algorithm: Message Passing, Spheres or Connected Components. Typically, a file containing a set of DNA sequences is passed as input, jointly with the desired clustering distance and algorihtm. Starcode returns the canonical sequence of the cluster, the cluster size, the set of different sequences that compose the cluster and the input line numbers of the cluster components.
Version: 1.3
MethylDackel will process a coordinate-sorted and indexed BAM or CRAM file containing some form of BS-seq alignments and extract per-base methylation metrics from them. MethylDackel requires an indexed fasta file containing the reference genome as well.
Version: 0.4.0
WiggleTools package allows genomewide data files to be manipulated as numerical functions, equipped with all the standard functional analysis operators (sum, product, product by a scalar, comparators), and derived statistics (mean, median, variance, stddev, t-test, Wilcoxon's rank sum test, etc).
Version: 1.2.3
xAtlas is a fast and retrainable small variant caller that has been developed at the Baylor College of Medicine Human Genome Sequencing Center.
Versions: 0.1
Delly is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read massively parallel sequencing data. It uses paired-ends, split-reads and read-depth to sensitively and accurately delineate genomic rearrangements throughout the genome.
Version: 0.8.1
Octopus is a mapping-based variant caller that implements several calling models within a unified haplotype-aware framework. Octopus takes inspiration from particle filtering by constructing a tree of haplotypes and dynamically pruning and extending the tree based on haplotype posterior probabilities in a sequential manner. This allows octopus to implicitly consider all possible haplotypes at a given loci in reasonable time.
Version: 0.5.2b
BAMscale is a one-step tool for either 1) quantifying and normalizing the coverage of peaks or 2) generated scaled BigWig files for easy visualization of commonly used DNA-seq capture based methods.
Version: 0.0.5
bedtools is a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic. Bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.
Version: 2.29.0
BUStools is a program for manipulating BUS files for single cell RNA-Seq datasets. It can be used to error correct barcodes, collapse UMIs, produce gene count or transcript compatbility count matrices, and is useful for many other tasks.
Version: 0.39.3
AlignStats produces various alignment, whole genome coverage, and capture coverage metrics for sequence alignment files in SAM, BAM, and CRAM format.
Version:0.8
DIAMOND is a high-throughput program for aligning a file of short DNA sequencing reads against a protein reference database such as NR, at 20,000 times the speed of BLASTX, with high sensitivity.
Version: 0.9.26
PHESANT the PHEnome Scan ANalysis Tool, runs a phenome scan (pheWAS, Mendelian randomisation (MR)-pheWAS etc.) in UK Biobank.
Version: 0.18
ARAGORN identifies tRNA and tmRNA genes. The program employs heuristic algorithms to predict tRNA secondary structure, based on homology with recognized tRNA consensus sequences and ability to form a base‐paired cloverleaf.
Version: 1.2.38
SeqPrep is a program to merge paired end Illumina reads that are overlapping into a single longer read. It may also just be used for its adapter trimming feature without doing any paired end overlap.
Version: 1.3.2
Bowtie is an ultrafast, memory-efficient short read aligner for short DNA sequences (reads) from next-gen sequencers.
Version: 1.2.1.1
ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes.
Version: 2.0.2
Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
Versions: 2.3.5.1
Bowtie is an ultrafast, memory-efficient short read aligner for short DNA sequences (reads) from next-gen sequencers.
Version: 1.2.3
CRISPRCasFinder enables the easy detection of CRISPRs and cas genes in user-submitted sequence data (allows sequences up to 50 Mo otherwise download standalone program). This is an update of the CRISPRFinder program with improved specificity and indication on the CRISPR orientation. MacSyFinder is used to identify cas genes, the CRISPR-Cas type and subtype.
Version: 2.0.2
SMALT aligns DNA sequencing reads with a reference genome.
Reads from a wide range of sequencing platforms can be processed, for example Illumina, Roche-454, Ion Torrent, PacBio or ABI-Sanger. Paired reads are supported. There is no support for SOLiD reads.
Version: 0.7.6
STAR (Spliced Transcripts Alignment to a Reference) is an ultrafast universal RNA-seq aligner.
Version: 2.7.2b
Vmatch is a versatile software tool for efficiently solving large scale sequence matching tasks. Vmatch subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements.
Version: 2.3.0
MacSyFinder is a program to model and detect macromolecular systems, genetic pathways... in protein datasets. In prokaryotes, these systems have often evolutionarily conserved properties: they are made of conserved components, and are encoded in compact loci (conserved genetic architecture). The user models these systems with MacSyFinder to reflect these conserved features, and to allow their efficient detection.
Version: 1.0.5
Tcl/Tk (Tool Command Language) is a very powerful but easy to learn dynamic programming language, suitable for a very wide range of uses, including web and desktop applications, networking, administration, testing and many more. Tk is a graphical user interface toolkit that takes developing desktop applications to a higher level than conventional approaches.
Version: 8.5.18
CNVkit is command-line toolkit and Python library for detecting copy number variants and alterations genome-wide from high-throughput sequencing.
Version: 0.9.6
Software Training
Training sessions available to HMS trainees:
HMS Research Computing
Intro to R 10/02/2019 3-5p Countway 403
Intro to O2 10/09/2019 3-5p Countway 403
Intermediate O2 10/16/2019 3-5p Countway 403
Intro to Git/Github 10/30/2019 3-5p TMEC 128
The Harvard Chan Bioinformatics Core
Introduction to R
October 3rd & 4th 1.5 day None
Introduction to differential gene expression analysis - bulk RNA-seq (counts -> DE genes)
October 21st & 22nd 2 days
Setting up for success: Everything you need to know when planning for an (bulk) RNA-seq analysis Part II
October 25th 1 PM HSPH Kresge G1
Introduction to R Basic
November 19th & 20th 1.5 day
Introduction to single-cell RNA-seq
December 2nd & 3rd 2 days
Bioinformatics Support
Need help getting software installed on new machines? Have you been planning to try Amazon Web Services (AWS) cloud computing?
BioGrids can help you get started. We have expertise in bioinformatics, programming, workflow development and high performance computing.
We improve the collection with feedback from the community.
Want to see a new application in BioGrids?
Let us know: help@biogrids.org
BioGirds is supported by Harvard Medical School and Boston Children's Hospital and relies on a framework that was developed by SGBGrid.
|