October 2019 Newsletter
Our October newsletter includes fifteen new titles, three updated titles and an additional fourteen R packages in the latest available version of R (3.5.1). Thirteen new workshops are available on the Harvard Longwood campus.
BioGrids is not compatible with MacOS 10.15
BioGrids is not compatible with MacOS 10.15 Catalina. Apple enforces strict filesystem controls that will break existing BioGrids installations. We are working on a solution. We recommend not upgrading to 10.15 on any Mac with BioGrids already installed.
Apple has also dropped support for 32-bit binaries. We are compiling a list of 32-bit software applications that will not be available in MacOS 10.15.
If your use of BioGrids supplied software was an important element in your publication, please include the following statement in your work:
"Software used in the project was installed and configured by BioGrids
(cite: eLife 2013;2:e01456, Collaboration gets the most out of software.)"
See our Grant Support page for additional details.
BioGrids is available to all Harvard affiliates on a trial basis for the 2019 calendar year.
Register here to try out our software installer, which allows users to choose from over 200 bioinfomatics tools that can be installed as ready-to-run applications on Mac or Linux machines with the click of a button or a short command from the CLI. No need to worry about dependencies or compilation.
BioGrids is supported by a team of scientists and engineers at HMS. We provide direct support to BioGrids members. This includes all aspects of software installation and management. If you need assistance of any kind please send a note to: help@biogrids.org.
Software Highlight: BEDOPS
BEDOPS is a suite of tools to address common questions raised in genomic studies — mostly with regard to overlap and proximity relationships between data sets. It aims to be scalable and flexible, facilitating the efficient and accurate analysis and management of large-scale genomic data.
From the BEDOPS website:
BEDOPS is an open-source command-line toolkit that performs highly efficient and scalable Boolean and other set operations, statistical calculations, archiving, conversion and other management of genomic data of arbitrary scale.
The suite includes tools for set and statistical operations (bedops, bedmap and closest-features) and compression of large inputs into a novel lossless format (starch) that can provide greater space savings and faster data extractions than current alternatives. BEDOPS offers native support for this deep compression format, in addition to BED.
BEDOPS also offers logarithmic time search to per-chromosome regions in sorted BED data (in bedextract and core BEDOPS tools). This feature makes whole-genome analyses “embarassingly parallel”, in that per-chromosome computations can be distributed onto separate work nodes, with results collated at the end in map-reduce fashion.
Sorting arbitrarily large BED files is easy with sort-bed, which easily scales beyond available system memory, as needed. We also offer portable conversion scripts that transform data in common genomic formats (SAM/BAM, GFF/GTF, PSL, WIG, and VCF) to sorted BED data that are ready to use with core BEDOPS utilities.
All of these tools are made to be glued together with common UNIX input and output streams. This helps make your pipeline design and maintenance easy, fast and flexible.
Operation Example:
Element-of (-e, –element-of)
The --element-of operation shows the elements of the first (“reference”) file that overlap elements in the second and subsequent “query” files by the specified length (in bases) or by percentage of length.
In the following example, we search for elements in the reference set A which overlap elements in query set B by at least one base:
Elements that are returned are always from the reference set (in this case, set A).
The argument to --element-of is a value that species to degree of overlap for elements. The value is either integral for per-base overlap, or fractional for overlap measured by length.
Here is a demonstration of the use of --element-of 1 on two sorted sets First.bed and Second.bed, which looks for elements in the First set that overlap elements in the Second set by one or more bases:
$ more First.bed
chr1 100 200
chr1 150 160
chr1 200 300
chr1 400 475
chr1 500 550
$ more Second.bed
chr1 120 125
chr1 150 155
chr1 150 160
chr1 460 470
chr1 490 500
$ bedops --element-of 1 First.bed Second.bed > Result.bed
$ more Result.bed
chr1 100 200
chr1 150 160
chr1 400 475
Learn more from the excellent BEDOPS documentation.
BioGrids Installer
The BioGrids Installer is an easy to use application that makes installing and managing life sciences software simple and quick.
A command line version is also available for Macs and Linux. Download using the link button above and register here for activation.
The BioGrids team provides support, infrastructure and testing for scientific software packages. We currently provide over 200 titles in five categories and an additional 1,500 R, python and perl packages and modules. The collection grows weekly. Learn more here: About BioGrids
BioGrids QuickStart
If you are new to BioGrids and would like to quickly get started with the command line version, follow the instructions below:
1: Download the BioGrids Installer command line version
Linux CLI
curl -kLO https://biogrids.org/wiki/downloads/biogrids-1.0.694-Linux.tgz
tar zxf biogrids-1.0.694-Linux.tgz
cd biogrids-1.0.694-Linux
OSX CLI
curl -kLO https://biogrids.org/wiki/downloads/biogrids-1.0.694-Darwin.tgz
tar zxf biogrids-1.0.694-Darwin.tgz
cd biogrids-1.0.694-Darwin
2: Activate biogrids
./biogrids activate biogrid-production jvinent1 70rYFTDnmCr93VUklfbf1s3M4jdyC9bFVYHew==
Replace the site name, user name and activation key with your own credentials.
3: Install software with BioGrids
./biogrids install fastqc trimmomatic samtools star subread igv
When finished, verify applications are installed:
./biogrids installed
Software Updates
bam2fastx provides conversion of PacBio BAM files into gzipped fasta and fastq files, including splitting of barcoded data.
Version: 1.3.0
bcbio-variation is a toolkit to analyze genome variation data, built on top of the Genome Analysis Toolkit (GATK) with Clojure. It enables validation of variants and exploration of algorithm differences between calling methods by automating the process involved with comparing two sets of variants. For users, this integrates with the bcbio-nextgen framework to automate variant calling and validation.
Version: 0.2.6
Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing. Canu specializes in assembling PacBio or Oxford Nanopore sequences. Canu operates in three phases: correction, trimming and assembly. The correction phase will improve the accuracy of bases in reads.
Version: 1.8
Circlator is a tool that circularizes genome assemblies. The input is a genome assembly in FASTA format and corrected PacBio or nanopore reads in FASTA or FASTQ format. Circlator will attempt to identify each circular sequence and output a linearised version of it. It does this by assembling all reads that map to contig ends and comparing the resulting contigs with the input assembly.
Version: 1.5.5
Fastool is a simple and quick tool to read huge FastQ and FastA files (both normal and gzipped) and manipulate them.
Version: 0.1.4
GROOT is a tool to type Antibiotic Resistance Genes (ARGs) in metagenomic samples (a.k.a. Resistome Profiling). It combines variation graph representation of gene sets with an LSH indexing scheme to allow for fast classification of metagenomic reads. Subsequent hierarchical local alignment of classified reads against graph traversals facilitates accurate reconstruction of full-length gene sequences using a simple scoring scheme.
Version: 0.8.5
MapCaller is an efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes.
Version: 0.9.9.17
Oncofuse is a framework designed to estimate the oncogenic potential of de-novo discovered gene fusions. It uses several hallmark features and employs a bayesian classifier to provide the probability of a given gene fusion being a driver mutation.
Version: 1.1.1
Parafly Given a file containing a list of unix commands, multithreading is used to process the commands in parallel on a single server. Success/failure is captured, and failed commands are retained and reported.
Version: r2013_01_21
pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.
Version: 0.4.1
pbmm2 is a SMRT C++ wrapper for minimap2's C API. Its purpose is to support native PacBio in- and output, provide sets of recommended parameters, generate sorted output on-the-fly, and postprocess alignments. Sorted output can be used directly for polishing using GenomicConsensus, if BAM has been used as input to pbmm2.
Version: 1.1.0
Pilon is a software tool which can be used to automatically improve draft assemblies and find variation among strains, including large event detection.
Version: 1.23
PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) and variational inference (VI) algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.
Version: 3.7
RStudio is an integrated development environment (IDE) for R that includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
Version: 1.2.5001
rust-bio-tools is a set of ultra fast and robust command line utilities for bioinformatics tasks based on Rust-Bio.
Version: 0.6.0
slclust is a utility that performs single-linkage clustering with the option of applying a Jaccard similarity coefficient to break weakly bound clusters into distinct clusters.
Version: 02022010
SPAdes (St. Petersburg genome assembler) is a genome assembly algorithm designed for single-cell and multi-cell bacterial data sets.
Version: 3.13.1
TensorFlow is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices.
Version: 2.0.0
Software Training
Training sessions available to HMS trainees:
HMS Research Computing
Intro to Parallel Computing 11/06/2019 2-4p Countway 506 Minot Room
Intro to Python 11/13/2019 3:15-5p Countway 403
R/Biostatistics
R/Biostatistics is a multi-part course covering the basics of RNA-seq analysis with R. This biostatistics course covers standard supervised approaches and functional enrichment analyses of a breast cancer RNA-seq dataset. Topics include edgeR for differential analysis, GOSeq for functional enrichment analyses, and KEGG pathway analysis. Deep learning applications will be discussed. High-throughput data visualization techniques are be emphasized. Each class includes a lecture and R practicum, and registration is for all courses. Laptops are encouraged
R/Biostatistics Part I 10/25/2019 1-3p Countway 403 32
R/Biostatistics Part II 11/01/2019 1-3p Countway 403 32
R/Biostatistics Part III 11/08/2019 1-3p Countway 403 32
The Harvard Chan Bioinformatics Core
Setting up for Success: Everything you need to know to make your data analysis reproducible
Nov. 15th 1 PM HSPH Kresge G2
Introduction to R
Nov. 19th & 20th 1.5 day
Introduction to single-cell RNA-seq
Dec. 2nd & 3rd 2 days
Setting up for Success: Introduction to Version Control (Git)
Dec. 13th 1 PM HSPH Kresge G2
Countway Library of Medicine
EndNote Essentials
Learn how to save time by using EndNote bibliographic software. Topics include working with EndNote libraries, importing references from databases (e.g. PubMed), and writing with EndNote.
Nov.5 1:00pm - 1:50pm Kresge 200
Zotero: An Alternative to EndNote
Zotero is a free bibliographic management tool, which provides users with a powerful alternative to EndNote or Refworks.
Nov. 7 1:00pm - 1:50pm Kresge 200
Practical Presentation Skills
This informal group is for students & postdocs who would like to practice their oral presentation & conversational skills.
Nov. 3 5:30pm - 6:45pm Countway L2: Room 025
Publishing Your Research: Considerations and Options
This session will provide guidance in selecting the best scholarly journal to reach your intended audience, and ensure your published research generates high impact.
Nov. 14 1:00pm - 1:50pm Kresge 200
Best Practices for Keeping a Lab Notebook Fall 2019 Research Data Management Seminar Series
Whether you’re working at the bench, in the field, or in theory, you need to keep track of what you do so that you—and others—can refer back to it whenever necessary.
Nov. 22 1:00pm - 2:00pm Countway Classroom 403
Practical Presentation Skills
This informal group is for students & postdocs who would like to practice their oral presentation & conversational skills.
Nov. 27 5:30pm - 6:45pm Countway L2: Room 025
Bioinformatics Support
Need help getting software installed on new machines? Have you been planning to try Amazon Web Services (AWS) cloud computing?
BioGrids can help you get started. We have expertise in bioinformatics, programming, workflow development and high performance computing.
We improve the collection with feedback from the community.
Want to see a new application in BioGrids?
Let us know: help@biogrids.org
BioGirds is supported by Harvard Medical School and Boston Children's Hospital and relies on a framework that was developed by SGBGrid.
|