Bioinformatics

Bioinformatics Support at CCR-SF:

The CCR-SF uses high-throughput sequencing technologies to enrich cancer research and ensure that the NCI community can remain at the leading edge of NGS. Since its inception in 2009, CCR-SF Bioinformatics group (CCR-SF IFX) provides a broad range of bioinformatics support services to CCR investigators and their collaborators. Our team has diverse expertise in bioinformatics pipeline development and NGS data analysis support. Our mission is to provide the highest quality of sequencing data to our customers.  We work closely with investigators to help get their NGS projects off the ground.

Requests for Bioinformatics support should be discussed during the sequencing project consultation. Please contact Yongmei Zhao (yongmei.zhao@nih.gov, 301-360-3455) or CCRSF_IFX@nih.gov for any assistance.

Main Area of Support:

Project Consultation – we offer experimental design consultation including sequencing technology recommendation, library protocol consultation, sequencing coverage and cost estimate, etc. Proper consultation ensures our customers receive the best sequencing strategy for their project with most cost-effective manner.
Data Analysis & Interpretation – we perform QC, secondary and tertiary data analysis based on application types for sequencing data generated from Illumina, PacBio, Oxford Nanopore, Bionano platform, as well as single cell technologies.
Bioinformatics Research and Development – we develop robust and reproducible analysis workflows and pipelines based on application types and sequencing technologies. We share our software pipelines through open-source platforms such as GITHUB (https://github.com/CCRSF-IFX).
Collaboration – we work closely with CCR-SF laboratory scientists, support adaptive new sequencing protocols and new technology development. We provide customized data analysis through the development or adoption of new bioinformatics workflows. We actively engage with scientific communities for development best practices in NGS quality control, benchmark bioinformatics methods, NGS protocols and sequencing technologies.
Data Management – we perform data management of all NGS data generated at CCR-SF and provide investigators easy access to their sequencing project data and analysis results using the High-Performance Computing Data Management Environment (HPC DME) system. We help investigators to submit their sequencing project data to public databases.
Bioinformatics Training – We provide training to customers for NGS technology and data analysis. We offer advice regarding tools and methods suitable for further analysis and result interpretation.

Analysis Services:

Transcriptome Analysis – support both short-read using Illumina RNA-seq and long-read using PacBio and Oxford Nanopore platforms for full length transcripts and novel splice variants discoveries.
Whole Genome Sequencing Analysis – utilize both short- and long-reads technologies such as Illumina, PacBio, Oxford Nanopore, or Bionano Optical Genome Mapping technology to detect variations or rearrangements in the structure of chromosomes as well as copy number variations.
Whole Genome Methylation Analysis – support the analysis of 5mC and 5hmC using protocols for Illumina short-read sequencing, DNA base modification detection via PacBio Single Molecule Real Time (SMRT) sequencing (including 5mC and both endogenous and exogenous m6A-marked bases), along with direct DNA and RNA base modifications (4mC, 5mC, 5hmC, 6mA in DNA, and m6A in RNA) using Oxford Nanopore sequencing.
Targeted Sequencing Analysis – utilize adaptive sampling of Oxford Nanopore technology to enrich or deplete any regions of the interest (ROI) in genome to get the high quality and efficient coverage of the ROI; PacBio targeted RNA Iso-seq using hybridization capture to enrich full-length transcripts of interests; utilize Xdrop Sort for whole genome structure variation detection or viral integration site detection.
Single cell Analysis – support both whole transcriptome and droplet-based single cell technologies such as 10X Genomics, Fluent Biosciences PIPseq, Scale Bio single-cell RNA sequencing, single-cell multi-omics (targeted DNA, DNA+Protein) analysis using Mission Bio Tapestri platform; spatial transcriptomics technologies such as 10X Genomics Visium, Curio Bioscience Curio Seeker. Additionally, we provide analysis support for PacBio MAS-Seq or Oxford Nanopore for sequencing of full-length transcripts from 10X Genomics single cell captures.
16S amplicon analysis – utilize PacBio SMRT long-read technology to sequence the entire 16S rRNA gene in one read.

General Questions for Bioinformatics

What analyses does the CCR-SF Bioinformatics group perform?
Sequencing depth and experimental design questions?
What is required to assure timely processing and delivery of my data?
What types of analysis workflows does the CCR-SF use to perform analyses?
What types of data formats will I receive from CCR-SF?
How do I analyze the data?
How large are the data delivery files?
How are the data files delivered?
How long is the data made available to download?
What is the yield per run for different sequencing platforms?

Answers for General Questions

What analyses does the CCR-SF Bioinformatics group perform?

Currently we offer primary and secondary analyses for all NGS projects, including initial base-calling, demultiplexing, data quality control, and reference genome alignment of NGS reads. We also offer tertiary analyses on a limited basis for collaboration projects, which may include de novo assembly, whole genome small variants and large structural variant analysis, whole genome methylation analysis, long-read full length transcriptomic splice variant detection, and single cell analysis. For all projects, we ensure that every sequence run we deliver meets our high standard for yield, base-call quality, and base alignment percentage and application specific standard metrics that we established.

Sequencing depth and experimental design questions?

Coverage requirements vary by application, library protocol, sequencing platform, and project specific considerations. To provide the best approach for your project, a meeting is setup between you and representatives from our sequencing facility in order to make recommendations in sequencing platform, library protocol, and other needs.

For assistance in planning your experiment or to discuss specifics of your project please contact Bao Tran (tranb2@mail.nih.gov) or ccrsfhelp@mail.nih.gov.

For bioinformatics consultation please contact Yongmei Zhao (yongmei.zhao@nih.gov) or our group email CCRSF_IFX@nih.gov.

What is required to assure timely processing and delivery of my data?

We recommend an initial consultation with the CCR-SF Bioinformatics group to discuss data analysis requirements and to establish expectations. It is also important to specify the reference genome version and annotation build for projects. In addition to model organisms, we also perform data analysis using non-model organisms. For non-model organisms sequencing projects, you will need to provide the reference sequences (FASTA file format or weblink).

If you have any questions regarding your preferred data processing options, please contact Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov.

What types of analysis workflows does the CCR-SF use to perform analyses?

We currently provide analyses based on sequencing application type. We have designed and implemented in-house data analysis pipelines that integrate platform/vendor specific data analysis tools with open-source software tools. We released our pipelines to public on GITHUB, which is accessible from the following weblink:

https://github.com/CCRSF-IFX

Please contact us at via email CCRSF_IFX@nih.gov if you have any questions.

Currently available data analysis pipelines:

Illumina Sequencing
- ATAC-seq
- ChIP-seq
- Exome-seq
- Whole Genome Sequencing for SNVs, CNVs and SVs
- RNA-Seq
- miRNA-Seq
- Whole genome methyl-seq
PacBio Long-read Sequencing
- Iso-seq and single cell MAS Iso-seq
- De novo assembly
- Whole genome sequencing for mutations (SNVs) and structural variant (SV) analysis
- DNA base modification
- 16S amplicon
Oxford Nanopore Sequencing
- Directed RNA or full-length transcript sequencing
- Adaptive sampling for regions of interest analysis or virus integration sites detection
- Single cell full-length transcriptomic sequencing analysis along with SNV analysis
- Whole genome sequencing for mutations (SNVs) and structural variant (SV) and Copy Number Variation (SNV) analysis
Bionano Optical Genome Mapping
- Whole genome structural variant (SV) analysis
- Copy Number Variation (CNV) profiling
- De novo assembly
Single Cell Analysis
- Single Cell RNA-seq and CITE-seq
- Single Cell Immune Profiling
- Single Cell Multiome

What types of data formats will I receive from CCR-SF?

For projects using the Illumina sequencing platform, a PDF report containing a summary of the sequencing project (i.e., library and sequencing protocols, sequencing result summary, application-based QC metrics, and software details) and an excel file containing the detailed data analysis results. Depending on the application, you will also receive a html QC report file contains detailed QC statistics and plots for analysis workflows included for that specific application. In addition, you will receive the pass-filtered raw sequence reads in FASTQ format and the reference alignment data in BAM format. BAM files contain base-call and quality score information for all pass-filtered reads, as well as alignment information for reads that have mapped to the reference genome. Additional application specific data files were specified in the deliverable data file types.

For projects using the PacBio sequencing platform, the data delivery choice is driven by the specific needs of the project. For example, when circular consensus processing is performed, the raw subreads bam file, run definition xml files, and the consensus reads (CCS) are included in the data delivery package. If alignment and variant calling are performed, the resulting data are provided within BAM and VCF files. There are also files containing the intermediate results of pipeline processing (such as the read-to-cluster mapping for IsoSeq) are sometimes included. Beyond that, we are happy to deliver any of the files produced by our processing upon request. The content of the data delivery package should be discussed at project definition time.

For standard projects, the deliverable data file types are:

Sequencing FASTQ/FASTA files
Alignment BAM files or assembly files
Data QC statistics reports
Mapping or variant calling statistics

For projects with secondary and application specific analysis, the deliverable data file types are:

Exome-seq or WGS Structural Variants Discovery:
- Raw FASTQ files
- Alignment BAM files
- SNP/Indel and structural variant call VCF files
- Sturctural variant call BED file
- Variant annotation files
- QC and variant analysis statistics reports
RNA-Seq:
- Raw FASTQ files
- STAR-2pass alignment BAM files
- Rsem gene and transcript quantification count matrix files
- QC and RNA analysis statistics reports
Nanopore Iso-seq:
- Raw data: raw POD5 and FASTQ
- QC and statistics reports: MultiQC report, Squanti3 report and Kraken contamination check
- Analysis data: high quality clustered isoforms, full length cDNAs, Squanti3 results including BAM, GTF, classification table as well as html reports.
Nanopore De novo Assembly:
- Raw data: raw POD5 and FASTQ
- QC and Statistics reports: MultiQC report, Kraken contamination check
- Analysis data: mapped BAM file, CNV, SNV and SV VCF file
Nanopore Adaptive Sampling:
- Raw data: raw POD5 and FASTQ
- QC and Statistics reports: MultiQC report, Kraken contamination check
- Analysis data: mapped BAM files, SNV and SV VCF file
Nanopore Single Cell Sequencing:
- Raw data: raw POD5 and FASTQ
- QC and Statistics reports: MultiQC report, Kraken contamination check
- Analysis data: mapped BAM files, transcriptome matrix, gene matrix, tagged BAM file, Squanti3 filtered results including tagged BAM, GTF, classification table as well as html reports
- The count matrix files to run tertiary analysis with Seurat
- Seurat clustering, SingleR annotations in html report
Nanopore RNA/DNA base modification:
- Raw data: raw POD5 and FASTQ
- QC and Statistics reports: MultiQC report, Kraken contamination check
- Analysis data: mapped BAM file along with a tagged base modification and BED file
PacBio Iso-seq/MACS-SC Iso-seq:
- Raw data: CCS/HiFi reads BAM or FASTQ
- QC and Statistics reports: MultiQC report, Squanti3 report and kraken contamination check
- Analysis data: high quality clustered isoforms, full length cDNAs, Squanti3 results
  including BAM, GTF, classification table as well as html reports
- The count matrix files to run tertiary analysis with Seurat
- Seurat clustering, SingleR annotations in html report
PacBio WGS Sequencing:
- Raw data: CCS/HiFi reads BAM or FASTQ
- QC and Statistics reports: MultiQC report, kraken contamination check
- Analysis data: mapped BAM file, variant call VCF file
PacBio De novo Assembly:
- Raw data: CCS reads BAM or FASTQ
- QC and Statistics reports: MultiQC report, assembly report and kraken contamination check
- Analysis data: polished contigs
PacBio Long Amplicon Sequencing:
- Raw data: CCS reads BAM or FASTQ
- QC and Statistics reports: MultiQC report, kraken contamination check
- Analysis data: Clustered long amplicon consensus, Phasing and variant analysis file
PacBio 16S Amplicon Sequencing:
- Raw data: CCS reads BAM or FASTQ
- QC and Statistics reports: MultiQC report, kraken contamination check
- Analysis data: an amplicon sequence variant (ASV) table, which records the number of times each exact amplicon sequence variant was observed in each sample
Single Cell RNA:
- Cell Ranger output
- Seurat clustering, SingleR annotations in html report
Single Cell ATAC:
- Cell Ranger output
- Signac clustering
Single Cell Multiome:
- Cell Ranger output
Single Cell Immune Profiling:
- Cell Ranger output
Single Cell Fixed RNA Profiling:
- Cell Ranger output
Single Cell CNV:
- Cell Ranger DNA output
Single Cell PIPseq:
- PIPseeker output

How do I analyze the data?

The SF typically provides primary and secondary analysis for all applications which include delivery of the FASTQ pass-filtered raw read files and alignment BAM files. For tertiary data analysis for RNA-seq or whole genome sequencing, we offer gene quantification counting files, or variant analysis (both SNVs, SVs) VCF files to the customer.  Investigators are expected to provide for their own downstream analyses not offered by the SF bioinformatics group.  For investigators interested in performing their own bioinformatics in-house, there are several commercial software options from Illumina, PacBio, and third-party vendors.  In addition, many open-source NGS software tools are freely available from Biowulf and other online computing sources.

For investigators interested in need of assistance for downstream NGS data analyses, the CCR Collaborative Bioinformatics Resource (CCBR) provides expert bioinformatics data analysis for the Center for Cancer Research at the NCI free of charge.  To contact the CCBR, please submit a request through the CCBR Project Submission Form at https://ccbr.ccr.cancer.gov/project-support/.

How large are the delivery files?

Because NGS sequencing is still a rapidly evolving field, this answer changes regularly. Please contact the bioinformatics group (CCRSF_IFX@nih.gov) for current data delivery file size information.

How are the data files delivered?

Please contact Yongmei Zhao (yongmei.zhao@nih.gov) to discuss your options. The original sequence, alignment, and analysis files are available to download through NCI DME system. To access your project data at DME system, please email Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov to get your NIH account linked to DME system. You will need to register an account for each lab member planning to log in. Please follow DME tutorial (https://wiki.nci.nih.gov/display/DMEdoc) to access and download your project data.

If you or your collaborator does not have NIH account, we recommend to register an account at GlobusFTP (https://www.globus.org/) in order to transfer data via the GlobusFTP site. Please see the following tutorial on registering an account and transferring data:
https://helix.nih.gov/Documentation/globus.html

If you have any issues setting up a Globus account or transferring data via the shared endpoint, please contact us via email CCRSF_IFX@nih.gov.

How long is the data made available to download?

The data files located on NCI DME system currently is depending on the data life cycle defined by data policy. It is available online within 5 years after the initial project data generation. For data files uploaded to Globus system, we make data available for up to 2 weeks starting from the date of our data delivery email announcement. It is the responsibility of the investigator laboratory contact, or bioinformatics contact to ensure that they have retrieved their data promptly. To maintain sufficient data storage for upcoming projects, the analysis files are then archived and stored for an additional four weeks for Globus data transfer.

If your data is no longer available for download, please contact the SF bioinformatics group and we can re-run the data processing and alignment as necessary. However, please note that it may take longer to receive the re-analyzed data due to resource conflicts with current production runs. Whenever possible, it is best to download the data in a timely manner after receipt of the delivery notice.

What is the yield per run for different sequencing platforms?

The following table provides the example yields per sequencing instrument types based on applications supported at CCR-SF by the vendor supported chemistry and flow cell types. Actual performance parameters may vary based on sample type, sample quality, and clusters passing filter

Sequencing Platform	Specification Website
Illumina NovaSeq Xplus	https://www.illumina.com/systems/sequencing-platforms/novaseq-x-plus/specifications.html
Illumina NextSeq 2000	https://www.illumina.com/systems/sequencing-platforms/nextseq-1000-2000/specifications.html
Illumina MiSeq	https://www.illumina.com/systems/sequencing-platforms/miseq/specifications.html
PacBio Revio System	https://www.pacb.com/wp-content/uploads/Revio-specification-sheet.pdf
Oxford Nanopore PromethION	https://nanoporetech.com/products/promethion

For further questions, please email Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov .