Bioinformatics FAQ and Answers – Cancer Research Technology Program (CRTP)

Back

Bioinformatics Questions

What analyses does the CCR-SF Bioinformatics group perform?

Currently we offer primary and secondary analyses for all NGS projects, including initial base-calling, demultiplexing, data quality control, and reference genome alignment of NGS reads. We also offer tertiary analyses on a limited basis for R&D projects, which may include de novo assembly, whole genome structural variant analysis, full length transcriptomic splice variant detection, and single cell analysis. For all projects, we insure that every sequence run we deliver meets our high standard for yield, base-call quality, and base alignment percentage and application specific standard metrics that we established.

Sequencing depth and experimental design questions?

Coverage requirements vary by application, library protocol, sequencing platform, and project specific considerations. To provide the best approach for your project, a meeting is setup between you and representatives from our sequencing facility in order to make recommendations in sequencing platform, library protocol, and other needs.

For assistance in planning your experiment or to discuss specifics of your project please contact Bao Tran (bao.tran@nih.gov) for bioinformatics consultation please contact Yongmei Zhao (yongmei.zhao@nih.gov).

Please go to the following web links for experimental design best practices:

ATAC-seq Best Practices: https://informatics.fas.harvard.edu/atac-seq-guidelines.html
ChIP-seq, RNA-seq, and Exome and Whole Genome-seq: https://bioinformatics.ccr.cancer.gov/ccbr/project-support/experimental-design-best-practices/
Whole genome sequencing and Structural Variation Detection Best Practices: coming soon
Single cell RNA-seq: coming soon

What is required to assure timely processing and delivery of my data?

We recommend an initial consultation with the CCR-SF Bioinformatics group to discuss data analysis requirements and to establish expectations. It is also important to specify the reference genome version and annotation build for projects with human or mouse genome mapping requirements. For other reference-based sequencing projects, you will need to provide us with the reference sequences (FASTQ file format or weblink).

If you have any questions regarding your preferred data processing options, please contact Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov.

What types of analysis workflows does the CCR-SF use to perform analyses?

We currently provide analyses based on sequencing application type. We have designed and implemented in-house data analysis pipelines that integrate platform/vendor specific data analysis tools with popular open-source software.

Currently available data analysis pipelines:

Illumina Sequencing

ATAC-seq
ChIP-seq
Exome-seq
Whole Genome Sequencing for SNVs, CNVs and SVs
RNA-Seq
miRNA-Seq
Whole genome bisulfite-seq

PacBio Long-read Sequencing

16S amplicon
Iso-seq
De novo Assembly
Whole genome sequencing for mutations (SNVs) and structural variant (SV) analysis
DNA base modification
HLA Genotyping

Single Cell Analysis

Single Cell RNA-seq and CITE-seq
Single Cell Immune Profiling
Single Cell Multiome

Oxford Nanopore Sequencing

Directed RNA or full-length transcript sequencing
Adaptive sampling for regions of interest analysis or virus integration sites detection
Single cell full-length transcriptomic sequencing analysis

Bionano Optical Genome Mapping

Whole genome structural variant (SV) analysis and copy number variation (CNV) profiling

What types of data formats will I receive from CCR-SF?

For projects using the Illumina sequencing platform, a PDF report containing a summary of the sequencing project (i.e., library and sequencing protocols, sequencing result summary, application-based QC metrics, and software details) and an excel file containing the detailed data analysis results. Depending on the application, you will also receive a html QC report file contains detailed QC statistics and plots for analysis workflows included for that specific application. In addition, you will receive the pass-filtered raw sequence reads in FASTQ format and the reference alignment data in BAM format. BAM files contain base-call and quality score information for all pass-filtered reads, as well as alignment information for reads that have mapped to the reference genome. Additional application specific data files were specified in the deliverable data file types.

For projects using the PacBio sequencing platform, the data delivery choice is driven by the specific needs of the project. For example, when circular consensus processing is performed, the raw subreads bam file, run definition xml files, and the consensus reads (CCS) are included in the data delivery package. If alignment and variant calling are performed, the resulting data are provided within BAM and VCF files. There are also files containing the intermediate results of pipeline processing (such as the read-to-cluster mapping for IsoSeq) are sometimes included. Beyond that, we are happy to deliver any of the files produced by our processing upon request. The content of the data delivery package should be discussed at project definition time.

For standard projects, the deliverable data file types are:

Sequencing FASTQ/FASTA files
Alignment BAM files or assembly files
Data QC statistics reports
Mapping or variant calling statistics

For projects with secondary and application specific analysis, the deliverable data file types are:

Exome-seq or WGS Structural Variants Discovery:

Raw FASTQ files
Alignment BAM files
SNP/Indel and structural variant call VCF files
Sturctural variant call BED file
Variant annotation files
QC and variant analysis statistics reports

RNA-Seq:

Raw FASTQ files
STAR-2pass alignment BAM files
Rsem gene and transcript quantification count matrix files
QC and RNA analysis statistics reports

PacBio Iso-seq:

Raw data: CCS/HiFi reads BAM or FASTQ
QC and Statistics reports: MultiQC report, Squanti3 report and Kraken contamination check
Analysis data: high quality clustered isoforms, full length cDNAs, Squanti3 filtered results including BAM, GTF as well as classification table.

PacBio De novo Assembly:

Raw data: CCS reads BAM or FASTQ
QC and Statistics reports: MultiQC report, assembly report and Kraken contamination check
Analysis data: polished contigs

PacBio Long Amplicon Sequencing:

Raw data: CCS reads BAM or FASTQ
QC and Statistics reports: MultiQC report, Kraken contamination check
Analysis data: Clustered long amplicon consensus, phasing and variant analysis file

PacBio WGS Sequencing:

Raw data: CCS reads BAM or FASTQ
QC and Statistics reports: MultiQC report, Kraken contamination check
Analysis data: mapped BAM file, SV VCF file

PacBio HLA Genotyping:

Raw data: CCS reads BAM or FASTQ
QC and Statistics reports: MultiQC report, Kraken contamination check
Analysis data: mapped BAM files, standard HLA genotyping reports

Single Cell RNA:

Cell Ranger output
Seurat clustering
SingleR annotations
Nozzle report

Single Cell ATAC:

Cell Ranger output
Signac clustering

Single Cell Multiome:

Cell Ranger output

Single Cell Immune Profiling:

Cell Ranger output

Single Cell Fixed RNA Profiling:

Cell Ranger output

Single Cell CNV:

Cell Ranger DNA output

Single Cell PIPseq:

PIPseeker output

What analyses does the CCR-SF Bioinformatics group perform?

How do I analyze the data?

The SF typically provides primary, secondary and sometimes tertiary data analysis, which includes delivery of the FASTQ pass-filtered raw read files and alignment BAM files, gene quantification counting files, or variant analysis VCF files to the customer. Investigators are expected to provide for their own downstream analyses not offered by the SF bioinformatics group. For investigators interested in performing their own bioinformatics in-house, there are several commercial software options from Illumina, PacBio, and third-party vendors. In addition, many open-source NGS software tools are freely available from Biowulf and other online computing sources.

For investigators interested in need of assistance for downstream NGS data analyses, the CCR Collaborative Bioinformatics Resource (CCBR) provides expert bioinformatics data analysis for the Center for Cancer Research at the NCI free of charge. To contact the CCBR, please submit a request through the CCBR Project Submission Form at https://bioinformatics.ccr.cancer.gov/ccbr/project-support/

How large are the delivery files?

Because NGS sequencing is still a rapidly evolving field, this answer changes regularly. Please contact the bioinformatics group for current data delivery file size information.

How are the data files delivered?

Please contact the bioinformatics group to discuss your options. The original sequence, alignment, and analysis files are available to download through CBIIT DME system. To access your project data at DME system, please email Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov to get your NIH account linked to DME system. You will need to register an account for each lab member planning to log in. Please follow DME tutorials to access and download your project data.

If you or your collaborator does not have NIH account, we recommend you to register an account at GlobusFTP (https://www.globus.org/) in order to transfer data via the GlobusFTP site. Please see the following tutorial on registering an account and transferring data: https://hpc.nih.gov/docs/globus/

If you have any issues setting up a Globus account or transferring data via the shared endpoint, please contact us via email CCRSF_IFX@nih.gov

How long is the data made available to download?

The data files located on CBIIT DME system currently is depending on the data life cycle defined by data policy implemented at CBIIT. It is available online within 5 years after the initial project data generation. For data files uploaded to Globus system, we make data available for up to 2 weeks starting from the date of our data delivery email announcement. It is the responsibility of the investigator laboratory contact, or bioinformatics contact to ensure that they have retrieved their data promptly. To maintain sufficient data storage for upcoming projects, the analysis files are then archived and stored for an additional four weeks for Globus data transfer.

If your data is no longer available for download, please contact the SF bioinformatics group and we can re-run the data processing and alignment as necessary. However, please note that it may take longer to receive the re-analyzed data due to resource conflicts with current production runs. Whenever possible, it is best to download the data in a timely manner after receipt of the delivery notice.

How do I obtain a LIMS account and submit an order in the CCR-SF LIMS?

Instructions on how to initiate the LIMS account set-up process for your group are available in the LIMS user guide, as well as instructions on how to submit an order once your account is authorized by CCR-SF.

In order to have your account authorized, the PI should email Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov with a list of group members requiring LIMS account access after successful completion of the account creation steps in the LIMS user guide.

What is the yield per run for different sequencing platforms?

The following table provides the example yields per sequencing instrument types based on applications supported at CCR-SF by the vendor supported chemistry and flowcell types. Actual performance parameters may vary based on sample type, sample quality, and clusters passing filter.

Sequencing Platform	Specification Website
Illumina NextSeq 2000	https://www.illumina.com/systems/sequencing-platforms/nextseq-1000-2000/specifications.html
Illumina NovaSeq 6000	https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html
Illumina NovaSeq Xplus	https://www.illumina.com/systems/sequencing-platforms/novaseq-x-plus/specifications.html
PacBio Sequel System	https://www.pacb.com/technology/hifi-sequencing/sequel-system/
PacBio Revio System	https://www.pacb.com/wp-content/uploads/Revio-specification-sheet.pdf
Oxford Nanopore GridION	https://nanoporetech.com/products/gridion
Oxford Nanopore PromethION	https://nanoporetech.com/products/promethion

For further questions, please contact SF Bioinformatics Team via email CCRSF_IFX@nih.gov.