Supported by CCR Office of Science and Technology Resources (OSTR)

Bioinformatics FAQ and Answers

Back

Bioinformatics Questions

Currently we offer primary and secondary analyses for all NGS projects, including initial base-calling, demultiplexing, data quality control, and reference genome alignment of NGS reads. We also offer tertiary analyses on a limited basis for R&D projects, which may include de novo assembly, whole genome structural variant analysis, full length transcriptomic splice variant detection, and single cell analysis. For all projects, we insure that every sequence run we deliver meets our high standard for yield, base-call quality, and base alignment percentage and application specific standard metrics that we established.

Coverage requirements vary by application, library protocol, sequencing platform, and project specific considerations. To provide the best approach for your project, a meeting is setup between you and representatives from our sequencing facility in order to make recommendations in sequencing platform, library protocol, and other needs.

For assistance in planning your experiment or to discuss specifics of your project please contact Bao Tran (bao.tran@nih.gov) for bioinformatics consultation please contact Yongmei Zhao (yongmei.zhao@nih.gov).

Please go to the following web links for experimental design best practices:

We recommend an initial consultation with the CCR-SF Bioinformatics group to discuss data analysis requirements and to establish expectations. It is also important to specify the reference genome version and annotation build for projects with human or mouse genome mapping requirements. For other reference-based sequencing projects, you will need to provide us with the reference sequences (FASTQ file format or weblink).

If you have any questions regarding your preferred data processing options, please contact Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov.

What types of analysis workflows does the CCR-SF use to perform analyses?

We currently provide analyses based on sequencing application type. We have designed and implemented in-house data analysis pipelines that integrate platform/vendor specific data analysis tools with popular open-source software.

Currently available data analysis pipelines:

  • Illumina Sequencing
    • ATAC-seq
    • ChIP-seq
    • Exome-seq
    • Whole Genome Sequencing for SNVs, CNVs and SVs
    • RNA-Seq
    • miRNA-Seq
    • Whole genome bisulfite-seq
  • PacBio Long-read Sequencing
    • 16S amplicon
    • Iso-seq
    • De novo Assembly
    • Whole genome sequencing for mutations (SNVs) and structural variant (SV) analysis
    • DNA base modification
    • HLA Genotyping
  • Single Cell Analysis
    • Single Cell RNA-seq and CITE-seq
    • Single Cell Immune Profiling
    • Single Cell Multiome
  • Oxford Nanopore Sequencing
    • Directed RNA or full-length transcript sequencing
    • Adaptive sampling for regions of interest analysis or virus integration sites detection
    • Single cell full-length transcriptomic sequencing analysis
  • Bionano Optical Genome Mapping
    • Whole genome structural variant (SV) analysis and copy number variation (CNV) profiling

For projects using the Illumina sequencing platform, a PDF report containing a summary of the sequencing project (i.e., library and sequencing protocols, sequencing result summary, application-based QC metrics, and software details) and an excel file containing the detailed data analysis results. Depending on the application, you will also receive a html QC report file contains detailed QC statistics and plots for analysis workflows included for that specific application. In addition, you will receive the pass-filtered raw sequence reads in FASTQ format and the reference alignment data in BAM format. BAM files contain base-call and quality score information for all pass-filtered reads, as well as alignment information for reads that have mapped to the reference genome. Additional application specific data files were specified in the deliverable data file types.

For projects using the PacBio sequencing platform, the data delivery choice is driven by the specific needs of the project. For example, when circular consensus processing is performed, the raw subreads bam file, run definition xml files, and the consensus reads (CCS) are included in the data delivery package. If alignment and variant calling are performed, the resulting data are provided within BAM and VCF files. There are also files containing the intermediate results of pipeline processing (such as the read-to-cluster mapping for IsoSeq) are sometimes included. Beyond that, we are happy to deliver any of the files produced by our processing upon request. The content of the data delivery package should be discussed at project definition time.

For standard projects, the deliverable data file types are:

  • Sequencing FASTQ/FASTA files
  • Alignment BAM files or assembly files
  • Data QC statistics reports
  • Mapping or variant calling statistics

For projects with secondary and application specific analysis, the deliverable data file types are:

  • Exome-seq or WGS Structural Variants Discovery:
    • Raw FASTQ files
    • Alignment BAM files
    • SNP/Indel and structural variant call VCF files
    • Sturctural variant call BED file
    • Variant annotation files
    • QC and variant analysis statistics reports
  • RNA-Seq:
    • Raw FASTQ files
    • STAR-2pass alignment BAM files
    • Rsem gene and transcript quantification count matrix files
    • QC and RNA analysis statistics reports
  • PacBio Iso-seq:
    • Raw data: CCS/HiFi reads BAM or FASTQ
    • QC and Statistics reports: MultiQC report, Squanti3 report and Kraken contamination check
    • Analysis data: high quality clustered isoforms, full length cDNAs, Squanti3 filtered results including BAM, GTF as well as classification table.
  • PacBio De novo Assembly:
    • Raw data: CCS reads BAM or FASTQ
    • QC and Statistics reports: MultiQC report, assembly report and Kraken contamination check
    • Analysis data: polished contigs
  • PacBio Long Amplicon Sequencing:
    • Raw data: CCS reads BAM or FASTQ
    • QC and Statistics reports: MultiQC report, Kraken contamination check
    • Analysis data: Clustered long amplicon consensus, phasing and variant analysis file
  • PacBio WGS Sequencing:
    • Raw data: CCS reads BAM or FASTQ
    • QC and Statistics reports: MultiQC report, Kraken contamination check
    • Analysis data: mapped BAM file, SV VCF file
  • PacBio HLA Genotyping:
    • Raw data: CCS reads BAM or FASTQ
    • QC and Statistics reports: MultiQC report, Kraken contamination check
    • Analysis data: mapped BAM files, standard HLA genotyping reports
  • Single Cell RNA:
    • Cell Ranger output
    • Seurat clustering
    • SingleR annotations
    • Nozzle report
  • Single Cell ATAC:
    • Cell Ranger output
    • Signac clustering
  • Single Cell Multiome:
    • Cell Ranger output
  • Single Cell Immune Profiling:
    • Cell Ranger output
  • Single Cell Fixed RNA Profiling:
    • Cell Ranger output
  • Single Cell CNV:
    • Cell Ranger DNA output
  • Single Cell PIPseq:
    • PIPseeker output

Currently we offer primary and secondary analyses for all NGS projects, including initial base-calling, demultiplexing, data quality control, and reference genome alignment of NGS reads. We also offer tertiary analyses on a limited basis for R&D projects, which may include de novo assembly, whole genome structural variant analysis, full length transcriptomic splice variant detection, and single cell analysis. For all projects, we insure that every sequence run we deliver meets our high standard for yield, base-call quality, and base alignment percentage and application specific standard metrics that we established.

The SF typically provides primary, secondary and sometimes tertiary data analysis, which includes delivery of the FASTQ pass-filtered raw read files and alignment BAM files, gene quantification counting files, or variant analysis VCF files to the customer. Investigators are expected to provide for their own downstream analyses not offered by the SF bioinformatics group. For investigators interested in performing their own bioinformatics in-house, there are several commercial software options from Illumina, PacBio, and third-party vendors. In addition, many open-source NGS software tools are freely available from Biowulf and other online computing sources.

For investigators interested in need of assistance for downstream NGS data analyses, the CCR Collaborative Bioinformatics Resource (CCBR) provides expert bioinformatics data analysis for the Center for Cancer Research at the NCI free of charge. To contact the CCBR, please submit a request through the CCBR Project Submission Form at https://bioinformatics.ccr.cancer.gov/ccbr/project-support/

Because NGS sequencing is still a rapidly evolving field, this answer changes regularly. Please contact the bioinformatics group for current data delivery file size information.

Please contact the bioinformatics group to discuss your options. The original sequence, alignment, and analysis files are available to download through CBIIT DME system. To access your project data at DME system, please email Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov to get your NIH account linked to DME system. You will need to register an account for each lab member planning to log in. Please follow DME tutorials to access and download your project data.

If you or your collaborator does not have NIH account, we recommend you to register an account at GlobusFTP (https://www.globus.org/) in order to transfer data via the GlobusFTP site. Please see the following tutorial on registering an account and transferring data: https://hpc.nih.gov/docs/globus/

If you have any issues setting up a Globus account or transferring data via the shared endpoint, please contact us via email CCRSF_IFX@nih.gov

The data files located on CBIIT DME system currently is depending on the data life cycle defined by data policy implemented at CBIIT. It is available online within 5 years after the initial project data generation. For data files uploaded to Globus system, we make data available for up to 2 weeks starting from the date of our data delivery email announcement. It is the responsibility of the investigator laboratory contact, or bioinformatics contact to ensure that they have retrieved their data promptly. To maintain sufficient data storage for upcoming projects, the analysis files are then archived and stored for an additional four weeks for Globus data transfer.

If your data is no longer available for download, please contact the SF bioinformatics group and we can re-run the data processing and alignment as necessary. However, please note that it may take longer to receive the re-analyzed data due to resource conflicts with current production runs. Whenever possible, it is best to download the data in a timely manner after receipt of the delivery notice.

Instructions on how to initiate the LIMS account set-up process for your group are available in the LIMS user guide, as well as instructions on how to submit an order once your account is authorized by CCR-SF.

In order to have your account authorized, the PI should email Yongmei Zhao (yongmei.zhao@nih.gov) or SF Bioinformatics Team via email CCRSF_IFX@nih.gov with a list of group members requiring LIMS account access after successful completion of the account creation steps in the LIMS user guide.

The following table provides the example yields per sequencing instrument types based on applications supported at CCR-SF by the vendor supported chemistry and flowcell types. Actual performance parameters may vary based on sample type, sample quality, and clusters passing filter.

Sequencing Platform Specification Website
Illumina NextSeq 2000 https://www.illumina.com/systems/sequencing-platforms/nextseq-1000-2000/specifications.html
Illumina NovaSeq 6000 https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html
Illumina NovaSeq Xplus https://www.illumina.com/systems/sequencing-platforms/novaseq-x-plus/specifications.html
PacBio Sequel System https://www.pacb.com/technology/hifi-sequencing/sequel-system/
PacBio Revio System https://www.pacb.com/wp-content/uploads/Revio-specification-sheet.pdf
Oxford Nanopore GridION https://nanoporetech.com/products/gridion
Oxford Nanopore PromethION https://nanoporetech.com/products/promethion

For further questions, please contact SF Bioinformatics Team via email CCRSF_IFX@nih.gov.