Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

About the Isopedia

Isopedia is a scalable tool designed for the analysis of hundreds to thousands of long-read transcriptome datasets simultaneously by read-level indexing approach. It provides two key capabilities:

Population-level transcript quantification and frequency profiling — With a single command, Isopedia can quantify isoform expression and estimate their occurrence frequencies across large cohorts using minimal computational resources.

Isoform diversity exploration and visualization — Isopedia enables systematic analysis of fusion genes and specific splicing events across populations, offering insights into transcript diversity beyond individual samples.

Table of Content

Quick Start

Iospedida has two binaries: isopedia and isopedia-tools. the main binary isopedia is used for all the main functions, and isopedia-tools has some helper functions.

Download prebuild index and run

# install isopedia from conda
conda install -y zhengxinchang::isopedia

# query transcripts
isopedia isoform -i index/ -g query.gtf -o out.isoform.tsv.gz

# query one fusion gene (two breakpoints)
isopedia fusion  -i index/ -p chr1:181130,chr1:201853853 -o out.fusion.tsv.gz

# query multiple fusion genes
isopedia fusion  -i index/ -P fusion_query.bed -o out.fusion.tsv.gz

# query gene regions and discover potential fusion events
isopedia fusion  -i index/ -g gene.gtf -o out.fusion.discovery.tsv.gz

# query a splice junction and visualize it
isopedia splice  -i index/ -s 17:7675236,17:7675993  -o out.splice.tsv.gz
python script/isopedia-splice-viz.py  -i out.splice.tsv.gz -g gencode.v47.basic.annotation.gtf  -t script/temp.html  -o isopedia-splice-view

For indexing GTF files, please refer to Indexing GTF Files section.

How it works

how-it-works

The workflow of Isopedia involves several key steps, including isoform profiling, merging, indexing, and quering. Users can start by profiling isoform signals from individual BAM files, then merge the results to build a comprehensive index. Once the index is ready, it can be used to qeury isoforms, fusion genes, and explore splice junctions across multiple samples. Users can also visualize specific splicing events using the provided visualization tools.

Isopedia comes with pre-built indexes from hundreds of publicly available long-read RNA-seq datasets, which can be used directly for isoform and fusion gene annotation.

how-it-works2

This figure dipicts how Isopedia determines a positive hit for a query in different sinarios.

Download pre-built index

[place holder for index download link]

Build your own index

Isopedia supports building local index in your own datasets. prerequests are listed below:

  1. Latest isopedia binaries
  2. A set of mapped bam files(sorted bam are not required)
  3. A manifest file that describe the sample name, isoform file path, and other optional meta data in tabular(\t sperated and with a header line) format.

Example manifest file:

sample_name   path   platform
HG002_pb_chr22   /path/to/hg002_pb_chr22.isoform.gz   PacBio
HG002_ont_chr22   /path/to/hg002_ont_chr22.isoform.gz   ONT

Example workflow

# make sure isopedia in your $PATH or use absolute path to the binaries.

# download the toy_ex 
git clone https://github.com/zhengxinchang/isopedia && cd isopedia/toy_ex/

# profile isoform signals on each bam individually
isopedia profile -b ./chr22.pb.grch38.bam -o ./hg002_pb_chr22.isoform.gz
isopedia profile -b ./chr22.ont.grch38.bam -o ./hg002_ont_chr22.isoform.gz

# make a manifest.tsv(tab-seprated) for *.isoform.gz files. example can be found at ./manifest.tsv

# merging, only first two column will be read in this step.
isopedia merge -i manifest.tsv -o index/

# build index. provide the same manifest file, the rest of meta columns will be read.
isopedia index  -i index/ -m manifest.tsv 

# test your index by run a small annotation task.
isopedia isoform -i index/ -g gencode.v47.basic.chr22.gtf -o out.isoform.tsv.gz

Usage

Query(quantification) transcripts

Purpose:

search transcripts from input gtf file and return how many samples in the index have evidence.

Example:

isopedia isoform -i index/ -g query.gtf -o out.tsv.gz

key parameters:

--min-read(-m) minimal support read in each sample to define a postive sample

--flank(-f) flank base pairs when searching splice sites. large value will slow down the run time but allow more wobble splice site.

All parameters:
Usage: isopedia isoform [OPTIONS] --idxdir <IDXDIR> --gtf <GTF> --output <OUTPUT>

Options:
  -i, --idxdir <IDXDIR>
          Path to the index directory

  -g, --gtf <GTF>
          Path to the GTF file

  -f, --flank <FLANK>
          Flanking size (in bases) before and after the position
          
          [default: 10]

  -m, --min-read <MIN_READ>
          Minimum number of reads required to define a positive sample
          
          [default: 1]

  -o, --output <OUTPUT>
          Output file for search results

  -w, --warmup-mem <WARMUP_MEM>
          Memory size to use for warming up (in gigabytes). Example: 4GB. Increasing this will significantly improve performance; set it as large as your system allows
          
          [default: 4]

  -c, --cached_nodes <LRU_SIZE>
          Maximum number of cached nodes per tree
          
          [default: 10000]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Output

The output of the search command is a tab-separated file with the following columns:

Column nameDescription
chromChromosome
startStart position of the query transcript
endEnd position of the query transcript
lengthLength of the query transcript
exon_countNumber of exons in the query transcript
trans_idTranscript ID
gene_idGene ID
confidenceConfidence value for detecting the query transcript in the index
detectedWhether at least one sample supports this transcript with ≥ --min-read reads
min_readMinimum number of reads to define a positive sample
positive_count/sample_sizePositive count / sample size
attributesOriginal attributes of the transcript from the input GTF file
FORMATFormat of the values in each sample column
sample1Values
sampleNValues

There are a few columns can be used to filter the results.

detected this binary value indicates if at least one sample has evidence to support the query transcirpt. it can be used to quickly filterout transcirpts without evidence.

positive_count/sample_size this value is a combination of two values. it indicates how many samples have engouth evidence(defined by --min-read). it can be used to quckly filter the transcirpts that have at least several samples in the index.

confidence a value that summarize the confidence of observing a transcript in the entire index

CPM values are provided in each sample column, which is defined as:

$$CPM=\frac{ \text{Number of support reads for the query transcript}} {\text{Total number of valid reads in the sample}} * 1,000,000$$

$$C = \frac{k}{n}* (\prod_{i}^{n}CPM_{i})^{1/n} *G$$

where $n$ is the total number of samples in the index. $k$ is the sample number that found evidence(at least 1 support read) for a query. $CPM_{i}$ is the count per million value of the transcript in the sample $i$, which is defined as:

$$CPM_{i}=\frac{ \text{Number of support reads for the query transcript}} {\text{Total number of valid reads in the sample }i} * 1,000,000$$

$G$ is the GINI coefficient in positive samples$(i=0..k)$:

$$G = 2 \frac{\sum_{i=1}^{n} i*CPM_{i}}{n \sum_{i=1}^{n} CPM_{i} } - \frac{n+1}{n}$$

Query fusion gene breakpoints

Purpose:

This command is used to search for evidence of specific gene fusion events in the index based on provided breakpoints.

Example:

# query a single fusion
isopedia fusion -i index/ -f 10 -p chr1:pos1,chr2:pos2 -o fusion.anno.bed.gz

# query multiple fusions at the same time
isopedia fusion -i index/ -f 10 -P fusion_breakpoints.bed -o fusion_all.anno.bed.gz

key parameters:

--min-read(-m) minimal support read in each sample to define a postive sample

--flank(-f) flank base pairs when searching splice sites. large value will slow down the run time but allow more wobble splice site.

All parameters:
Usage: isopedia fusion [OPTIONS] --idxdir <IDXDIR> --output <OUTPUT>

Options:
  -i, --idxdir <IDXDIR>
          index directory

  -p, --pos <POS>
          two breakpoints for gene fusion to be search(-p chr1:pos1,chr2:pos2)

  -P, --pos-bed <POS_BED>
          bed file that has the breakpoints for gene fusions. First four columns are chr1, pos1, chr2, pos2, and starts from the fifth column is the fusion id

  -G, --gene-gtf <GENE_GTF>
          bed file that has the start-end positions of the genes, used to find any possible gene fusions within the provided gene regions

  -f, --flank <FLANK>
          flank size for search, before and after the position
          
          [default: 10]

  -m, --min-read <MIN_READ>
          minimal reads to define a positive sample
          
          [default: 1]

  -o, --output <OUTPUT>
          output file for search results

      --debug
          debug mode

  -c, --cached_nodes <LRU_SIZE>
          number of cached nodes for each tree in maximal
          
          [default: 1000000]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Output

Column nameDescription
chr1Chromosome of the first breakpoint
pos1Position of the first breakpoint
chr2Chromosome of the second breakpoint
pos2Position of the second breakpoint
idEvent or transcript ID
min_readMinimum number of reads required
sample_sizeTotal number of samples considered
positive_sample_countNumber of positive samples
sample1Value/status in sample1
......
sampleNValue/status in sampleN

Find potential fusion genes

Purpose:

Query candidate fusion genes within specified gene regions. It identifies potential fusion events by examining all possible gene pairs within the provided regions and reporting those with supporting evidence in the index.

Example:


isopedia fusion -i index/  -G gene.gtf -o fusion.discovery.out.gz

key parameters:

--gene-gtf(-G) a gtf file that has gene records. the rest of feature will be ignored.

--min-read(-m) minimal support read in each sample to define a postive sample

--flank(-f) flank base pairs when searching splice sites. large value will slow down the run time but allow more wobble splice site.

Output

Column nameDescription
gene1_nameName of gene 1
gene1_idID of gene 1
gene2_nameName of gene 2
gene2_idID of gene 2
chr1Chromosome of gene 1
start1Consensus start position for gene 1 mapped region
end1Consensus end position for gene 1 mapped region
chr2Chromosome of gene 2
start2Consensus start position for gene 2 mapped region
end2Consensus end position for gene 2 mapped region
total_evidencesTotal number of supporting evidences
total_samplesTotal number of samples supporting the event
splice_junctions_count1Number of splice junctions supporting gene 1
splice_junctions_count2Number of splice junctions supporting gene 2
Sample1Number of supporting reads in sample1
......
SampleNNumber of supporting reads in sampleN

Query splice junctions and visualize isoforms

Purpose:

This command is designed for cases where you have a specific splice junction of interest and want to explore its isoform context in detail. It provides both tabular output and visualization.

Example:

isopedia splice  -i index/ -p chr22:41100500-41101500  -o splice.out.gz
python script/isopedia-splice-viz.py  -i splice.out.gz -g gencode.v47.basic.annotation.gtf  -t script/temp.html  -o isopedia-splice-view

key parameters(isopedia splice):

--min-read(-m) minimal support read in each sample to define a postive sample

--flank(-f) flank base pairs when searching splice sites. large value will slow down the run time but allow more wobble splice site.

key parameters(isopedia-splice-viz.py):

-g/--gtf a gtf file that has gene annotations. it will be used to annotate the splice junction and isoforms.

-t/--temp-html a template html file that will be used to generate the interactive vislization.

All parameters:

Usage: isopedia splice [OPTIONS] --idxdir <IDXDIR> --output <OUTPUT>

Options:
  -i, --idxdir <IDXDIR>
          Path to the index directory

  -s, --splice <SPLICE>
          Splice junction in 'chr1:pos1,chr2:pos2' format

  -S, --splice-bed <SPLICE_BED>
          Path to splice junction bed file

  -f, --flank <FLANK>
          Flanking size (in bases) before and after the position
          
          [default: 10]

  -m, --min-read <MIN_READ>
          Minimum number of reads required to define a positive sample
          
          [default: 1]

  -o, --output <OUTPUT>
          Output file for search results

  -w, --warmup-mem <WARMUP_MEM>
          Memory size to use for warming up (in gigabytes). Example: 4GB. Increasing this will significantly improve performance; Set it to 0(default) if you only have small query and want to skip warming up step
          
          [default: 0]

  -c, --cached_nodes <LRU_SIZE>
          Maximum number of cached nodes per tree
          
          [default: 100000]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

usage: isopedia-splice-viz.py [-h] -i INPUT [-g GTF] [-t TEMPLATE] -o OUTPUT

Visualize isoforms from isopedia-anno-splice output

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file, the file is the output from isopedia-anno-splice with single query mode.
  -g GTF, --gtf GTF     Reference GTF file
  -t TEMPLATE, --template TEMPLATE
                        Templates HTML file
  -o OUTPUT, --output OUTPUT
                        Output file

Output

The output is a gzip-compressed file containing detailed information about the splice junction and associated isoforms. Each isoform record includes:

Field NameDescription
#idIsoform identifier
chr1, pos1Chromosome and position of splice donor
chr2, pos2Chromosome and position of splice acceptor
total_evidenceTotal number of supporting reads
cpmNormalized counts (CPM)
matched_sj_idxIndex of the matched splice junction
dist_to_matched_sjDistance to the matched splice junction
n_exonsNumber of exons in the isoform
start_pos_leftLeftmost starting position of isoform
start_pos_rightRightmost starting position of isoform
end_pos_leftLeftmost ending position of isoform
end_pos_rightRightmost ending position of isoform
splice_junctionsList of splice junctions in the isoform
formatFormat of record
ENCSR***Per-sample evidence (columns for each dataset, e.g., ENCODE accessions)

Visualization output example: https://zhengxinchang.github.io/isopedia/

isopedia-splice-view

Memory Usage

ENCODE long-read RNA-seq datasets(107 samples)

StepPeak Memory Usage (GB)
isopedia merge7.12
isopedia-idx3.84
isopedia isoform(158K transcripts from GENCODE)15.82

Installation

Install from conda

conda install zhengxinchang::isopedia

Check out the latest release

https://github.com/zhengxinchang/isopedia/releases

Note that the isopedia-<version>.linux.tar.gz was compliled in Amazon Linux 2 with GCC 7.3, Glibc 2.26, and Binutils 2.29.1. It should work in most of Linux distrubtion, however, if your Linux distribution can not run it, you can still try to use the isopedia-<version>.musl.tar.gz(statically linked with musl).

From source code

Rust, cargo, and musl are required for building the project from source.

git clone https://github.com/zhengxinchang/isopedia.git
cd isopedia
cargo build --release
cargo build --release --target x86_64-unknown-linux-musl

Contact

  • zhengxc93@gmail.com
  • fritz.sedlazeck@bcm.edu