DOCUMENTATION FOR OCPAT

Guozhen Liu, Munirul Islam, Monica Uddin, Derek Wildman

Center for Molecular Medicine and Genetics
Wayne State University School of Medicine
Detroit, MI 48201

Email: Derek Wildman, PhD


TABLE OF CONTENTS:
  1. OVERVIEW OF OCPAT
  2. DATA SOURCES
  3. OCPAT INPUT
  4. SAMPLE OCPAT OUTPUT
  5. HOW OCPAT WORKS
  6. RUNNING OCPAT FROM THE COMMAND LINE
  7. WEB PAGES
  8. REFERENCES
  9. OCPAT DATA

1. OVERVIEW OF OCPAT:

OCPAT is a pipeline to conduct automated codon-preserved alignments for protein coding DNA sequences. Existing software for multiple alignments, such as CLUSTALW, T-COFFEE, do not ensure the codon integrity in alignments, and they often have to be curated by eye.

Other tools derive the reading frame from a reference species, but do not preserve the reading frame in the other taxa. These alignments are thus impractical for detecting natural selection and for inferring phylogenies. The release of draft genomes from more than one dozen vertebrate species now makes it practical to examine protein coding gene evolution rapidly and comprehensively. To address these issues, we developed OCPAT, a pipeline to automate the creation of codon-preserved alignments of putatively orthologous sequences. OCPAT can be run from the web interface or from the command line.


2. DATA SOURCES:

The current version of OCPAT aligns genes from Homo sapiens (human), Pan troglodytes (chimpanzee), Macaca mulatta (Rhesus macaque), Mus musculus (mouse), Rattus norvegicus (rat), Oryctolagus cuniculus (rabbit), Canis familiaris (dog), Bos taurus (cow), Dasypus novemcinctus (armadillo), Loxodonta africana (elephant), Echinops telfairi (tenrec), Monodelphis domestica (opossum), Ornithorhynchus anatinus (platypus) Gallus gallus (chicken), and Xenopus tropicalis (frog). mRNA and/or cDNA files are downloaded from the RefSeq mRNA databases [2] , the ENSEMBL cDNA databases [3] , and the NR (Non-redundant) mRNAs [4] . mRNA/cDNA sequences are then sorted by species and formatted and indexed using the ¡°formatdb¡± program [5] . The GenBank formatted human mRNA and protein sequences are downloaded from RefSeq as well [6] . For the analysis described, data were updated on November 2, 2006.


3. OCPAT INPUT:

The input for the OCPAT is very simple, it is the human RefSeq gene IDs for the genes that are waiting to be analyzed. The human RefSeq gene IDs have a format "NM_xxxxxx" or "XM_xxxxx". "x" is a digital number (0-9) and the length of the digitals are variable.

- Example Input File: RefSeq_IDs.data


4. OCPAT OUTPUT FOR NM_020716:

You can also download the sample output files from here: OCPAT results for NM_020716

A sample output summary file from a whole genome analysis can be found here: OCPAT summary for all refseq genes with orthologs

You will receive two emails in your mailbox to access all OCPAT results and the ocpat error file from www@homopan.med.wayne.edu. OCPAT results will be available for user download for 10 days. The error file contains stantard error output used for debugging. If everything goes fine, you should see no error messages (but you will still receive the error email).


5. HOW OCPAT WORKS:

1). When the list of human RefSeq IDs are submitted (either through the command line or the web interface), the human mRNAs, translated peps, and gene symbols as well as the CDS positions (in the mRNAs) are directly extracted from the human.rna.gbff file by looking up the RefSeq ID. mRNAs or CDSs from the other species are then pulled out by blast search of respective cDNA or mRNA databases (using the human CDS sequence as query). In the case of multiple sequences from one non-human organism showing high concordances to the human CDS (less than 5% difference in concordance to the human CDS, caused by paralogues or similar gene predictions from ENSEMBL/RefSeq/Non-Redundant), the annotations of those genes are used to make the best choice in addition to the concordance ranking.

The concordance is calculated by the following formula:

Concordance = 2 * matched sequence length/(query sequence length + aligned subject sequence length)

Any aligned sequence shorter than half the length of the query sequence (in our case, the human gene sequence) is pre-eliminated.

2). Human, chimpanzee, mouse, rat, cow, and dog CDS/mRNA sequences are pre-aligned using CLUSTALW (v1.83, Higgins et. al., 1994). Alignments are then searched for possible places where only one sequence has a frame shift introduced by nucleotide insertions or deletions while all the other genes are perfectly aligned to each other. The nucleotide insertions are then deleted; conversely, nucleotide deletions (indels in the multiple alignments) are filled with "N"s so the subsequent translation does not cause frame-shifting. This pre-alignment and auto correction for collected CDSs allows the correct translation of full-length peptides in the following translation step, and avoids the inclusion of partial peptide alignments due to single nucleotide insertion/deletion errors.

3). All the orthologous genes are translated into peptides except that the human peptide is directly pulled out from the human.rna.gbff file. The reading frame of each gene is determined by aligning to the human peptide using the bl2seq program. Each sequence is then trimmed so the reading frame correctly begins with the first nucleotide.

4). Corresponding peptides for the orthologous genes are aligned using the Clustalw program. Aligned peptides are then "translated" back to their corresponding cDNA sequences by sequence mapping. There is no codon degeneration as the "reverse translation" is directed by correlated CDS and peptide positions (e.g., according to step 3, the Nth aa in the peptide maps back to the 3N-2, 3N-1, 3N nucleotides in the CDS). This translation, alignment and "reverse translation" procedure generates alignments that preserve the codon frames.

5). The codon-preserved alignments of cDNAs are then assessed for the core alignment region. The suboptimal alignments at the beginning and end of genes (due to poor predictions or sequence errors) are removed. The remaining core alignments always begin with the first nucleotide of a codon and end at the third nucleotide of a codon. The core alignments are then subject to PAML analysis.

The "core alignment" is determined in the following manner: A sliding a window of three consecutive amino acids, beginning from the 5' end, is moved across the multiple alignment. The "identical count" is determined by calculating the number of identical amino acids at each position in a three-amino-acid window. For a multiple alignment of N sequences, the maximum "identical count" per window is 3N. When the "identical count" reaches 2.2N, we mark the first amino acid in the sliding window as the start point of the "core alignment". Using the same strategy of sliding a window from the 3' end and upstream, we can determine the end of the "core alignment".

We tested several numbers corresponding to 50%, 60%, 70% and 80% of "identical count", and decided to use 2.2N, which is slightly over 70% identity in the alignments.

The core alignments are then re-organized into the format required by the PAML program.

6). Organize multiple alignments in multiple formats.


6. RUNNING OCPAT FROM THE COMMAND LINE:

OCPAT currently runs only on MAC OS X and Unix/Linux platforms. It doesn't run on Windows yet. To run OCPAT from the command line, please follow the following steps:
  1. make sure you have PERL installed on your system. If not, you may download PERL from www.activestate.com. ActivePerl 5.8 Online Documentation is an excellent resource for platform-specific installation instructions.
  2. make sure you have NCBI blast utility and ClustalW installed.
  3. create a new directory named "ocpat".
  4. download ocpat souce code.
  5. download data (3.73 GB) - unpack under "ocpat" directory.
  6. create an input file (example: RefSeq_IDs.data). The input for the OCPAT is very simple, each line contains a human RefSeq gene ID waiting to be analyzed. The human RefSeq gene IDs have a format "NM_xxxxxx" or "XM_xxxxx". "x" is a digital number (0-9) and the length of the digitals are variable.
  7. run the following command:

7. WEB PAGES:

OCPAT web server
http://homopan.wayne.edu/pise/ocpat

http://homopan.wayne.edu/OCPAT/index.html
(Contains links to other OCPAT-related pages)

8. REFERENCES:

1. Guozhen Liu, Monica Uddin, Juan C. Opazo, Munirul Islam, Roberto Romero, Lawrence I. Grossman, Morris Goodman, Derek E. Wildman. OCPAT: an online codon-preserved alignment tool for evolutionary genomic analysis of protein coding sequences "BMC: Source Code for Medicine and Biology", submitted

2. RefSeq mRNA databases: * is the organism name. [ ftp://ftp.ncbi.nih.gov/refseq/*/mRNA_Prot/]

3. ENSEMBL cDNA databases: * is the organism name [ ftp://ftp.ensembl.org/pub/release-41/*/data/fasta/cdna/]

4. non-redundant (NR) mRNA database [ ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz]

5. ¡°formatdb¡± program (bundled with the blast package): * is the organism name. [ ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.13/blast2.2.13-*.tar.gz]

6. Human RefSeq database [ ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.gbff.gz]


Copyright: no restriction to all users


Funding Sources

This work was supported in part by the Intramural Research Division of the National Institute of Child Health and Human Development
National Institutes of Health, Department of Health and Human Services, and by the National Science Foundation BCS 0550209.