ArrayOligoSelector

Last updated October 1, 2003

Jingchun Zhu, Zbynek Bozdech, Joe DeRisi, DeRisi Lab, UCSF


The complete genomic sequences of an increasing number of organisms are becoming available. To exploit these new resources, we have developed a program, ArrayOligoSelector, to systematically design gene specific long oligonucleotide probes for entire genomes, for the purpose of developing whole genome microarrays. For each open reading frame, the program optimizes the oligo selection based upon several parameters, including uniqueness in the genome, sequence complexity, lack of self-binding, GC content and proximity to the 3'end of the gene. We have used this program to generate oligonucleotides for the genome of Plasmodium falciparum.   The malaria chip has been used to study the differential gene expression of the schizont and trophozoite stages and the transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum.

Publications:

download program here

Licence:
ArrayOligoSelector is freely available.  But if you intent to use this program for any commercial reason AND to use the "blat" or "gfclient" options, you need to obtain license from the authors (kent@soe.ucsc.edu ) of those programs yourself.  If you do not have the appropriate licence for blat or gfclient, please do not use those options.

Program work flow:
ArrayOligoSelector includes two sub-programs that run in series. The first sub program is the "computation program" which calculates scores of uniqueness, sequence complexity, lack of self-binding and GC content for every candidate oligo. The detailed scoring schemas are as the following:

Scores generated by the first sub program are stored in a series of output files, which are used by the second sub-program, the "selection program" to select oligos are unique for the gene, with low level of internal repeat and self-annealing tendency and within a narrow range from the target GC percentage the user specified. ArrayOligoSelector chooses an optimal set of ranked oligos by the following means. The uniqueness-filter allows all oligos to pass which satisfy one of two criteria: a user defined binding energy threshold, or alternatively, the top 5% of candidate oligos within 5 kcal/mol of the candidate with the best (least stable) binding energy. In parallel, the sequence complexity filter and the self-binding filter will allow a given oligo to pass if it falls below the 33rd percentile of scores for the target open reading frame. The set of sequences emerging from both filters as well as the uniqueness filter are compared. If there exist one or more sequences from the intersection of these three filters, the sequence is then allowed to pass onto final selection. If an intersection does not exist, the self-binding and complexity filters are incrementally relaxed until an intersection becomes available. Candidate oligos present in the intersection set are subjected to the %GC filter. Initially, oligos are allowed to pass if they meet the user specified %GC. If no oligos pass, the target %GC range is relaxed by one percentage point in each direction until one or more oligos pass. As a final step, all final candidate oligos are ranked by their proximity to the 3’ end of the gene and the optimum oligos are selected.

Parameters can be manipulated by user

What does the user need to provide?

Users need to provide sequence files of both input sequences for designing oligos and the complete genome. Both should be DNA sequences in FASTA format. The complete genome sequences could be either the complete set of the genes in the genome (exons or ORFs), or the complete genomic sequences that include exons, introns and intergenic regions.   Two different versions (exon version and contig version) of the programs are provided in the ArrayOligoSelector for either scenario.  Please refer to the RUN section to find out which program you should use.   Please also refer to the next section on the difference of the implementation between the two versions.   In the case of partial genomes, ArrayOligoSelector will find the unique oligos for the incomplete genomes.   Users should bear in mind that the oligos might have similar sequences in the rest part of the genome.

How does the "self" regions been defined in the contig version of the program?

The contig version of the program identifies the "self" regions of each input sequence, i.e. the genomic locations where the input sequences come from.  The "self" regions are excluded from the uniqueness score calculations.  The "self" regions are identified by first using blat or blastn to align the input sequence with genomic sequences and then the alignments segments with 100% identical base pairings are identified.  All combinations of the identical segments are re-stitched back together through a heuristic approach.  Only the combinations that can generate the original input sequence in an correct order are classified as the "self" regions.  If the input sequences have multiple genomic "self" regions, all will be identified.   The identified regions are recorded in the file "groupfile" in the program's root directory.

Platform: Linux.   The program has been tested on Redhat linux 6.1, 6.2, 8.0 and 9.0.  The program can be theoretically ported to other UNIX environment. The users need to replace the following executable files: blastall(NCBI), formatdb(NCBI), blat(UCSC) and gfclient (UCSC).

Python Interpreter: From ArrayOligoSelector version 3.2 or above, python interpreter version 2.2 or above is REQUIRED. For previous versions, it is still recommended but not required.   You can download the interpreter from http://www.python.org .

RUN

Output

Speed
The program takes 12 hours to design gene specific 70mer oligos for 12MB Plasmodium falciparum coding sequences on a dual cpu 700MHz linux computer.

Reference

The project is hosted at SourceForge.net Logo