ArrayOligoSelector
Last updated October 1, 2003
Jingchun Zhu, Zbynek Bozdech, Joe DeRisi, DeRisi Lab, UCSF
Publications:
Licence:
ArrayOligoSelector is freely available.  But if you intent to use this program for any commercial
reason AND to use the "blat" or "gfclient" options, you need to obtain license from the authors (kent@soe.ucsc.edu
Program work flow:
ArrayOligoSelector includes two sub-programs that run in series. The
first sub program is the "computation program" which calculates scores of
uniqueness, sequence complexity, lack of self-binding and GC content for every
candidate oligo. The detailed scoring schemas are as the following:
Scores generated by the first sub program are stored in a series of output files, which are used by the second sub-program, the "selection program" to select oligos are unique for the gene, with low level of internal repeat and self-annealing tendency and within a narrow range from the target GC percentage the user specified. ArrayOligoSelector chooses an optimal set of ranked oligos by the following means. The uniqueness-filter allows all oligos to pass which satisfy one of two criteria: a user defined binding energy threshold, or alternatively, the top 5% of candidate oligos within 5 kcal/mol of the candidate with the best (least stable) binding energy. In parallel, the sequence complexity filter and the self-binding filter will allow a given oligo to pass if it falls below the 33rd percentile of scores for the target open reading frame. The set of sequences emerging from both filters as well as the uniqueness filter are compared. If there exist one or more sequences from the intersection of these three filters, the sequence is then allowed to pass onto final selection. If an intersection does not exist, the self-binding and complexity filters are incrementally relaxed until an intersection becomes available. Candidate oligos present in the intersection set are subjected to the %GC filter. Initially, oligos are allowed to pass if they meet the user specified %GC. If no oligos pass, the target %GC range is relaxed by one percentage point in each direction until one or more oligos pass. As a final step, all final candidate oligos are ranked by their proximity to the 3’ end of the gene and the optimum oligos are selected.
Parameters can be manipulated by user
What does the user need to provide?
Users need to provide sequence files of both input sequences for designing oligos and the complete genome. Both should be DNA sequences in FASTA format. The complete genome sequences could be either the complete set of the genes in the genome (exons or ORFs), or the complete genomic sequences that include exons, introns and intergenic regions.   Two different versions (exon version and contig version) of the programs are provided in the ArrayOligoSelector for either scenario.  Please refer to the RUN section to find out which program you should use.   Please also refer to the next section on the difference of the implementation between the two versions.   In the case of partial genomes, ArrayOligoSelector will find the unique oligos for the incomplete genomes.   Users should bear in mind that the oligos might have similar sequences in the rest part of the genome.How does the "self" regions been defined in the contig version of the program?
The contig version of the program identifies the "self" regions of each input sequence, i.e. the genomic locations where the input sequences come from.  The "self" regions are excluded from the uniqueness score calculations.  The "self" regions are identified by first using blat or blastn to align the input sequence with genomic sequences and then the alignments segments with 100% identical base pairings are identified.  All combinations of the identical segments are re-stitched back together through a heuristic approach.  Only the combinations that can generate the original input sequence in an correct order are classified as the "self" regions.  If the input sequences have multiple genomic "self" regions, all will be identified.   The identified regions are recorded in the file "groupfile" in the program's root directory.Platform: Linux.   The program has been tested on Redhat linux 6.1, 6.2, 8.0 and 9.0.  The program can be theoretically ported to other UNIX environment. The users need to replace the following executable files: blastall(NCBI), formatdb(NCBI), blat(UCSC) and gfclient (UCSC).
Python Interpreter: From ArrayOligoSelector version 3.2 or above, python interpreter version 2.2 or above is REQUIRED. For previous versions, it is still recommended but not required.   You can download the interpreter from http://www.python.org .
RUN
To run the programs, typing "./Pick70_script1" or "./Pick70_script1_contig" on the command line in the program's root directory, and the command line usage will be printed on the screen.   Three command line arguments are required.   They are the filenames of the input and the genome sequence files and the length of the oligo (eg. 70).   The first sub-program writes the results on disk as a series of output files called "output0, 1, 2, ... ". Two test files are provided with the release: "test_input" and "test_genome".   The following are examples of the usage: ./Pick70_script1 test_input test_genome 70 and ./Pick70_script1_contig test_input test_genome.
Since version 3.2, the contig version of the sub-program is changed in regard to finding the cognate genomic locations of the input sequences.   The program used BLAST in the pre 3.2 versions to identify the genomic origins of each input sequence.   In the version 3.2 or later, the user can choose to use BLAST, BLAT or "gfclient" to do that.   Both BLAT and gfclient are Blast-like alignment tools ideal for fast aligning exons to the genomes (Kent 2002).   ArrayoligoSelector runs much faster if either Blat or gfclient are used.   While Blat and gfclient are essentially the same, gfclient requires setting up the gfServer in advance and Blat calls for more memory.
The post 3.2 contig version of the first sub-program requires an additional (fourth) command line argument to specify the method to identify self locations in the genome. The fourth argument takes a constant string of "blast", "blat" or "gfclient".   Here is an example of the usage: /Pick70_script1_contig test_input test_genome 70 blat .
A new feature was added in the post 3.4 version (both contig or exon versions) of the first sub-program to exclude lower case sequences from the calculation process. Oligos with those sequences will be flagged in the outputs (output1, 2, ...) and therefore also be excluded from the selection process of the second sub-program. This feature can be used, for example, in combination with the popular repeat masking program "repeatmasker" to exclude highly repetitive sequence segments such as the alu element in the human genome; thus speeding up the program dramatically.  An additional command line argument is required to specify if lower case sequences will be excluded, which takes a constant string of "yes" or "no".   Here is an example of the usage: ./Pick70_script1_contig test_input test_genome 70 no blat .
Usage Examples:
Output
Speed
The program takes 12 hours to design gene
specific 70mer oligos for 12MB Plasmodium falciparum coding sequences on a dual
cpu 700MHz linux computer.
Reference