Solexa Eland Export format to BED file
Here is a perl script which converts from Eland export format to BED format.
#!/usr/bin/perl # Program to convert eland export format to BED format # for running MACS # Chris Seidel, June 2009 # # Requires tab delim file of chromosome or contig names # (eland fa match files) in the format: # UCSC_chr_name chr_length eland_name # corrects for alignments that go off the ends of the chrs # negative bases are trimmed to 1, # bases > chr_length are set to chr_length # (I know the former exist, I don't know if the latter exist) # results are not sorted, but can be sorted in linux by: # sort -o infile.bed -k 1,1 -k 2,2n infile.bed # (sort in place, first column, then by second column numeric) die("usage: $0 chrmap.txt eland_export.txt") unless(scalar(@ARGV) == 2); # create output filename $outfile = $ARGV[1]; $outfile =~ s/\.txt$/\.bed/; open(FOUT, ">$outfile") || die("can't open output file: $outfile"); # get info on chromosomes open(cmap, $ARGV[0]) || die("no chromosome name mapping file!"); %chrmap = {}; while($line = <cmap>){ chomp($line); ($newval, $size, $oldval) = split(/\t/, $line); $chrmap{$oldval} = $newval; $chrsize{$oldval} = $size; } # open input file open(fp, $ARGV[1]) || die("can't open eland file"); $lines = 0; while(<fp>){ chop; ++$lines; @bits = split(/\t/); # skip reads that didn't pass filtering next if($bits[21] eq "N"); # get match name $seqname = $bits[10]; # skip No Matches or QC failures # next if($seqname =~ /NM|QC/); # skip repeat matches # next if($seqname =~ /\d+:\d+:\d+/); # we're only interested in sequences that match our chrs next unless(exists($chrmap{$seqname})); $seqlen = length($bits[8]); $start = $bits[12]; $end = $start + $seqlen - 1; $strand = $bits[13]; # parse match descriptor $n = ($bits[14] =~ tr/[ACGTN]/[ACGTN]/); # skip reads beyond a certain threshold next if($n > 2); $read_code = "U".$n; # correct for alignments off the chromosome ends if( $start <= 0 ){ print STDERR "start less than or equal to 0: ", $start, "\n"; print STDERR join("\t", @bits), "\n"; $start = 1; } if($end > $chrsize{$seqname}){ print STDERR "end greater than chr end $chrsize{$seqname}: $end, diff: ", $end - $chrsize{$seqname}, "\n"; print STDERR join("\t", @bits), "\n"; $end = $chrsize{$seqname}; } if($strand eq "F"){ $strand = "+"; $color = "0,0,255"; } else{ $strand = "-"; $color = "255,0,0"; } $score = 0; print FOUT join("\t", $chrmap{$seqname}, $start, $end, $read_code, $score, $strand, $start, $end, $color), "\n"; # give some feedback print STDERR "$lines processed\n" if(!($lines % 100000)); } close(FOUT); print STDERR "output file: $outfile\n";
it takes a companion file of chromosome descriptions. This file is used by the program to map the chromosome names used for the alignment - which are often big ugly names to nice short chromosome names. It also has the length of each chromosome. It's a tab delimited text file that you create for your organism to reflect the alignment values.
chr2L 23011544 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2L.fa chr2LHet 368872 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2LHet.fa chr2R 21146708 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2R.fa chr2RHet 3288761 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.2RHet.fa chr3L 24543557 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3L.fa chr3LHet 2555491 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3LHet.fa chr3R 27905053 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3R.fa chr3RHet 2517507 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.3RHet.fa chr4 1351857 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.4.fa chrX 22422827 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.X.fa chrXHet 204112 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.XHet.fa chrYHet 347038 Drosophila_melanogaster.BDGP5.4.54.dna.chromosome.YHet.fa