Transopson Insertion Finder (TIF) - An efficient detector for insertions of transposable element from NGS data

TIF detectes transposition of transpososable elements from sequence data of NGS. This page explains how to use TIF, using the transposition detection of the rice endogenous retrotransposon Tos17 as an example.
Code: https://github.com/akiomiyao/tif
Paper: Nakagome M, Solovieva E, Takahashi A, Yasue H, Hirochika H, Miyao A (2014) Transposon Insertion Finder (TIF): a novel program for detection of de novo transpositions of transposable elements. BMC Bioinformatics 5:71.

Tos17 across the century

Tos17 is an endogenous retrotransposon of rice. Tos17 is active only in cultured cells of Japonica rice cultivar Nipponbare (Hirochika et al. 1996). Taking advantage of this property, we created about 50,000 lines of mutant panels from rice individuals regenerated from cultured cells (Miyao et al. 2003). Insertion sites of Tos17 are determined by the Suppression PCR and TAIL-PCR methods, and are published in the mutant panel database. Tos17 insertion mutant lines have been distributed to researchers around the world and are useful for functional analysis of genes.

Before the development of NGS technology, primer set of a primer having a Tos17 sequence near the end of Tos17 and a primer with a special modification at the end, or a short primer with the elaborate reaction cycles was used to amplify the DNA fragment containing the flanking sequence of Tos17 insertion point. Amplified fragments were separated by agarose gel electrophoresis, and the DNA obtained from the gel block containing the fragment was used as a template to determine the flanking sequence of Tos17 insertion point by the Sanger method.

We tried to determine insertion positions of Tos17 for all 50,000 lines. It was costly and required several years. Furthermore, amplification of flanking sequences depends on recognition sites of restriction enzyme or by-chance primer annealing, making it difficult to cover all insertion sites.

Tos17 in this century with NGS

NGS technology has made it possible to obtain the entire genome sequence of rice individuals. The junction created by Tos17 insertion should also be included in the large number of short reads. Sequences having a junction with Tos17 can be easily selected by searching the sequences at both ends of Tos17. It is known that a 5-base duplication occurs at the target site during the insertion of Tos17. The insertion site of Tos17 can be detected by selecting a set having the same 5 base target site duplication (TSD) in the 5'-end and the 3'-end-flankng sequence of Tos17. TIF selects short reads containing 5'-end and 3'-end sequence of transposable element, makes set of 5'-end and the 3'-end-flankng sequence having same TSD sequence, and then output inserted position on the reference genome by mapping pair of flanking seqeunces.

Detection of Tos17 insertion from NGS data

Once the NGS sequence data of the Tos17 insertion mutant individual is obtained, the insertion site can be easily obtained by the TIF program. A computer running Unix (ubuntu is good), NGS data, and the rice reference genome sequence (IRGSP1.0) are required.

Following is an example.

To obtain short read sequence, fastq-dump is required. Link to fastq-dump setup
git clone and wget reference is required at first analysis.

$ git clone https://github.com/akiomiyao/tif.git
$ cd git
$ mkdir ttm5
$ mkdir ttm5/read
$ cd ttm5/read
$ fastq-dump --split-files SRR556174   # Fastq-dump takes long time.
$ fastq-dump --split-files SRR556175

$ cd git
$ wget https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_genome.fasta.gz
$ perl tif.pl IRGSP-1.0_genome.fasta.gz ttm5 TGTTAAATATATATACA TTGCAAGTTAGTTAAGA
or
$ perl tif.pl ref=IRGSP-1.0_genome.fasta.gz,target=ttm5,head=TGTTAAATATATATACA,tail=TTGCAAGTTAGTTAAGA (New format for options)

'ttm5' is one of the Tos17 insertion line (line name: ND6834). The NGS data of ttm5 was acquired for the analysis of somaclonal variation. ND6834 was harvested as a regenerated individual in October 1998. The individual was regenerated from liquid culture in N6 medium containing 2,4-dichlorophenoxyacetic acid at a concentration of 1 mg/L until March 1998 from the Nipponbare calus induced in October 1997. The individual analyzed by NGS is the M4 generation, two generations after the harvested seeds. SRR556174 and SRR556175 are run data of ttm5 by Illumina Hiseq2000 sequencing system.

without argument returns how to use tif.pl.
$ perl tif.pl
'$' at the head of line indicates prompt of terminal window.

$ perl tif.pl reference target_directory 5'-end_of_TE_sequence 3'-end_of_TE_sequence
or
$ perl tif.pl ref=reference,target=target_directory,head=5'-end_of_TE_sequence,tail=3'-end_of_TE_sequence

In the new option format, space after commas is not allowed.

For analysis of rice, IRGSP-1.0_genome.fasta.gz is the reference genome sequence.
NGS fastq files should be seved in target/read directory
Tos17 5'-end (Head) sequence: TGTTAAATATATATACA
Tos17 3'-end (Tail) sequence: TTGCAAGTTAGTTAAGA

Result of tif.pl is saved in target (ttm5) directory.

% ll
total 516
drwxrwxr-x  4 miyao miyao   4096 Jan  4 14:46 ./
drwxrwxr-x 11 miyao miyao 155648 Jan  4 23:50 ../
drwxrwxr-x  2 miyao miyao   4096 Jan  4 14:46 child/
lrwxrwxrwx  1 miyao miyao     54 Jan  3 19:24 read
-rw-rw-r--  1 miyao miyao 264216 Jan  3 19:35 tif_grep.TGTTAAATATATATACA.TTGCAAGTTAGTTAAGA
-rw-rw-r--  1 miyao miyao  43407 Jan  4 19:09 tif_log.TGTTAAATATATATACA.TTGCAAGTTAGTTAAGA
-rw-rw-r--  1 miyao miyao   2522 Jan  4 19:09 tif_result.TGTTAAATATATATACA.TTGCAAGTTAGTTAAGA
-rw-rw-r--  1 miyao miyao   2979 Jan  4 19:09 tif_result.TGTTAAATATATATACA.TTGCAAGTTAGTTAAGA.vcf
drwxrwxr-x  2 miyao miyao  36864 Jan  3 19:36 tmp/

tif_head_seq.tail_seq.vcf is the result with vcf format.
tif_grep.head_seq.tail_seq is the list of short reads containing head_seq or tail_seq.
tif_log.head_seq.tail_seq is the log of run.
tif_result.head_seq.tail_seq is the result of tif analysis.

Result with vcf format

miyao:~/tif/ttm5$ cat tif_result.TGTTAAATATATATACA.TTGCAAGTTAGTTAAGA.vcf 
##fileformat=VCFv4.3
##fileDate=20220104
##source=<PROGRAM=tif.pl,target=ttm5,reference=IRGSP-1.0_genome.fasta>
##INFO=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=MEINFO,Number=9,Type=String,Description="Movile element info of the form ME_HEAD_SEQ,ME_TAIL_SEQ,JUNCTION_POS_OF_HEAD,JUNCTION_POS_OF_TAIL,TSD_SIZE,TSD_SEQUENCE,DIRECTION,COUNT_OF_READS_WITH_JUNCTION_OF_HEAD,COUNT_OF_READS_WITH_JUNCTION_OF_TAIL">
##ALT=<ID=INS,Description="Insertion of a mobile element">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=String,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">
##created=<TIMESTAMP="2022-01-04 19:09:29 +0900">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ttm5
1       34453641        .       G       G<INS>  .       .       GT=1/1;AF=1;DP=23;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,34453641,34453645,5,CTTTG,reverse,3,20;SVTYPE=INS  GT:AD:DP   1/1:0,23:23
2       1004765 .       C       C<INS>  .       .       GT=1/0;AF=0.168;DP=89;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,1004765,1004769,5,ATACC,reverse,2,13;SVTYPE=INS        GT:AD:DP   1/0:74,15:89
2       31596632        .       T       T<INS>  .       .       GT=1/0;AF=0.47;DP=34;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,31596632,31596628,5,CTAAT,forward,2,14;SVTYPE=INS       GT:AD:DP    1/0:18,16:34
3       741222  .       C       C<INS>  .       .       GT=1/1;AF=1;DP=17;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,741222,741226,5,GCAGC,reverse,3,14;SVTYPE=INS      GT:AD:DP        1/1:0,17:17
3       8304674 .       A       A<INS>  .       .       GT=1/1;AF=1;DP=21;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,8304674,8304678,5,GAATA,reverse,2,19;SVTYPE=INS    GT:AD:DP        1/1:0,21:21
6       24967877        .       T       T<INS>  .       .       GT=1/1;AF=1;DP=29;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,24967877,24967881,5,TGCAT,reverse,4,25;SVTYPE=INS  GT:AD:DP   1/1:0,29:29
7       20064395        .       T       T<INS>  .       .       GT=1/0;AF=0.365;DP=82;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,20064395,20064391,5,CTTAT,forward,1,29;SVTYPE=INS      GT:AD:DP    1/0:52,30:82
7       20080556        .       T       T<INS>  .       .       GT=1/0;AF=0.365;DP=82;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,20080556,20080552,5,CTTAT,forward,1,29;SVTYPE=INS      GT:AD:DP    1/0:52,30:82
9       12970614        .       C       C<INS>  .       .       GT=1/1;AF=1;DP=24;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,12970614,12970618,5,CATGC,reverse,1,23;SVTYPE=INS  GT:AD:DP   1/1:0,24:24
10      19069889        .       G       G<INS>  .       .       GT=1/0;AF=0.194;DP=67;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,19069889,19069885,5,ACTTG,forward,3,10;SVTYPE=INS      GT:AD:DP    1/0:54,13:67
10      21583058        .       T       T<INS>  .       .       GT=1/0;AF=0.352;DP=34;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,21583058,21583054,5,CTTAT,forward,3,9;SVTYPE=INS       GT:AD:DP    1/0:22,12:34
12      2155895 .       C       C<INS>  .       .       GT=1/0;AF=0.18;DP=100;MEINFO=TGTTAAATATATATACA,TTGCAAGTTAGTTAAGA,2155895,2155899,5,GGAAC,reverse,1,17;SVTYPE=INS        GT:AD:DP   1/0:82,18:100
Table 1. Table of Tos17 transposed positions in ttm5. TIF output, 'tif_result.TGTTAAATATATATACA.TTGCAAGTTAGTTAAGA' separated with TAB is converted to the html table. TSDs in flanking sequence are indicated with red. Genotype M : Homozygous mutant, H : Heterozygous mutant.
ChrJunction of headJunction of tailTSD sizeTSDDirectionHead flanking sequence (HFS)Tail flanking sequence (TFS)Detected number of HFSDetected number of TFSDetected number of wild type readsGenotype
chr0134453641344536455CTTTGreverseTAGCAAGAAGTATCTCATCAGAGCCACCTTCACTTATGGAAACTACGATGGGCTCAACTCGTCAGAGAAGGGTTCTTTGCTTTGTTTATCTTTGGACTCCATATCGGTGTCAACTTCTGGACGACGGTAAACTTGACAAAGTGGGATCCATCGA3200M
chr02100476510047695ATACCreverseGCACATTTTGCAAAATTTGAGATATACCATCCTGACCCACATGTCATTGACTCATGTGGGTTCTACATGTCATTGATATACCATACCAATGGCATATCTCAAACTTTGCAAAATTATAATGGCATGGCTCCAATTTACCCTTTTCTTTAAGAAATCAACA21374H
chr0231596632315966285CTAATforwardCAAACACCGTGTAATTTAGAGGGATGCTATATAACGGAATCTCGTTCAGACGGGATGATACCTACGAGGTATAAGTTCTAATCTAATACCTGTGAGAGTATCGACTTCAGATACCGGTGAGTATTGAATGATTCATGCGAGCTATCAGGTGATAC21418H
chr037412227412265GCAGCreverseTAGAACCGGCTGTCGGATGTGTTTACTGTGTTTGCTGGACATTTGGCAATCAACGTTGGTGGTGGCAGCGCAGCCAGGGTTCCATCCCTTCCAGCTCAGGTGGATCGAGCTGTAGAATCTCTGGTTCACCATAAGTTAGT3140M
chr03830467483046785GAATAreverseGCATAAAGAGTGAATCTCATTGAAGGAACAAAATGAACATTTCAATTTTTTTCAGCAGTCTCGGTAATCAAAGTTATGAATAGAATATACCATTGCTCTATAGCACCTCTCTCTGGCAAAATATTAGCATTTTGTCTCAAGACCTCCTTATGACACTATATTA2190M
chr0624967877249678815TGCATreverseTCTGGAAATGATAGAGTTGCCAGAATTGGCAGAATGGTCTTCTGTGGACTGTTTTTTTCCTGCTCTTCTTGAGGTGTGCATTGCATCAGAGGATGCCCCAAGCTGAAACAATTGCCACCAGTTGTTCTCCCACCTGTAAGAATGAGTATATATGTATCTACC4250M
chr0720064395200643915CTTATforwardACACACTCGCTCACAGTGGAGGACATTGTGGACAATGCAATCGTACTCCTGACTGCCGGTTATGGGAATTCGGCTGTTCTTATCTTATCACCTTCTTGCTCCGGTACCTAGCTAATGATCCAGACATCCTTGGCAAGATAACCGAGGGTGAGCACCTAGTACTCTA12952H
chr0720080556200805525CTTATforwardACACACTCGCTCACAGTGGAGGACATTGTGGACAATGCAATCGTACTCCTGACTGCCGGTTATGGGAATTCGGCTGTTCTTATCTTATCACCTTCTTGCTCCGGTACCTAGCTAATGATCCAGACATCCTTGGCAAGATAACCGAGGGTGAGCACCTAGTACTCTA12952H
chr0912970614129706185CATGCreverseACATTTGTATTAATTTCTATGGCCTATGGGGTTAGTGTTTGGACTGTCAAGGTGTGAGGCATCTTCTCAGTGACTGTGCATGCCATGCATGGTGGGGTGTAGTAGTAGTTTATCAGGACCAAAGAAAAGAACAAACTGCAAGCTAGCTGTAAGCTGCAAAGTATC1230M
chr1019069889190698855ACTTGforwardGCACAAACAGAAATTCCAAAGTTAAAAGATTTTCTCACCAATAACCTATGGTATCTACCATTATATATCATCAGCACACTTGACTTGTACTCTGTGAAAATAGATTTCAGTAGTATGGTTATTGACACTGACAGAGCTGCTTGCTCAATACAATGTGTTTC31054H
chr1021583058215830545CTTATforwardGCTTATGGAGGTCGAGGTGGAGTTGCCAACAGCTCATTCTGTAAATCTATCTTCAAGTCCGAGCTCGAGGTACTTATCTTATCATATCCTCAAATTAGCTAGCCCCCCTATCGAGAATAGCTTAGCATATTTGAGGACCTGTTGGGAGATTGGGAGGT3922H
chr12215589521558995GGAACreverseAACATTAAATTTAGCTGAGATAATCTTGTTTTGCGACAAAGTTGATGTATTATAGAATGATTCCTTGAATTGCAGCTTGGAACGGAACAATTTGGCTTTGTGTAGCACATAAATAAAACATCTTACTATTTAATGTGAACCAGCCAAACTTTGACCCTTGATGT11782H

Book

Plant transposable Elements, Methods in Molecular Biology, Cho, Jungnam (Ed.), 2021, Springer.

Contact

Akio Miyao, Ph.D (miyao@affrc.go.jp)
Institute of Crop Science / NARO
2-1-2, Kannondai, Tsukuba, Ibaraki, 305-8518, Japan