Cancer is an inevitable outcome of the potential of cells to reproduce and to adapt to their environment; their environment is usually limited to a single host, but cancers can sometimes escape from their hosts and become parasitic clonal lineages.

  Genome Sequencing and Analysis of the Tasmanian Devil and Its Transmissible Cancer
Here we have described a whole-genome analysis of such a cancer, and our studies have provided insights into the genetic identity of the individual that founded the DFTD clone, as well as patterns of ongoing DFTD somatic evolution and clonal dynamics. Understanding the interaction between the genomes of DFTD and its host and the identification of patterns of disease spread and host response may provide information that will assist with the conservation of the Tasmanian devil.

In addition, short-insert libraries were constructed from DNA extracted from the liver of a male Tasmanian devil. The genome of the female devil was assembled with the Phusion2 Assembly Pipeline Mullikin and Ning, K-tuples were merged and sorted into a table, and shared kmer words were linked in a relation matrix.

Rapid evolutionary response to a transmissible cancer in Tasmanian devils | Nature Communications

RPono, a package in the Phusion2 pipeline, was then used to build supercontigs with mate-pair sequences. Transcriptome sequencing of pooled RNA from 12 devil tissues was used to assist with annotation more details are in the Extended Experimental Procedures. Tasmanian devil Sarcophilus harrisii samples were collected from captive animals or from wild animals either in the field, or at postmortems undertaken at the Tasmanian Department of Primary Industries Animal Health Laboratories.

Cell lines were established from two DFTD tumors and from the skin of a normal captive-bred five-year-old female Tasmanian devil. The female fibroblast cell line developed trisomy 6 during cell culture. We developed the Phusion2 genome assembly pipeline to assemble large eukaryotic genomes using Illumina short sequence reads. K-tuples were then merged and sorted into a table so that kmer words shared by different reads could be linked. A relation matrix was used to record the shared kmer words among all the reads. Setting a minimum threshold of shared k-tuples, short reads can then be clustered into groups using kmer sharing information in the relational matrix.

For the normal female Tasmanian devil genome, we generated 2. Any kmer word occurring more than 35 times was not used in further analysis, thus removing the repetitive reads. After read clustering, we obtained , read groups containing 1. Using Phrap, we generated an assembly with 2. We removed chimeric read pairs by excluding read pairs if one or two ends were found in the middle of the contig, but the edge length was larger than the insert size. To minimize the effect of read pairs overlapping biotin junctions the circularization junction point that is introduced during mate pair library construction we only used those pairs with full-length genome alignment.

We then used RPono, a package in the Phusion2 pipeline, to build supercontigs.

At the end of this process, we had an assembly of 3. To further improve our assembly, we aligned our supercontigs with the opossum genome assembly Monodelphis5. This version of the genome assembly Tasmanian devil genome assembly version 7. The Tasmanian devil mitochondrial genome was amplified from DNA extracted from the female Tasmanian devil fibroblast cell line.

The control loop region could not be completely sequenced due to its repetitive structure. We define the genome size as the total number of effective kmer words divided by the kmer depth or the kmer occurrence number at the peak kmer frequency D p :. Here K n is the total number of kmer words and K s is the number of single or unique kmer words. A total of 50, to , events were acquired for each chromosome preparation and displayed on a bivariate plot of HO versus CA3 fluorescence. Data collected from the experiments were analyzed using the Summit analysis software Beckman Coulter.

This approach involved acquiring data from a mixed human-Tasmanian devil chromosome suspension and calculating the mean HO and CA3 fluorescence intensity for a few selected human chromosomes and each of the Tasmanian devil chromosome peaks along the human DNA line. The genome size of the Tasmanian devil was obtained through the summation of the chromosomal DNA content of all the measured chromosome peaks Table S1.

DNA cell suspensions were prepared from lysed whole blood isolated from four different Tasmanian devils. Four different human lymphoblastoid cell lines were analyzed together using propidium iodide PI staining. The DNA cell samples from fixed lysed whole blood of Tasmanian devil and human lymphoblastoid cell lines were stained with PI solution before analyzing on a flow cytometer Darzynkiewicz and Juan, A total of 10, events were acquired per DNA cell sample.

The cells were gated on PI fluorescence area versus PI fluorescence width to discriminate any doublets and clumps. The Tasmanian devil genome size was estimated as follows.

The genome size of the Tasmanian devil was calculated using the formula. The average genome size of Tasmanian devil was computed and is shown in Table S1. The genome size for human was calculated using the formulas above and compared to the value obtained from current sequence estimation. To determine the chromosomal locations of our assembled supercontigs, each of the seven Tasmanian devil chromosomes was individually sorted from the female devil fibroblast cell line using a flow cytometer.

The PCR product was cleaned using 0. We first aligned all of the flow-sorted chromosome reads with the assembled contigs. We calculated the total number of mapped reads and the numbers of mapped reads from each chromosome library for each contig. Since the sizes of chromosome are different, we introduced an effective number of mapped reads for each contig:. Here C i,s is the size of chromosome i ; N i,c is the number of raw flow-sorted chromosome sequence reads aligning to chromosome i ; N i is the number of mapped reads from flow sorted chromosome sequence library i.

For a given contig, we first calculated the maximum and the second maximum values of the effective mapping numbers: N m,e and N m-1,e. If the ratio was larger than 0. For these small units, we examined blocks one by one for a transition from one chromosome library to another; such a transition is strong evidence for misassembly. This method allowed us to detect and correct 2, misassembled contigs.

Once we had an assembly in which the majority of contigs had been assigned to chromosomes and potential assembly errors had been corrected, we produced supercontigs using mate pair sequence reads and synteny with opossum. In these two steps, small unassigned contigs could be assigned to a chromosome together with other assigned contigs within one supercontig.

Supercontigs that could be assigned to a chromosome but could not be aligned with the opossum assembly were placed at the end of each chromosome assembly. Transcriptome sequencing was performed on cDNA libraries constructed from pooled RNA from 12 devil tissues heart, liver, skin, spleen, testis, brain, kidney, lung, bone marrow, pancreas, adrenal gland and salivary gland. Brief methods used for the preparation of each library, and details of the sequencing and analysis are provided below.

Purified RNA was fragmented with divalent cations under elevated temperature. Purified RNA was fragmentated with divalent cations under elevated temperature.

We assembled , transcripts in , Mb. RepeatMasker and Dust combined masked Passing only Genscan results to BLAST is an effective way of reducing the search space and therefore the computational resources required. The devil sequences were mapped to the genome using Pmatch Slater and Birney, Two sets of models were produced, one with only consensus splice sites and one where non-consensus splices were allowed; where a single protein sequence had generated two different coding models at the same locus, the BestTargeted module was used to select the coding model that most closely matched the source protein to take through to the next stage of the gene annotation process.

Due to the paucity of devil specific protein and cDNA evidence, the majority of the gene models were based on proteins from other species. Proteins from other PE levels were used where no other evidence was available; similarly, mammalian proteins were favored over non-mammalian. Coding models from the Similarity stage were filtered using modules such as TranscriptConsensus, RNA-Seq spliced alignments supporting introns were used to help filter the set. The largest set of devil specific evidence was from Illumina paired end RNASeq, this was used where appropriate to help inform our gene annotation.

A set of 1. The split reads and the processed BWA alignments were combined to produce 41, transcript models in total; one transcript per locus. The resulting models were added into the gene set where they produced a novel model or splice variant, in total 5, models were added. The gene set was screened for potential pseudogenes. Stable Identifiers were assigned to each gene, transcript, exon and translation. When annotating a species for the first time, these identifiers are auto-generated. In all subsequent annotations the stable identifiers are propagated based on comparison of the new gene set to the previous gene set.

The final gene set consists of 18, protein coding genes containing 22, transcripts, pseudogenes, 1, ncRNAs including miRNAs. In order to determine our power for calling variants in the cancer and immune genes, we measured average sequencing depth in the two DFTD genomes and two normal devil genomes across each exon for each gene. It is possible that we missed variants in these exons due to lack of sequencing coverage. It is also possible that some genes were incompletely annotated due to assembly errors or gaps in the assembly.

To make chromosome-specific paint probes, biotin Roche , Cy5-, Cy3- Enzo , Green- Abbotts and Texas red Invitrogen dUTPs were incorporated into the WGA2 products via a round of reamplification using a modified protocol adapted specially for labeling probes. The thermo cycling followed the program suggested by the manufacturer except that the number of thermo cycles was increased from 14 cycles to 18 cycles. Karyotype analysis was performed according to standard protocols. Biotin-labeled probes were detected using Cy5.

Each of the seven Tasmanian devil chromosomes was individually sorted from the female devil fibroblast cell line with a flow cytometer. Alignment of chromosome-derived reads with contigs was used to assign contigs to chromosomes and to correct assembly errors.

Sorted chromosomes were also used as probes for chromosome painting Extended Experimental Procedures. Variants from each genome were compared and subtracted to identify the set of variants that were unique to each genome. Discordant pairs that mapped with an unexpected insert distance or orientation or to different supercontigs were identified and clustered to form regions of interest. Structural variants were filtered for those that were specific to individual samples and a subset were validated with PCR, gel electrophoresis, and sequencing from both ends with an ABI xl DNA analyzer.

A subset of copy number variants were validated with quantitative real-time PCR. All authors at Illumina see the affiliations are employees of Illumina Inc. All authors at Illumina receive stocks as part of their compensation. National Center for Biotechnology Information , U. Sponsored Document from. Bignell , 1 R. Papenfuss , 4, 10 P. Schulz-Trieglaff 2 Illumina Cambridge Ltd.

  6. Bauer 2 Illumina Cambridge Ltd.