Analysis pipelines for cancer genome sequencing in mice

Mouse models of human cancer have transformed our ability to link genetics, molecular mechanisms and phenotypes. Both reverse and forward genetics in mice are currently gaining momentum through advances in next-generation sequencing (NGS). Methodologies to analyze sequencing data were, however, developed for humans and hence do not account for species-specific differences in genome structures and experimental setups. Here, we describe standardized computational pipelines specifically tailored to the analysis of mouse genomic data. We present novel tools and workflows for the detection of different alteration types, including single-nucleotide variants (SNVs), small insertions and deletions (indels), copy-number variations (CNVs), loss of heterozygosity (LOH) and complex rearrangements, such as in chromothripsis. Workflows have been extensively validated and cross-compared using multiple methodologies. We also give step-by-step guidance on the execution of individual analysis types, provide advice on data interpretation and make the complete code available online. The protocol takes 2–7 d, depending on the desired analyses. Here, the authors present standardized computational pipelines tailored specifically to the analysis of cancer genome sequencing data from mice. The protocol enables detection of single-nucleotide variants, indels, copy-number variations, loss of heterozygosity and complex rearrangements such as those of chromothripsis.


19
The mouse as a model organism has been used Q3 in cancer research for almost a century. In the 1920s, 20 the first inbred 'isogenic' mouse lines were generated to establish cancer models that developed 21 different malignancies either spontaneously or after treatment with carcinogens 1 . Transgenesis, 22 embryonic stem cell technology and gene targeting opened the way for the development of genetically 23 engineered mouse models of cancer, revolutionizing our ability to link genes, molecular mechanisms 24 and organismal phenotypes 2 . Mouse models were used to elucidate many of the most fundamental 25 biological principles that have since been discovered 3 . Through CRISPR-based genome engineering, it 26 has now become possible to edit genomes, even somatically in living animals. Fast and scalable in vivo 27 CRISPR applications are substantially changing our ability to perform complex manipulations and 28 functional genomic studies in mice 4 . These and other developments contribute to a growing 29 importance of mouse models in basic and translational cancer research. Q4 Q5 Q6 Q7 30 In humans, cancer genomics has been revolutionized by NGS. With sequencing costs constantly 31 dropping, NGS has also begun to influence the arena of mouse cancer genomics. As a consequence, 32 the demand for sequencing of mouse cancers is increasing, as is the need for robust analysis pipelines. 33 A high degree of gene orthology between human and mouse exists. 80% of human protein-coding 34 genes have one-to-one mouse orthologs. The remaining 20% are either (i) in one-to-many, or many- 35 to-many, orthologous relationships; (ii) are members of gene families that have undergone species-36 specific expansions or reductions; or (iii) contain species-specific open reading frames 5 . 37 Nevertheless, comparative analyses of mouse and human genomes have also revealed some dif- 38 ferences between the two species 6,7 . For example, in mice, segmental duplications are typically 39 arranged in clusters, forming contiguous blocks of structural variations, whereas in humans Q5 Q6 are hotspots of recombination, leading to diversification both between mouse strains and 'de novo' 42 between individuals of the same strain 9 . Other differences between mouse and human genomes are 43 not well studied, and it is unclear how such differences affect the accuracy of genomic analyses. The 44 development of analytical tools and bioinformatics pipelines was focused on humans and such tools 45 have so far not been systematically validated in the mouse context. 46 Another limitation in mouse genomic analyses is the lower size/availability of genomic data 47 resources for rodents. For example, single-nucleotide or copy-number databases comprise orders of 48 magnitude more entries in humans than in mice 10,11 . Moreover, large data resources linking muta- 49 tions to various phenotypes (cancer, Mendelian disorders) exist for human data but are mostly 50 unavailable for the mouse. As an exception to this, a few mutations have been modeled in mice (e.g., 51 Trp53 point mutants) to dissect functionality at the organismal level. 52 Finally, the use of inbred strains for mouse cancer studies, which can affect different aspects of 53 data analysis, represents a significant difference from the human situation. For example, the type and Our protocol describes computational workflows for each analysis type. It extensively cross- 73 compares, validates and recommends tools for the analysis of SNVs and CNVs, and contributes novel 74 analytical methods and pipelines for the detection of LOH and chromothripsis. We provide all 75 scripts, as well as guidance on their use. The protocol also gives recommendations for a broad 76 spectrum of analytical details, such as parameter settings in various analytical and research contexts. 77 Finally, each section also contains advice on data interpretation. 78 This work benefited from our extensive collection of various mouse tumor entities, including 79 pancreatic, colon, stomach and hematopoietic cancers. The collection encompasses both tumors 80 derived from genetically engineered mice and cancers triggered by environmental factors such as 81 inflammation. Importantly, we developed primary cancer cell cultures from these mouse tumors, 82 allowing accurate multi-layered analyses and validation approaches. For example, M-FISH using 83 metaphase spreads facilitated the development, refinement and validation of pipelines for the 84 detection of CNVs, LOH or chromothripsis. 85 We used the workflow (overview in Fig. 1) described in this protocol to analyze mouse cancers 86 from different cancer entities 16,17 . Comparative analysis using matched human cancers revealed N-nitroso-N-methylurea 28 , respectively. Before the era of NGS, genetic studies in such models were 126 substantially hampered by the low throughput of traditional approaches to cancer genome analysis. 127 Recent studies showed, however, that chemical perturbation of genomes, combined with NGS of 128 cancers in these mice, is a powerful approach for gene discovery and evolutionary studies [29][30][31][32] . 129 Comparison with alternative methods Q10 mice (n = 38) and human pancreatic ductal adenocarcinomas (n = 51 patients for SNVs, indels, CNVs (data from ref. 63 ) and n = 24 cell lines for translocations). **P = 0.002, ***P ≤ 0.001, two-sided Mann-Whitney test; bars, median. c, Representative examples of M-FISH karyotypes from pancreatic cancers. Top, highly aneuploid human karyotype (70 chromosomes) with multiple translocations; middle, diploid mouse karyotype (40 chromosomes); bottom, complex mouse karyotype (77 chromosomes, 4 translocations). CNA, copy-number alteration; n, Q11 xxxxxxx. a-c adapted with permission from ref. 17  respectively), we noticed substantial discrepancies in precision, with Mutect1 and Mutect2 con-233 sistently reporting fewer false-positive calls than the other algorithms. 234 These differences between callers were evident over the whole range of variant allele frequencies.
235 Figure 3c shows the cumulative performance in relation to the frequency of analyzed variant alleles. 236 As expected, the confidence of calls was smallest at low mutant allele frequencies. For example, the 237 sensitivity and precision for Mutect2 were 0.7 and 0.61, respectively, at mutant allele frequencies of 238 0.1-0.2 but increased to 0.89 and 0.95, respectively, at allele frequencies between 0.4 and 0.5. We 239 noted that when combining results from Mutect1 and Mutect2, the increase in sensitivity was more 240 pronounced than the decrease in precision (red curves in Fig. 3c), which could be exploited in 241 projects in which high sensitivity is the key requirement.  for the interchange between computer programs and is therefore not the best choice as a final output. 289 We therefore export the final results in tabular format. An explanation of all relevant fields can be 290 found in Box 1, and exemplary data are shown in +0.25 were regarded as copy-number neutral. This relatively low cutoff was used to account for 372 intratumoral heterogeneity and the frequent presence of aneuploidy/polyploidy in our cohort. 373 We used CopywriteR and CNVKit to determine copy-number aberrations for each gene from 374 WES data in this cohort. For each tool, sensitivity and precision were determined using aCGH as a 375 reference. For CopywriteR, the weighted mean sensitivity and precision were 94% and 93%, 376 respectively ( Fig. 5a), whereas for CNVKit both were 90% (Supplementary Fig. 1). 377 CNVKit, which uses on-and off-target reads, can be advantageous when looking at small exonic 378 regions. As an example, Supplementary Fig. 2 shows an isolated small intragenic deletion of the EGFR 379 gene, which was detected by CNVKit but not by CopywriteR. However, this for calling copy number-altered segments). This oscillation around our cutoff value is the cause for 396 the decreased concordance between aCGH and CopywriteR. Importantly, when we raised our 397 segment-calling threshold to ±0.3, concordance increased considerably (Fig. 5c). 398 The analysis of an extensive series of cancers allowed us to systematically search for limitations 399 inherent to CopywriteR. Notably, we found that, in very aneuploid samples, CopywriteR assigns 400 incorrect log2 ratios to called segments, which is due to incorrect centering to the 'zero baseline' (i.e., 401 see Chr11-13 in Fig. 5d). Figure 5e shows the M-FISH karyotype for such a sample. Because this 402 phenomenon was strongly dependent on the degree of aneuploidy, we suspected that CopywriteR's 403 normalization method, which uses the absolute median deviation as a location parameter, was the 404 cause. To verify this hypothesis, we adopted the normalization strategy used in CNVKit for re-405 centering called segments from CopywriteR. By contrast, CNVKit uses the mode derived from a Gaussian kernel estimator as location parameter. This allowed us to correct faulty annotations, 434 resulting in substantially increased concordance with M-FISH and aCGH data, even in highly 435 aneuploid samples (Fig. 5f,g). This re-normalization is implemented in the Procedure.  be generated (Fig. 6a). Moreover, to identify regions of the genome that are significantly amplified or 438 deleted across samples, we use GISTIC2 (Fig. 6b)   Copy-number changes can be inferred from WES data using CopywriteR with precision similar to that of aCGH. a, Sensitivity and precision of CopywriteR (median on-target coverage of 75×; from SureSelect XT Mouse All Exon kit ;49.6 Q23 Mb) in primary pancreatic cancer cell cultures (n = 38). CNV calls were benchmarked with corresponding reference aCGH data (Agilent SurePrint G3 Mouse CGH; 240K) by gene-wise comparison. Called segments with a log2 ratio between −0.25 and +0.25 were regarded as copy-number neutral. Samples were sorted by the fraction of the genome affected by CNVs. Two samples (*) performed significantly worse than the rest of the cohort, owing to a large degree of intratumoral heterogeneity and aneuploidy/polyploidy (see also c and Supplementary Figs. 3-5). b, Copy-number profiles of Chr4 from one primary pancreatic cancer cell culture sample (S821) detected by aCGH (top), WES using CopywriteR (middle) or WGS using HMMCopy (bottom) show high concordance. c, Effect of increasing the log2 cutoff on the performance of CopywriteR, as compared to aCGH, in polyploid cancers with substantial intratumoral heterogeneity ( Supplementary Figs. 3-5). d, Copy-number profile estimated by CopywriteR for aneuploid sample R1035. For centering, CopywriteR uses the absolute median deviation (MAD), which incorrectly centers copy-number states in highly aneuploidy cancer genomes. Note the shift of the log2 ratio for chromosomes 1, 3, 9, 11 and 12, indicating a subclonal loss, was not confirmed by M-FISH (e). e, Representative M-FISH karyotype Q24 for the same sample. In total, ten separate karyotypes for this sample were analyzed: +2 (2/10 analyzed karyotypes), +5 (10/10), +6 (10/10), +7 (10/10), +8 (7/10), +14 (5/10), +17 (10/10), and +19 (5/10). f, Re-centering of the copy-number profile estimated by CopywriteR for sample R1035. Using the mode, estimated by a Gaussian kernel estimator of the called segments, results in expected log2 ratios for all chromosomes. Mode centering results in a shift of the log2 ratio of +0.16. g, Performance of CopywriteR using MAD or mode estimator for centering. After correction using the mode estimator, the performance of CopywriteR improves for the samples with the highest CNV load. backcrossing. In the mouse cohorts typically used in cancer research, these effects can be substantial. 474 For example, in a cohort of mouse pancreatic cancers induced by pancreas-specific Kras LSL-G12D 475 expression, some mice were kept on a mixed Sv/129;C57BL/6 background (Fig. 8b), whereas others 476 were backcrossed to C57BL/6 to varying degrees (Fig. 8c,d). 477 In animals on mixed backgrounds or outbred mice, the distribution of germline variants allows 478 LOH analysis at most genetic loci (Fig. 8b). By contrast, the frequent use of inbred genetic back-479 grounds in cancer research poses significant challenges to LOH analyses. Genomes of inbred mice 480 have only a few nucleotides at which the maternal and paternal alleles differ. Figure . 7). Each dot corresponds to the variant allele frequency of an individual SNP and its position in the mouse genome. For a diploid genome, the distribution of allele frequencies is expected to peak at 0.5 (heterozygous, variant allele inherited from one parent) and 1 (homozygous, inherited from both parents). Only heterozygous variants (informative positions) can be used for the detection of LOH. a, In the human germline, both hetero-and homozygous variants are distributed evenly throughout the genome. The plot on the right Q27 shows a zoom-in on Chr17. b, Mouse genome with mixed C57BL/6 and Sv/129 background. Although the absolute number of variants is comparable to those of human genomes, informative positions are not evenly distributed across the genome. Stretches of heterozygous variants are interrupted by blocks of homozygous variants, allowing the study of the LOH of most but not all genetic loci. The zoom-in on the right shows the distribution of germline variants on Chr16. c, Mouse genome with mixed C57BL/6 and Sv/129 background backcrossed to C56BL/6 background for 13 generations. Backcrossing resulted in extensive loss of informative germline variants, thus rendering LOH analysis impossible. d, Mouse genome with mixed C57BL/ 6 and Sv/129 background, partially backcrossed to C57BL/6. Note the strong variation of germline variant density at different chromosomes (e.g., Chr6 versus Chr3). Chr, chromosome; gen., generation.  states is independent of the number of alterations (Fig. 11c). 542 Interspersed loss and retention of heterozygosity. In a diploid genome, loss of a chromosomal 543 fragment leads to LOH of the corresponding region, which is irreversible. Therefore, in chromo-544 thripsis, there is high-level concordance between CNV and LOH patterns ( Fig. 11b; concordance level 545 reflected by Jaccard index). independently reordered and absolute rank differences were calculated to generate a random background distribution (n = 1,000 simulations). For sample S821, the observed value is located within the null model of random distribution, making it unlikely that the observed segment order arose in a progressive model. Two-sided P = 0.78. f, Ability to walk the derivative chromosome: rearrangement graph of Chr4 from sample S821 (n = 146 rearrangements). Each fragment is represented by two blocks, indicating the read orientations (3′ and 5′ in gray and red, respectively) for the start and the end of each chromosome segment when mapped to the reference genome (Fig. 12b). In a chromothriptic model, the read orientations will be alternating, resulting in a gray-red-gray-red pattern. The Wald-Wolfowitz test is used to test this alternating 3′-to-5′ pattern of paired-end read orientation (P < 10 -12 ). The connections between fragments are visualized above and below the blocks (line color indicates fragment join type; see categories: deletion type, duplication type and two different inversion types (Fig. 12a). Each of these 567 categories is characterized by a unique pattern of read orientations between two paired-end reads 568 when these are mapped onto the reference genome. In the literature, multiple different nomenclatures 569 for these structural variants can be found 55,56,59 . 570 The current assumption is that, during reassembly after chromothripsis, there is no preference for 571 the type of join between two fragments. Therefore, each category should occur in 25% of the 572 rearrangements. A χ 2 test can be used to test whether the observed distribution of joins significantly 573 differs from the expected distribution. (Fig. 11d)  . The orientation of paired-end reads relative to the reference genome is altered by rearrangements and is specific for the rearrangement type (as proposed by Stephens et al. 54 ). b, On a rearranged chromosome, each rearranged fragment contains a loose 3′ and 5′ ends that is covered by reads spanning the breakpoints in the 3′ and 5′ directions (colored arrows). In the case of a chromothriptic chromosome, mapping of breakpoint-flanking reads to the reference genome results in an alternating 3′-to-5′ read orientation pattern. c, The alternating pattern of 3′-to-5′ read orientations is disturbed by nested deletions or duplications originating from sequential accumulation of CNVs or by rearrangements, which are not detected by sequencing (Missing observations). Asterisks indicate missing read support for rearrangements, which remain undetected. orient., orientations.
The order in which the fragments are reassembled after chromothripsis is random and 576 independent of the types of joins between two segments. To this end, we order each segment 577 according to its start position and assign ranks for both the start and the end positions. For a 578 perfectly ordered chromosome, the difference between these ranks is 0. By contrast, for a chromo-579 thriptic chromosome, the difference in ranks is >0 and increases with the randomness of 580 fragment order. 581 For statistical testing, we implemented a Monte Carlo approach by randomly reassigning the 582 observed start and end positions 1,000 times and re-calculating the (absolute) mean rank difference 583 for each simulation. The results of this test are shown in the histogram in Fig. 11e. If the observed 584 mean rank difference is larger than the 5% percentile of this distribution, there is strong evidence that 585 the observed fragment order originates from a random reassembly process. 586 Ability to walk the derivative chromosome. After a chromothriptic event, each chromosome 587 fragment contains loose 3′ and 5′ ends, to which other fragments are joined during reassembly. In a 588 paired-end sequencing approach, each breakpoint is supported by a read facing in the 3′ direction 589 and a read facing in the 5′ direction (Fig. 12b). Mapping of these read orientations onto the reference 590 chromosome results in an alternating 3′/5′ pattern, as shown in Fig. 11f   To illustrate the use of the Docker container, the following command will process the WES FASTQ specifying -e USERID='id -u' -e GRPID='id -g'.

814
By default, Docker containers cannot access files located on the machine on which they run. 815 Therefore, local folders need to be mapped to folders inside the container using -v local_ degradation, immediately place the cross-section into a histology cassette and fixative for use in 827 Step 2B. Both remaining tumor ends can be collected in RNAlater for direct DNA isolation 828 (Step 2A) and/or used for the establishment of primary cultures (Step 2C  8 Now, create the folders for the output of the pipeline. Note that~200 GB of data will be generated 1062 per pipeline run for WES, of which~10 GB will be the raw read files,~30 GB will be the results, 1063 including the mapped BAM files, and~170 GB will be temporary files (located inside the 1064 temp Q50 folder), which can be deleted afterward. Quality control of raw reads after trimming • Timing 5 min 1167 12 Run fastqc to collect quality data using the trimmed reads as follows. The data from both pre-1168 (Step 9) and post-trimming can be summarized using multiqc (Step 19 14 In each of following steps, processed files will be deleted once they are not needed anymore. 1195 Remove the trimmed raw files as follows:     improve the overall runtime (Table 5).   Step 37, a copy-number profile for the complete genome is generated (Fig. 14a)  Because this region carries only a few heterozygous germline variants in this mouse, focal amplifi-1808 cation cannot be easily seen in the LOH plot. Figure 15c and Fig. 15d show two cancers displaying 1809 Kras G12D dosage gain by copy-number-neutral (CN)-LOH (Kras G12D homozygosity, acquired uni-1810 parental disomy, loss of wild-type Kras). CN-LOH can affect either the whole chromosome ( Fig. 15c;   1811 arising through chromosomal missegregation) or only parts of Chr6 ( Fig. 15d; through mitotic 1812 recombination). Discriminating between these two scenarios is possible only through LOH analyses 1813 (bottom panels in Fig. 14c,d). 1814 Alterations at prototype tumor suppressors 1815 Examples of different types of tumor suppressor alterations are shown for Trp53 (Fig. 16a) and 1816 Cdkn2a (Fig. 16b-d). One cancer has a somatic Trp53 point mutation on Chr11 (Fig. 16a). Three   Figure 16b shows a heterozygous loss of Chr4, which harbors Cdkn2a, an important tumor 1822 suppressor locus in pancreatic cancer. In a different tumor (Fig. 16c), Cdkn2a is inactivated 1823 by two independent copy-number alterations: loss of one Chr4 and focal Cdkn2a deletion 1824 on the remaining chromosome. Finally, another cancer (Fig. 16d)      Journal: NPROT Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ11 Please provide the expansion of "n"; if it stands for "number", please change to "n = 4" and "n = 2", etc.
AQ13 Please check the changes made to the sentence "We prefer Illumina systems for WGS or WES…" AQ14 Change to "Illumina HiSeq 3000/4000" OK?
AQ15 Is the change to "megabase" OK? Or should it be "megabase pair"?
AQ16 Please specify whether the percentage is "vol/vol", "wt/vol" or "wt/wt" for ethanol expressed in %.
AQ17 Please provide a URL leading to the reference genome GRCm38 (mm10). Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ18 Please check "685" number (Available for validation) in Figure 3a, the caption, and the main text; we added up the numbers and got 686.
AQ20 Please check the changes made to all the tables. AQ21 In the Impact column, are both "Moderate" and "Modifier" correct?
AQ22 Please check the changes made to the sentence "As an example, the total numbers…" AQ23 Change to "49.6 Mb" (megabases instead of megabytes)? Also, should "240K" be "240 kb"?
AQ24 Please check the description of Figure 5e carefully and mention in the caption the difference between the top and bottom images.
AQ25 Change to "owever, this advantage must be weighed against" OK?
AQ26 Change to "marked with asterisks in Fig. 5a)" OK?  Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ28 Please provide the expansions of the abbreviations used in Figure 10.
AQ29 The format used to further describe the six hallmarks is incompatible with our numbering format; therefore, we have deleted the numbers; OK?
AQ30 Please check the changes made to the sentence "Our algorithm sequentially inserts…" AQ31 Please check the changes made to the Materials section.
AQ32 Usually in such a CAUTION note, the author names the specific IRB(s) or IAUAC(s) that approved the experiments discussed (and states the approval number(s), if applicable).
AQ33 Note that the HiSeq kits were moved here from the Equipment list because they contain reagents.
AQ34 Please note that "ddH2O" has been inserted here; please provide supplier and catalog number. Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ35 Per journal requirements, all items listed in the "Biological materials", "Reagents" and "Equipment" sections (aside from materials such as regular water that would not be obtained from an outside supplier) should include the supplier name and catalog and/or model number or citation of a protocol for obtaining the item. Please check the lists carefully and add this information where applicable. Please check the protocol thoroughly to make sure that all materials mentioned are listed in the Materials section, along with supplier and catalog/model number. This will help to ensure that readers can successfully use your protocol.
AQ37 The "Tweezers" entry was inserted because tweezers were mentioned elsewhere; please provide supplier and catalog number.
AQ38 Please check the URLs inserted for bcl2fastq and Trimmomatic; we could not find the software at the given web pages, so we did searches and found them at the inserted URLs.
AQ39 Please check the sentence "This is very flexible…" and the following sentence. You seem to be saying that using the docker container "can be cumbersome when processing large numbers of samples", but then go on to say "Using the functionality provided by the Docker container to execute a scripted version of this pipeline greatly simplifies processing…"; the two statements seem to contradict each other. Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ40 Please check all code throughout the paper to make sure it has been rendered properly. Please note that we have changed "curly" (slanted) quotation marks to straight quotation marks.
AQ41 Per journal style, the CAUTION note must appear at the end of the entry, so we have moved it here and slightly reworded it; please check.
AQ42 Per journal style, we have changed "artefact" to "artifact" in the text, but left it unchanged in the code. Please check whether it can be changed in the code as well or will cause it not to run.
AQ43 Please check the changes made to the Procedure. AQ44 We inserted headings over Steps 2 and 3 so that they could be treated separately in the Timing section at the end of the paper; OK? Please insert or adjust timing as needed, i.e., can you provide timing for Step 3 and a range for Step 1 instead of "variable"?
AQ45 We changed the CAUTION notes to CRITICAL STEP notes here because they did not seem to be related to personal safety or possible ethical violations; please check.
AQ46 Please check the changes made to Step 2B(iv).
AQ47 Please provide timing here for Step 3 and in the corresponding entry in the Timing section. Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ48 "HiSeq 4000 Reagent Kit" is not in the Reagents list; please specify an item in the list, or add this to the list, along with supplier/catalog number.
AQ49 Change to "HiSeq X Ten Reagent Kit" OK?
AQ50 Change to "inside the temp folder" OK?

AQ51 For
Step 9C, does Step 9B have to be performed after Step 9C is performed?
AQ52 Change to "found in the fastqc output" OK?
AQ53 In the line of code in Step 11 that contains $trimmomatic_dir"/"$trimmomatic_file", should there be another set of quotation marks?
AQ54 For "used in the exome extraction kit", please specify the step at which this was done.

AQ55 Please check the code in
Step 24 (and beyond) carefully, especially the parentheses and spaces.
AQ56 Please check the text "and supporting reads for mutation in tumor sample" for clarity.
AQ57 Change to "and (iii) that are potentially affected by strand artifacts" OK? Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Description
Author's Response AQ58 Here and throughout the paper, if "mode" simply means "mode", please lowercase all instances; if it refers to a program or something similar, please change to "Mode" where necessary.
AQ59 Change to "CNVs from WGS data" OK?
AQ60 Change to "The output format (.tif or .emf)" OK? If not, please change to "The output format (TIF or EMF)".

AQ61 Is it correct that
Step 46 skips over the "Prevalence of rearrangements affecting one haplotype" (hallmark iv) test? If you need to insert an additional step here, please check all cross-references to step numbers to make sure that the revised step numbers are used. Journal: NPROT Author :-The following queries have arisen during the editing of your manuscript. Please answer by making the requisite corrections directly in the e.proofing tool rather than marking them up on the PDF. This will ensure that your corrections are incorporated accurately and that your paper is published as quickly as possible.

Query
No.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section).
n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code

Data collection
The information is provided in the manuscript.

Data analysis
The information is included in the manuscript and available online (https://github.com/roland-rad-lab/MoCaSeq) For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability