5_sampleEvolTimes module

Process the posterior evolutionary times for each window of a given pairwise alignment, then sample evolutionary times from each window. The sampled evolutionary times are then used to derive distributions for evolutionary time and PCS size at the whole genome level as well as for each chromosome.

  • Use:

    python3 5_sampleEvolTimes.py.py -sp_ucsc_name [UCSC name] -alpha [any number > 1] -cores [nb. of cores] [--overwrite, optional]
    
  • Example of Usage (human (reference genome) and mouse):

    python3 ~/code/5_sampleEvolTimes.py -sp_ucsc_name mm39 -alpha 1.1 -cores 1 --overwrite
    python3 ~/code/5_sampleEvolTimes.py -sp_ucsc_name mm39 -alpha 1.1 -cores 80 --overwrite
    
  • Input Parameter (mandatory):

-sp_ucsc_name:

UCSC name of the species that is being aligned with the reference species (e.g. mm39 for mouse).

-alpha:

It determines how frequent longer indels can occur. The parameter alpha can take any value above 1 (1 is not included): (1, ∞). If alpha is near 1, larger indels are more likely to occur. If alpha is above 5 (a hard upper limit set internally with max_alpha), the model turns into a substitution only model.

-cores:

If cores=1, the script will execute serially, which may take several hours to complete. It is advisable to utilize as many available cores as possible.

–overwrite:

Optional flag. If this flag is specified, any existing output files will be overwritten during the run.

  • Other Parameters taken from dataset.py:

refsp_ucscName:

UCSC name of the reference species that is being aligned (e.g. hg38 for human).

dirWindows:

Directory where windows and their PCS size distributions are saved (input files).

dirEstEvolTimes:

Directory where estimates of evolutionary time will be saved (input files).

dirSampEvolTimes:

Directory where (1) sampled PCS size distribution and (2) sampled evolutionary time distribution will be saved (output files).

Note

Make sure that the required parameters described above are correctly defined in the file utils/dataset.py.

Pre-requisites

Before using this script, make sure all the required files were pre-computed:

a) Window file for all chromosomes

Make sure to run 2_computeWindows.py for all chromosomes of the reference genome (human in our case: hg38).

b) Files with estimated evolutionary times per window

Make sure to run 4_estimateEvolTimes.py for the alpha value specified in the input parameter -alpha.

Cluster resources

In case you want to run this Python script stand-alone in a cluster that uses Slurm to manage jobs:

srun -p compute -t 4:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=80 --pty bash

Otherwise you can use the script ../cluster/5_sampleEvolTimes_runAll.py to run this script for all 40 species used in this study (whole genome).

Time, Memory & Disk space

For reference, here we include a run example, with runtime, memory usage, and disk space required for running this script on each pairwise alignment of the 40 vertebrate dataset examined in our study. 80 cores were used in these runs.

Desc.

Parameter α=1.1

Parameter α=10

UCSC name

Time

Memory

Disk

Time

Memory

Disk

panPan3

05:55:54

23GB

0.064GB

04:42:16

24GB

0.064GB

panTro6

05:59:13

12GB

0.066GB

04:42:32

12GB

0.066GB

gorGor6

06:08:48

13GB

0.064GB

04:40:49

13GB

0.064GB

ponAbe3

05:51:36

10GB

0.057GB

04:32:41

9GB

0.057GB

papAnu4

05:47:22

11GB

0.053GB

04:10:04

9GB

0.053GB

macFas5

05:30:09

9GB

0.053GB

04:04:21

9GB

0.053GB

rhiRox1

05:47:51

9GB

0.056GB

04:32:31

8GB

0.056GB

chlSab2

05:59:08

9GB

0.056GB

04:17:24

9GB

0.056GB

nasLar1

05:28:40

11GB

0.046GB

03:43:41

8GB

0.046GB

rheMac10

05:48:28

9GB

0.053GB

04:09:44

12GB

0.053GB

calJac4

04:59:18

12GB

0.048GB

03:58:28

8GB

0.048GB

tarSyr2

04:54:38

14GB

0.047GB

04:03:42

9GB

0.047GB

micMur2

03:56:58

9GB

0.039GB

03:12:04

9GB

0.039GB

galVar1

04:39:45

10GB

0.046GB

04:06:06

9GB

0.046GB

mm39

03:26:26

8GB

0.031GB

02:41:15

6GB

0.031GB

oryCun2

03:58:40

8GB

0.037GB

03:10:44

7GB

0.037GB

rn7

03:17:01

7GB

0.031GB

02:31:41

7GB

0.031GB

vicPac2

04:40:23

9GB

0.043GB

03:36:43

9GB

0.043GB

bisBis1

04:05:15

8GB

0.039GB

03:17:50

8GB

0.039GB

felCat9

04:35:20

9GB

0.043GB

03:30:10

9GB

0.043GB

manPen1

04:32:14

11GB

0.041GB

03:44:48

8GB

0.041GB

bosTau9

03:33:13

8GB

0.033GB

02:50:09

8GB

0.033GB

canFam6

04:22:27

9GB

0.041GB

03:21:08

8GB

0.041GB

musFur1

04:37:25

9GB

0.044GB

03:42:45

8GB

0.044GB

neoSch1

04:59:10

9GB

0.046GB

03:49:58

9GB

0.046GB

equCab3

04:58:28

10GB

0.047GB

03:48:17

10GB

0.047GB

myoLuc2

03:33:09

8GB

0.033GB

02:54:18

11GB

0.033GB

susScr11

04:05:26

8GB

0.039GB

03:10:22

8GB

0.039GB

enhLutNer1

04:31:27

9GB

0.044GB

03:44:19

9GB

0.044GB

triMan1

04:43:29

9GB

0.043GB

03:35:42

9GB

0.043GB

macEug2

01:25:03

7GB

0.011GB

01:06:42

7GB

0.011GB

ornAna2

01:04:26

7GB

0.009GB

00:53:59

6GB

0.009GB

aptMan1

00:54:14

7GB

0.007GB

00:46:38

7GB

0.007GB

galGal6

00:49:03

6GB

0.006GB

00:39:58

6GB

0.006GB

thaSir1

00:41:32

8GB

0.005GB

00:32:17

7GB

0.005GB

aquChr2

00:52:14

9GB

0.007GB

00:43:08

7GB

0.007GB

melGal5

00:49:41

6GB

0.006GB

00:41:30

7GB

0.006GB

xenLae2

00:40:21

6GB

0.005GB

00:32:04

6GB

0.005GB

xenTro10

00:41:31

6GB

0.005GB

00:32:09

7GB

0.005GB

danRer11

00:34:22

6GB

0.003GB

00:30:25

7GB

0.003GB

Time per Run: Details

Stats on time of a single run (human-mouse alignment, all chromosomes, 80 cores): ~3 hours 18 minutes Details on computational time are available in the log of the run.

Step

Time (s)

[chr1] Load data (obs. PCSs + estimates)

45.22

[chr1] Sample results

967.04

[chr2] Load data (obs. PCSs + estimates)

49.74

[chr2] Sample results

1050.78

[chr3] Load data (obs. PCSs + estimates)

33.68

[chr3] Sample results

717.64

[chr4] Load data (obs. PCSs + estimates)

40.90

[chr4] Sample results

614.35

[chr5] Load data (obs. PCSs + estimates)

34.67

[chr5] Sample results

756.51

[chr6] Load data (obs. PCSs + estimates)

33.89

[chr6] Sample results

722.64

[chr7] Load data (obs. PCSs + estimates)

26.56

[chr7] Sample results

566.90

[chr8] Load data (obs. PCSs + estimates)

29.50

[chr8] Sample results

599.79

[chr9] Load data (obs. PCSs + estimates)

22.56

[chr9] Sample results

452.59

[chr10] Load data (obs. PCSs + estimates)

24.31

[chr10] Sample results

491.95

[chr11] Load data (obs. PCSs + estimates)

28.77

[chr11] Sample results

599.64

[chr12] Load data (obs. PCSs + estimates)

24.69

[chr12] Sample results

519.79

[chr13] Load data (obs. PCSs + estimates)

21.11

[chr13] Sample results

362.63

[chr14] Load data (obs. PCSs + estimates)

19.26

[chr14] Sample results

414.56

[chr15] Load data (obs. PCSs + estimates)

17.94

[chr15] Sample results

349.07

[chr16] Load data (obs. PCSs + estimates)

19.27

[chr16] Sample results

323.72

[chr17] Load data (obs. PCSs + estimates)

14.70

[chr17] Sample results

256.59

[chr18] Load data (obs. PCSs + estimates)

16.56

[chr18] Sample results

325.18

[chr19] Load data (obs. PCSs + estimates)

12.01

[chr19] Sample results

166.99

[chr20] Load data (obs. PCSs + estimates)

14.11

[chr20] Sample results

310.32

[chr21] Load data (obs. PCSs + estimates)

8.08

[chr21] Sample results

156.26

[chr22] Load data (obs. PCSs + estimates)

8.82

[chr22] Sample results

142.33

[chrX] Load data (obs. PCSs + estimates)

28.60

[chrX] Sample results

403.82

[chrY] Load data (obs. PCSs + estimates)

7.52

[chrY] Sample results

30.64

Total time

11884.55

Storage per Run: Details

Total size of output files (80 files, one for each pairwise alignment, given α=1.1 and α=10.0): 3 GB.

Details of each output file (α=1.1), including the file size and filename:

UCSC name

Size

Filename

panPan3

66 MB

pcsDistrib-samp.panPan3.alpha1.1.pickle

panTro6

68 MB

pcsDistrib-samp.panTro6.alpha1.1.pickle

gorGor6

66 MB

pcsDistrib-samp.gorGor6.alpha1.1.pickle

ponAbe3

59 MB

pcsDistrib-samp.ponAbe3.alpha1.1.pickle

papAnu4

55 MB

pcsDistrib-samp.papAnu4.alpha1.1.pickle

macFas5

54 MB

pcsDistrib-samp.macFas5.alpha1.1.pickle

rhiRox1

58 MB

pcsDistrib-samp.rhiRox1.alpha1.1.pickle

chlSab2

58 MB

pcsDistrib-samp.chlSab2.alpha1.1.pickle

nasLar1

47 MB

pcsDistrib-samp.nasLar1.alpha1.1.pickle

rheMac10

55 MB

pcsDistrib-samp.rheMac10.alpha1.1.pickle

calJac4

50 MB

pcsDistrib-samp.calJac4.alpha1.1.pickle

tarSyr2

48 MB

pcsDistrib-samp.tarSyr2.alpha1.1.pickle

micMur2

41 MB

pcsDistrib-samp.micMur2.alpha1.1.pickle

galVar1

48 MB

pcsDistrib-samp.galVar1.alpha1.1.pickle

mm39

32 MB

pcsDistrib-samp.mm39.alpha1.1.pickle

oryCun2

39 MB

pcsDistrib-samp.oryCun2.alpha1.1.pickle

rn7

32 MB

pcsDistrib-samp.rn7.alpha1.1.pickle

vicPac2

44 MB

pcsDistrib-samp.vicPac2.alpha1.1.pickle

bisBis1

41 MB

pcsDistrib-samp.bisBis1.alpha1.1.pickle

felCat9

44 MB

pcsDistrib-samp.felCat9.alpha1.1.pickle

manPen1

42 MB

pcsDistrib-samp.manPen1.alpha1.1.pickle

bosTau9

34 MB

pcsDistrib-samp.bosTau9.alpha1.1.pickle

canFam6

43 MB

pcsDistrib-samp.canFam6.alpha1.1.pickle

musFur1

46 MB

pcsDistrib-samp.musFur1.alpha1.1.pickle

neoSch1

48 MB

pcsDistrib-samp.neoSch1.alpha1.1.pickle

equCab3

48 MB

pcsDistrib-samp.equCab3.alpha1.1.pickle

myoLuc2

34 MB

pcsDistrib-samp.myoLuc2.alpha1.1.pickle

susScr11

41 MB

pcsDistrib-samp.susScr11.alpha1.1.pickle

enhLutNer1

46 MB

pcsDistrib-samp.enhLutNer1.alpha1.1.pickle

triMan1

44 MB

pcsDistrib-samp.triMan1.alpha1.1.pickle

macEug2

12 MB

pcsDistrib-samp.macEug2.alpha1.1.pickle

ornAna2

9.4 MB

pcsDistrib-samp.ornAna2.alpha1.1.pickle

aptMan1

7.5 MB

pcsDistrib-samp.aptMan1.alpha1.1.pickle

galGal6

6.4 MB

pcsDistrib-samp.galGal6.alpha1.1.pickle

thaSir1

5.3 MB

pcsDistrib-samp.thaSir1.alpha1.1.pickle

aquChr2

7.2 MB

pcsDistrib-samp.aquChr2.alpha1.1.pickle

melGal5

6.6 MB

pcsDistrib-samp.melGal5.alpha1.1.pickle

xenLae2

4.9 MB

pcsDistrib-samp.xenLae2.alpha1.1.pickle

xenTro10

4.8 MB

pcsDistrib-samp.xenTro10.alpha1.1.pickle

danRer11

3.6 MB

pcsDistrib-samp.danRer11.alpha1.1.pickle

Function details

Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.

5_sampleEvolTimes.initParallelInputs(my_dataset, prefixTarget, qChrom, alpha, nbcores)
5_sampleEvolTimes.initializeOutputs(chromLst, nbSamplesPerWin)
5_sampleEvolTimes.processIteration(chrom, result, info_saved)
5_sampleEvolTimes.sampleEvolTimes(parallelInput)
5_sampleEvolTimes.sample_pcs(model, solver, ts_sampled, winPcsObs, nbSamplesPerWin)