2_computeWindows module

Computes the PCS size distribution for every window in a chromosome, considering all species present in the dataset.

  • Use:

    python3 2_computeWindows.py -chr [chromosome name]
    
  • Example of Usage (vertebrate dataset with 40 species):

    python3 ~/code/2_computeWindows.py -chr chr16
    
  • Input Parameter (mandatory):

-chr:

Chromosome name in the reference species (e.g. chr1, chr2, …, chrX, chrY).

  • Other Parameters taken from dataset.py:

-refsp_ucscname:

UCSC name of the reference species that is being aligned (e.g. hg38 for human).

-win_size:

Size of windows in base pairs.

-species_UCSC_names:

list of UCSC names of all species included in the dataset (40 vertebrates in our study).

-pcs_dir:

Directory where the PCSs for each pairwise alignment included in the dataset are saved.

-win_dir:

Directory where windows and their PCS size distributions will be saved.

Note

Make sure that the required parameters described above are correctly defined in the file utils/dataset.py.

  • Output:

    One .pickle file for the specified chromosome with a dictionary. In this dictionary, the keys are the UCSC names of the species, while the values are dictionaries that map window coordinates on the chromosome to their PCS size distributions. These distributions are also organized in a dictionary format, where the keys indicate size and the values denote counts.

Pre-requisites

Before using this script, make sure all the required files were computed:

a) PCSs from each pairwise alignment in the dataset

Make sure to run 1_extractPCS.py for every species included in the attribute speciesUCSCnames declared in file dataset.py. In our study, this list includes the UCSC names of the 40 vertebrate species used in our analysis, whose data (fasta files and chain files) was retrieved from the UCSC website.

Cluster resources

In case you want to run this Python script stand-alone:

srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash

Otherwise you can use the script ../cluster/2_computeWindows_runAll.py to run this script for all 24 chromosomes (chr1, chr2, …, chrX, chrY).

Time, Memory & Disk space

For reference, here we include an upper limit on runtime, memory usage, and disk space required for running this script on the 40 vertebrate dataset examined in our study.

Desc.

Time

Memory

Disk

chr1

01:09:13

9GB

0.340GB

chr2

01:29:36

11GB

0.360GB

chr3

00:59:03

7GB

0.280GB

chr4

01:01:40

7GB

0.280GB

chr5

00:54:51

7GB

0.260GB

chr6

01:00:43

5GB

0.230GB

chr7

00:51:09

6GB

0.190GB

chr8

00:47:49

5GB

0.200GB

chr9

00:32:51

3GB

0.150GB

chr10

00:36:37

4GB

0.170GB

chr11

00:29:28

5GB

0.190GB

chr12

00:30:16

5GB

0.190GB

chr13

00:20:57

3GB

0.140GB

chr14

00:19:24

3GB

0.140GB

chr15

00:17:15

3GB

0.120GB

chr16

00:17:42

3GB

0.120GB

chr17

00:18:00

3GB

0.100GB

chr18

00:21:40

3GB

0.110GB

chr19

00:12:39

2GB

0.080GB

chr20

00:18:12

3GB

0.100GB

chr21

00:10:03

2GB

0.060GB

chr22

00:07:28

1GB

0.050GB

chrX

00:35:28

5GB

0.200GB

chrY

00:04:18

1GB

0.030GB

Time per Run: Details

Stats on time of a single run (chr16, 40 vertebrate species): ~15 minutes

More details on computational time can be found in the log of the run.

Step

Time (s)

Merging PCSs

715.09147525

Computing windows

0.57280803

Computing PCS size distribution

172.84002709

Total time

902.55868173

Storage per Run: Details

Size of output files with all windows of one chromosome (chr16): ~134 MB.

Output files

Size

hg38.chr16.1000.windows.pickle

119M

hg38.chr16.mergedPCSs.pickle

15M

Note

The temporary file hg38.chr16.mergedPCSs.pickle can be removed after the output file hg38.chr16.1000.windows.pickle is succesfully computed.

Function details

Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.

2_computeWindows.checkInputFiles(qChrom, my_dataset)
2_computeWindows.computeWindows(pcs_lst, windowSize)

This method computes coordinates for windows in the human/reference genome. This method ensures that the window coordinates are defined in such a way as to prevent the disruption of any PCS found in the species included in the dataset.

returns:

A list of window coordinates, each approximately equal to the specified windowSize (either exactly that size or slightly larger). The coordinates are formatted as [begPos, endPos), where endPos is excluded from the interval.

2_computeWindows.consistencyCheck(pcs_lst)
2_computeWindows.distribPCS_all(qChrom, win_lst, my_dataset)

This method computes the PCS size distribution for every window in a chromosome, considering all species present in the dataset.

2_computeWindows.distribPCS_single(win_lst, pcs_lst)

This method computes the PCS size distribution for each window given in a list.

Parameters:
  • win_lst – List of positions representing non-overlapping consecutive windows. Each position is a tuple in the format (begPos, endPos), where begPos and endPos are positive integers indicating absolute positions on the chromosome. The list is sorted in ascending order by these positions.

  • pcs_lst (list of named tuples Pcs) – list of PCSs, containing their positions in the chromosome. List should be sorted by position.

Returns:

a dictionary that maps tuples of window coordinates (begPos, endPos) on the chromosome to another dictionary. This inner dictionary contains the distribution of PCS sizes, with PCS sizes as keys and their respective counts as values.

2_computeWindows.isContiguous(pcs1, pcs2)
2_computeWindows.isOverlap(pcs1, pcs2)
2_computeWindows.mergePCS_all(qChrom, my_dataset)
2_computeWindows.mergePCS_pairwise(pcs_lst_cur, pcs_lst_new)

This function adds a list of new PCSs to the current list of already-merged PCSs.

A new PCS can be:

  • Discarded, if it is encompassed by another PCS in the list;

  • Added, if its position does not overlap any PCS position in the list;

  • Merged, if its position overlaps one or more PCSs in the list.

2_computeWindows.mergePCS_pairwise_check(pcs_lst_cur, idx_to_add)

This function checks if the new PCS was properly added to the list. Two constrains are checked:

  1. The new PCS should not overlap the previous or the next PCS;

  2. The list must keep its property of being sorted by position.

2_computeWindows.mergePCS_pairwise_findPos(p_new, pcs_lst_cur, idx_to_add)

This function determines the index to insert the new PCS while keeping the list sorted and the PCSs non-overlapping.

2_computeWindows.mergePCS_pairwise_updLst(p_new, pcs_lst_cur, idx_to_add, nb_pcs_merged)

This function updates the PCS list with a new PCS (if needed).

2_computeWindows.printPCS(p)