1_extractPCS module

Reads a chain file (extension ‘.all.chain’) downloaded from the UCSC website and outputs one pickle file per chromosome with all PCSs of a pair of species.

To make sure that PCSs are being properly extracted, this script also needs the FASTA files of both species, which can be downloaded in the UCSC website.

  • Use:

    python3 1_extractPCS.py -sp_ucsc_name [UCSC name]
    
  • Example of Usage (human (reference genome) and mouse):

    python3 ~/code/1_extractPCS.py -sp_ucsc_name mm39
    
  • Input Parameter (mandatory):

-sp_ucsc_name:

UCSC name of the species that is being aligned with the reference species (e.g. mm39 for mouse).

  • Other Parameters taken from dataset.py:

refsp_ucscName:

UCSC name of the reference species that is being aligned (e.g. hg38 for human).

chain_file:

file with extension .all.chain downloaded from UCSC (e.g.: hg38.papAnu4.all.chain.gz)

refsp_fastadir:

directory with FASTA files following the format genomeName.chrName.bXXXeXXX.fa (e.g. mm39.chrY.b51040000e51060000.fa). FASTA files can be downloaded in the UCSC website, and then split into smaller files for more efficient reading.

othsp_fastadir:

directory with FASTA files following the format genomeName.chrName.bXXXeXXX.fa (e.g. mm39.chrY.b51040000e51060000.fa). FASTA files can be downloaded in the UCSC website, and then split into smaller files for more efficient reading.

pcs_dir:

Directory where files with PCSs are going to be saved.

pcs_minsize:

minimum size (nb. of base pairs) for the PCS to be considered.

Note

Make sure that the required parameters described above are correctly defined in the file utils/dataset.py.

  • Output:

    A separate .pickle file for each chromosome, each including all PCSs positioned on the corresponding chromosome of the reference genome. The pickle file saves a list of Pcs named tuples, where each tuple has the size of a PCS and its coordinates in both genomes.

Note

Required Directory Structure for FASTA files: a single FASTA file from UCSC (e.g. mm39.fa) must be split in many files, following the pattern genomeName.chrName.bXXXeXXX.fa (see Pre-Requisites below).

Pre-requisites

Before using this script, make sure all the required files were downloaded:

a) Chain files from UCSC

1. Download the “chain” file for the analyzed pairwise species (chained lastz alignments) in the UCSC website. For example, for human and mouse (hg38 and mm39), go to:

https://hgdownload.soe.ucsc.edu/goldenPath/mm39/vsHg38/

and download the file mm39.hg38.all.chain.gz.

  1. Unzip the download file:

    gzip -d mm39.hg38.all.chain.gz
    

3. The full directory of the unziped chain file (e.g. mm39.hg38.all.chain) will be given to this script through the parameter -chain_file.

b) FASTA files from UCSC

4. Download the FASTA files for the analyzed pairwise species in the UCSC website (e.g. mm39.fa.gz). The FASTA files are needed to validate the PCS coordinates obtained from the chain file.

5. Run the script breakFasta.py using the unziped FASTA file (e.g. mm39.fa). This script will break the unziped FASTA file into smaller files, following the pattern:

genomeName.chrName.bXXXeXXX.fa
  1. These files must be saved in a directory structure organized as follows:

    genomeName/chr/chrName/genomeName.chrName.bXXXeXXX.fa
    

Cluster resources

In case you want to run this Python script stand-alone in a cluster that uses Slurm to manage jobs:

srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash

Otherwise you can use the script ../cluster/1_extractPCS_runAll.py to run this script for all 40 species used in this study.

Time, Memory & Disk space

For reference, here we include an upper limit on runtime, memory usage, and disk space required for running this script on each pairwise alignment of the 40 vertebrate dataset examined in our study.

UCSC name

Time

Memory

Disk

panPan3

01:01:19

5GB

0.95GB

panTro6

00:59:09

6GB

0.98GB

gorGor6

00:50:02

5GB

1.14GB

ponAbe3

01:20:13

5GB

1.98GB

papAnu4

06:27:13

38GB

2.76GB

macFas5

02:42:13

20GB

2.71GB

rhiRox1

03:29:32

23GB

2.98GB

chlSab2

00:59:52

6GB

2.99GB

nasLar1

04:26:24

15GB

2.25GB

rheMac10

01:01:55

6GB

2.80GB

calJac4

02:50:19

36GB

3.14GB

tarSyr2

04:57:09

42GB

2.62GB

micMur2

02:24:55

17GB

2.25GB

galVar1

09:48:31

40GB

2.56GB

mm39

00:58:45

9GB

1.10GB

oryCun2

01:23:21

19GB

1.67GB

rn7

01:27:01

24GB

1.07GB

vicPac2

01:23:54

17GB

2.19GB

bisBis1

02:50:09

24GB

1.86GB

felCat9

01:57:19

28GB

2.19GB

manPen1

03:18:57

18GB

1.99GB

bosTau9

01:10:49

15GB

1.59GB

canFam6

01:00:37

18GB

2.07GB

musFur1

01:00:55

17GB

2.21GB

neoSch1

01:20:42

19GB

2.46GB

equCab3

01:30:38

17GB

2.60GB

myoLuc2

01:46:55

20GB

1.60GB

susScr11

01:20:55

18GB

1.93GB

enhLutNer1

01:27:55

18GB

2.26GB

triMan1

01:20:32

20GB

2.16GB

macEug2

05:58:51

18GB

0.22GB

ornAna2

01:20:04

13GB

0.17GB

aptMan1

00:45:15

3GB

0.12GB

galGal6

00:45:15

3GB

0.09GB

thaSir1

01:30:08

5GB

0.07GB

aquChr2

00:30:03

3GB

0.12GB

melGal5

01:15:00

3GB

0.09GB

xenLae2

01:15:46

6GB

0.06GB

xenTro10

02:40:30

40GB

0.06GB

danRer11

04:08:26

14GB

0.04GB

Time per Run: Details

Stats on time of a single run (human-mouse alignment, whole-genome): ~65 minutes Details on computational time are available in the log of the run.

Step

Time (s)

Reading chains

164.45332742

Compatible chains

236.70236850

PCSs from chr1

246.49562716

PCSs from chr2

283.03650451

PCSs from chr3

218.50386620

PCSs from chr4

198.70389938

PCSs from chr5

206.89367223

PCSs from chr6

199.71864271

PCSs from chr7

204.75429463

PCSs from chr8

199.65774012

PCSs from chr9

150.57288957

PCSs from chr10

173.72448373

PCSs from chr11

156.16801929

PCSs from chr12

168.79578829

PCSs from chr13

113.74493265

PCSs from chr14

93.78934121

PCSs from chr15

99.19000816

PCSs from chr16

91.97542357

PCSs from chr17

85.33002925

PCSs from chr18

83.99309540

PCSs from chr19

102.53493857

PCSs from chr20

72.90905333

PCSs from chr21

37.42282581

PCSs from chr22

44.24664927

PCSs from chrX

223.61548066

PCSs from chrY

21.73898244

Total time

3949.23640776

Storage per Run: Details

Size of output files with PCSs for the human and mouse alignment (whole-genome): ~1085 MB.

Output file

Size

hg38.mm39.chr1.pcs.pickle

98M

hg38.mm39.chr2.pcs.pickle

104M

hg38.mm39.chr3.pcs.pickle

72M

hg38.mm39.chr4.pcs.pickle

59M

hg38.mm39.chr5.pcs.pickle

74M

hg38.mm39.chr6.pcs.pickle

66M

hg38.mm39.chr7.pcs.pickle

56M

hg38.mm39.chr8.pcs.pickle

55M

hg38.mm39.chr9.pcs.pickle

44M

hg38.mm39.chr10.pcs.pickle

46M

hg38.mm39.chr11.pcs.pickle

65M

hg38.mm39.chr12.pcs.pickle

50M

hg38.mm39.chr13.pcs.pickle

33M

hg38.mm39.chr14.pcs.pickle

39M

hg38.mm39.chr15.pcs.pickle

34M

hg38.mm39.chr16.pcs.pickle

30M

hg38.mm39.chr17.pcs.pickle

24M

hg38.mm39.chr18.pcs.pickle

30M

hg38.mm39.chr19.pcs.pickle

14M

hg38.mm39.chr20.pcs.pickle

28M

hg38.mm39.chr21.pcs.pickle

12M

hg38.mm39.chr22.pcs.pickle

11M

hg38.mm39.chrX.pcs.pickle

41M

hg38.mm39.chrY.pcs.pickle

387K

Genome Browser

Here there are instructions in case you wish to check the corresponding DNA sequences of a PCS in UCSC’s Genome Browser. Take for example the following PCS found in the pairwise alignment between human (hg38) and bonobo (panPan3):

Pcs(size=45, tChrom='chr10', tStrand='+', tPosBeg=87682, qChrom='chr16', qStrand='-', qPosBeg=21852)

Here, qChrom, qStrand, and qPosBeg correspond to the chromosome, strand, and starting coordinate in the query genome. In our dataset, the query genome is always the human genome (hg38). Similarly, tChrom, tStrand, and tPosBeg refer to the same information in the target genome (panPan3 in this example).

To compute the coordinates, first add +1 to tPosBeg and qPosBeg, as the coordinates in Genome Browser start at 1, whereas here they start at 0.

Next, by summing the size of the PCS (size=45) to the initial coordinates (tPosBeg+1 and qPosBeg+1), it is possible to retrieve the DNA sequence in Genome Browser for both genomes.

For this example, by accessing:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38

and typing chr16:21853-21897, we obtain:

TAATGAGTGTACTGCTCCTAAACAAGGAGTATCTGCATTTATTTA

If it is indicated that sequence should be read in the negative strand (qStrand=’-‘), the reverse complement should be computed:

TAAATAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTA

Similarly for the bonobo genome, by accessing:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=panPan3

and typing chr10:87683-87727, we obtain:

TAAATAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTA

Function details

Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.

class 1_extractPCS.PcsSingleCheck(desc, cntUB, pcsSingleCheckFunc)

Bases: object

This class makes sure that PCSs are being correctly processed. For every new PCS, it determines whether the PCS is a good candidate to be checked into details. Good candidates are long PCSs, PCSs in the reverse strand of one of the genomes, among others. It keeps a history of how many PCSs have been checked so far.

Parameters:
  • desc (str) – Brief description of which type of check is being performed.

  • cntUB (int) – Maximum number of PCSs that will be checked in details.

  • pcsSingleCheckFunc (function) – Function that receives a named tuple PCS and returns True (PCS should be checked) or False.

checkPCS(pcs)
print()
reset()
1_extractPCS.checkPCSrevQ(pcs)
1_extractPCS.checkPCSrevT(pcs)
1_extractPCS.checkPCSsize(pcs)
1_extractPCS.checkSpecies(dataset, ucscName)
1_extractPCS.computePCScoords(begAbs, posRel, isRevComp, sizePCS, sizeMis)
1_extractPCS.computePos(strand, size, start, end)
1_extractPCS.initializePCSchecks(cntUB=100, prefixTarget='tar', prefixQuery='qry')
1_extractPCS.isCompatible(pos, pos_lst)

Check if a sequence is compatible (non-intersecting) to a list of sequences.

1_extractPCS.performPCSchecks(pcs, checks)
1_extractPCS.readChains(chainsFilename)

This function parses the chains from a chain file (check the UCSC website for more information on the format).

1_extractPCS.readDNA(dirFASTA, prefix, chrName, strand, posBeg, posEnd, debug=False)

This function reads a DNA segment in a FASTA file.

1_extractPCS.reverseComplement(seq)

This function returns a reverse complement of a DNA.