1_extractPCS module

Reads a chain file (extension ‘.all.chain’) downloaded from the UCSC website and outputs one pickle file per chromosome with all PCSs of a pair of species.

To make sure that PCSs are being properly extracted, this script also needs the FASTA files of both species, which can be downloaded in the UCSC website.

Use:

python3 1_extractPCS.py -sp_ucsc_name [UCSC name]

Example of Usage (human (reference genome) and mouse):

python3 ~/code/1_extractPCS.py -sp_ucsc_name mm39

Input Parameter (mandatory):

-sp_ucsc_name:: UCSC name of the species that is being aligned with the reference species (e.g. mm39 for mouse).

Other Parameters taken from dataset.py:

refsp_ucscName:: UCSC name of the reference species that is being aligned (e.g. hg38 for human).
chain_file:: file with extension .all.chain downloaded from UCSC (e.g.: hg38.papAnu4.all.chain.gz)
refsp_fastadir:: directory with FASTA files following the format genomeName.chrName.bXXXeXXX.fa (e.g. mm39.chrY.b51040000e51060000.fa). FASTA files can be downloaded in the UCSC website, and then split into smaller files for more efficient reading.
othsp_fastadir:: directory with FASTA files following the format genomeName.chrName.bXXXeXXX.fa (e.g. mm39.chrY.b51040000e51060000.fa). FASTA files can be downloaded in the UCSC website, and then split into smaller files for more efficient reading.
pcs_dir:: Directory where files with PCSs are going to be saved.
pcs_minsize:: minimum size (nb. of base pairs) for the PCS to be considered.

Note

Make sure that the required parameters described above are correctly defined in the file utils/dataset.py.

Output:
A separate .pickle file for each chromosome, each including all PCSs positioned on the corresponding chromosome of the reference genome. The pickle file saves a list of Pcs named tuples, where each tuple has the size of a PCS and its coordinates in both genomes.

Note

Required Directory Structure for FASTA files: a single FASTA file from UCSC (e.g. mm39.fa) must be split in many files, following the pattern genomeName.chrName.bXXXeXXX.fa (see Pre-Requisites below).

Pre-requisites

Before using this script, make sure all the required files were downloaded:

a) Chain files from UCSC

1. Download the “chain” file for the analyzed pairwise species (chained lastz alignments) in the UCSC website. For example, for human and mouse (hg38 and mm39), go to:

https://hgdownload.soe.ucsc.edu/goldenPath/mm39/vsHg38/

and download the file mm39.hg38.all.chain.gz.

Unzip the download file:
```
gzip -d mm39.hg38.all.chain.gz
```

3. The full directory of the unziped chain file (e.g. mm39.hg38.all.chain) will be given to this script through the parameter -chain_file.

b) FASTA files from UCSC

4. Download the FASTA files for the analyzed pairwise species in the UCSC website (e.g. mm39.fa.gz). The FASTA files are needed to validate the PCS coordinates obtained from the chain file.

5. Run the script breakFasta.py using the unziped FASTA file (e.g. mm39.fa). This script will break the unziped FASTA file into smaller files, following the pattern:

genomeName.chrName.bXXXeXXX.fa

These files must be saved in a directory structure organized as follows:
```
genomeName/chr/chrName/genomeName.chrName.bXXXeXXX.fa
```

Cluster resources

In case you want to run this Python script stand-alone in a cluster that uses Slurm to manage jobs:

srun -p compute -t 2:00:00 --mem 20G --nodes=1 --ntasks=1 --cpus-per-task=1 --pty bash

Otherwise you can use the script ../cluster/1_extractPCS_runAll.py to run this script for all 40 species used in this study.

Time, Memory & Disk space

For reference, here we include an upper limit on runtime, memory usage, and disk space required for running this script on each pairwise alignment of the 40 vertebrate dataset examined in our study.

UCSC name	Time	Memory	Disk
panPan3	01:01:19	5GB	0.95GB
panTro6	00:59:09	6GB	0.98GB
gorGor6	00:50:02	5GB	1.14GB
ponAbe3	01:20:13	5GB	1.98GB
papAnu4	06:27:13	38GB	2.76GB
macFas5	02:42:13	20GB	2.71GB
rhiRox1	03:29:32	23GB	2.98GB
chlSab2	00:59:52	6GB	2.99GB
nasLar1	04:26:24	15GB	2.25GB
rheMac10	01:01:55	6GB	2.80GB
calJac4	02:50:19	36GB	3.14GB
tarSyr2	04:57:09	42GB	2.62GB
micMur2	02:24:55	17GB	2.25GB
galVar1	09:48:31	40GB	2.56GB
mm39	00:58:45	9GB	1.10GB
oryCun2	01:23:21	19GB	1.67GB
rn7	01:27:01	24GB	1.07GB
vicPac2	01:23:54	17GB	2.19GB
bisBis1	02:50:09	24GB	1.86GB
felCat9	01:57:19	28GB	2.19GB
manPen1	03:18:57	18GB	1.99GB
bosTau9	01:10:49	15GB	1.59GB
canFam6	01:00:37	18GB	2.07GB
musFur1	01:00:55	17GB	2.21GB
neoSch1	01:20:42	19GB	2.46GB
equCab3	01:30:38	17GB	2.60GB
myoLuc2	01:46:55	20GB	1.60GB
susScr11	01:20:55	18GB	1.93GB
enhLutNer1	01:27:55	18GB	2.26GB
triMan1	01:20:32	20GB	2.16GB
macEug2	05:58:51	18GB	0.22GB
ornAna2	01:20:04	13GB	0.17GB
aptMan1	00:45:15	3GB	0.12GB
galGal6	00:45:15	3GB	0.09GB
thaSir1	01:30:08	5GB	0.07GB
aquChr2	00:30:03	3GB	0.12GB
melGal5	01:15:00	3GB	0.09GB
xenLae2	01:15:46	6GB	0.06GB
xenTro10	02:40:30	40GB	0.06GB
danRer11	04:08:26	14GB	0.04GB

Time per Run: Details

Stats on time of a single run (human-mouse alignment, whole-genome): ~65 minutes Details on computational time are available in the log of the run.

Step	Time (s)
Reading chains	164.45332742
Compatible chains	236.70236850
PCSs from chr1	246.49562716
PCSs from chr2	283.03650451
PCSs from chr3	218.50386620
PCSs from chr4	198.70389938
PCSs from chr5	206.89367223
PCSs from chr6	199.71864271
PCSs from chr7	204.75429463
PCSs from chr8	199.65774012
PCSs from chr9	150.57288957
PCSs from chr10	173.72448373
PCSs from chr11	156.16801929
PCSs from chr12	168.79578829
PCSs from chr13	113.74493265
PCSs from chr14	93.78934121
PCSs from chr15	99.19000816
PCSs from chr16	91.97542357
PCSs from chr17	85.33002925
PCSs from chr18	83.99309540
PCSs from chr19	102.53493857
PCSs from chr20	72.90905333
PCSs from chr21	37.42282581
PCSs from chr22	44.24664927
PCSs from chrX	223.61548066
PCSs from chrY	21.73898244
Total time	3949.23640776

Storage per Run: Details

Size of output files with PCSs for the human and mouse alignment (whole-genome): ~1085 MB.

Output file	Size
hg38.mm39.chr1.pcs.pickle	98M
hg38.mm39.chr2.pcs.pickle	104M
hg38.mm39.chr3.pcs.pickle	72M
hg38.mm39.chr4.pcs.pickle	59M
hg38.mm39.chr5.pcs.pickle	74M
hg38.mm39.chr6.pcs.pickle	66M
hg38.mm39.chr7.pcs.pickle	56M
hg38.mm39.chr8.pcs.pickle	55M
hg38.mm39.chr9.pcs.pickle	44M
hg38.mm39.chr10.pcs.pickle	46M
hg38.mm39.chr11.pcs.pickle	65M
hg38.mm39.chr12.pcs.pickle	50M
hg38.mm39.chr13.pcs.pickle	33M
hg38.mm39.chr14.pcs.pickle	39M
hg38.mm39.chr15.pcs.pickle	34M
hg38.mm39.chr16.pcs.pickle	30M
hg38.mm39.chr17.pcs.pickle	24M
hg38.mm39.chr18.pcs.pickle	30M
hg38.mm39.chr19.pcs.pickle	14M
hg38.mm39.chr20.pcs.pickle	28M
hg38.mm39.chr21.pcs.pickle	12M
hg38.mm39.chr22.pcs.pickle	11M
hg38.mm39.chrX.pcs.pickle	41M
hg38.mm39.chrY.pcs.pickle	387K

Genome Browser

Here there are instructions in case you wish to check the corresponding DNA sequences of a PCS in UCSC’s Genome Browser. Take for example the following PCS found in the pairwise alignment between human (hg38) and bonobo (panPan3):

Pcs(size=45, tChrom='chr10', tStrand='+', tPosBeg=87682, qChrom='chr16', qStrand='-', qPosBeg=21852)

Here, qChrom, qStrand, and qPosBeg correspond to the chromosome, strand, and starting coordinate in the query genome. In our dataset, the query genome is always the human genome (hg38). Similarly, tChrom, tStrand, and tPosBeg refer to the same information in the target genome (panPan3 in this example).

To compute the coordinates, first add +1 to tPosBeg and qPosBeg, as the coordinates in Genome Browser start at 1, whereas here they start at 0.

Next, by summing the size of the PCS (size=45) to the initial coordinates (tPosBeg+1 and qPosBeg+1), it is possible to retrieve the DNA sequence in Genome Browser for both genomes.

For this example, by accessing:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38

and typing chr16:21853-21897, we obtain:

TAATGAGTGTACTGCTCCTAAACAAGGAGTATCTGCATTTATTTA

If it is indicated that sequence should be read in the negative strand (qStrand=’-‘), the reverse complement should be computed:

TAAATAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTA

Similarly for the bonobo genome, by accessing:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=panPan3

and typing chr10:87683-87727, we obtain:

TAAATAAATGCAGATACTCCTTGTTTAGGAGCAGTACACTCATTA

Function details

Only relevant functions have been documented below. For more details on any function, check the comments in the souce code.

class 1_extractPCS.PcsSingleCheck(desc, cntUB, pcsSingleCheckFunc)

Bases: object

This class makes sure that PCSs are being correctly processed. For every new PCS, it determines whether the PCS is a good candidate to be checked into details. Good candidates are long PCSs, PCSs in the reverse strand of one of the genomes, among others. It keeps a history of how many PCSs have been checked so far.

Parameters:

desc (str) – Brief description of which type of check is being performed.
cntUB (int) – Maximum number of PCSs that will be checked in details.
pcsSingleCheckFunc (function) – Function that receives a named tuple PCS and returns True (PCS should be checked) or False.

checkPCS(pcs)

print()

reset()

1_extractPCS.checkPCSrevQ(pcs)

1_extractPCS.checkPCSrevT(pcs)

1_extractPCS.checkPCSsize(pcs)

1_extractPCS.checkSpecies(dataset, ucscName)

1_extractPCS.computePCScoords(begAbs, posRel, isRevComp, sizePCS, sizeMis)

1_extractPCS.computePos(strand, size, start, end)

1_extractPCS.initializePCSchecks(cntUB=100, prefixTarget='tar', prefixQuery='qry')

1_extractPCS.isCompatible(pos, pos_lst): Check if a sequence is compatible (non-intersecting) to a list of sequences.

1_extractPCS.performPCSchecks(pcs, checks)

1_extractPCS.readChains(chainsFilename): This function parses the chains from a chain file (check the UCSC website for more information on the format).

1_extractPCS.readDNA(dirFASTA, prefix, chrName, strand, posBeg, posEnd, debug=False): This function reads a DNA segment in a FASTA file.

1_extractPCS.reverseComplement(seq): This function returns a reverse complement of a DNA.