Split VCF into n pieces by coordinate

(Re)edited: 2025-10-24

1 Introduction

vcf is short for variant call format. It is a text format which stores differences from the “reference” genome.

bcftools is a core command line tool for parsing and editing vcf files.

bcftools view -r 1:40000-50000 vcf.gz will output (to stdout) a vcf containing the header and variants on chromosome 1 between coordinates 40,000 and 50,000 base pairs.

I need to break down a large vcf into smaller pieces to dramatically speed up annotation. Let’s try 100 pieces.

The human genome is approximately 3 gigabases or 3e9 base pairs.

\[ \frac{3 * 10^9\ base\ pairs}{100\ pieces} = 3*10^7\ base\ pairs\ per\ piece \]

That’s our target size.

This is made a bit tricky since the genome is laid by chromosome. So we have to break into 3e7 pieces, accounting for chromosomes. There are also many contigs, most of which are well under 3e7 in size. Those can be processed as a group with bcftools by splitting each contig by a ,.

Let’s read in the header. It contains chromosome (and contig) sizes, which I’ve extracted from the vcf with zcat EGAD00001002656.GATK.vcf.gz | head -n 1000 | grep ^## > /home/mcgaugheyd/git/OGVFB_one_offs/mcgaughey/split_VCFs_into_n_pieces/EGAD00001002656.header

2 Read in vcf header

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(stringr)
vcf_header = scan('../data/EGAD00001002656.header', what='character')
vcf_header[grepl('contig',vcf_header)]

 [1] "##contig=<ID=1,length=249250621,assembly=b37>"      
 [2] "##contig=<ID=2,length=243199373,assembly=b37>"      
 [3] "##contig=<ID=3,length=198022430,assembly=b37>"      
 [4] "##contig=<ID=4,length=191154276,assembly=b37>"      
 [5] "##contig=<ID=5,length=180915260,assembly=b37>"      
 [6] "##contig=<ID=6,length=171115067,assembly=b37>"      
 [7] "##contig=<ID=7,length=159138663,assembly=b37>"      
 [8] "##contig=<ID=8,length=146364022,assembly=b37>"      
 [9] "##contig=<ID=9,length=141213431,assembly=b37>"      
[10] "##contig=<ID=10,length=135534747,assembly=b37>"     
[11] "##contig=<ID=11,length=135006516,assembly=b37>"     
[12] "##contig=<ID=12,length=133851895,assembly=b37>"     
[13] "##contig=<ID=13,length=115169878,assembly=b37>"     
[14] "##contig=<ID=14,length=107349540,assembly=b37>"     
[15] "##contig=<ID=15,length=102531392,assembly=b37>"     
[16] "##contig=<ID=16,length=90354753,assembly=b37>"      
[17] "##contig=<ID=17,length=81195210,assembly=b37>"      
[18] "##contig=<ID=18,length=78077248,assembly=b37>"      
[19] "##contig=<ID=19,length=59128983,assembly=b37>"      
[20] "##contig=<ID=20,length=63025520,assembly=b37>"      
[21] "##contig=<ID=21,length=48129895,assembly=b37>"      
[22] "##contig=<ID=22,length=51304566,assembly=b37>"      
[23] "##contig=<ID=X,length=155270560,assembly=b37>"      
[24] "##contig=<ID=Y,length=59373566,assembly=b37>"       
[25] "##contig=<ID=MT,length=16569,assembly=b37>"         
[26] "##contig=<ID=GL000207.1,length=4262,assembly=b37>"  
[27] "##contig=<ID=GL000226.1,length=15008,assembly=b37>" 
[28] "##contig=<ID=GL000229.1,length=19913,assembly=b37>" 
[29] "##contig=<ID=GL000231.1,length=27386,assembly=b37>" 
[30] "##contig=<ID=GL000210.1,length=27682,assembly=b37>" 
[31] "##contig=<ID=GL000239.1,length=33824,assembly=b37>" 
[32] "##contig=<ID=GL000235.1,length=34474,assembly=b37>" 
[33] "##contig=<ID=GL000201.1,length=36148,assembly=b37>" 
[34] "##contig=<ID=GL000247.1,length=36422,assembly=b37>" 
[35] "##contig=<ID=GL000245.1,length=36651,assembly=b37>" 
[36] "##contig=<ID=GL000197.1,length=37175,assembly=b37>" 
[37] "##contig=<ID=GL000203.1,length=37498,assembly=b37>" 
[38] "##contig=<ID=GL000246.1,length=38154,assembly=b37>" 
[39] "##contig=<ID=GL000249.1,length=38502,assembly=b37>" 
[40] "##contig=<ID=GL000196.1,length=38914,assembly=b37>" 
[41] "##contig=<ID=GL000248.1,length=39786,assembly=b37>" 
[42] "##contig=<ID=GL000244.1,length=39929,assembly=b37>" 
[43] "##contig=<ID=GL000238.1,length=39939,assembly=b37>" 
[44] "##contig=<ID=GL000202.1,length=40103,assembly=b37>" 
[45] "##contig=<ID=GL000234.1,length=40531,assembly=b37>" 
[46] "##contig=<ID=GL000232.1,length=40652,assembly=b37>" 
[47] "##contig=<ID=GL000206.1,length=41001,assembly=b37>" 
[48] "##contig=<ID=GL000240.1,length=41933,assembly=b37>" 
[49] "##contig=<ID=GL000236.1,length=41934,assembly=b37>" 
[50] "##contig=<ID=GL000241.1,length=42152,assembly=b37>" 
[51] "##contig=<ID=GL000243.1,length=43341,assembly=b37>" 
[52] "##contig=<ID=GL000242.1,length=43523,assembly=b37>" 
[53] "##contig=<ID=GL000230.1,length=43691,assembly=b37>" 
[54] "##contig=<ID=GL000237.1,length=45867,assembly=b37>" 
[55] "##contig=<ID=GL000233.1,length=45941,assembly=b37>" 
[56] "##contig=<ID=GL000204.1,length=81310,assembly=b37>" 
[57] "##contig=<ID=GL000198.1,length=90085,assembly=b37>" 
[58] "##contig=<ID=GL000208.1,length=92689,assembly=b37>" 
[59] "##contig=<ID=GL000191.1,length=106433,assembly=b37>"
[60] "##contig=<ID=GL000227.1,length=128374,assembly=b37>"
[61] "##contig=<ID=GL000228.1,length=129120,assembly=b37>"
[62] "##contig=<ID=GL000214.1,length=137718,assembly=b37>"
[63] "##contig=<ID=GL000221.1,length=155397,assembly=b37>"
[64] "##contig=<ID=GL000209.1,length=159169,assembly=b37>"
[65] "##contig=<ID=GL000218.1,length=161147,assembly=b37>"
[66] "##contig=<ID=GL000220.1,length=161802,assembly=b37>"
[67] "##contig=<ID=GL000213.1,length=164239,assembly=b37>"
[68] "##contig=<ID=GL000211.1,length=166566,assembly=b37>"
[69] "##contig=<ID=GL000199.1,length=169874,assembly=b37>"
[70] "##contig=<ID=GL000217.1,length=172149,assembly=b37>"
[71] "##contig=<ID=GL000216.1,length=172294,assembly=b37>"
[72] "##contig=<ID=GL000215.1,length=172545,assembly=b37>"
[73] "##contig=<ID=GL000205.1,length=174588,assembly=b37>"
[74] "##contig=<ID=GL000219.1,length=179198,assembly=b37>"
[75] "##contig=<ID=GL000224.1,length=179693,assembly=b37>"
[76] "##contig=<ID=GL000223.1,length=180455,assembly=b37>"
[77] "##contig=<ID=GL000195.1,length=182896,assembly=b37>"
[78] "##contig=<ID=GL000212.1,length=186858,assembly=b37>"
[79] "##contig=<ID=GL000222.1,length=186861,assembly=b37>"
[80] "##contig=<ID=GL000200.1,length=187035,assembly=b37>"
[81] "##contig=<ID=GL000193.1,length=189789,assembly=b37>"
[82] "##contig=<ID=GL000194.1,length=191469,assembly=b37>"
[83] "##contig=<ID=GL000225.1,length=211173,assembly=b37>"
[84] "##contig=<ID=GL000192.1,length=547496,assembly=b37>"
[85] "##contig=<ID=NC_007605,length=171823,assembly=b37>" 
[86] "##contig=<ID=hs37d5,length=35477943,assembly=b37>"

3 Parse out chr / contig sizes

# turn into data frame (well, a tibble)
contig_size <- vcf_header[grepl('contig', vcf_header)] %>% 
  data.frame() %>% 
  select(1, 'header' = 1) %>% 
  # separate by ,
  separate(header, c('contig','length','assembly'),',') %>% 
  # extract values by splitting against = and taking the last element (first after reversing)
  rowwise() %>% 
  mutate(contig = str_split(contig,'=')[[1]] %>% gsub('>','',.) %>% rev() %>% .[[1]],
         length = str_split(length,'=')[[1]] %>% gsub('>','',.) %>% rev() %>% .[[1]] %>% as.numeric(),
         assembly = str_split(assembly,'=')[[1]] %>% gsub('>','',.) %>% rev() %>% .[[1]])
contig_size

# A tibble: 86 × 3
# Rowwise: 
   contig    length assembly
   <chr>      <dbl> <chr>   
 1 1      249250621 b37     
 2 2      243199373 b37     
 3 3      198022430 b37     
 4 4      191154276 b37     
 5 5      180915260 b37     
 6 6      171115067 b37     
 7 7      159138663 b37     
 8 8      146364022 b37     
 9 9      141213431 b37     
10 10     135534747 b37     
# … with 76 more rows

4 Split chr above 3e7 base pairs into equal(ish) size pieces

ceiling will allow intervals a bit less than 3e7 by rounding up the number of pieces per chromsome. Would rather have more splits with less than the target size.

n_split <- function(size){
  pieces <- ceiling(size / 3e7)
  seq(1, size, size/pieces)
}

5 print coordinates given a chromosome / contig

n_printer <- function(chr) {
  # grab the legnth of chr or contig
  size <- contig_size %>% filter(contig == chr) %>% pull(length)
  # split into ~30e7 sized pieces
  sequence <- n_split(size)
  # add the max size to end (plus another base pair since the loop below reduces size by 1 to eliminate overlaps)
  sequence <- c(sequence, size+1)
  df <- data.frame()
  for(i in 1:length(sequence)){
    row <- cbind(chr, as.integer(sequence[max(i-1,1)]), # for first row, makes sure you don't pick the 0 position, which doesn't exit
                 as.integer(sequence[i]-1)) # decrements by one so you don't overlap
    df <- rbind(df, row)
  }
  colnames(df) <- c('chr','start','end')
  # skip first row which has dummy values
  df[-1,]
}

6 calculate coordinates

Will skip contig < 3e7 (all but hs37d5, which I don’t process, so it will be eliminated). The contigs will be printed comma separated for bcftools view -r purposes.

How many regions do we have? Should have a bit more than 100.

regions <- data.frame()
for (i in contig_size %>% filter(length > 3e7, contig != 'hs37d5') %>% pull(contig)){
  regions <- rbind(regions,(n_printer(i)))
}
regions %>% nrow()

[1] 115

7 print ’em

regions %>% mutate(f = paste(paste(chr, start, sep =':'), end, sep='-')) %>% select(f)

                         f
2             1:1-27694513
3      1:27694514-55389026
4      1:55389027-83083540
5     1:83083541-110778053
6    1:110778054-138472567
7    1:138472568-166167080
8    1:166167081-193861594
9    1:193861595-221556107
10   1:221556108-249250621
21            2:1-27022152
31     2:27022153-54044305
41     2:54044306-81066457
51    2:81066458-108088610
61   2:108088611-135110762
71   2:135110763-162132915
81   2:162132916-189155067
91   2:189155068-216177220
101  2:216177221-243199373
22            3:1-28288918
32     3:28288919-56577837
42     3:56577838-84866755
52    3:84866756-113155674
62   3:113155675-141444592
72   3:141444593-169733511
82   3:169733512-198022430
23            4:1-27307753
33     4:27307754-54615507
43     4:54615508-81923261
53    4:81923262-109231014
63   4:109231015-136538768
73   4:136538769-163846522
83   4:163846523-191154276
24            5:1-25845037
34     5:25845038-51690074
44     5:51690075-77535111
54    5:77535112-103380148
64   5:103380149-129225185
74   5:129225186-155070222
84   5:155070223-180915260
25            6:1-28519177
35     6:28519178-57038355
45     6:57038356-85557533
55    6:85557534-114076711
65   6:114076712-142595889
75   6:142595890-171115067
26            7:1-26523110
36     7:26523111-53046221
46     7:53046222-79569331
56    7:79569332-106092442
66   7:106092443-132615552
76   7:132615553-159138663
27            8:1-29272804
37     8:29272805-58545608
47     8:58545609-87818413
57    8:87818414-117091217
67   8:117091218-146364022
28            9:1-28242686
38     9:28242687-56485372
48     9:56485373-84728058
58    9:84728059-112970744
68   9:112970745-141213431
29           10:1-27106949
39    10:27106950-54213898
49    10:54213899-81320848
59   10:81320849-108427797
69  10:108427798-135534747
210          11:1-27001303
310   11:27001304-54002606
410   11:54002607-81003909
510  11:81003910-108005212
610 11:108005213-135006516
211          12:1-26770379
311   12:26770380-53540758
411   12:53540759-80311137
511  12:80311138-107081516
611 12:107081517-133851895
212          13:1-28792469
312   13:28792470-57584939
412   13:57584940-86377408
512  13:86377409-115169878
213          14:1-26837385
313   14:26837386-53674770
413   14:53674771-80512155
513  14:80512156-107349540
214          15:1-25632848
314   15:25632849-51265696
414   15:51265697-76898544
514  15:76898545-102531392
215          16:1-22588688
315   16:22588689-45177376
415   16:45177377-67766064
515   16:67766065-90354753
216          17:1-27065070
316   17:27065071-54130140
416   17:54130141-81195210
217          18:1-26025749
317   18:26025750-52051498
417   18:52051499-78077248
218          19:1-29564491
318   19:29564492-59128983
219          20:1-21008506
319   20:21008507-42017013
418   20:42017014-63025520
220          21:1-24064947
320   21:24064948-48129895
221          22:1-25652283
321   22:25652284-51304566
222           X:1-25878426
322    X:25878427-51756853
419    X:51756854-77635280
516   X:77635281-103513706
612  X:103513707-129392133
77   X:129392134-155270560
223           Y:1-29686783
323    Y:29686784-59373566

8 output ’em for python input (Snakemake)

The second write command appends all of the chromosomes or contigs (in this case, just contigs) that are less than 3e7 in length to the output file. It comma separates them, which is how bcftools view -r takes in multiple chromosomes or contigs. The paste(., collapse=',') command at the end collapses the vector of contigs into a string with comma separation.

write(regions %>% mutate(f = paste(paste(chr, start, sep =':'), end, sep='-')) %>% pull(f), file='vcf_region_split_coords.txt')
write(contig_size %>% filter(length < 3e7, contig != 'hs37d5') %>% pull(contig) %>% paste(., collapse=','), file='vcf_region_split_coords.txt', append = T)

9 rscript

I’ve wrapped up the functions and handling as a Rscript that takes the header of a vcf as input and outputs and writes to a user-given file the regions. The script also allows you to select desired number of regions (you will almost always get a few more), the output file name, and the genome size (defaults to human genome). The script is here.

10 Using the script output

I’m using it in a Snakemake pipeline. bcftools can use it with -R (region) if you run the script like this (see source for comments): Rscript split_vcf_into_n_pieces.R yourVCF.header 200 vcf_region_split_200_coords.txt 3e9 bed

11 sessionInfo()

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.2 (2022-10-31)
 os       macOS 14.7.4
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2025-10-24
 pandoc   3.6.3 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
 backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
 broom           1.0.3   2023-01-25 [1] CRAN (R 4.2.0)
 cachem          1.0.6   2021-08-19 [1] CRAN (R 4.2.0)
 callr           3.7.3   2022-11-02 [1] CRAN (R 4.2.0)
 cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
 cli             3.6.0   2023-01-09 [1] CRAN (R 4.2.0)
 colorspace      2.1-0   2023-01-23 [1] CRAN (R 4.2.0)
 crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
 DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
 dbplyr          2.3.0   2023-01-16 [1] CRAN (R 4.2.0)
 devtools        2.4.5   2022-10-11 [1] CRAN (R 4.2.0)
 digest          0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
 dplyr         * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
 ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
 evaluate        0.20    2023-01-17 [1] CRAN (R 4.2.0)
 fansi           1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
 fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
 forcats       * 1.0.0   2023-01-29 [1] CRAN (R 4.2.0)
 fs              1.6.1   2023-02-06 [1] CRAN (R 4.2.0)
 gargle          1.3.0   2023-01-30 [1] CRAN (R 4.2.0)
 generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
 ggplot2       * 3.4.0   2022-11-04 [1] CRAN (R 4.2.0)
 glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
 googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
 googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.0)
 gtable          0.3.1   2022-09-01 [1] CRAN (R 4.2.0)
 haven           2.5.1   2022-08-22 [1] CRAN (R 4.2.0)
 hms             1.1.2   2022-08-19 [1] CRAN (R 4.2.0)
 htmltools       0.5.4   2022-12-07 [1] CRAN (R 4.2.0)
 htmlwidgets     1.6.1   2023-01-07 [1] CRAN (R 4.2.0)
 httpuv          1.6.9   2023-02-14 [1] CRAN (R 4.2.0)
 httr            1.4.4   2022-08-17 [1] CRAN (R 4.2.0)
 jsonlite        1.8.4   2022-12-06 [1] CRAN (R 4.2.0)
 knitr           1.42    2023-01-25 [1] CRAN (R 4.2.0)
 later           1.3.0   2021-08-18 [1] CRAN (R 4.2.0)
 lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
 lubridate       1.9.1   2023-01-24 [1] CRAN (R 4.2.0)
 magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
 memoise         2.0.1   2021-11-26 [1] CRAN (R 4.2.0)
 mime            0.12    2021-09-28 [1] CRAN (R 4.2.0)
 miniUI          0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
 modelr          0.1.10  2022-11-11 [1] CRAN (R 4.2.0)
 munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
 pillar          1.8.1   2022-08-19 [1] CRAN (R 4.2.0)
 pkgbuild        1.4.0   2022-11-27 [1] CRAN (R 4.2.0)
 pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
 pkgload         1.3.2   2022-11-16 [1] CRAN (R 4.2.0)
 prettyunits     1.1.1   2020-01-24 [1] CRAN (R 4.2.0)
 processx        3.8.0   2022-10-26 [1] CRAN (R 4.2.0)
 profvis         0.3.7   2020-11-02 [1] CRAN (R 4.2.0)
 promises        1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
 ps              1.7.2   2022-10-26 [1] CRAN (R 4.2.0)
 purrr         * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
 R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
 Rcpp            1.0.10  2023-01-22 [1] CRAN (R 4.2.0)
 readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
 readxl          1.4.1   2022-08-17 [1] CRAN (R 4.2.0)
 remotes         2.4.2   2021-11-30 [1] CRAN (R 4.2.0)
 reprex          2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
 rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
 rmarkdown       2.20    2023-01-19 [1] CRAN (R 4.2.0)
 rstudioapi      0.14    2022-08-22 [1] CRAN (R 4.2.0)
 rvest           1.0.3   2022-08-19 [1] CRAN (R 4.2.0)
 scales          1.2.1   2022-08-20 [1] CRAN (R 4.2.0)
 sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 shiny           1.7.4   2022-12-15 [1] CRAN (R 4.2.0)
 stringi         1.7.12  2023-01-11 [1] CRAN (R 4.2.0)
 stringr       * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
 tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.0)
 tidyr         * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
 tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
 tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
 timechange      0.2.0   2023-01-11 [1] CRAN (R 4.2.0)
 tzdb            0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
 urlchecker      1.0.1   2021-11-30 [1] CRAN (R 4.2.0)
 usethis         2.1.6   2022-05-25 [1] CRAN (R 4.2.0)
 utf8            1.2.3   2023-01-31 [1] CRAN (R 4.2.0)
 vctrs           0.5.2   2023-01-23 [1] CRAN (R 4.2.0)
 withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
 xfun            0.37    2023-01-31 [1] CRAN (R 4.2.0)
 xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
 xtable          1.8-4   2019-04-21 [1] CRAN (R 4.2.0)
 yaml            2.3.7   2023-01-23 [1] CRAN (R 4.2.0)

 [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────