This function accepts a vector of SNP IDs (rsID), retrieves their genomic positions and overlaps them with the peaks to extend the peak metadata (`GRN@data$peaks$counts_metadata`) by storing the number, positions and rsids of all overlapping SNPs per peak (new columns starting with `SNP_`). Optionally, SNPs in LD with the user-provided SNPs can be identified using the LDlinkR package. Note that only SNPs in LD are associated with a peak for those SNPs directly overlapping a peak. That is, if a user-provided SNP does not overlap with any peak, neither the SNP itself nor any of the SNPs in LD will be associated with any peak, even if an LD SNP overlaps another peak. The results of are stored in GRN@annotation$SNPs (full, unfiltered table) and GRN@annotation$SNPs_filtered (filtered table), and rapid re-filtering is possible without re-querying the database (time-consuming)

addSNPData(
  GRN,
  SNP_IDs,
  EnsemblVersion = NULL,
  add_SNPs_LD = FALSE,
  requeryLD = FALSE,
  population = "CEU",
  r2d = "r2",
  token = NULL,
  filter = "R2 > 0.8",
  forceRerun = FALSE
)

Arguments

GRN

Object of class GRN

SNP_IDs

Character vector. No default. Vector of SNP IDs (rsID) that should be integrated and overlapped with the peaks.

EnsemblVersion

NULL or Character(1). Default NULL. Only relevant if source is not set to custom, ignored otherwise. The Ensembl version to use for the retrieval of gene IDs from their provided database names (e.g., JASPAR) via biomaRt. By default (NULL), the newest version is selected for the most recent genome assembly versions is used (see biomaRt::listEnsemblArchives() for supported versions). This parameter can override this to use a custom (older) version instead.

add_SNPs_LD

TRUE or FALSE. Default FALSE. Should SNPs in LD with any of the user-provided SNPs that overlap a peak be identified and added to the peak? If set to TRUE, LDlinkR::LDproxy_batch will be used to identify SNPs in LD based on the user-provided SNP_IDs argument, a specific (set of) populations (argument population) and a value for r2d. The full table (stored in GRN@annotation$SNPs) is then subsequently filtered, see also the filter argument.

requeryLD

TRUE or FALSE. Default FALSE. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. Should LDlinkR::LDproxy_batch be re-executed if already present? As this is very time-consuming, querying the database is only performed if this parameter is set to TRUE or if GRN@annotation$SNPs is missing.

population

Character vector. Default CEU. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. Population code(s) from the 1000 Genomes project to be used for LDlinkR::LDproxy_batch. Multiple codes are allowed. For all valid codes, see https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README_populations.md

r2d

r2 or d. Default r2. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. See the help of the function LDlinkR::LDproxy_batch for more details.

token

Character or NULL. Default NULL. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. LDlink provided user token. Register for token at https://ldlink.nih.gov/?tab=apiaccess. Has to be done only once and is very simple and straight forward but unfortunately necessary. An example, non-functional token is 2c49a3b54g04.

filter

Character. Default R2 > 0.8. Only applicable if add_SNPs_LD = TRUE and ignored otherwise. Filter criteria for the output table as generated by LDlinkR::LDproxy_batch. dplyr::filter style is used to specify filters, and multiple filtering criteria can be used (e.g., R2 > 0.8 & MAF > 0.01). The filtered table is stored in GRN@annotation$SNPs_filtered. Note that re-filtering is quick without the need to re-query the database unless requeryLD = TRUE.

forceRerun

TRUE or FALSE. Default FALSE. Force execution, even if the GRN object already contains the result. Overwrites the old results.

Value

An updated GRN object, with additional information added from this function.

Details

`biomaRt` is used to retrieve genomic positions for the user-defined SNPs, which can take a long time depending on the number of SNPs provided. Similarly, querying the LDlink servers may take a long time.

Examples

# See the Workflow vignette on the GRaNIE website for examples
GRN = loadExampleObject()
#> Downloading GRaNIE example object from https://git.embl.de/grp-zaugg/GRaNIE/-/raw/master/data/GRN.rds
#> INFO [2023-08-22 12:30:59] Storing GRN@data$RNA$counts matrix as sparse matrix because fraction of 0s is > 0.1 (0.44)
#> Finished successfully. You may explore the example object. Start by typing the object name to the console to see a summaty. Happy GRaNIE'ing!
GRN = addSNPData(GRN, SNP_IDs = c("rs7570219", "rs6445264", "rs12067275"), forceRerun = FALSE)
#> INFO [2023-08-22 12:30:59]  Quering biomaRt for 3 rsid annotation, this may take a while, particularly if the number of provided IDs is large
#> INFO [2023-08-22 12:31:01] Retrieving BioMart database succeeded
#> INFO [2023-08-22 12:31:02] Retrieving genome annotation succeeded
#> INFO [2023-08-22 12:31:02]   Retrieved annotation for a total of 3 SNPs out of 3 that were provided
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> INFO [2023-08-22 12:31:02]  Overlapping peaks and SNPs
#> Error in dplyr::summarise(., SNP_n = dplyr::n(), SNP_rsid = paste0(.data$RS_Number,     collapse = ","), SNP_start = paste0(.data$SNP_start, collapse = ","),     SNP_origin = paste0(.data$association, collapse = ",")):  In argument: `SNP_rsid = paste0(.data$RS_Number, collapse = ",")`.
#>  In group 1: `peakID = "chr1:25861303-25861554"`.
#> Caused by error in `.data$RS_Number`:
#> ! Column `RS_Number` not found in `.data`.