filterGRNAndConnectGenes.Rd
This is one of the main integrative functions of the GRaNIE
package. It has two main functions:
First, filtering both TF-peak and peak-gene connections according to different criteria such as FDR and other properties
Second, joining the three major elements that an eGRN consist of (TFs, peaks, genes) into one data frame, with one row per unique TF-peak-gene connection.
After successful execution, the connections (along with additional feature metadata) can be retrieved with the function getGRNConnections
.
Note that a previously stored eGRN graph is reset upon successful execution of this function along with printing a descriptive warning,
and re-running the function build_eGRN_graph
is necessary when any of the network functions of the package shall be executed.
If the filtered connections changed, all network related enrichment functions also have to be rerun.
Internally, before joining them, both TF-peak links and peak-gene connections are filtered separately for reasons of memory and computational efficacy:
First filtering out unwanted links dramatically reduces the memory needed for the full eGRN. Peak-gene p-value adjustment is only done after all filtering steps on the remaining set of
connections to lower the statistical burden of multiple-testing adjustment; therefore, this may lead to initially counter-intuitive effects such as a particular connections not being included anymore as compared to a
filtering based on different thresholds, or the FDR being different for the same reason.
filterGRNAndConnectGenes(
GRN,
TF_peak.fdr.threshold = 0.2,
TF_peak.connectionTypes = "all",
peak.SNP_filter = list(min_nSNPs = 0, filterType = "orthogonal"),
peak_gene.p_raw.threshold = NULL,
peak_gene.fdr.threshold = 0.2,
peak_gene.fdr.method = "BH",
peak_gene.IHW.covariate = NULL,
peak_gene.IHW.nbins = "auto",
outputFolder = NULL,
gene.types = c("protein_coding"),
allowMissingTFs = FALSE,
allowMissingGenes = TRUE,
peak_gene.r_range = c(0, 1),
peak_gene.selection = "all",
peak_gene.maxDistance = NULL,
filterTFs = NULL,
filterGenes = NULL,
filterPeaks = NULL,
TF_peak_FDR_selectViaCorBins = FALSE,
filterLoops = TRUE,
resetGraphAndStoreInternally = TRUE,
silent = FALSE,
forceRerun = FALSE
)
Object of class GRN
Numeric[0,1]. Default 0.2. Maximum FDR for the TF-peak links. Set to 1 or NULL to disable this filter.
Character vector. Default all
. TF-peak connection types to consider. The special keyword all
denotes all connection types (e.g., expression
and TFActivity
) that are found in the GRN
object. By default, only expression
is present in the object, so all
and expression
are usually equivalent unless calculation of TF-peak links based on TF activity has also been enabled.
Named list. Default list(min_nSNPs = 0, filterType = "orthogonal")
. Filters related to SNP data if they have
been added with the function addSNPData
, ignored otherwise. The named list must contain at least two elements:
First, min_nSNPs
, an integer >= 0 that denotes how many SNPs a peak has to overlap with at least to pass the filter or be considered for inclusion.
Second, filterType
, a character that must either be orthogonal
or extra
and denotes whether the SNP filter is orthogonal to the other filters (i.e, an alternative way of when a peak is considered for being kept) or whether the SNP filter is in addition to all other filters.
For more help, see the Vignettes.
Numeric[0,1]. Default NULL. Threshold for the peak-gene connections, based on the raw p-value. All peak-gene connections with a larger raw p-value will be filtered out.
Numeric[0,1]. Default 0.2. Threshold for the peak-gene connections, based on the FDR. All peak-gene connections with a larger FDR will be filtered out.
Character. Default "BH". One of: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none", "IHW".
Method for adjusting p-values for multiple testing.
If set to "IHW", the package IHW
is required (as it is listed under Suggests
, it may not be installed),
and independent hypothesis weighting will be performed, and a suitable covariate has to be specified for the parameter peak_gene.IHW.covariate
.
Character. Default NULL
. Name of the covariate to use for IHW (column name from the table that is returned with the function getGRNConnections
. Only relevant if peak_gene.fdr.method
is set to "IHW". You have to make sure the specified covariate is suitable for IHW, see the diagnostic plots that are generated in this function for this. For many datasets, the peak-gene distance (called peak_gene.distance
in the object) seems suitable.
Integer or "auto". Default "auto". Number of bins for IHW. Only relevant if peak_gene.fdr.method
is set to "IHW".
Character or NULL
. Default NULL
. If set to NULL
, the default output folder as specified when initiating the
object in initializeGRN
will be used. Otherwise, all output from this function will be put into the specified folder.
If a folder is provided, while we recommend specifying an absolute path, a relative one also works.
Character vector of supported gene types. Default c("protein_coding", "lincRNA")
.
Filter for gene types to retain, genes with gene types not listed here are filtered. The special keyword "all" indicates no filter and retains all gene types.
The specified names must match the names as stored in the GRN
object (see GRN@annotation$genes$gene.type
) and
correspond 1:1 to the gene type names as provided by biomaRt
, with the exception of lncRNAs
,
which is internally renamed to lincRNAs
when first fetching all gene types. This is done due to a recent change in biomaRt
and aims at
keeping backwards compatibility with GRN
objects.
TRUE
or FALSE
. Default FALSE
. Should connections be returned for which the TF is NA (i.e., connections consisting only of peak-gene links?). If set to TRUE
, this generally greatly increases the number of connections but it may not be what you aim for.
TRUE
or FALSE
. Default TRUE
. Should connections be returned for which the gene is NA (i.e., connections consisting only of TF-peak links?). If set to TRUE
, this generally increases the number of connections.
Numeric(2). Default c(0,1)
. Filter for lower and upper limit for the peak-gene links. Only links will be retained if the correlation coefficient is within the specified interval. This filter is usually used to filter out negatively correlated peak-gene links.
"all"
or "closest"
. Default "all"
. Filter for the selection of genes for each peak. If set to "all"
, all previously identified peak-gene are used, while "closest"
only retains the closest gene for each peak that is retained until the point the filter is applied.
Integer >0. Default NULL
. Maximum peak-gene distance to retain a peak-gene connection.
Character vector. Default NULL
. Vector of TFs (as named in the GRN object) to retain. All TFs not listed will be filtered out.
Character vector. Default NULL
. Vector of gene IDs (as named in the GRN object) to retain. All genes not listed will be filtered out.
Character vector. Default NULL
. Vector of peak IDs (as named in the GRN object) to retain. All peaks not listed will be filtered out.
TRUE
or FALSE
. Default FALSE
. Use a modified procedure for selecting TF-peak links. Instead of selecting solely based on the user-specified FDR, this procedure first identifies the correlation bin closest to 0 that contains at least one significant TF-peak link according to the chosen TF_peak.fdr.threshold. This is done spearately for both FDR directions. It then retains all TF-peak links that have a correlation bin at least as extreme as the identified pair. For example, if the correlation bin [0.35,0.40] contains a significant TF-peak link while [0,0.05], [0.05,0.10], ..., [0.30,0.35] do not, all TF-peak links with a correlation of at least 0.35 or above are selected (i.e, bins [0.35,0.40], [0.40,0.45], ..., [0.95,1.00]). Thus, for the final selection, also links with a higher FDR but a more extreme correlation may be selected.
TRUE
or FALSE
. Default TRUE
. If a TF regulates itself (i.e., the TF and the gene are the same entity), should such loops be filtered from the GRN?
TRUE
or FALSE
. Default TRUE
. If set to TRUE
, the stored eGRN graph (slot graph
) is reset due to the potentially changed connections that
would otherwise cause conflicts in the information stored in the object. Also, a GRN object is returned. If set to FALSE
, only the new filtered connections are returned and the object is not altered.
TRUE
or FALSE
. Default FALSE
. Print progress messages and filter statistics.
TRUE
or FALSE
. Default FALSE
. Force execution, even if the GRN object already contains the result. Overwrites the old results.
An updated GRN
object, with additional information added from this function.
The filtered and merged TF-peak and peak-gene connections in the slot GRN@connections$all.filtered
and can be retrieved (along with other feature metadata) using the function getGRNConnections
.
# See the Workflow vignette on the GRaNIE website for examples
GRN = loadExampleObject()
#> Downloading GRaNIE example object from https://git.embl.de/grp-zaugg/GRaNIE/-/raw/master/data/GRN.rds
#> INFO [2023-08-16 17:28:06] Storing GRN@data$RNA$counts matrix as sparse matrix because fraction of 0s is > 0.1 (0.44)
#> Finished successfully. You may explore the example object. Start by typing the object name to the console to see a summaty. Happy GRaNIE'ing!
GRN = filterGRNAndConnectGenes(GRN)
#> INFO [2023-08-16 17:28:06] Filter GRN network
#> INFO [2023-08-16 17:28:06] Data already exists in object or the specified file already exists. Set forceRerun = TRUE to regenerate and overwrite.
#> INFO [2023-08-16 17:28:06] Finished successfully. Execution time: 0 secs