GRaNIE_packageDetails.Rmd
Abstract
This vignette introduces the GRaNIE
package and
explains the main features and additional methodological
background. We refer also to the Workflow Vignette for further
information and examples as well as the latest
preprint and methodological details therein.
Genetic variants associated with diseases often affect non-coding
regions, thus likely having a regulatory role. To understand the effects
of genetic variants in these regulatory regions, identifying genes that
are modulated by specific regulatory elements (REs) is crucial. The
effect of gene regulatory elements, such as enhancers, is often
cell-type specific, likely because the combinations of transcription
factors (TFs) that are regulating a given enhancer have cell-type
specific activity. This TF activity can be quantified with existing
tools such as diffTF
and captures differences in binding of
a TF in open chromatin regions. Collectively, this forms a
enhancer-mediated gene regulatory network (eGRN) with cell-type
and data-specific TF-RE and RE-gene links. Here, we reconstruct such a
eGRN using bulk RNA-seq and open chromatin (e.g., using
ATAC-seq or ChIP-seq for open chromatin marks) and optionally TF
activity data. Our network contains different types of links, connecting
TFs to regulatory elements, the latter of which is connected to genes in
the vicinity or within the same chromatin domain (TAD). We use
a statistical framework to assign empirical FDRs and weights to all
links using various statistical approaches.
In summary, we present a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types.
For the latest version of the paper, see the References.
GRaNIE
package and required dependency packages
GRaNIE
is available on Bioconductor. The package and
installation instructions can be found here and are very simple.
The basic installation installs all required packages but not
necessarily those that are listed under Suggests
unless you
installed the package with
BiocManager::install("GRaNIE", dependencies = TRUE)
.
However, we also provide instructions on how to install the newest version of the package outside of Bioconductor for users who do not use the newest version of Bioconductor and/or do not have the devel version of Bioconductor installed. For details, see the instructions here.
Note that from the additional packages, only some of the genome annotation packages are actually required, which of them depends on your genome assembly version (see below).
Overall, we tried to minimize the installation burden and only require packages if they are absolutely necessary for the main workflow. If you want to pre-install also all optional packages, please consider the following options:
BiocManager::install("GRaNIE", dependencies = TRUE)
which however installs all annotation packages for all supported
genomes, which may take a longer time due to the overall download size
of multiple Gb.Genome annotation packages (only one of the four is optionally needed for some functions, see section below, which of them depends on your genome assembly version):
hg38
:
BiocManager::install(c("org.Hs.eg.db", "BSgenome.Hsapiens.UCSC.hg38", "TxDb.Hsapiens.UCSC.hg38.knownGene"))
hg19
:
BiocManager::install(c("org.Hs.eg.db", "BSgenome.Hsapiens.UCSC.hg19", "TxDb.Hsapiens.UCSC.hg19.knownGene"))
mm10
:
BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm10", "TxDb.Mmusculus.UCSC.mm10.knownGene"))
mm9
:
BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm9", "TxDb.Mmusculus.UCSC.mm9.knownGene"))
rn6
:
BiocManager::install(c("org.Rn.eg.db", "BSgenome.Rnorvegicus.UCSC.rn6", "TxDb.Rnorvegicus.UCSC.rn6.refGene"))
rn7
:
BiocManager::install(c("org.Rn.eg.db", "BSgenome.Rnorvegicus.UCSC.rn7", "TxDb.Rnorvegicus.UCSC.rn7.refGene"))
dm6
:
BiocManager::install(c("org.Dm.eg.db", "BSgenome.Dmelanogaster.UCSC.dm6", "TxDb.Dmelanogaster.UCSC.dm6.ensGene"))
If you want to use the JASPAR database for TF and TFBS,
you need the following extra packages:
BiocManager::install(c("JASPAR2022", "TFBSTools", "motifmatchr", "rbioapi"))
For all other optional packages, you may execute this line for
installing them in one go:
BiocManager::install(c("IHW", "WGCNA", "csaw", "EDASeq", "ChIPseeker", "variancePartition", "ReactomePA", "DOSE", "clusterProfiler", "BiocFileCache", "BiocParallel", "LDlinkR"))
GRaNIE
has a number of packages that enhance the
functionality for particular cases, may be required depending on certain
parameters, only when using a particular functionality or that generally
speed-up the computation. In detail, the following packages are
currently optional and may be needed context-specifically:
Genome assembly and annotation packages (only one of the four is optionally needed, which of them depends on your genome assembly version)
all of the following packages are ONLY needed for either
additional peak annotation in combination with ChIPseeker
(see below) or if the chosen peak normalization method is GC-based
(EDASeq_GC_peaks
, gcqn_peaks
)
org.Hs.eg.db
: Needed only for hg19
or
hg38
org.Mm.eg.db
: Needed only for mm9
or
mm10
org.Rn.eg.db
: Needed only for rn6
or
rn7
org.Dm.eg.db
: Needed only for dm6
BSgenome.Hsapiens.UCSC.hg19
,
TxDb.Hsapiens.UCSC.hg19.knownGene
: Needed only for
hg19
BSgenome.Hsapiens.UCSC.hg38
,
TxDb.Hsapiens.UCSC.hg38.knownGene
: Needed only for
hg38
BSgenome.Mmusculus.UCSC.mm10
,
TxDb.Mmusculus.UCSC.mm10.knownGene
: Needed only for
mm10
BSgenome.Mmusculus.UCSC.mm9
,
TxDb.Mmusculus.UCSC.mm9.knownGene
: Needed only for
mm9
BSgenome.Rnorvegicus.UCSC.rn6
,
TxDb.Rnorvegicus.UCSC.rn6.refGene
: Needed only for
rn6
BSgenome.Rnorvegicus.UCSC.rn7
,
TxDb.Rnorvegicus.UCSC.rn7.refGene
: Needed only for
rn7
BSgenome.Dmelanogaster.UCSC.dm6
,
TxDb.Dmelanogaster.UCSC.dm6.ensGene
: Needed only for
dm6
ChIPseeker
: Not needed strictly
speaking, but used only for the function addData()
to
provide additional peak annotation, fully optional otherwise.
TF and TFBS related packages
JASPAR2022
,
TFBSTools
,
motifmatchr
,
rbioapi
: Needed only when
source = "JASPAR"
for addTFBS()
for using
JASPAR 2022 TFs and TFBSAdditional statistical packages and for enhanced functionality
IHW
: Needed only for the function
filterGRNAndConnectGenes()
for p-value adjustment of the
raw peak-gene p-values when using the IHW
framework
(parameter peak_gene.fdr.method
)csaw
: Needed only for a cyclic
LOESS normalization in addData()
, which is the default
normalization currently for adding TF activity data via
addData_TFActivity()
EDASeq
: Provides additional, GC-aware
normalization schemes for the peak data in in `addData().variancePartition
: Needed only for the
function add_featureVariation()
to quantify multiple
sources of biological and technical variation for features (TFs, peaks,
and genes)WGCNA
: Needed only when setting
corMethod = "bicor"
in multiple supported functions for
enabling another type of robust correlations as alternative to Spearman
(an experimental, largely undocumented feature so far).Enrichment-associated packages
topGO
,
ReactomePA
,
DOSE
,
clusterProfiler
: Needed only for all
enrichment functions for GO
, Reactome
,
DO
, and KEGG
respectivelySNP-related extra packages
LDlinkR
: Needed only for
addSNPData()
to add SNP LD information.Computational efficacy
BiocFileCache
: Needed only for
loadExampleObject()
, in cases when this function is
executed multiple times, caching is faster than re-downloading the file
anew every time the function is executed.BiocParallel
: Only used but highly
recommended in the functions addConnections_peak_gene()
,
add_TF_gene_correlation()
, and
overlapPeaksAndTFBS()
to speed-up computation when
requesting multiple cores.Check the workflow vignette for an example workflow with explanations, figures and results.
For user convenience, we provide the function
loadExampleObject()
that can load a supported GRN object
from the GRaNIE
package from an arbitrary URL into R. This
function can be used, for example, to load an example object that
contains a small number of TFs that we specifically prepared for package
use and exploring the package.
Simply run the following command:
GRN = loadExampleObject()
You can start by simply typing GRN
(i.e., the object
name) into the console, and a summary of the GRN object is printed.
The example object is always up to date with the most recent
version of the package and everything that the package can calculate is
contained in the object. Thus, you can run any function after loading
the example object.
For more details on where the data from the example object comes from, see the workflow vignette with explanations.
In our GRaNIE
approach, we integrate multiple data
modalities. Here, we describe them in detail and their required
format.
Open chromatin data may come from ATAC-seq, DNAse-seq or ChIP-seq data for particular histone modifications that associate with open chromatin such as histone acetylation (e.g., H3K27ac). They all capture open chromatin either directly or indirectly, and while we primarily tested and used ATAC-seq while developing the package, the others should also be applicable for our framework. From here on, we will refer to these regions simply as peaks.
For RNA-seq, the data represent expression counts per gene across samples.
Here is a quick graphical representation which format is required to be compatible with our framework:
peakID
while for
RNA-seq, we use EnsemblID
chr
denoting the chromosome, followed by :
,
and then start
, -
, and end
for
the peak start and end, respectively. Coordinates are assumed to be
zero-based exclusive, the standard for BED files, see here
or here
for more information. In short, the first base in a chromosome is
numbered 0, and the last base is not included in the feature. For
example, a peak with the coordinates chr1:100-150
starts at
position 100 (or the 101th base of the chromosome, as counting starts
with 0), spans 50 bp and ends at 149 (and NOT 150).GRaNIE
pipeline can work with it. Note that only the shared samples between
both data modalities are kept, however, so make sure that the sample
names match between them and share as many samples as possible. See the
methods part for guidelines on how many samples we recommend.Note that peaks should not overlap. If they do, an informative error message is thrown and the user is requested to modify the peak input data so that no overlaps exist among all peaks. This can be done by either merging overlapping peaks or deleting those that overlap with other peaks based on other criteria such as peak signal, by keeping only the strongest peak, for example.
For guidelines on how many peaks are necessary or recommended, see the section below.
For guidelines on whether and how to prepare single-cell 10x multiome data, see here.
TF and TFBS data is mandatory as input. As of March 2023, we offer
two different ways of how to import TF and TFBS data into
GRaNIE
.
First, we support the import of any TF and TFBS data irrespective of
how it was generated or what it represents, thereby offering full
flexibility for how our framework is used. The only drawback is that
preparing the TF and TFBS data is a little more involved. This was the
only and default way of how to use GRaNIE
until February
2023. In what follows, we explain more details.
Specifically, the package requires a bed
file per TF
with TF binding sites (TFBS). TFBS can either be in-silico predicted, or
experimentally verified, as long as genome-wide TFBS can be used. For
convenience and orientation, we provide TFBS predictions for
HOCOMOCO-based TF motifs that were used with PWMScan
for
hg19
, hg38
and mm10
. Check the workflow vignette for an example.
We recommend to download the full HOCOMOCO TFBS data from here.
In short, we provide example files for selected supported genome
assemblies (hg19, hg38 and mm10) that are
fully compatible with GRaNIE
as separate downloads. While
we also provide TFBS data for FIMO, we recommend using
HOCOMOCO.
However, you may also use your own TFBS data, and we provide full flexibility in doing so. Only some manual preparation is necessary. Briefly, if you decide to use your own TFBS data, you have to prepare the following:
bed
or
bed.gz
format, 6 columns{TF}{suffix}.{fileEnding}
, where
{TF}
specifies the name of the TF, {suffix}
an
optional and arbitrary string (we use _TFBS
, for example),
and {fileEnding}
the file type (supported are
bed
and bed.gz
).translationTable.csv
. This file must have the
following structure: 3 columns (tab-separated), called
SYMBOL
, ENSEMBL
, and HOCOID
. The
first column denotes a symbol or shortcut for the TF that is used
throughout the pipeline (e.g., AHR
), the second the ENSEMBL
ID (without the dot suffix; e.g., ENSG00000106546
) and the
third column the prefix of how the file is called with the TFBS (e.g.,
AHR.0.B
if the file for AHR
is called
AHR.0.B_TFBS.bed.gz
).GRaNIE
as separate downloads. For more
information, check here.For more methodological details, details on how to construct these
files, their exact format etc we refer to diffTF
paper and
website for details.
As of March 2023, we additionally a more user-friendly way of
importing TF and TFBS data from the various JASPAR databases that are
available in R/Bioconductor: JASPAR
2014, 2016, 2018, 2022,
and 2022. While we so far tested only with th newest
JASPAR2022
database / R package, the older ones should also
work - if not, please let us know. Using JASPAR2022
as TF
database is much easier and also quicker from a user perspective, and
none of the additional complexities that were mentioned above for the
custom import apply here. For details, see the addTFBS()
and overlapPeaksAndTFBS()
functions.
Note that we also implemented an option to customize which TFBS
collection to use from the JASPAR2022
database via the
function argument JASPAR_useSpecificTaxGroup
from
addTFBS()
. For example, this feature can be used to specify
the whole TF collection irrespective of the genome assembly the data is
in - which is useful for genomes such as rat or mouse, for which JASPAR
only finds a few dozen TF at most if the specific genome was specified
otherwise.
Be aware, though, that you may want to compare your eGRN
across different TF and TFBS data - we are currently investigating the
effect of using JASPAR2022
and cannot currently comment
much on it. We will, however, update the vignette here once we have more
results to share.
Providing sample metadata is optional, but highly recommended - if
available, the sample metadata is integrated into the PCA plots to
understand where the variation in the data comes from and whether any of
the metadata (e.g., age, sex, sequencing batch) is associated with the
PCs from a PC, indicating a batch effect that needs to be addressed
before running the GRaNIE
pipeline.
The integration of sample metadata can be achieved in the
addData()
function (click the link for more
information).
Integration of Hi-C data is optional and serves as alternative to identifying peak-gene pairs to test for correlation based on a predefined and fixed neighborhood size (see Methods).
If TAD domains are used, the neighborhood of a peak will be defined by the TAD the peak is located in, and all genes in the same TAD will be tested for correlation. If the TAD is very narrow, this consequently results in fewer genes to be tested as compared to a fixed 250 kb neighborhood (the default), for example, while the opposite is true for a large megabase-long TAD domain.
If Hi-C data are available, the pipeline expects a BED file format
with at least 3 columns: chromosome name
,
start
, and end
. An ID
column is
optional and assumed to be in the 4th column, all additional columns are
ignored.
For more details, see the R help
(?addConnections_peak_gene
) and the Methods.
Optionally, SNP data can also be integrated into our framework via
the function addSNPData()
. The main idea is to annotate
each peak with the information whether it overlaps any of the
user-provided SNPs (and how many), and use this extra annotation later
on for creating the TF-peak-gene eGRN within
filterGRNAndConnectGenes()
. For more information, see here.
Optionally, we now also support identifying SNPS that are in LD with
any of the user-provided SNPs to expand the set of SNPs that are
peak-associated. For more details, see addSNPData()
.
SNP IDs have to be provided with their corresponding rsIDs, and the
genomic position is retrieved automatically within
addSNPData()
by using biomaRt
.
We also plan to make it possible to manually import TF activity or any other TF-specific data that can be used for generating TF-peaks links, in addition to our default approach of using TF expression. Stay tuned!
For more details, see here.
In this section, we give methodological details and guidelines.
As everywhere in bioinformatics, proper data normalization for both
RNA and open chromatin (peaks) data is very important. Thus, the chosen
normalization method may have a large influence on the resulting
eGRN, so make sure the choice of normalization is reasonable.
Data normalization is performed when the data is imported into
GRaNIE
in the addData()
function and cannot be
changed thereafter.
In this section, we give details on the supported data normalization
schemes currently available in the GRaNIE
framework.
We currently support six choices of normalization of either peak or
RNA-Seq data: limma_quantile
,
DESeq2_sizeFactors
, limma_cyclicloess
,
limma_scale
, csaw_cyclicLoess_orig
,
csaw_TMM
plus none
to skip normalization
altogether. The default for RNA-Seq is a quantile normalization, while
for the open chromatin peak data, it is DESeq2_sizeFactors
(i.e., a “regular” DESeq2
size factor normalization).
Importantly, some normalization methods such as
DESeq2_sizeFactors
strictly require raw data, while others,
such as limma_quantile
, do not necessarily.
For peaks, two additional GC-based normalization schemes are offered:
EDASeq_GC_peaks
and gcqn_peaks
. We refer to
the R help for more details (?addData
).
We recommend raw data as input, although it is also possible to
provide pre-normalized data as input. Pre-normalizing data beforehand
may be useful in case there are known batch effects, for example.
Removing the batch effects from the counts before running
GRaNIE
therefore may be a worthwhile strategy, a procedure
that generally produces non-integer counts. In such a case, the user can
then either normalize the data with another normalization method (one
that is suitable also with non-integer values, see above) or skip
additional normalization and use the counts as they are (choosing
none
therefore for the normalization parameter).
To identify statistically significant TF-peak connections we implement a cell-type specific data-driven approach. In brief, we first calculate the Pearson correlation coefficients between the expression level of each TF and the open chromatin signal of each peak across all samples. We then use an empirical FDR procedure to identify statistically significant TF-peak connections. For this, for each TF, we split the peaks into two sets: A foreground set containing the peaks with a putative TFBS and a corresponding background set consisting of peaks without putative TFBS based on the TF-peak binding matrix calculated above. We then discretize the TF-peak correlation r into 40 bins in steps of 0.05 ranging from -1, -0.95, …, 0, …, 1 and calculate a bin-specific FDR value using two different directions. For more details on how we establish TF-peak links, please check the Supplement of the corresponding publication. See References for links.
We employ a shuffling-based approach to calculate background TF-peak links, as described in detail here.
TF-peak connections are found by correlating a suitable measure related to TFs with peak accessibility, and then identifying statistically significant links. By default, TF expression is used as TF measure, but we also implemented an additional measure for establishing these links based on a measure called TF activity (see below).
TFs and therefore also TF motifs can have very different GC
preferences. Some TFs are known to bind to very GC-rich regions such as
SP1. Thus, when calculating the TF-peak link FDRs, comparing a GC-rich
or GC-poor foreground set of peaks with the full background (that is,
all other peaks with no predicted site), the latter of which may exhibit
a very different GC distribution, is potentially biasing the FDR values.
We therefore offer an experimental feature to correct for this effect
and to use a corresponding GC-matching background. The GC correction is
enabled with useGCCorrection = TRUE
in the function
addConnections_TF_peak()
. Note that the GC
adjustment will take additional time and is currently much slower as
compared to not adjusting the background.
Importantly, we note again that this feature is experimental as of now and we are still exploring its effects and plausibility.
When the GC correction is activated, the following is performed:
percBackground_resample
is set to a value > 0
(default is 75), the background size is fixed to this value (i.e., 75
translates to a background size of 75% of the original size), and no
iterative procedure is run to automatically determine the background
size (see b).percBackground_resample = 0
, an automatic
iterative procedure is run that determines the maximum value of
percBackground_resample
(starting from a value of 100 - the
full size of the background - down to 5 maximum in steps of 5 per
iteration) so that all GC bins with a relative frequency of at least 5%
in the foreground can be matched in sufficient numbers with the
corresponding GC bin from the background. GC bins with a relative
frequency of less than 5% are ignored due to their low relevance. If the
background size as identified by the iterative procedure selects a
background that is too small (< 1000 peaks), the selected background
percentage is again increased in steps of 5% until at least 1000 peaks
can be selected from. This is done for statistical reasons and ensures a
minimum no of points in background, even if this sacrifices the
mimicking of the distributions as described above.percBackground_resample
is set to TRUE
,
then the background distribution will mimic the foreground distribution
perfectly even for GC bins that are very lowly abundant (i.e., those
with a relative frequency of < 0.05 in the foreground). In summary,
for GC bins for which not enough peaks are available in the background,
a resampling scheme is performed that selects the required number of
peaks by sampling with replacement from the background. If set to
FALSE
, however, no resampling is performed and the number
of peaks selected from the background is the maximum that can be
selected. In this case, the relative frequencies in foreground and
background may differ. Note that when
percBackground_resample = 0
, this is only relevant for GC
bins with a relative frequency of < 0.05, while otherwise, resampling
may occur for an GC bin depending on the chosen value of
percBackground_size
Rationale for choosing an appropriate number for
percBackground_resample
Ideally, one would want to select as many peaks in the background as possible to maximize the sample size and therefore randomness for better statistical sampling. However, this often fails because particular GC bins may be very dominant in the foreground for GC-rich TFs (e.g., the GC bin 71-80% has a relative frequency of 25%) while the background contains overall only 5% peaks with this GC bin. Thus, one cannot select 25% of all background peaks (as the relative frequency of the GC bins in the foreground is to be kept in the background) of the same GC bin without resampling. The only alternative is to not use the full background size as reference, but only a particular percentage of it so that the required absolute numbers of peaks to sample from decreases and eventually can be selected for all relevant GC bins. When choosing this option, the background size can be much smaller (e.g., 15% only), therefore also choosing only a relative small absolute number of peaks from the background. However, no resampling is done here, therefore not (by chance) giving particular peaks a higher weight.
You may ask which value to choose for
percBackground_size
and whether to use GC correction at
all, however? These are good questions, and while we cannot give any
guaranteed reply, we can give recommendations and our experiences from a
limited number of datasets:
GRaNIE
once with and
once without GC correction to compare the eGRNs. What we
sometimes see is that in the eGRN without a GC correction, the
most abundant TFs actually have a GC-rich motif, indicating that they
may be overrepresented in the eGRN.percBackground_size
, we recommend leaving it at the default
value of 75 first and potentially adjusting it after looking at the
diagnostic plots. Higher values will result in (1) a higher number of GC
bins that need resampling (and therefore potentially overrepresenting
some peaks or indirectly causing a bias due to the higher percentage of
resampled peaks) while (2) increasing the background size overall, which
is better from a statistical sampling point of view. We think
Judging how well the GC correction worked
We also provide detailed QC plots that summarize both the GC background selection as well as the differences for teh TF-peak connections with and without GC correction. For more details, see here.
Reproducibility of GC correction
Lastly, we need to emphasize that the GC correction, due to the random sampling from the background, is not deterministic and may yield a slightly different number of connections each time it is run. The more bins are resampled (see discussion above), the more differences may arise
As explained above, TF-peak connections are found by correlation TF
expression with peak accessibility. In addition to
expression, we also offer to identify statistically significant
TF-peak links based on TF Activity. The concept of TF Activity
is described in more detail in our diffTF
paper. In short,
we define TF motif activity, or TF activity for short, as the effect of
a TF on the state of chromatin as measured by chromatin accessibility or
active chromatin marks (i.e., ATAC-seq, DNase sequencing [DNase-seq], or
histone H3 lysine 27 acetylation [H3K27ac] ChIP-seq). A TF
Activity score is therefore needed for each TF and each
sample.
TF Activity information can either be calculated within the
GRaNIE
framework using a simplified and
empirical approach) or it can be calculated outside of our framework
using designated methods and then imported into our
framework. We now describe these two choices in more detail.
In our GRaNIE
approach, we empirically estimate TF
Activity for each TF with the following approach:
By default, we currently offer the four different types of normalizing the raw data for calculating TF Activity. Options 2 to 4 are described in more detail in the section Data normalization above, while option 1 is currently only available for TF Activity and therefore explained below (this may change in the future):
normOffsets
function of the csaw
package in R as opposed to using one size factor for each sample only as
with the regular size factor normalization in DeSeq
. For
each sample, a LOWESS (Locally Weighted Scatterplot Smoothing)
curve is fitted to the log-counts against the log-average count. The
fitted value for each bin pair is used as the generalized linear model
offset for that sample. The use of the average count provides more
stability than the average log-count when low counts are present. For
more details, see the csaw
package in R
and
the normOffsets
methods therein.DeSeq
Soon, it will also be possible to import TF Activity data into our framework as opposed to calculating it using the procedure as described above. This feature is currently in development and will be available soon.
Once TF Activity data is available, finding TF-peak links and assessing their significance is then done in complete analogy as for the TF expression data - just the input data is different (TF Activity as opposed to TF expression). The so-called connection type - expression or TF Activity, is stored in the GRN object and output tables and therefore allows to tailor and filter the resulting network accordingly. All output PDFs also contain the information whether a TF-peak link has been established via the TF expression or TF Activity.
In the GRaNIE
framework, we construct peak-gene pairs
completely independently and separately from the TF-peak links. For
this, the function addConnections_peak_gene()
is used. In
general, we offer two options to decide which peak-gene pairs to test
for correlation:
promoterRange
for details) to select which peak-gene pairs
to test. Thus, for a particular peak, all genes within 250 kb either up-
or downstream of it would be selected and tested for correlation.For both approaches, the user can select between two options of how
to exactly calculate the distance between a peak and a gene. That is,
which part of the gene is taken as reference point: the 5’ end (TSS, the
default) or the full gene including all exons and introns. For more
information see the R help (?addConnections_peak_gene
and
the parameter overlapTypeGene
in particular)
Importantly, as for TF-peaks, we also employ a shuffling-based approach to calculate background peak-gene links, as described in detail here.
As GRaNIE
models TF-peak and peak-gene links
independently from one another, the package function
filterGRNAndConnectGenes()
can be used to combine them to a
tripartite TF-peak-gene combination. In addition, filtering both
TF-peaks and peak-gene links according to different criteria is
important and can be user-adjusted flexibly. Importantly, for
statistical reasons, multiple testing correction of the peak-gene raw
p-values is only done after connecting them to TFs and filtering them
according to user-defined choices. We offer a wide range of different
adjustment methods, including Independent hypothesis weighting
(IHW), a multiple testing procedure that increases power compared to the
method of Benjamini and Hochberg by assigning data-driven weights to
each hypothesis. For more details, see, for example, here.
When SNPs have been integrated (see here), the user can choose whether to 1. additionally add a SNP filter to keep or discard TF-peak-gene connections (e.g. by requiring a minimum number of SNPs that have to overlap a peak) 2. keep peaks (either as part of TF-peak or peak-gene connections) if they overlap with a user-defined minimum number of SNPs (e.g, at least 1).
Thus, integrating SNPs can either reduce the size of eGRN by adding another filtering step or increase it by extending the criteria for which we find a peak relevant enough: In addition to being connected to a TF, a peak may also be kept just because it overlaps with a SNP (independent of any TF connection).
After successful execution, the filtered TF-peak-gene network is
stored in GRN@connections$all.filtered
(both real and
background eGRN). It can be easily retrieved, along with
additional options for which output columns to include, using
getGRNConnections()
.
After a filtered eGRN has been stored in the GRN object, the user can run all network related analysis such as enrichment or visualization. See the next sections for details.
Importantly, after successful execution of
filterGRNAndConnectGenes()
and re-population of the
filtered eGRN in
GRN@connections$all.filtered[["0"]]
, the eGRN
graph as produced by build_eGRN_graph()
(GRN@graph
) is reset to avoid inconsistencies with
mismatches between the filtered eGRN, the graph representation
of it and associated enrichment results.
In the GRaNIE
framework, in addition to the real
connections, we also calculate background connections in each
step based on randomized data. Calculating a background eGRN
follows exactly the same principles and methods as the real
eGRN, without exceptions. It is only the input data that is
shuffled in different ways, for both TF-peak links as well as peak-gene
links, as follows:
TF-peak links: First, the rows of the binary 0/1 TF-peak overlap table are shuffled separately for each sample (i.e., per column) to produce a truly random overlap matrix. In addition, the sample labels for the RNA counts are shuffled. Thus, subsequently, the counts that are correlated between RNA and peak data are not from the same pair of samples in fact.
peak-gene links: After creating the real and
true table for the peak-gene pairs that fulfil the user-specified
requirements for being tested for correlation (e.g., within the
specified distance from one another as defined by the parameter
promoterRange
), the peaks are shuffled. This consequently
means that the peak-gene pairs that are subsequently tested for
correlation are (usually) not in the specified proximity anymore, but
instead mostly on other chromosomes or at a very different location on
the same chromosome. Notably, this preserves the degree distribution for
both peaks and genes: that is, if a gene or peak is overrepresented,
this is also true for the shuffled version. Second, in analogy to the
background TF-peak links, the sample labels for the RNA counts are
shuffled before peak-gene correlations are calculated. The latter can be
controlled via the parameter shuffleRNACounts
in
addConnections_peak_gene()
. Empirically, we found that only
shuffling the peak-gene links is not sufficient to create a truly random
background, which is why the default behavior is to do the shuffling in
addition. Nevertheless, this can be user-controlled.
TF-peak-gene links (eGRN): Combining TF-peak and peak-gene links for the background data then again follows the exact same methods as the real eGRN. However, due to the randomized nature of both the TF-peak and peak-gene links, the background eGRN typically has very few or no statistically significant connections at all.
In summary, we want to emphasize that we currently do not simply permute the real eGRN network. Instead, we re-construct an eGRN using the very same methods but with randomized data, as outlined above. This allows us to judge and compare the connectivity for the real network as compared to a background eGRN.
In the GRN
object (GRN@connections
slot, to
be specific), we label the background data with a 1
, while
the original data is labeled as 0
. For example,
GRN@connections$TF_peaks[["0"]]
stores the original
connections, while GRN@connections$TF_peaks[["1"]]
stores
those arising from the background.
Relevant output files consequently contain the
background
label to designate that the corresponding
background has been used to generate the plot, while the original data
is labeled as original
.
After a filtered set of TF-peak-gene links (i.e., a full
eGRN) has been built with
filterGRNAndConnectGenes()
, the actual graph must be
constructed using the function build_eGRN_graph()
based on
the filtered eGRN (in
GRN@connections$all.filtered[["0"]]
).
Importantly, two types of networks are constructed:
filterGRNAndConnectGenes()
.allowLoops
).For subsequent functions, either of the two networks is used, and the
user can choose which type of network to use if both are possible (such
as for visualizeGRN
).
For more information, see the R help
(?build_eGRN_graph
).
After the graph has been built, three types of enrichment analysis
are available within the GRaNIE
framework:
calculateGeneralEnrichment
)calculateCommunitiesEnrichment()
)calculateTFEnrichment()
)All enrichment functions here use the TF-gene graph in the
GRN
object and therefore require the function
build_eGRN_graph()
to be run beforehand. They are, as
explained above, based on the filtered network as obtained by
filterGRNAndConnectGenes()
and include only the genes from
the corresponding (sub)network.
All three functions have corresponding plot functions to visualize the enrichment. For more information such as supported ontologies, see the corresponding R help for the functions, all details are explained there.
For an explanation of the output files from the plot functions, see here.
All results are stored in GRN@stats$Enrichment
. For
example, the results from the enrichment results of TF
E2F7.0.B
from the example GRN object (see function
loadExampleObject()
) for the GO
biological
process (BP) enrichment can be retrieved from
GRN@stats$Enrichment$byTF$E2F7.0.B$GO_BP$results
, while the
parameters that were used to run the enrichment are correspondingly
stored in
GRN@stats$Enrichment$byTF$E2F7.0.B$GO_BP$parameters
.
We also provide the functions visualizeGRN()
to
visualize a filtered eGRN network as well as
filterConnectionsForPlotting()
to filter a eGRN
just for visualization purposes. They are both very easy to invoke, but
provide many options to customize the output and the way the graph is
drawn. We recommend to explore the options in the R help
(?visualizeGRN
and
filterConnectionsForPlotting()
). Note that we use the
provided network visualization methods from the igraph
package for visualizeGRN()
.
Visualizing the whole network usually works well for small to medium-sized networks that contain up to a few hundred edges, but may fail for larger eGRNs. Drawing thousands of nodes and edges at once without losing readability is almost impossible. However, here are a few tips on what you can do if your eGRN cannot be drawn or it looks not nice enough:
maxEdgesToPlot
to
say 1000 or even more, and check whether the eGRN can be drawn.
It may also help to increase the dimensions of the PDF (if
plotAsPDF = TRUE
) by increasing pdf_width
and
pdf_height
.layout
parameter. Try all of the supported layouts (see
?visualizeGRN
for a list), maybe one looks good
enough.If none of the plotting-related parameters improve the visualization
or if you want to visually explore the network for particular features
of the network (such as specific TFs, peaks, or genes), you can use
filterConnectionsForPlotting()
to filter a eGRN
just for visualization purposes. This reduces the number of nodes and
edges to plot and gives the user ultimate flexibility of what to
visualize. For example, you can filter the network to just visualize the
part of the network that is connected to a specific set of TFs (i.e,
their regulons).
Alternatively, you can also re-filter the network again more
stringently using filterGRNAndConnectGenes()
, but then all
subsequent analyses have to be rerun as well (e.g.,
build_eGRN_graph()
and all enrichment functions). This may
reduce the multiple testing burden for peak-gene p-values and recover
connections that may not have passed the filter beforehand.
One of our latest features we added to the package is to use the
variancePartition
package to quantify and interpret the
multiple sources of biological and technical variation from the features
(TFs, peaks, and genes) in a GRN
object. This can be done
via the function add_featureVariation()
. This can be
helpful to identify where exactly the observed variation actually comes
from, which we believe is useful information when checking
eGRNs and particular elements of it. Note that this is still a
work of progress, however, and we did not yet test this feature for many
datasets.
In essence, we run the main function
fitExtractVarPartModel
of the package
variancePartition
. This fits a linear (mixed) model to
estimate the contribution of multiple sources of variation while
simultaneously correcting for all other variables for the features in a
GRN
object (TFs, peaks, and genes), given particular
metadata. The function reports the fraction of variance attributable to
each metadata variable. As input, the normalized count matrices are
used.
The formula used for model fitting can either be provided manually or
set to auto
. For the latter case, the formula will be build
automatically based on all metadata variables as specified with the
metadata
parameter. By default, numerical variables will be
modeled as fixed effects, while variables that are defined as factors or
can be converted to factors (characters and logical variables) are
modeled as random effects as recommended by the
variancePartition
package.
The user can also speciy whether to run the procedure only on the
features from the filtered set of connections (the eGRN, the
result of filterGRNAndConnectGenes()
) or on all filters
(that is, for all features irrespective of whether or not they are part
of the final eGRN).
For more information where the information is stored in the
GRN
object and which QC plots are available, see here.
Here, we describe the various output files that are produced by the pipeline. They are described in the order they are produced in the pipeline.
Our pipeline works and output a so-called GRN
object. The goal is simple: All information is stored in it, and by
keeping everything within one object and sharing it with others, they
have all the necessary data and information to run the
GRaNIE
workflow. A consistent and simple workflow logic
makes it easy and intuitive to work with it, similar to other packages
such as DESeq2
.
Technically speaking, it is an S4 object of class GRN
.
As you can see from the workflow
vignette, almost all GRaNIE
functions return a
GRN
object (with the notable exception of get
functions - i.e., all functions that start with get). Except
initializeGRN()
, which creates an empty GRN
object, they also all require a GRN
object as first
argument, which makes is easy and intuitive to work with.
GRN
objects contain all data and results necessary for
the various functions the package provides, and various extractor
functions allow to extract information out of an GRN
object
such as the various get
functions. In addition, printing a
GRN
object results in an object summary that is printed
(try it out and just type GRN
in the console if your
GRN
object is called like this!). In the future, we aim to
add more convenience functions. If you have specific ideas, please let
us know.
The slots of a GRN
object are described in the R help,
see ?GRN-class
for details. While we work on general
extractor functions for the various slots for optimal user experience,
we currently suggest to also access and explore the data directly with
the @
operator until we finalized it. For example, for a
GRN
object called GRN
, GRN@config
accesses the configuration slot that contains all parameters and object
metadata, and slotNames(GRN)
prints all available slots of
the object.
During a typical GRaNIE
analysis, almost all results are
automatically stored in GRN
object that is used as input
(exceptions are indicated in the corresponding sections) and therefore,
almost all functions return also the very same GRN
object,
containing in addition the results of the function.
The only exception is the function getGRNConnections()
that can be used to extract the resulting eGRN network from a
GRN
object and returns a separate data frame and
NOT a GRN
object. For more information and
an explanation about the various columns that are returned from the
function, see the corresponding R help (
?getGRNConnections
).
A summary of a GRN
object can be retrieved with the
function getGRNSummary()
. This returns a named list that
summarizes all relevant aspects for a GRN
object. For more
details, see the R help.
All user data is stored in GRN@data
. This slot contains
the following elements:
peaks
: Here, peaks data are stored, and the list
contains the following elements:
counts
stores the data after the user-selected
normalization scheme has been performed (even if set to
none
)counts_metadata
stores the peak metadata (peak
location, ID, and whether or not it is marked as filtered with the
current object settings)counts_raw
is only present when the user used
keepOriginalReadCounts = TRUE
when running the
addData()
function. It stores the original, user-provided
data before any normalization scheme. For memory reasons and because the
raw counts are never directly used in the GRaNIE
workflow,
they are not stored by default (anymore).RNA
: Here, RNA data are stored, and the list contains
the same elements as the peaks data, with one additional element:
GRN@data$RNA$counts_permuted_index
: A memory-saving
representation of how to perform sample shuffling when permuting the
data, which is the basis for constructing the background links for both
TF-peak and peak-gene connections.metadata
: Data frame with the user-provided
sample-specific metadata and a few additional metadata that were added
when running addData()
.TFs
: Here, TF-specific data are stored, and the list
contains the following elements:
TF_peak_overlap
stores the binary 0/1 TF-peak overlap
matrix that denotes whether a particular TF has a putative/predicted TF
binding site in a particular peak. This information comes from the
provided TF and TFBS data from the functions addTFBS()
and
overlapPeaksAndTFBS()
in particularclassification
stores the TF classification results and
is only present when the function
AR_classification_wrapper()
has been run successfully. For
more information, see here
Currently, the actual PCA result data are not stored in the
GRN
object, but we may change this. We will update the
Vignette once this is done and mention it in the News/Changelog.
The pipeline outputs PCA plots for both peaks and RNA as well as original (i..e, the counts the user provided as input) and normalized (i.e., the counts after normalizing them if any normalization method has been provided) data. Thus, in total, 4 different PCA plots are produced, 2 per data modality (peaks and RNA) and 2 per data type (original and normalized counts).
Each PDF consists of three parts: PCA results based on the top 500,
top 1000 and top 5000 features (see page headers, depending on the
parameter topn
in plotPCA_all()
). For each
part, different plot types are available and briefly explained in the
following:
The TF-peak connections are stored in
GRN@connections$TF_peaks
. This list is structured as
follows:
the first level constitutes, in analogy to many other places within the object, whether the real or background links are stored. For more details, see here.
inside both real and background links (i.e., within
GRN@connections$TF_peaks[["0"]]
and
GRN@connections$TF_peaks[["1"]]
), a list with two elements
is stored:
main
, containing a data frame with the actual TF-peak
connections along with additional propertiesconnectionStats
, containing a summary of the empirical
FDR procedure, stratified by TF, correlation bin,
FDR direction and TF connection type. This slots also
contains details on GC-specific information if activated (see parameter
useGCCorrection = TRUE
in
addConnections_TF_peak
)The slot GRN@stats$GC
is also populated:
GRN@stats$GC$peaks
.GRN@stats$GC
are stored:
TFs_GC_correction
: A data frame that summarized the GC
correction on the TF and GC bin levelTFs_GC_correction_plots
: For each TF, the GC plots for
each TF are stored, namely the same two plots that were part of the
output plots (see below). For example, to plot the first plot of the TF
AIRE.0.C
, simply type
GRN@stats$GC$TFs_GC_correction_plots$AIRE.0.C[[1]]
.We provide multiple QC plots for the TF-peak connections as part of the output, as explained in the following.
First, we provide an overview of the total number of TF-peak connections for a range of typically used FDR values for both real and background TF-peak links, stratified by the TF-peak correlation bin. Typically, one sees three things:
In addition, we provide TF-specific diagnostic plots for all TFs that
are included in the GRaNIE
analysis. They summarize the FDR
and number of connections, stratified by the connection type (see here for methodological
details), the FDR directionality and the TF-peak correlation bin for
both real and background links (see here for methodological details).
The TF name is indicated in the title, and each page shows two plots. In each plot, the TF-peak FDR for each correlation bin (ranging from -1 to 1 in bins of size 0.05) is shown. The only difference between the two plots is the directionality upon which the FDR is empirically derived from: the upper plot is for the positive and the lower plot for the negative direction (for more details, see the References). Each plot is also colored by the number of distinct TF-peak connections that fall into the particular bin. Mostly, correlation bins with smaller absolute correlation values have higher frequencies (i.e., more TF-peak links fall into them) while correlation bins with more extreme correlation values are less frequent. In the end, for the resulting TF-peak and ultimately the eGRN network, the directionality can be ignored and only those TF-peak links with small FDRs are kept, irrespective of the directionality.
For a particular (set of) TF-peak pair(s), the underlying TF-peak
data and the correlation can be visualized with
plotCorrelations()
. For more information, see the R help
for the function.
When GC background adjustment was activated
(useGCCorrection = TRUE
in the function
addConnections_TF_peak()
), additional QC plots can
optionally be generated with the function
plotDiagnosticPlots_TFPeaks_GC()
. They summarize all
relevant data and adjustments related to the GC adjustment. Note
that this function is not run automatically as part of the regular
TF-peak diagnostic plots at the moment when
plotDiagnosticPlots_TFPeaks()
is executed.
For each TF, 2 plots (1 per page in the corresponding PDF file) are produced. One shows the relative frequency of each of the GC bins, stratified by the peak set origin (foreground, background before and after GC adjustment), the second the absolute frequencies (log10 + 1 transformed). For the first plot, GC bins for which resampling was performed are additionally labeled with a percentage value that indicates which percentage of the required number of peaks was actually available in the background. Thus, a value of 95% indicates that only 5% of peaks where missing to have the required number of peaks available in the background (small number of resampled peaks), while a value of 5% indicates the opposite: a large resampling happened to select the required number of peaks, and a potential sampling bias from the small number of peaks to select from in the first place may have occurred.
Typically, though, resampling happens only for extreme GC bins that have a relatively low relative percentage in the foreground. Ideally, one should see a perfect match between foreground and GC-adjusted background in the plots, both in relative frequencies (page 1) and absolute numbers (page 2) for as many GC bins as possible. The number of GC bins for which resampling occurred should be as small as possible, on the other hand.
The pages after summarize further aspects of the GC adjustment in the form of heatmaps. Each page shows the heatmap for a particular variable, each row in the heatmap is a TF, and columns are either the peak GC bins (stratified into ten bins, see above) or the TF-peak correlation bins (40 bins, in steps of 0.05, see above). The color scale is either blue-white-red (for negative and positive scales), white-red (for only positive ones) or black-white (for binary variables).
Details for peak GC bins from either foreground or background
that are the basis for the bin-specific FDR calculation (labeled GC
adjustment details in the plot title) As the legend indicates,
variables are either foreground or background (n.bg.needed
,
n.bg.needed.ratio
, n.bg.needed.ratio.atLeast1
)
associated, belong to both (n
,
peak_width mean
, peak_width sd
,
n_rel
) or they capture the difference of foreground and
background (diff_fg_bg
). For each plot, TFs are clustered
according to their similarity, so the TF ordering differs for each
plot.
TF-peak connection summary from the actual connections before and
after the GC background adjustment (labeled GRN connections in
the plot title) as stored in GRN@connections$TF_peaks
. Four
plots are shown, three of which show absolute numbers
(n_beforeGC
, n_afterGC
,
n_diff_abs
) and one a relative number
(n_diff_rel
, calculated as n_afterGC
/
n_beforeGC
). For ratios that R deems as Infinite
like this (e.g, 1 connection after GC correction but no connection
without it - 1 / 0
), we set the ratio to a value of either
the maximum of all finite values in the heatmap or a hard-coded value of
10 - depending on what the maximum of these two values is. Unlike the
heatmaps above, TFs are not clustered here but appear in the same
alphabetical order on every plot.
It is also possible to extract the results from the AR classification
out of a GRN
object. Currently, this can only be done
manually, extractor functions are in the works that will further enhance
the user experience. The results are stored in the slot
GRN@data$TFs$classification[[permIndex]] [[connectionType]]$TF.classification
.
Here, permIndex
refers to the original, non-permuted
(0
) or permuted (1
) data, while
connectionType
here is either expression
or
TFActivity
, depending on whether the pipeline has also be
run for TF Activity in addition to expression (see function
addConnections_TF_peak()
). Thus, typically, the results for
the original data are stored in
GRN@data$TFs$classification[["0"]] [["expression"]]$TF.classification
.
If intermediate results from the classification have not been deleted
(the default is to delete them as they can occupy a large amount of
memory in the object, see the parameters of
AR_classification_wrapper
for details), they can be
accessed similarly within
GRN@data$TFs$classification[[permIndex]] [[connectionType]]
:
TF_cor_median_foreground
,
TF_cor_median_background
,
TF_peak_cor_foreground
,
TF_peak_cor_background
, and
act.rep.thres.l
.
The pipeline produces 3 different plot types related to the
activator-repressor (AR) classification that can optionally be run as
part of the GRaNIE
workflow. For each of the 3 types, plots
are produced for both the original (labeled as original
) as
well as the permuted (labeled as permuted
) data.
The AR classification is run for the RNA expression data (labeled as
expression
) and can additionally also be run for TF
activity data (labeled as TFActivity
, see the function
addConnections_TF_peak()
and its parameter options).
In the following, the 3 plot types are briefly explained:
Summary heatmaps (files starting with
TF_classification_summaryHeatmap
): This
is described in detail in the diffTF
documentation.
We explain and summarize this type of plot in the Workflow Vignette. Please check there for details.
Classification stringency summary plots (files starting
with TF_classification_stringencyThresholds
): This
is described in detail in the diffTF
documentation.
Density plots per TF (files starting with
TF_classification_densityPlotsForegroundBackground
):
Density plots for each TF, with one TF per page. The plot shows the
foreground (red, labeled as Motif
) and background (gray,
labeled as Non-motif
) densities of the correlation
coefficient (either Pearson or Spearman, see x-axis label) from peaks
with (foreground) or without (background) a (predicted) TFBS in the peak
for the particular TF. The numbers in the parenthesis summarize the
underlying total number of peaks.
The peak-gene connections are stored in
GRN@connections$peak_genes
. This list is structured as
follows:
GRN@connections$peak_genes[["0"]]
and
GRN@connections$peak_genes[["1"]]
), a data frame with the
peak-gene connections along with additional metadata
(peak.ID
, gene.ENSEMBL
,
peak_gene.distance
) and statistical properties are stored
(peak_gene.r
, peak_gene.p_raw
)The function plotDiagnosticPlots_peakGene()
or
indirectly the function addConnections_peak_gene()
(when
plotDiagnosticPlots = TRUE
) plots various diagnostic plots
for the peak-gene links that are imperative for understanding the
biological system and resulting eGRN. In what follows, we
describe them briefly, along with some notes on expected patterns,
implications etc. Note that this section is subject to regular
change.
The main QC summary figure on page 1 is divided into two rows: the
upper row focuses on the peak-gene raw p-value from the correlation
tests, while the lower row focuses on the peak-gene correlation
coefficient. The left side visualizes the data of the corresponding
metrics via density plots, while the right side bins the metrics and
visualizes them with barplots for highlighting differences between real
and background data as well as negatively and positively correlated
peak-gene links (denoted as r+
and r-
,
respectively).
We will now discuss and explain both of them in more detail.
First and most importantly, we focus on the distribution of the raw p-values from the correlation tests (i.e., peak accessibility correlated with gene expression) of all peak-gene links. We can investigate these from multiple perspectives.
Let’s start with a density plot. The upper left plot shows the raw p-value density for the particular gene type as indicated in the title (here: all gene types), stratified on two levels:
r-
, gray) and
positive (r+
, black) correlation coefficient,
respectivelyGenerally, we also consider r-
connections as
background. The reasoning for this is as follows: Since
chromatin accessibility in regulatory elements is generally associated
with active gene regulation and transcription, we only expect positive
correlations for functional peak-gene links. Notably, the same is still
true even for repressor-bound elements, where binding of most repressors
leads to loss of both accessibility and transcription. Negative
correlations have no clear biological meaning and therefore may indicate
remaining batch effects or random noise.
What we would like to see is:
r+
and r-
connectionsr+
links, a strong peak at
small p-values, and a (marginally) flat distribution for higher ones
(similar to a well-calibrated raw p-value distribution for any
hypothesis test such as differential expression). For r-
links, the peak at small p-values should be much smaller and ideally the
curve is completely flat. However, from the many datasets we examined,
this is rarely the case.If any of these quality controls fails, it indicates at least one assumption of the framework is violated. This could be due to an issue with data normalization, the underlying biology and what the eGRN represents, or an issue with data size or covariates influencing the results. Often, when the number of samples is small, the QC does not look satisfactory. Thus, it is important to use as many samples as possible, preferably at least 12 or so (see here for more details).
To quantify the difference between r+
and
r-
links within each connection type (background vs real),
we can also plot the results in form of ratios rather than densities for
either the r+
/ r-
or the real / background
dimension. These plots are shown in the upper right panel of the summary
plot.
For the r+
/ r-
dimension and background
data, the ratios should be close to 1 across all p-value bins, while for
the real data, a high ratio is typically seen for small p-values. In
general, the difference between the background and real bar should be
large for small p-values and close to 1 for larger ones.
For the real / background dimension, what we want to see is again a
high ratio for small p-value bins for the r+
links,
indicating that when comparing background vs real, there are many more
small p-value links in real data as compared to background. This usually
does not hold true for the r-
links, though, as can be seen
also from the plot: the gray bars are smaller and closer to 1 across the
whole binned p-value range.
We also provide a summary of the distribution of the peak-gene correlation coefficient. Typically, one sees that the distribution of the real links is wider (i.e., more links with high absolute values - in either direction) and less pronounced around 0 as compared to the background, an indication of the captured signal in the real data.
Lastly, we also provide additional, advanced QC plots in which we
stratify the aforementioned raw p-value distributions across real data
according to various metadata and additional biological and statistical
features that we calculate during the GRaNIE
workflow.
For each of these variables, two plots are shown based on the real
data only. First, density plots for the raw peak-gene p-value
distribution, stratified by r+
and r-, along with a summary
graph showing the ratio of r+
/ r-
in a
barplot. Second, a page with two barplots showing the same information
as before, just visually differently.
The features include, for example, the mean, median, or the
coefficient of variation (ratio of the standard deviation to the mean, a
unitless and standardized measure of dispersion/variability; abbreviated
CV in what follows) for both peak and RNA data. The individual values
are all calculated based on the normalized input data across all samples
(excluding the filtered peaks / genes, in analogy how the eGRN
is constructed) as stored in GRN@annotation
. We also
stratify by peak annotation as identified by the ChIPSeeker
package, the GC content of the peaks as well as the combined CV of peak
and gene:
Lastly, a few density plots are shown that stratify by the
peak-gene distance. In total 10 equidistant bins (which
results in the bins containing a non-equal number of data points but
genomic distance is increased uniformly from bin to bin) are
constructed, ranging from 0 to the user-defined value of the maximum
peak-gene distance, which was defined as promoterRange
in
the function addConnections_peak_gene()
. In addition, a
background bin is shown, labeled as “randomized”, from the background
data (see above):
r+
/r-
)We generally (hope to) see that for smaller peak-gene distances (in
particular those that overlap, i.e., the peak and the gene are in direct
vicinity or even overlapping), the difference between r+
and r-
links is bigger as for more distant links. We also
include the random links, for which no difference between
r+
and r-
links is visible, as expected for a
well-calibrated background.
For a particular (set of) peak-gene pair(s), the underlying peak-gene
data and the correlation can be visualized with
plotCorrelations()
. For more information, see the R help
for the function.
The peak-gene connections are stored in
GRN@connections$all.filtered
. This list is structured as
follows:
GRN@connections$all.filtered[["0"]]
and
GRN@connections$all.filtered[["1"]]
), a data frame with the
TF-peak-gene connections along with additional metadata and statistical
properties are stored. This table is not to be used directly, however;
instead, the function getGRNConnections()
has to be used to
retrieve them and add additional information as needed along with
it.No plots are available here, but the eGRN can be visualized
with visualizeGRN()
.
The peak-gene connections are stored in
GRN@connections$$TF_genes.filtered
. This list is structured
as follows:
GRN@connections$$TF_genes.filtered[["0"]]
and
GRN@connections$$TF_genes.filtered[["1"]]
), a data frame
with the TF-gene connections along with additional metadata and
statistical properties are stored. This table is not to be used
directly, however; instead, the function
getGRNConnections()
with
include_TF_gene_correlations
has to be used to retrieve
them.For a particular (set of) TF-gene pair(s), the underlying TF-gene
data and the correlation can be visualized with
plotCorrelations()
. For more information, see the R help
for the function.
In addition, the eGRN can be visualized with
visualizeGRN()
.
The results of the function generateStatsSummary()
is
stored in GRN@stats$connections
. This table contains a lot
of information, including but not limited to the various parameters the
function iterated over to generate multiple eGRNs for different
parameter thresholds and combinations such as the number of distinct
TFs, peaks, genes, their connectivity, etc.
We currently offer two different connection summary PDFs, both of
which are produced from the function
plot_stats_connectionSummary()
. Both PDFs shows the number
of connections for each node type (TF, peak, and gene), while in the
boxplots, peaks are further differentiated into TF-peak and peak-gene
entities. They also iterate over various parameters and plot one plot
per page and parameter combination, as indicated in the title:
allowMissingTFs
: TRUE
or
FALSE
(i.e., allow TFs to be missing when summarizing the
eGRN network. If set to TRUE
, a valid connection
may consist of just peak to gene, with no TF being connected to the
peak. For more details, see the R
help for
plot_stats_connectionSummary()
)allowMissingGenes
: TRUE
or
FALSE
(i.e., allow genes to be missing when summarizing the
eGRN network. If set to TRUE
, a valid connection
may consist of just TF to peak, with no gene being connected to the
peak. For more details, see the R
help for
plot_stats_connectionSummary()
)TF_peak.connectionType
. Either expression
or TFActivity
to denote which connection type the summary
is based on.Both plot types compare the connectivity for the real and background
data (denoted as Network type
in the boxplot PDF), which
allows a better judgment of the connectivity from the real data.
An example page for the summary heatmap looks like this:
Here, two heatmaps are shown, one for real (top) and one for background (bottom) network. Each of them shows for different combinations of TF-peak and peak-gene FDRs (0.01 to 0.2) the number of unique node types for the given FDR combination (here: TFs).
Second, a multi-page summary PDF for the connections in form of a boxplot, as exemplified with the following Figure:
The actual graph structure is stored in GRN@graph
after
successful execution of the function build_eGRN_graph()
.It
contains two different kind of graphs, namely the TF-peak-gene
and the TF-gene graph. Each of them is associated with a list
element with the same name, which stores also the table that was used to
create the graph along with the graph itself and the used parameters for
creating it.
The function visualizeGRN()
can be used for
visualization of the eGRN graph.
The identified communities for each node are stored within
igraph
objects in
GRN@graph$TF_gene$clusterGraph
(retrievable via
GRN@graph$TF_gene$clusterGraph$membership
, for example) and
as vertex attributes in GRN@graph$TF_gene$graph
(retrievable via
igraph::vertex.attributes(GRN@graph$TF_gene$graph)$community
,
for example).
The function visualizeGRN()
can be used for
visualization of the identified communities in the eGRN graph.
The functions plotCommunitiesStats()
and
plotCommunitiesEnrichment()
can be used for plotting and
summarizing community properties and enrichments, respectively.
All enrichment results are stored inside
GRN@stats$Enrichment
. This contains a nested list with the
named elements general
, byCommunity
, and
byTF
for general, community-specific and TF-specific
enrichments, respectively.
Inside each of these elements, there is further nesting by the chosen
ontology (e.g, `GO_BP
), and whether results or the
parameters of the enrichment function should be retrieved.
The output is structured as follows: - one page per ontology for the whole eGRN with an enrichment summary that includes the ontology terms (y-axis), the gene ratio (the number of genes found divided by the total number of genes in the foreground) on the x-axis, the raw p-value (color) and the absolute number of genes found (dot size). The plot title gives some additional information such as the chosen ontology, the statistic used, the used background, the number of genes in foreground and background, respectively, and the chosen multiple testing adjustment (if applicable; in some cases, this is ignored due to the way the enrichment calculation works)
The output is structured as follows: - one page per ontology and
community with an enrichment summary (see general enrichment above for
explanations) - two pages in the end that compare the
community-enrichment with the general enrichment. Here, all ontology
terms that are deemed significant according to the chosen significance
threshold p
(see function parameters) in either the general
or the community-specific enrichment are included, along with their
-log10-transformed statistical confidence (either raw or adjusted
p-value). The top part shows the size of the general enrichment and the
community-specific subnetworks (in number of genes they include). In
contrast, the last page shows not all but only the top 10 enriched terms
per column.
The results are added to GRN@stats$variancePartition
as
well as within the various feature-related elements of
GRN@annotation
. The former slot and results can be used for
the various diagnostic and plotting functions from the
variancePartition
package.
The easiest way to retrieve and include the results into the final
eGRN is to rerun the function getGRNConnections()
and set include_variancePartitionResults = TRUE
, as the
results are NOT added automatically to
GRN@connections$all.filtered
either.
SNP data is added via the function addSNPData()
. When
add_SNPs_LD = TRUE
, additional information for SNPs in LD
with any of the user-provided input SNPs are stored within
GRN@annotation$SNPs
and
GRN@annotation$SNPs_filtered
for the unfiltered and
filtered SNP table, respectively, and the list of peak-associated SNPs
is extended to SNPs in LD with overlapping SNPs from the user input.
Final peak-SNP overlaps are stored in
GRN@data$peaks$counts_metadata
. For more details, see
?addSNPData
.
In this section, we provide a few guidelines and recommendations that may be helpful for your analysis.
In this section, we want explicitly mention the designated scope of
the GRaNIE
package, its limitations and additional /
companion packages that may be used subsequently or beforehand.
This section is still being completed, and we regularly extend it.
The number of samples is very important. Due to the nature of the
method (correlating sample data), a small number of samples is likely to
violate the assumptions of our framework, in particular the peak-gene
links (see here for more details). We
generally recommend at least 10, better 15 samples that are shared
between the RNA and peak data - remember that in the GRaNIE
framework, only samples for which both modalities are available can be
included in the analysis. Thus, the overlap of RNA and peak data should
be at least 10-15 samples, the more the better in general.
The number of peaks that is provided as input matters greatly for the resulting eGRN and its connectivity. From our experience, this number should be in a reasonable range so that there is enough data and TFBS overlap to build a eGRN, but also not too many so that the whole pipeline runs unnecessarily long. We have good experience with the number of peaks ranging between 50,000 and 200,000, although these are not hard thresholds but rather recommendations.
With respect to the recommended width of the peaks, we usually use
peaks that have a width of a couple of hundred base pairs until a few
kb, while the default is to filter peaks if they are wider than 10,000
bp (parameter maxSize_peaks
in the function
filterData()
). Remember: peaks are used to overlap them
with TFBS, so if a particular peak is too narrow, the likelihood of not
overlapping with any (predicted) TFBS from any TF increases, and such a
peak is subsequently essentially ignored.
The following list is subject to change and provides some rough guidelines for the RNA-Seq data:
filterData()
, see the
argument minNormalizedMeanRNA
for more information. You may
want to check beforehand how many gens have a row mean of >1. This
number is usually in the tens of thousands.TFBS are a crucial input for any GRaNIE
analysis. Our
GRaNIE
approach is very agnostic as to how these files are
generated - as long as one BED file per TF is provided with TFBS
positions, the TF can be integrated. As explained above, we usually work
with TFBS as predicted by PWMScan
based on
HOCOMOCO
TF motifs, while this is by no means a requirement
of the pipeline. Instead, JASPAR
TFBS or TFBS from any
other database can also be used. The total number of TF and TFBS per TF
seems more relevant here, due to the way we integrate TFBS: We create a
binary 0/1 overlap matrix for each peak and TF, with 0 indicating that
no TFBS for a particular TF overlaps with a particular peak, while 1
indicates that at least 1 TFBS from the TFBS input data does indeed
overlap with the particular peak by at least 1 bp. Thus, having more
TFBS in general also increases the number of 1s and therefore the
foreground of the TF (see the diagnostic plots) while it makes
the foreground also more noisy if the TFBS list contains too many false
positives. As always in biology, this is a trade-off.
This section is partially based on and inspired by this nice summary about association measures in the context of gene expression networks.
We currently offer 3 different choices for how to measure the correlation (generally defined as an association measure that is used to estimate the relationships between two random variables) between different pairs of features such as TF-peaks, TF-genes, and peak-genes. All correlation coefficients take on values between −1 and 1 where negative values indicate an inverse relationship.
The choice depends on the user, and is dependent on whether or not the data contain outlier points that may heavily affect the correlation measures. While we generally advise to remove obvious outlier samples, as identified in a PCA, for example, from the analysis altogether in case they come from a single sample, in other cases this may not be the best approach and a more robust correlation measure such as Spearman or bicor seems more appropriate. We are currently also testing and evaluating which of them are best under different criteria and data typoes. For single-cell derived data, this choice seems to be particularly relevant.
As outlined already above, the peak-gene diagnostic plots are very important to check and to verify that the assumptions of our framework and the underlying data are met. We highly recommend doing so, and we provide detailed recommendations and our experiences here.
We now also provide some preliminary guidelines on whether and how to
use GRaNIE
for single-cell eGRN inference. For
more details, see the designated
vignette for single-cell data.
A nice little helper feature for the GRaNIE
package is
that the object history, including all function calls that have been
applied to the object, including the function parameters and their
actual values (unless a variable has been provided, then only the
variable name is stored and not the actual value), the date, and the
actual call are stored in the actual GRN
object in a nested
list inside GRN@GRN@config$functionParameters
. This looks
like the following, for example, for the function
add_TF_gene_correlation
that has been executed at
2023-01-23 21:50:52 CET:
> GRN@config$functionParameters$add_TF_gene_correlation$`2023-01-23_21:50:52`
$call
add_TF_gene_correlation(GRN = GRN, nCores = 5)
$parameters
$parameters$GRN
GRN
$parameters$nCores
1] 5
[
$parameters$corMethod
1] "pearson"
[
$parameters$forceRerun
1] FALSE [
This can be very helpful for recapitulating at a later point which
functions have been applied to a GRN
object, and to verify
specific parameter assignments.
In this section, we will give an overview over CPU and memory
requirements when running or planning an analysis with
GRaNIE
.
Both CPU time and memory footprint primarily depend on similar factors, namely the number of
While the number of TFs is typically similar across analyses when the
default database is used (HOCOMOCO
+ PWMScan
),
the number of peaks and samples may vary greatly across analyses.
For optimized CPU times, make sure to have the package
BiocParallel
installed, which is not automatically
installed with GRANIE
, even though it should be already
installed for most Bioconductor installations as it is one of the core
packages.
A typical analysis runs within an hour or two on a standard machine with 2 to 4 cores or so. CPU time non-surprisingly depends primarily on the number of used cores for those functions that support multiple cores and that are time-consuming in nature.
The maximum memory footprint even for a large dataset during a
typical GRaNIE
workflow usually does not exceed a few GB in
R
. Instead, it is usually in the range of 1-2 GB only
maximum. Thus, GRaNIE
can usually be run on a normal laptop
without problems.
We recommend not using more than a few 100,000 or so peaks as the memory footprint as well as running time may otherwise increase notably. However, there is no package-defined upper limit, it all depends on the available system memory.
Given that our approach is correlation-based, it seems preferable to maximize the number of samples while retaining only peaks that carry a biological signal that is differing among samples.
The size of the GRN
object itself is typically in the
range of a few hundred MB, but can go up to 1-2 GB for large datasets.
If you have troubles with big datasets, let us know! We always look for
ways to further optimize the memory footprint. However, with the recent
optimizations to store the count matrices as sparse matrix if the
fraction of 0s is too large, the overall memory footprint significantly
reduced.
1. [GRaNIE and GRaNPA: inference and evaluation of enhancer-mediated gene regulatory networks](https://www.embopress.org/doi/full/10.15252/msb.202311627)
2. [Comparison of co-expression measures: mutual information, correlation, and model based indices](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3586947/)