GRN
object.addData.Rd
This function adds both RNA and peak data to a GRN
object, along with data normalization.
In addition, and highly recommended, sample metadata can be optionally provided.
addData(
GRN,
counts_peaks,
normalization_peaks = "DESeq2_sizeFactors",
idColumn_peaks = "peakID",
counts_rna,
normalization_rna = "limma_quantile",
idColumn_RNA = "ENSEMBL",
sampleMetadata = NULL,
additionalParams.l = list(),
allowOverlappingPeaks = FALSE,
keepOriginalReadCounts = FALSE,
EnsemblVersion = NULL,
genomeAnnotationSource = "AnnotationHub",
forceRerun = FALSE
)
Object of class GRN
Data frame. No default. Counts for the peaks, with raw or normalized counts for each peak (rows) across all samples (columns).
In addition to the count data, it must also contain one ID column with a particular format, see the argument idColumn_peaks
below.
Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.
Character. Default DESeq2_sizeFactors
. Normalization procedure for peak data.
Must be one of limma_cyclicloess
, limma_quantile
, limma_scale
, csaw_cyclicLoess_orig
, csaw_TMM
,
EDASeq_GC_peaks
, gcqn_peaks
, DESeq2_sizeFactors
, none
.
Character. Default peakID
. Name of the column in the counts_peaks data frame that contains peak IDs.
The required format must be chr:start-end", with chr denoting the abbreviated chromosome name, and start and end the begin and end
of the peak coordinates, respectively. End must be bigger than start. Examples for valid peak IDs are chr1:400-800
or chrX:20-25
.
Data frame. No default. Counts for the RNA-seq data, with raw or normalized counts for each gene (rows) across all samples (columns).
In addition to the count data, it must also contain one ID column with a particular format, see the argument idColumn_rna
below.
Row names are ignored, column names must be set to the sample names and must match those from the RNA counts and the sample metadata table.
Character. Default limma_quantile
. Normalization procedure for peak data.
Must be one of limma_cyclicloess
, limma_quantile
, limma_scale
, csaw_cyclicLoess_orig
, csaw_TMM
, DESeq2_sizeFactors
, none
.
Character. Default ENSEMBL
. Name of the column in the counts_rna
data frame that contains Ensembl IDs.
Data frame. Default NULL
. Optional, additional metadata for the samples, such as age, sex, gender etc.
If provided, the @seealso [plotPCA_all()] function can then incorporate and plot it. Sample names must match with those from both peak and RNA-Seq data. The first column is expected to contain the sample IDs, the actual column name is irrelevant.
Named list. Default list()
. Additional parameters for the chosen normalization method.
Currently, only the GC-aware normalization methods EDASeq_GC_peaks
and gcqn_peaks
are supported here.
Both support the parameters roundResults
(logical flag, TRUE
or FALSE
) and nBins
(Integer > 0), and EDASeq_GC_peaks
supports three additional parameters:
withinLane_method
(one of: "loess","median","upper","full") and betweenLane_method
(one of: "median","upper","full").
For more information, see the EDASeq vignette.
TRUE
or FALSE
. Default FALSE
. Should overlapping peaks be allowed (then only a warning is issued
when overlapping peaks are found) or (the default) should an error be raised?
TRUE
or FALSE
. Default FALSE
. Should the original read counts as provided to the function be kept in addition to
storing the rad counts after a (if any) normalization? This increases the memory footprint of the object because 2 additional count matrices have to be stored.
NULL
or Character(1). Default NULL
. The Ensembl version to use for genome annotation retrieval via biomaRt
, which is only used to populate the gene annotation metadata that is stored in GRN@annotation$genes
.
By default (NULL
), the newest version is selected for the most recent genome assembly versions is used (see biomaRt::listEnsemblArchives()
for supported versions). This parameter can override this to use a custom (older) version instead.
AnnotationHub
or biomaRt
. Default AnnotationHub
. Annotation source to retrieve genome annotation data from.
TRUE
or FALSE
. Default FALSE
. Force execution, even if the GRN object already contains the result. Overwrites the old results.
An updated GRN
object, with added data from this function(e.g., slots GRN@data$peaks
and GRN@data$RNA
)
If the ChIPseeker
package is installed, additional peak annotation is provided in the annotation slot and a peak annotation QC plot is produced as part of peak-gene QC.
This is fully optional, however, and has no consequences for downstream functions.
Normalizing the data sensibly is very important. When quantile
is chose, limma::normalizeQuantiles
is used, which in essence does the following:
Each quantile of each column is set to the mean of that quantile across arrays. The intention is to make all the normalized columns have the same empirical distribution.
This will be exactly true if there are no missing values and no ties within the columns: the normalized columns are then simply permutations of one another.
# See the Workflow vignette on the GRaNIE website for examples
# library(readr)
# rna.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/rna.tsv.gz")
# peaks.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/peaks.tsv.gz")
# meta.df = read_tsv("https://www.embl.de/download/zaugg/GRaNIE/sampleMetadata.tsv.gz")
# GRN = loadExampleObject()
# We omit sampleMetadata = meta.df in the following line, becomes too long otherwise
# GRN = addData(GRN, counts_peaks = peaks.df, counts_rna = rna.df, forceRerun = FALSE)