Introductory remarks

Welcome to the N-of-1-pathways analysis tutorial! The intent of this document is to familiarize users with the basic functions included in the nof1 R package through a demonstration of a typical analytic workflow.

Brief overview of N-of-1-pathways

The software transforms a pair of transcriptomes from a single subject to a personal profile of pathway dysregulation1. The algorithm quantifies dysregulation2 (effect size) within a gene set along with statistical significance (p-value) in high throughput. Lastly, personal dysregulation profiles are visualized for pattern recognition through novel ’omic rose plots3. Moreover, these paired calculations can be applied in related problems, such as single-cell RNA-seq^3. Please refer to the references provided in the footnotes for details on these methods. The schematic below illustrates the N-of-1-pathways workflow:

Scope of the tutorial

In this demonstration, we will apply N-of-1-pathways to study the transcriptomes from four breast cancer patients. As we are applying the framework to multiple patients in one workflow, the majority of nof1 function used will have a multi designation. For example, the function compute_multi_nof1_pathways will be employed vs. compute_nof1_pathways. A truly one subject analysis is rare in practice, but rest assured that the analogous functions behave in a similar manner. Moreover, compute_nof1_pathways (and other non-multi functions) allow for more advanced users to flexibly calculate in non-standard scenarios.

Analysis of TCGA breast cancer patients

Step 0: Installing and attaching the package

As the nof1 package is not available on a public repository, the package must first be downloaded to a local destination. Then install using the following command (after updating with the system-specific path and version):

install.packages("/path/to/source/nof1_X.X.X.tar.gz", repos = NULL, type="source")

Be sure to employ the usual library() call in any data analysis scripts after installation.

## Load and attach the nof1 package after installation.
library(nof1)
## Retrieve the library location to access files needed for this tutorial
my_path <- find.package("nof1")

Step 1: Read gene expression data and define sample pairings

Utility functions are available to read/write gene expression, gene set (pathway) definitions, gene set descriptions, and sample data pairing specifications.

1.1: Reading in the gene expression

The first step is to read gene expression data into the workspace. In this tutorial, we will employ matched normal/tumor pairs of RNA-seq data from TCGA4 breast cancer patients as example data. We have preprocessed four matched pairs (8 transcriptomes) into an R data file to maintain a reasonable package size. Samples are labeled as TCGA.XX.XXXX.N or TCGA.XX.XXXX.T, where XX.XXXX identifies the patient and N or T indicates normal or tumor, respectively.

Notably, gene expression data can also be read into R using utilities provided by the package. The format for an input text file is specified in the read_gene_set function documentation. In short, the rows are the genes and the columns are the samples in a tab-delimited file.

## Example syntax for a text file:
## Load compressed gene expression data for the tutorial
data(BRCA_TCGA_4_patients)
## Display a few genes for the first two pairs of transcriptomes.
BRCA_TCGA_4_patients[1:5, 1:4]
##       TCGA.A7.A0DB.N TCGA.A7.A0DB.T TCGA.AC.A2FM.N TCGA.AC.A2FM.T
## A1BG         19.0722       302.9294       355.1020      1299.5289
## A1CF          0.0000         0.7664         0.0000         0.3341
## A2BP1         0.0000         2.2991         2.8548         0.0000
## A2LD1       153.5670       108.7537        81.4478       108.5667
## A2ML1         9.7938         1.9159         2.4470         0.0000

1.2: Define sample pairings

N-of-1-pathways is a paired analysis framework. Two transcriptomes (usually from the same patient) are analyzed for differential expression at the level gene sets (pathways). As such, one must provide the software guidance as to which samples are to be compared and, in particular, which sample is considered the “baseline” and “case” samples.

It is important to note that in gene_data above that the sample names are in a highly specific order. compute_multi_nof1_pathways assumes that the column (sample) order provided is of the form BaselinePair1, CasePair1, BaselinePair2, CasePair2, etc. However, one can specify the pairings via use of the read_sample_data and write_sample_data functions for complete control of the pairing specifications.

Also, it is important to note that patient/sample IDs (columns) cannot include the ‘-’ character as pairs are automatically named using this to show the direction of subtraction.

## Define pairings using a text file (included in the package as an example).
## Display the how these pairs are defined using this format.
sample_data
##   PairNum     SampleName BaselineOrCase
## 1       1 TCGA.A7.A0DB.N       BASELINE
## 2       1 TCGA.A7.A0DB.T           CASE
## 3       2 TCGA.AC.A2FM.N       BASELINE
## 4       2 TCGA.AC.A2FM.T           CASE
## 5       3 TCGA.BH.A1EU.N       BASELINE
## 6       3 TCGA.BH.A1EU.T           CASE
## 7       4 TCGA.E2.A153.N       BASELINE
## 8       4 TCGA.E2.A153.T           CASE
## Provide the total num of pairs.
max(as.numeric(sample_data\$PairNum))
## [1] 4

Step 2: Define pathways via a knowledgebase

In gene set (i.e, pathway) analysis genes are annotated into functional groups through a variety of approaches. In the nof1 package, we include pathways defined by the Pathway Interaction Database (PID)5. These are signaling pathways that are often indicated in cancer. The choice of ontology (knowledgebase) is an important consideration and will vary depending on the context and research objectives. However, that discussion is outside the scope of this tutorial.

From a practical standpoint, the format required is a ‘long’ structure. A tab-delimited text file is expected with pathway ID as the first variable and the HUGO gene symbol as the second.

## An example of how to read in a pathway definitions.
head(pid_filtered)
##   path_id symbol
## 1  200002   SSPO
## 2  200002  CHEK1
## 3  200002  FANCG
## 4  200002  FANCA
## 5  200002  XRCC3
## 6  200002   RAD1

It is often useful to include pathway descriptions in the output of the N-of-1-pathways framework. The format is very similar to above, but with the second variable indicated the pathway description - not pathway membership.

## note that read_gene_set reads both gene set definitions and descriptions
head(pid_desc)
##   path_id                                   description
## 1  200002                        Fanconi anemia pathway
## 2  200003       Regulation of nuclear SMAD2/3 signaling
## 3  200004 Fc-epsilon receptor I signaling in mast cells
## 4  200005                                   Endothelins
## 5  200006                         BCR signaling pathway
## 6  200007              Signaling events mediated by PRL

Step 3: Compute pathway dysregulation profiles

Now that the gene expression data and pathway annotation steps are complete, the actual calculation of pathway differential expression and statistical significance can be performed. Again, the details of the methods are provided in the references below and outside of the scope of this document.

In the example below, N-of-1-pathways Mahalanobis distance is applied to the four breast cancer patients in one convenient function call. The result is a list of data frames containing pathway scores for each of the 187 PID signaling pathways.

Often, a clinician or researcher a priori selects pathways of interest to study for treatment decisions or hypothesis testing. Here we arbitrary select the top five dysregulated pathways from the first patient to compare to the other three patients. In this context, the effect size and p-value for each pathway represent dysregulation as tumor tissue is compared to similar, non-tumor tissue from the same patient.

## Run and store Nof1-pathways Mahalanobis distance for the 4 pairs of data.
pathway_scores <- compute_multi_nof1_pathways(gene_data, sample_data = sample_data,
gene_set = pid_filtered,
model = "md", gene_set_desc = pid_desc)
## Explore the top five dysregulated pathways for the first patient.
pathway_scores[[1]][1:5,]
##        pathway_score direction      p_value num_genes_annot
## 200038     1.2262335         1 5.226671e-09              39
## 200002     1.0583383         1 2.617160e-08              48
## 200004    -0.6767436        -1 1.399637e-05              60
## 200079    -0.6144119        -1 1.263655e-04              48
## 200147    -0.6270801        -1 1.388339e-04              45
##        num_genes_measured model    zscore
## 200038                 39    md  5.839790
## 200002                 44    md  5.565291
## 200004                 57    md -4.343918
## 200079                 48    md -3.833436
## 200147                 45    md -3.810237
## 200038                         ATR signaling pathway 9.773874e-07
## 200002                        Fanconi anemia pathway 2.447044e-06
## 200004 Fc-epsilon receptor I signaling in mast cells 8.724403e-04
## 200079 Angiopoietin receptor Tie2-mediated signaling 5.192388e-03
## 200147                 IL6-mediated signaling events 5.192388e-03
## Store the top five pathways ID ("path_id") to compare to other patients.
top_ids <- row.names(pathway_scores[[1]])[1:5]

Note for larger data sets and ontologies, memory issues/speed could preclude the computation. In that case, it is advised to break up the analysis into smaller pieces and/or use the parallelization parameter in the function call. Also, the use of high-performance computing clusters is applicable.

Step 4: Visualize top pathway scores across patients

Finally, the nof1 package includes novel visualization tools to study and compare pathway dysregulation profiles. One such plot a rose plot.

## Shorten the pair names for display.
names(pathway_scores) <- gsub("TCGA\\.", "", names(pathway_scores))
## Create multiple rose plots, paneled by patient.
create_multi_rose_plot(pathway_scores, which_pathways = top_ids)

Here the negative natural logarithm of the p-values corresponding the each of the top five dysregulated pathways for patient A7.A0DB determine the radii for the ‘petals’. The petal areas are properly scaled so that the areas are proportional to the statistical significance of dysregulation. Note that there are a total of five petals for each rose plot. Petals above the horizontal axis (in blue) represent pathways that are higher expressed in the tumor tissue and conversely for the pathways that are down-regulated (in brown).

Note striking heterogeneity for these four patients. Patient E2.A153 has all five pathways that are significantly down-regulated. Whereas, both BH.A1EU and AC.A2FM are only significantly up-regulated in ATR signaling and Fanconi anemia in these five pathways. Comparisons of patient characteristics and/or determination of treatment plans could lead to better patient outcomes using these visualizations.

Concluding remarks

That concludes this tutorial to the nof1 package to implement the N-of-1-pathways framework. The package is in heavy development with more features, methods, and tutorials forthcoming. Please refer to the package documentation for detailed explanations and examples on all the functions and data sets.

Also, we greatly appreciate your support by citing the N-of-1-pathways framework:

citation("nof1")

Please also cite any particular method within the framework such as the Mahlanobis distance.