# Software Development - Juan R. González

We have developed some packages included in the R project in collaboration with other researches from different institutions. Some of these libraries are related to genetics and other ones to survival analysis with recurrent events. Most of the packages are available at Bioconductor or CRAN. Those that are under development can be installed using our installer. To see them, start R and enter:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.available()*

**Genetics**

**Package MEAL (under development)**

This package contains a set of tools to analyze and visualize methylation and gene expression data. This is a joint work with Carlos Ruiz.

The package can be installed through Bioconductor:

*source("https://bioconductor.org/biocLite.R")*

*biocLite("MEAL")*

The last versión of the package can be installed using creal.install function, as follows:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("MEAL")*

Following are the vignnetes for **MEAL** and for **MultiDataSet** Class (included in MEAL):MEAL - Case Example and MEAL - MultiDataSet.

**Package rasp (under development)**

This is a joint work with Roderic Guigó's group - Bioinformatics and Genomics program, Center for Genomic Regulatio (CRG). This R package is designed to compare transcript relative expression of different conditions obtained from RNA-seq experiments. Our approach is based on a distance-based non-parametric multivariate ANOVA method.

The Linux version of the package is currently under development at Bioconductor ( http://www.bioconductor.org/). To install a beta version of *rasp*, start R and enter:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("rasp")*

The performance of our approach has been compared with two other existing R packages (DEXseq and EBseq) using data from The Cancer Genome Atlas (TCGA). Exom abundances from RNA-seq data were obtained for several individuals diagnosed with Liver hepatocellular carcinoma [LIHC] and Bladder Urothelial Carcinoma [BLCA]. We have created two experimental data pacakges (ExonCountDataLIHC and ExonCountDataBLCA, respectively) that will be available at Bioconductor ( http://www.bioconductor.org/). So far, the can be installed by starting R and entering:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("ExonCountDataLIHC")*

*creal.install("ExonCountDataBLCA")*

This package contains a set of tools to extract the required files for analyzing different structural variants (CNVs, mosaicisms, inversions) from Affymetrix .CEL files (Genome Wide, Axiom and CytoScan HD). This is a joint work with Carles Hernandez-Ferrer (CREAL).

The methods are described in the manuscript

To be supplied. Free available here.

To install the last version, use our installer:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("affy2sv")*

If you prefere to install the package yourself, you can find the download files here and the process to install it can be found at the package's wiki. The wiki also contains three vignettes (one for GenomeWide SNP, a second one for Axiom and a last one for CytoScan).

Jointly with Alejandro Cáceres, we have developed a method that can be applied to common GWAS for calling the inversion genotypes, which accounts for population stratification when an appropriate reference population is not known. This method is extremely useful when performing inversion association studies in a GWAS context were population stratification can be present. To install invClust, start R and enter:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("invClust")*

If you experiment any problem during this process, the source code of the package can be downlaoded from here.

The methods are described in the manuscript

Cáceres and González, J. R. (2015) Following the footprints of polymorphic inversions on SNP data: from detection to association tests. NAR doi:10.1093. Free available here.

invClust available from Bioconductor. We have created a vignette that illustrates how to analyze real data.

tweeDEseq is an R package for analyzing RNAseq count data. It implements Poisson-Tweedie family of distributions to model count data distribution. This family includes Poisson and Negative Bionomial as particular cases. The testPT test is used to detect genes that are differentially expressed (DE).

The methods are described in the manuscript

Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR. A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. *BMC Bioinformatics* 2013, **14**:254. Free availalbe here

The manuscript illustrates the performance of our proposed method using a real RNA-seq data set comprising 69 Nigerian. We have created an experimental data pacakge (tweeDEseqCountData) that is available at Bioconductor ( http://www.bioconductor.org/).

tweeDEseq is available from Bioconductor. We have created a vignette that uses tweeDEseqCountData package for illustrating how to analyze real data.

inveRsion is an R package for the detection of genetic inversions using SNP-array data. This is a joint collaboration with Alejandro Caceres (CREAL) and Suzanne Sindi (Center for Computational Molecular Biology, Brown University). inveRsion is available at Bioconductor www.bioconductor.org, and a manuscript is published at BMC Bioinformatics.

Cáceres A, Sindi SS, Raphael BJ, Cáceres M, González JR. Identification of polymorphic inversions from genotypes. BMC Bioinformatics. 2012 Feb 9;13:28.

Our aim is to use SNP-array data of large cohorts, for which phenotype information has been collected, to assess the association of inversions with disease. We also intent to use the tool to assist in the mapping of human inversions; a project headed by Mario Cáceres (Universitat Autònoma de Barcelona).

**MAD (Mosaic Alteration Detector)**

This is a joint work with Benjamin Rodriguez-Santiago (qGenomics) and Luis Pérez-Jurado (UPF). MAD is a software tool to detect mosaic events from SNP arrays using BAF and LRR values. The algorithm is based on a segmentation procedure which uses the main features of GADA (and R package to detect CNVs). To install MAD, start R and enter:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("mad")*

The methodological paper has been accepted in BMC Bioinformatics and can be found here. An example about how to use the software is described in the vignette. The algorithm has been used to discover mosaic alterations in a large collaborative study:

Jacobs KB, Yeager M, Zhou W, ... **Gonzalez JR**, ... Rothman N, Pérez-Jurado LA, Chanock SJ. Detectable clonal mosaicism and its relationship to aging and cancer. Nat Genet. 2012 May 6;44(6):651-8

This is a joint work with Juan J Abellan and Carlos Abellan from Centre for Public Health Research (CSISP), Valencia, Spain. We are interested in addressing the question of determining those genetic variants, in particular SNPs or CNVs, that are specific of different groups of individuals. This could help in elucidating differences in disease predisposition and response to pharmaceutical treatments. We propose a Bayesian model designed to analyze thousand of variants where only few of them are expected to be associated with a specific phenotype. The model can be used to analyze a case-control studies with cases from different diseases or studies including only cases with different kind of phenotypes are available. The method is implemented in the R package called bayesGen (tar.gz file) that includes functions for analyzing SNP or CNV data. Some examples are described in the vignette. The statistical model and 2 real data examples are described here

JR Gonzalez, C Abellán, JJ Abellán. Bayesian model to detect phenotype-specific genes for copy number data. *BMC Bioinformatics* 2012, **13**:130

The data used in the paper can be downloaded after installing the package and executing

*data(armengol) # HapMap example*

*data(OV) # Ovarian Cancer example from TCGA*

We are interested in assessing association between CNVs and traits using information obtained from aCGH, Illumina, MLPA or any other platform that provides quantitative measurements of CNVs. We propose a class of latent models that incorporates uncertainty when copy number status is inferred.

The functions for assessing association are implemented in an R package (tar.gz file (Linux) or zip file (Windows)). The package requires libraries ‘mixdist’ and ‘mclust’ to be installed. We have included two real data sets to illustrate how the model works. Methods and examples are described in the vignette (the scripts can be downloaded here MLPA example and aCGH example).

The package includes functions for analysing association for a series of study designs (case-control, cohort, etc), using several response variables (class status, censored data, counts) as response, adjusting for covariates and under various inheritance models. It also includes functions for inferring copy number (CNV genotype calling) from raw probability data. Our package can also accept call probabilities obtained using other, more sophisticated algorithms, such as Canary or CGHcall, among others. Various classes and generic functions (print, summary, plot, anova, ... ) have been created to facilitate the analysis.

The R package, the manual and the formulas for the case of analyzing count and survival data are described in the paper

Subirana I, Diaz-Uriarte R, Lucas G, Gonzalez JR. CNVassoc: Association analysis of CNV data using R. Bioinformatics (submitted June 2010)

The statistical methods for case-control studies are described in the paper

Gonzalez JR, Subirana I, Escaramis G, Peraza S, Caceres A, Estivill X, Armengol L. Latent Class Model to Assess Association between Copy Number and Disease. BMC Bioinformatics 2009, 10:172 (pdf)

This is a joint work with Roger Pique-Regi. The package includes GADA, a fast and accurate method for detecting copy number alterations (CNA) from array data.* The google group of the package is not working anymore*.

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("gada")*

The current version of the R package for Windows is here.

Examples and vignette are availables here

Multiplex ligation-dependent probe amplification (MLPA) method is a potentially useful semi-quantitative method to detect copy number alterations in targeted regions. In this project we are developing statistical models and methods to determine the statistical significance of altered probes. The functions are implemented in an R package that includes an R GUI application. MLPAstats can be downloaded fromCRAN (pacakge source, MacOS X and Windows binary versions are available here). The package contains two real MLPA data sets (text files can also be obtained here: BRCA and Rare Diseases) that can be analyzed as described in the vignette. The script can be downloaded here.

The statistical methods are described in the paper:

Gonzalez JR, Carrasco JL, Armengol L, Villatoro S, Jover L, Yasui Y, Estivill X. Probe-specific mixed-model approach to detect copy number differences using multiplex ligation-dependent probe amplification (MLPA). BMC Bioinformatics 2008, 9:261 ( pdf)

The R package is illustrated in the manuscript:

Caceres A, Armengol L, Villatoro S, Gonzalez JR. An R GUI package for the integrated analysis of copy number alterations using MLPA data. BMC Bioinformatics, under revision

The BayNet R package is a collection of structure discovery functions. This is a joint collaboration with Mikel Esnaola, who implemented C code to improve computational issues. This package uses Bayesian Networks along with MCMC (including order space search) and heuristic methods to look for the structures that best explain the interdependencies between certain variables. In order to obtain better results, it is highly advisable to use it along with the Rgraphviz package, which can be found at the Bioconductor repositories. To install BayNet, start R and enter:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("BayNet")*

An example about how to use the software is described in the vignette.

This package was built when I was working at Xavier Estivill’s lab atCenter for Genomic Regulation and it is written in collaboration with Victor Moreno and his colleagues. The R package SNPassoc contains classes and methods to help the analysis of whole genome association studies. SNPassoc utilizes S4 classes and extends haplo.stats R package to facilitate haplotype analyses. The package is useful to carry out most common analysis when performing whole genome association studies. These analysis include descriptive statistics and exploratory analysis of missing values, calculation of Hardy-Weinberg equilibrium, analysis of association based on generalized linear models (either for quantitative or binary traits), and analysis of multiple SNPs (haplotype and epistasis analysis). Permutation test and related tests (sum statistic and truncated product) are also implemented.

The package is available from CRAN. The more recent Linux version can also be donwloaded here. More recent Windows version is available here. This link contains the vignette. The latest version is available through creal.install:

*source("* *http://www.creal.cat/media/upload/arxius/jr/CREAL_install/install.R")*

*creal.install("SNPassoc")*

The methodology is described in:

JR Gonzalez, L Armengol, X Sole, E Guino, JM Mercader, X Estivill, V Moreno (2007). SNPassoc: an R package to perform whole genome association studies. Bioinformatics, 23:644-5(pdf)

and it has been used (among others) in:

Real E., et al. A brain-derived neurotrophic factor (BDNF) haplotype is strongly associated with therapeutic response in obsessive-compulsive disorder. Biological Psychiatry 2009 Oct 1;66(7):674-80.

Gratacos M., et al. Identification of new putative susceptibility genes for several psychiatric disorders by association analysis of regulatory and non-synonymous SNPs of 306 genes involved in neurotransmission and neurodevelopment. Am J Med Genet B 2009 Sep 5;150B(6):808-16.

Gratacos M., et al. Contribution of neurotrophic tyrosine kinase receptor type 3 (NTRK3) gene to genetic susceptibility to obsessive-compulsive hoarding. Genes Brain and Behaviour 2008 Oct;7(7):778-85

Mercader JM., et al. Association of NTRK3 and its interaction with NGF suggest an altered cross-regulation of the neurotrophin signaling pathway in eating disorders. Hum Mol Genet. 2008 May 1;17(9):1234-44.

de Cid R., et al. BDNF variability in opioid addicts and response to methadone treatment. Genes Brain Behav. 2008 Jul;7(5):515-22.

Alonso P., et al. Extensive genotyping of the BDNF and NTRK2 genes define protective haplotypes against obsessive-compulsive disorder. 2008 Mar 15;63(6):619-28

Gratacos M., et al. A brain-derived neurotrophic factor (BDNF) haplotype is associated with antidepressant treatment outcome in mood disorders. Pharmacogenomics J 8(2):101-12 (2008)

The function requires BayesMedel R package in particular a C program which computes the probability of observing the phenotypes for the whole pedigree (deaf or hearing) given the genotype of the proband. This package is available upon request at BayesMendel lab from The Johns Hopkins University.

The methodology is described in:

Gonzalez JR, Wang W, Ballana E, Estivill X (2006). A Recessive Mendelian Model to Predict Carrier Probabilities of DFNB1 for Nonsyndromic Deafness. Human Mutation, 27(11):1135-1142.

** Survival Analysis (recurrent events)**

This package is written joint with Virginie Rondeau. Frailtypack can be used to estimate the parameters in a shared gamma frailty model with potentially right censored, left truncated and stratified survival data, using maximum penalized likelihood estimation. Time-dependent structure for the explanatory variables and/or estension of the Cox regression model to recurrent events are also allowed. This program can also be used simply to obtain directly a smooth estimates of the baseline hazard function.

The methodology is described and the package used in:

V Rondeau, JR Gonzalez (2005). Frailtypack: a computer program for the analysis of correlated failure time data using penalized likelihood estimation. Computer Methods and Programs in Biomedicine, 80:154-64.

This package is available from CRAN:source code and manual

A new version of frailtypack is availabe at CRAN. This new version includes functions for analyzing hierarchical (nested) models, recurrent event data with terminal event as well as an additive frailty model to model the random treatment × trial interaction and the random trial effect jointly in an individual patient data meta-analysis.

This package is written joint with Edsel A Peña and Elizabeth Slate. Gcmrec estimates the parameters involved in a general class of models for recurrent event data proposed by Pena and Hollander. This software also estimate a model designed for analyzing relapses in patients diagnosed with cancer considering the effect of treatment after treatment as described in Gonzalez JR, Peña E, Slate (2006).

The methodology is described in:

E Peña, EH Slate, JR Gonzalez (2007). Semiparametric inference for a general class of models for recurrent event data. J Stat Planning Inference, 137:1727-1747.

JR Gonzalez, E Peña, E Slate (2005). Modelling intervention effects after cancer relapses. Stat Med, 24:3959-75.

This package is available here:Windows version / Linux version

This package is written joint with Edsel A Peña and Robert Strawderman. Survrec is designed to estimate the survival function for recurrent event data using Pena-Stawderman-Hollander and Wang-Chang estimators and MLE estimation under a gamma frailty model.

This package is available from CRAN: source code and manual