AltAnalyze Information and Instructions

Version 1.1 Beta

Section 1 - Introduction

1.1 Program Description

               AltAnalyze is a freely available, cross-platform application that allows you to take raw (CEL file) or processed microarray data and assess alternative splicing or alternative promoter usage and then view how these changes may affect protein sequence, domain composition, and microRNA targeting. This software requires no advanced knowledge of bioinformatics programs or scripting. All you need are your microarray files along with some simple descriptions of the conditions that you're analyzing. In addition to splicing sensitive microarrays, AltAnalyze can analyze conventional Affymetrix gene expression arrays (e.g., 3' arrays or Gene 1.0 arrays).

               AltAnalyze is composed of a set of modules designed to (A) organize, filter, and summarize transcript tiling data; (B) calculate scores for alternative splicing (AS), alternative promoter selection (APS) or transcript elongation; (C) annotate regulated alternative exon events; and (D) assess downstream predicted functional consequences at the level of protein domains and microRNA binding sites (miR-BS). The resulting data will be a series of text files (results and over-representation analyses) that you can directly open in a computer spreadsheet program for analysis and filtering. In addition, export files are created for the Cytoscape [1] program DomainGraph, to graphically view domain and miR-BS probe set-exon alignments and AltAnalyze statistics.

               Alternative exon analysis is currently compatible with the Affymetrix Exon 1.0 ST array and the custom Affymetrix AltMouse A array, however, it may be adapted to support other platforms on a per example basis (contact author) or by other developers. Analysis of conventional Affymetrix microarrays is supported for array normalization, calculation of array group statistics and dataset annotation.

1.2 Implementation

               AltAnalyze is provided as a stand-alone application that can be run on Windows or Mac OS X operating systems, without installation of any additional software. This software is composed of a set of distinct modules written in the programming language Python. Python is a cross-platform compatible language; therefore, AltAnalyze can be run on any operating system that has Python installed including Linux and any future Mac OS X operating systems (python is included standard). The AltAnalyze interface is an interactive graphic user interface with easy to use options. To run AltAnalyze from source-code, rather than through the stand-alone applications, see the AltAnalyze Google Groups page (http://groups.google.com/group/alt_predictions) for more information.

1.3 Requirements

               The basic installation of AltAnalyze requires a minimum of 1GB of hard-drive space for all required databases and components. This includes support for all species and currently supported arrays. Future versions may include an option for automated download of databases specific to the user specified array analyses.A minimum of 1GB of RAM and Intel Pentium III processor speed are further recommended. At least an additional 1GB of free hard-drive space is recommended for building the required output files.

1.4 Pre-Processing, External Files and Applications

               AltAnalyze can process raw Affymetrix image files (CEL files) using the RMA algorithm. This algorithm is provided through Affymetrix Power Tools (APT) binaries that are distributed with AltAnalyze in agreement with the GNU distribution license (see agreement in the APT directory). In addition, users can pre-process their data outside of AltAnalyze to obtain expression values using any desired method. Example methods for obtaining such data include ExpressionConsole (Affymetrix) and R (Bioconductor), either of which can be used if the user desires another normalization algorithm rather than RMA (e.g., GC-RMA, PLIER).

               If CEL files are processed directly by AltAnalyze (using APT and RMA), two files will be produced; an expression file (containing probe set and expression values for each array in your study) and a detection above background (DABG) p-value file (containing corresponding DABG p-values for each probe set).The results produced by AltAnalyze will be identical to those produced by APT or ExpressionConsole (http://www.affymetrix.com/products/software/specific/expression_console_software.affx).

               The only external files required by AltAnalyze are Affymetrix library files (for array summarization) and Affymetrix gene annotation files. Both of these files will be prompted for import in AltAnalyze along with links provided for downloading.

               In additional to expression analysis, users can optionally install the program R (http://www.cran.org). R is optional when performing an exon-junction array analysis using the algorithm Linear Regression with the rlm method (not needed for basic Linear Regression). Further directions are provided to interface R in the algorithms.

1.5 Help with AltAnalyze

Additional documentation, help, and user discussions are available at the AltAnalyze website or at the AltAnalyze Google Groups user forum:

http://groups.google.com/group/alt_predictions


Section 2 - Running AltAnalyze for Splicing Arrays

2.1 Where to Save Input Expression Files?

               When performing analyses in AltAnalyze, the user needs to store all of their Affymetrix CEL files in one directory. This directory can be placed anywhere on your computer and will be later selected in AltAnalyze. Example files can be downloaded here.

               If the user has already run normalization on their CEL files outside of AltAnalyze or have downloaded already analyzed expression data from another source, you can save the expression and DABG p-value file (optional) anywhere on your computer. These files should be tab delimited text files that only consist of probe sets, expression values and headers for each column. Example files can be downloaded here.

2.2 Beginning an AltAnalyze Run

Windows and Mac Directions:

               Once you have saved your CEL files or normalized expression values to your computer, open the AltAnalyze application folder and double-click on the executable file named "AltAnalyze.exe" (Windows) or "AltAnalyze" (Mac).This will open a set of user interface windows where you will be presented with a series of program options (see following sections).

 

Linux and Source Code Directions:

               On Linux, move the files from the folder named "Source_code" up one directory to the AltAnalyze application main folder. You can start AltAnalyze by opening a terminal window and going to the AltAnalyze main folder (e.g. "cd AltAnalyze_v1beta" from the program parent directory). Once in this directory, typing "python AltAnalyze.py" in the terminal window will begin to run AltAnalyze (you should see the AltAnalyze main menu within a matter of seconds).

 

AltAnalyze Interface Options:

               There are many options in AltAnalyze, which allow the user to customize their output, the types of analyses they run and the stringency of those analyses. The following sections show the sequential steps involved in running and navigating AltAnalyze. Interactive tutorials for different analyses are provided through the AltAnalyze website.

1)     Introduction Window - Upon Opening AltAnalyze, the user is presented with the AltAnalyze splash screen and additional information. To directly open the AltAnalyze download page, follow the hyperlink under "About AltAnalyze", otherwise select "Begin Analysis".

A

B

 

Figure 2.1. Select a Species for Analysis in AltAnalyze. A) Options for selecting currently listed species and B) adding information for a new species.

 

2)     Select species - Next, the user must select a species for analysis. Select the species name and click "Continue". For alternative exon analysis, AltAnalyze is only compatible with human, mouse and rat Affymetrix arrays. However, if the user wishes to analyze data from a non-splicing array, that species can be added by selecting "other" and "Continue".A window named "Add New Species Support" will pop-up, where you can add a two letter species code and full name that will be used for all current and future analyses (Figure 2.1 B). Further support for new gene expression arrays is added by downloading the Affymetrix CSV file for that array (see below sections).

Figure 2.2. Select the Analysis and Array Type. Options available for the select species. AltAnalyze filtered is only available for alternative exon analyses (e.g., Affymetrix Exon ST 1.0 array or AltMouse A).

 

3)     Select analysis options and array type - In this window, the user must select the type data being analyzed and the type of array they will be using. There are three main types of data; A) CEL files, B) Expression files and C) AltAnalyze filtered files. Processing of CEL files will produce the two file types (expression and DABG). Processing of Expression files, allows the user to select tab-delimited text files where the data has already been processed (e.g. RMA), which will also produce AltAnalyze filtered files. AltAnalyze filtered files are written for any splicing array analysis (not for gene expression only arrays). These files are the only required input for a splicing analysis. Since CEL file normalization and array filtering and summarization can take a considerable amount of time (depending on the number of arrays), if re-performing an alternative exon analysis with different parameters, it is recommended that the user select the "Expression files" or "AltAnalyze filtered files", depending on which options the user wants to change. There are four possible array types that can be selected. Be sure to select the proper type indicated by the array manufacturer and select "Continue".

 

A
B

Figure 2.3. Select Folder and File Locations. (A) To properly run RMA using the APT software (included with AltAnalyze), the user must select a valid folder containing CEL files. (B) For other Affymetrix arrays, you will be prompted to download a library file(s) for that array. For some arrays, you will be prompted for AltAnalyze to download these plus annotation files for you. Otherwise you will be prompted by the program for such files.

 

4)     Processing Affymetrix CEL Files - If you selected the "CEL files" data type from "Main Dataset Parameters", you will be presented with a new window for selecting the location of your CEL files, array library files and desired output directory. Clicking the "select folder" icon will allow you to browse your hard-drive to select the folder with your CEL files. You can double-check the correct directory is selected by looking at the adjacent text display. This window will be followed by an indicator window that will automatically download the library and annotation files for that array. If the array type is unrecognized and you do not already have Affymetrix library files for your array (e.g. PGF or CDF), you will need to download these files from the Affymetrix website. To do so, select the link at the bottom left side of this window named "Download Library Files". Select the array type being analyzed from the web page and select the appropriate library files to download and extract to your computer (requires an Affymetrix username and password) (Figure 2.3 B).

Figure 2.4. Select Summarization and Filtering Options. Users are presented with options for filtering probe sets for alternative exon analyses (DAGB and mean group expression) and options for how to derive gene expression values.

 

5)     Summarizing Gene Data and Filtering For Expression - After CEL file data summarization, a number of options are available for summarizing gene level expression data and filtering out probe sets prior to alternative exon analysis. For splicing arrays, AltAnalyze calculates a "gene-expression" value based on the mean expression of all constitutive probe sets that have a DABG p-value less than the user indicated threshold for each gene. These values are used to report predicted gene expression changes (independent of alternative splicing) for all user-defined comparisons (see following section). In addition, fold changes and ttest p-values are calculated for each of these group comparisons. These statistics along with several types of gene annotations exported to a file in the folder "ExpressionOutput" in the user-defined results directory. Along with this tab-delimited text file, a similar file with those values most appropriate for import into the pathway analysis program GenMAPP will also be produced (Figure 5.1). For splicing analyses, probe sets with user defined splicing cutoffs (expression and DABG p-values) will be retained for further analysis (see section 5.2 - ExpressionBuilder algorithm).


 

A

B

 

Figure 2.5. Select Summarization and Filtering Options. Splicing analysis options for (A) exon arrays and (B) junction arrays.

 

6)     Select Alternative Exon Analysis Parameters - If using an exon-sensitive (e.g., Human Affymetrix 1.0 ST Exon array) or junction-sensitive microarray, the user will be presented with specific options for that microarray (Figure 2.5). These options include alternative exon analysis methods, statistical thresholds, and options for additional analyses (e.g., MiDAS). The default options are recommended. These include restricting the analysis to a conservative set of probe sets (e.g., "core") and changing the threshold of splicing statistics. When complete, the user can select "Continue" in AltAnalyze to incorporate these statistics into the analysis.

 

A

B


Figure 2.6. Establish Groups and Comparisons
. (A) Enter a name for each group for all samples. (B) Enter all group comparisons for any possible pair-wise group comparisons (in this case there is only one).

 

7)     Assigning Groups to Samples - When analyzing a dataset for the first time, the user will need to establish which samples correspond to which groups. Type in the name of the group adjacent to each sample name from in your dataset (Figure 2.6 A).

8)     Establishing Comparisons between Groups - Once sample-to-group relationships are added, the user can list which comparisons they wish to be performed (Figure 2.6 B). For splicing and non-splicing arrays, folds and p-values will be calculated for each comparison for the gene expression summary file. For splicing arrays, each comparison will be run in AltAnalyze to identify alternative exons. Thus, the more pair-wise comparisons the longer the analysis.

Figure 2.7. Download Affymetrix Annotation CSV warning. Screen appears if no CSV file for that species is found in the appropriate AltAnalyze directory.

 

9)     Download Gene Annotations - When running a non-splicing analysis for the first time, the user will be prompted to download the proper probe set annotation file from Affymetrix. Follow the link and instructions to download the appropriate annotation CSV file for your microarray. NOTE: when an analyzing a splicing-array dataset, it is also recommended that you have at least one Affymetrix annotation CSV file in the indicated directory to link gene level results to Gene Ontology and pathway annotations. Any Affymetrix CSV file for that species is sufficient (genes not probe sets are matched).

Figure 2.8. AltAnalyze Status. The AltAnalyze status window will appear once all user options are defined. Analysis run-time will depend on the number of samples, comparisons and array type.

 

10) AltAnalyze Status Window - While the AltAnalyze program is running, several intermediate results files will be created, including probe set, gene and dataset level summaries (see section 2.3). The results window (Figure 2.8) will indicate the progress of each analysis as it is running. When finished, AltAnalyze will prompt the user that the analysis is finished and a new "Continue" button will appear. In addition to viewing the program report, this information is written to a log text file in the user-defined output directory.

2.3 AltAnalyze Analysis Options

There is a number of analysis options provided through the AltAnalyze interface. In this section will overview these options for the different compatible array analyses (gene expression arrays, exon arrays and junction arrays). For new users, we recommend first running the program with the pre-set defaults and then modifying the options as necessary.

 

Selecting The MicroArray Analysis Method

After the user has selected the species of interest, they must choose what type of data they will next be analyzing. Data can consist of; 1) Affymetrix CEL files, 2) an already processed expression text files or 3) properly formatted and filtered AltAnalyze expression input text file. If beginning with Affymetrix CEL files all three of these file types are produced in series (see following section) and automatically processed without any user intervention. If all CEL files from your study already been previously in AltAnalyze or in another program, the user can load this file be selecting the option "expression file" and choosing this text file from your computer. This file needs to contain data from arrays corresponding to at least two biological groups. Users may wish to re-analyze these files to change their expression filtering parameters to be more or less stringent. For the two or more biological groups (see how to define in Figure 2.6), AltAnalyze will segregate the raw data based on the user-defined pair-wise group comparisons and filter the containing probe sets based on whether they match the user-defined thresholds for inclusion and are associated with Ensembl genes (see the below section: Expression Analysis Parameters). These files will be saved to the folder "AltExpression" in the user-defined output directory. These files can be later selected by choosing the option "AltAnalyze filtered", if the user wishes to re-run or use different AltAnalyze alternative exon analysis options (see below section: Alternative Exon Analysis Parameters).


CEL File Summarization

CEL files are one of the file types produced after scanning an Affymetrix microarray. The CEL file is produced automatically from the DAT file (an image file, similar to a JPEG), by the Affymetrix software by overlaying a grid over the microarray florescent image and assigning a numeric value to each cell or probe. From this file, expression values for each probe set can be calculated and normalized for all arrays in the study using various algorithms.

When choosing to analyze CEL files in AltAnalyze, the user will be prompted to identify the folder containing the CEL files and the folder in which to save these other results to. The user will also need to assign a name to the dataset. These CEL files will be summarized using the RMA algorithm using the program Affymetrix Power Tools (APT). The APT C++ module "apt-probeset-summarize" is directly called by AltAnalyze when running AltAnalyze on a Mac, PC or Linux operating system. Unlike some other applications, APT is packaged with AltAnalyze and thus does not require separate installation. However, because it is a separate application there may be unknown compatibility issues that exist, depending on your specific system configuration and account privileges. For human and mouse exon arrays, AltAnalyze also allows for the masking of probes with cross-hybridization potential, prior to running RMA. This is performed through an experimental APT function (--kill-list), masking probes that are indicated in files produced for the MADS application (http://biogibbs.stanford.edu/~yxing/MADS/Annotation.html) that cross-hybridize to an off-target transcript within 3bp mismatches and a person correlation coefficient > 0.55, as per the MADS recommendations.

APT requires the presence of a library file(s) specific for that array. AltAnalyze will automatically determine the array type and can install these files if the user wishes (currently most human, rat and mouse arrays supported). If AltAnalyze does not recognize the specific array type or the user chooses to download these files themselves, they will need to select the appropriate files when prompted in AltAnalyze. For exon arrays, a PGF, CLF and antigenomic BGP file are required. These files will be automatically downloaded and installed if the user selects "Download" when prompted. For non-exon arrays, the appropriate CDF file will be downloaded. In addition to these library files, a NetAffx CSV annotation file will be downloaded that allows for addition of gene annotations (non-exon arrays) and Gene Ontology pathway annotations (all arrays). Once installed, AltAnalyze will recognize these files and automatically use them for all future analyses. Once the user selects the appropriate directories and files, the user will be prompted to select the remaining options in AltAnalyze, before APT is run. Once run, an tab-delimited text expression file will be produced for all probe sets on the array and a detection-above background (DABG) p-value file (exon array only).

 

Loading a Processed Expression File

If CEL files are processed outside of AltAnalyze, the user must save the resulting expression text file in tab-delimited format. It is alright if the first rows in the file have run information as long as they are followed by a pound sign (#).


Expression Analysis Parameters

The options presented in this interface (Figure 2.4) allow the user to determine what fields are present in the gene expression output file, what scale the data is in (e.g. logarithmic), which probe sets to use when calculating gene expression and how to filter probe sets for subsequent analyses.

1)     Analyze alternative exon and/or expression data - Selecting the option "expression" will halt the analysis after the gene expression result file has been written, such that no splicing analysis is performed. This option is only available for splicing-sensitive arrays.

2)     Select expression data format - Indicates the format in which the CEL file summarized data has been written. When the CEL files have been processed by AltAnalyze, ExpressionConsole, APT, RMAExpress or through R, the file format will be logarithmic base 2 (log). If the user designates "not-log", then expression values will be log base 2 (log2) transformed prior to analysis.

3)     Use all probe sets to derive gene expression versus constitutive only - For splicing sensitive microarrays, the user has the choice to alter the way in which gene expression values are calculated and how to filter their probe set expression files prior to alternative exon analysis. When "yes" is selected for this option, all probe sets linked to a unique gene will be used to calculate a measure of gene expression by taking the mean expression of all probe set values. When the "no" is selected, only those probe sets that have been annotated as constitutive or common to all isoforms will be used for gene expression calculation. In either case, only probe sets with at least one array possessing a DABG p-value less than the user threshold will be retained (if a DABG p-value file is present). In order to exclude this threshold, set the minimum DABG p-value equal to 1.

4)     Include replicate experimental values in the export - Instructs AltAnalyze whether to include the expression values associated with each CEL file in the output file. If not selected only the mean expression value of all CEL files for each biological group will be written.

5)     Enter maximum mean DABG p-value - When a DABG file has been produced (default when summarizing CEL files with AltAnalyze for exon-arrays), this option is applied. The default DABG p-value cutoff is p<0.05. This will filter out any non-constitutive probe set that has a mean DABG p>0.05 for both compared biological groups. For constitutive probe sets, both biological groups must have a DABG p < user-value. In order to exclude this option, you can remove the DABG file (contains the prefix "stats."), or set this value equal to 1.

6)     Enter minimum mean expression value (non-log) - This statistic is treated the same as the DABG p-value cutoff except in that probe sets with a mean expression value less than this cutoff will be excluded. The same rules apply to this value as to the DABG value, where both variables must be true for probe set inclusion (e.g., p<0.05 and mean expression > 70). To exclude this option, set the default value to 100,000 (greater than the maximum intensity value of most expression summaries).

 

Alternative Exon Analysis Parameters

The options presented in this menu (Figure 2.5) instruct AltAnalyze what statistical methods to use when determining alternative exon expression, which probe sets to select for analysis, what domain-level and miR-BS analyses to perform and what additional values to export for analyses in other tools. Details on each analysis algorithm are covered in detail in section 3.2.

1)     Select the alternative exon algorithm - For exon arrays, the splicing index method is the only method currently provided, however, for junction arrays, several methods are available. These methods are used to calculate an alternative exon score, relative to constitutive expression levels. The default value for splicing-index analysis is 2, indicating that an adjusted expression difference greater than two fold (up- or down-regulated) is required for the probe set to be reported. Based on the algorithm, different values and scales will apply. For junction-arrays, the ASPIRE algorithm default cutoff is 0.2, whereas the linear-regression algorithm is 2. For linear-regression (linearregres), a minimum value of 2 will select any linear-regression fold greater than 2 (result folds are reported in log 2 scale, however), up- or down-regulated, whereas ASPIRE's scores ranging from -1 to 1. See algorithm descriptions for more details (section 3.2).

2)     Enter minimum alternative exon score - This value will vary based on the alternative exon analysis method chosen (see above options).

3)     Enter maximum constitutive corrected probe set p-value - This is the p-value cutoff applied to MiDAS and splicing-index ttest p-values for exon array analyses. Currently, the user cannot set different p-value thresholds for these two statistics. More on MiDAS can be found below and in section 3.2.

4)     Select probe sets to include - This option is used to increase or decrease the stringency of the analysis. In particular, this option allows the user to restrict what type of probe sets are to be used to calculate an alternative exon score. In the case of junction arrays, this option includes the ability to merge the expression values of probe sets that measure the same differential inclusion of an exon (combined-junctions). For exon arrays, there are three options, "core", "extended" and "full". Although these are the same probe set class names used by Affymetrix to group probe sets, AltAnalyze uses a modification of these annotations. Specifically, probe sets with the core annotation include all Affymetrix core probe sets that specifically overlap with a single Ensembl gene [2] (based on genomic position) along with any probe set that overlaps with an Ensembl or UCSC exon [3]. Likewise, extended and full probe sets are those remaining probe sets that also align to a single Ensembl gene, with the Affymetrix extended or full annotation.

5)     Enter maximum gene-expression change (absolute) - This value indicates maximum constitutive gene expression fold change (non-log, up- or down-regulated) that is allowed for a gene to be reported as alternatively regulated. The default is 3-fold, up or down-regulated. This filter is used with assumption that alternative splicing is a less critical factor when a gene is highly differentially expressed.

6)     Perform permutation analysis - (junction arrays only) This analysis reports a p-value that represents the likelihood of the observed alternative exon score occurring by change, after randomizing the expression values of all samples.

7)     Enter maximum reciprocal exon-junction permutation p-value - (junction arrays only) This p-value cutoff applies to the permutation based alternative exon score p-values when performing ASPIRE or linearregress (see section 3.2).

8)     Export constitutive adjusted probeset expression values - This option can be used to compare alternative exon scores prior to filtering for biological multiple comparisons, outside of AltAnalyze. For example, if comparing multiple tissues, the user may wish to export all splicing-index scores for all tissue comparisons. The results will be stored to the AltResults folder in the user-defined output-directory.

9)     Calculate MiDAS p-values - This statistic is analogous to the ttest p-value calculated during the splicing-index analysis (see section 3.2 for more details). If not selected, then only the splicing-index fold and p-value will be used to filter alternative exon results.

10) splicing-index p-values - This is the p-value cutoff applied to splicing-index ttest p-values for exon array analyses. More on this statistic can be found below and in section 3.2.

11) Filter results for predicted AS - This option instructs AltAnalyze to only include regulated probe sets in the output that have been assigned a valid splicing annotation (e.g., alternative-cassette exon) provided by AltAnalyze. These annotations exclude probe sets with no annotations or those with only an alternative N-terminal exon or alternative promoter annotation.

12) Align probesets to protein domains using - This option is used to restrict the annotation source for domain/feature over-representation analysis. If "direct-alignment" is chosen, only those probe sets that overlap with the genomic coordinates of a protein domains/features will be included in the over-representation analysis, otherwise, the inferred method is used (see section 3.2 for more details).

13) Number of algorithms required for miRNA binding site reporting - This option is used to filter out miR-BS predictions that only occur in one of the four miR-BS databases examined. For more miR database information see section 6.5.

 

2.4 Overview of Analysis Results

AltAnalyze will output two main types of files:

1)     Gene expression (GE) summary

2)     Alternative exon summary

 

Gene Expression Summary Data

               The GE summary files are two files that contain all computed constitutive gene expression values from your dataset, saved to the folder "ExpressionOutput" in the user-defined output directory. The first is a file is a complete dataset summary file with the prefix "DATASET" followed by the user-defined dataset name containing all array expression values (gene-level for exon arrays), calculated group statistics (mean expression, folds, t-test p-values) and gene annotations (e.g., gene symbol, description, Gene Ontology, pathway and some custom groups). For exon arrays, the gene expression values are derived from probe sets that align to regions of a gene that are common to all transcripts and thus are informative for transcription (unless all probe sets are selected - see "Select expression analysis parameters", in Section 2.3). These probe sets are referred to as constitutive probe sets. When one or more constitutive probe sets have DABG p-values (for any array in the dataset) below the user defined threshold, these probe sets will be used to calculate gene expression, otherwise, all constitutive probe sets will be used.

               The second file, called the GenMAPP input file, contains a subset of columns from the dataset summary file for import into GenMAPP [4] or PathVisio [5] (http://www.pathvisio.org/PathVisio2). This file has the prefix "GenMAPP" and excludes all gene annotations and individual array expression values.

 

Alternative Exon Summary Data

               These results are produced from all probe sets that may suggest alternative splicing, alterative promoter regulation, or any other variation relative to the constitutive gene expression for that gene (derived from comparisons file). Each set of results corresponds to a single pair-wise comparison (e.g., cancer vs. normal) and will be named with the group names you assigned. Four sets of results files are produced in the end:

1)     Probe set-level - Probe set-level statistics, exon annotations, AS/APS annotations, and functional predictions (protein, domain and miRNA binding site).

2)     Gene-level - Gene-level summary of data in probe set-level file.

3)     Domain-level - Over-representation analysis of gene-level domain changes due alternative exon regulation.

4)     miRNA binding sites - Over-representation analysis of gene-level, predicted miRNA binding sites present in alternatively regulation exons.

5)     Summary statistics file - Global statistics, reporting the number of genes alternatively regulated, number differentially expressed and summary protein association information (e.g, mean regulated protein length).

 

               Each file is a tab delimited text file that can be opened, sorted and filtered in a spreadsheet program. These files are saved to the user-defined output directory under "AltResults/AlternativeOutput", all with the same prefix (pair-wise group comparisons). AltAnalyze will analyze all pair-wise comparisons in succession and combine the probe set-level and gene-level results into two additional separate files (named based on the splicing algorithm chosen).

 

Probe set- and Gene-Level Alternative Exon Result Files

The probe set-level file contains alternative exon data for either one probe set (exon-array) or reciprocal probe sets (junction array). This includes:

  • Gene and probe set annotations (e.g., description, symbol, probe set ID, probe set exon ID, transcript clusters, links to Ensembl/UCSC exons, ordered exon-region IDs).
  • Mean probe-set expression values for the regulated probe set(s).
  • Constitutive gene expression changes and baseline expression.
  • Statistical results (e.g., splicing-index score and p-value, MiDAS p-values, raw probe set p-value).
  • Alternative exon annotations (e.g., splicing-events, alternative promoters, alternative annotation confidence score).
  • Protein- and microRNA-level associations (e.g., associated IDs, sequence, pattern of regulation, regulated domains/microRNA binding sites).

The gene-level file contains a summary of the data at the gene level, with each row representing a unique gene. This file also includes:

·        Gene Ontology and pathway information for each gene extracted from any Affymetrix CSV annotation files for that species present in the directory "AltDatabase/Affymetrix/*species*".

 

Protein Feature and MicroRNA Binding Site Over Representation Files

Over-representation analyses, (files 3 and 4) have the same structure:

  • Column A is the name of the miR-BS or protein domain (e.g., term).
  • Column B is the number of unique genes associated with alternatively regulated probe sets for that term (aka Changed).
  • Column C is the number of genes analyzed for over-representation that correspond to that term (aka Measured).
  • Column D is the percentage Changed (Changed/Measured).
  • Column E is the over-representation z-score (see Section 3 - Algorithms) for all unique genes aligning to the feature that are alternative regulated by the analysis that correspond to that term. A value of 1.96 is approximate to a p-value of 0.05 assuming a normal distribution.
  • Column F is the permutation based p-value to assess the likelihood of the observed z-score occurring by chance over 2000 random permutations of measured probe sets.
  • Column G is the Benjamini-Hochberg adjusted p-value of F, to take into account multiple hypothesis correction.
  • Column H contains all gene symbols for all unique genes changed.

Section 3 - Algorithms

               Multiple algorithms are available in AltAnalyze to identify individual probe sets (for exon arrays (EA)) or reciprocal probe sets (exon-exon junction array (JA)) that are differentially regulated relative to constitutive gene expression changes. These include the splicing index method (EA), MiDAS (EA), ASPIRE (JA) and Linear Regression (JA). In addition to these statistical methods, several novel methods are used to predict which alternative proteins correspond to a regulated exon, which protein domain/features differ between these and which RNA regulatory sequences differ between the associated transcripts (e.g., miR-BS).

 

3.1 Default Methods

               The default options are stored in external text files in the folder "Config" as "defaults-expr.txt", "defaults-alt_exon.txt", and "defaults-funct.txt".

defaults-expr.txt

Default expression analysis options (Figure 2.4)

defaults-alt_exon.txt

Default alternative exon analysis options (Figure 2.5)

defaults-funct.txt

Default functional analysis options (Figure 2.5)

 

               These options correspond to those found in the configuration file "options.txt". The user is welcome to modify the defaults and theoretically even the options in the "options.txt" file, however, care is required to ensure that these options are supported the by the program. Since AltAnalyze is an open-source program, it is feasible for the user to add new species and array support or to do so with AltAnalyze support.

               The default algorithms for the currently supported arrays are as follows:

Exon

splicing-index (score > 2 and t-test p<0.05), no MiDAS

AltMouse

ASPIRE (score > 0.2 and permute p<0.05)

3' array

NA

 

3.2 Algorithm Descriptions

Splicing Index Method

               This algorithm is described in detail in the following publications:

[6,7]. In brief, the expression value of each probe sets for a condition is converted to log space (if necessary). For each probe sets examined, its expression (log2) is subtracted from mean expression of all constitutive aligning probe sets to create a constitutive corrected log expression ratio (subtract instead of divide when these values are in log space). This ratio is calculated for each microarray sample, using only data from that sample.To derive the splicing-index value, the group mean ratio of the control is subtracted from the experimental.This value is the change in exon-inclusion (delta I or dI). A t-test p-value is calculated (two tailed, assuming unequal variance) by comparing these ratios for all samples between the two experimental groups. A negative dI score of -1, thus indicates a two fold change in the expression of probe set, relative to the mean constitutive expression, with expression being higher in the experimental versus the control.

 

MiDAS

               The MiDAS statistic is described in detail in the white paper:

www.affymetrix.com/support/technical/whitepapers/exon_alt_transcript_analysis_whitepaper.pdf. This analysis method is available from the computer program APT, mentioned previously. APT uses a series of text files to examine the expression values of each probe set compared to the expression of user supplied constitutive probe sets.Since AltAnalyze uses only probe sets found to align to a single Ensembl gene, AltAnalyze creates it's own unique numerical gene identifiers (different than the Affymetrix transcript clusters). When written, a conversion file is also written that allows AltAnalyze to translate from this arbitrary numerical ID back to an Ensembl gene ID. These relationships are stored in the following files along with the probe set expression values:

meta-Hs_Exon_cancer_vs_normal.txt

Relates probe set to gene

gene-Hs_Exon_cancer_vs_normal.txt

Gene expression values (non-log)

exon-Hs_Exon_cancer_vs_normal.txt

Probe set expression values (non-log)

commands-Hs_Exon_cancer_vs_normal.txt

Contains user commands for APT

Celfiles-Hs_Exon_cancer_vs_normal.txt

Relates sample to group

probe set-conversion -Hs_Exon_cancer_vs_normal.txt

Relates arbitrary gene IDs back to Ensembl

 

               When the user selects the option "Calculate MiDAS p-values", AltAnalyze first exports expression data for selected probe sets (e.g., AltAnalyze "core" annotated - Figure 2.5) to these files for all pair-wise comparisons. Once exported, AltAnalyze will communicate with the APT binary files packaged with AltAnalyze to run the analysis remotely. These statistics will be clearly labeled in the results file for each probe set and used for filtering based on the user-defined p-value thresholds (Figure 2.5).

 

Note: Different versions of the APT MiDAS binary have been distributed. AltAnalyze is distributed with two versions that report slightly different p-values. The older version (1.4.0) tends to report larger p-values than the most recent distributed (1.10.1). Previous versions of AltAnalyze used version 1.4.0, while AltAnalyze version 1.1 uses MiDAS 1.10.1 that produces p-values that are typically equivalent to those as the AltAnalyze calculated splicing-index p-values (larger).

 

ASPIRE

               For exon-exon junction microarray data (e.g., AltMouseA), the algorithm analysis of splicing by isoform reciprocity or ASPIRE was adapted from the original report [8] for inclusion into AltAnalyze.This algorithm uses the expression of probe sets aligning to two competitive exon-exon junctions, or one exon-exon junction and an exon aligning probe set along with constitutive expression values calculated as described with the splicing-index method.These probe set relationships were derived using the Affymetrix exon or exon-exon junction names (e.g., E1-E3 and E2-E3 or E1-E3 and E2), obtained by the Affymetrix AltMerge transcript assembly program [9]. For exon-exon junctions and exons aligning to the same gene, reciprocal probe set pairs were extracted using the ExonAnnotate_module.py program in AltAnalyze using the identifyPutativeSpliceEvents function. Such splicing events are classified as mutually-exclusive (mx-mx) or exon-inclusion/exon-exclusion (ei-ex). Mutually-exclusive splicing events represent an exchange of one exon for another (e.g., when E2-E4 and E1-E3 are compared, the E2 and E3 are considered regulated exons). For ei-ex, the proximal exon in the compared junctions is considered the regulated exon (e.g., when E1-E2 and E1-E3 are compared, E2 is the regulated exon).

               Similar to the splicing-index method, for each reciprocal probe set, a ratio is calculated for expression of the probe set (non-log) divided by the mean of all constitutive aligning probe sets (non-log), for the baseline and experimental groups. The ASPIRE dI was then calculated for the inclusion (ratio1) and exclusion (ratio2) probe sets, as such:

Rin = baseline_ratio1/experimental_ratio1

ex = baseline_ratio2/experimental_ratio2

I1=baseline_ratio1/(baseline_ratio1+baseline_ratio2)

I2=experimental_ratio1/(experimental_ratio1+experimental_ratio2)

 

in1= ((Rex-1.0)*Rin)/(Rex-Rin)

in2= (Rex-1.0)/(Rex-Rin)

dI = ((in2-in1)+(I2-I1))/2.0

 

               If (Rin>1 and Rex<1) or (Rin<1 and Rex>1) and the absolute dI score is greater than the user supplied threshold (default is 0.2), then the dI is retained for the next step in the analysis. If designated by the user, this next step will be a permutation analysis of the raw input data to determine the likelihood of each ASPIRE score occurring by chance alone. This permutation p-value is calculated by first storing all possible combinations of the two group comparisons. For example, if there are 4 samples (A-D) corresponding to the control group and 5 (E-H) samples in the experimental group, then all possible combinations of 4 and 5 samples would be stored (e.g, [B, C, G, H] and [A, D, E, F]). For each permutation set, ASPIRE scores were re-calculated and stored for all of these combinations. The permutation p-value is the number of times that the absolute value of a permutation ASPIRE score is greater than or equal to the absolute value of the original ASPIRE score (value = x) divided by the number of possible permutations that produced a valid ASPIRE score ((Rin>1 and Rex<1) or (Rin<1 and Rex>1)). If this p-value is less than user defined threshold, or x<2 (since some datasets have a small number of samples and thus little power for this analysis), the reciprocal probe sets are reported.

 

Linear Regression

               When working with the same type of reciprocal probe set data as ASPIRE, a linear regression based approach can be used with similar results. This method is based on previously described approach [10]. This algorithm uses the same input ASPIRE (junction comparisons, constitutive adjusted expression ratios). To derive the slope for each of the two biological conditions (control and experimental), the constitutive corrected expression of all samples for both reciprocal junctions is plotted against each other to calculate a slope for each of the two biological groups using the least squared method. In each case, the slope is forced through the origin of the graph (model = y ~ x - 1 as opposed to y ~ x). The final linear regression score is the log2 ratio of the slope of the experimental group divided by the baseline group. This ratio is analogous to a fold change, where 1 is equivalent to a 2-fold change. When establishing cut-offs, select 2 to designate a minimum 2-fold change. The same permutation analysis used for ASPIRE is also available for this algorithm.

Note: For previous published analyses ([10] and Salomonis et al. in preparation), linear regression was implemented using the algorithm rlm, which is apart of the R mass package from bioconductor. The Python R interpreter rpy, was used to run these analyses (which requires installation of R). To use this option, select "linearregres-rlm" under "select the alternative exon algorithm" (Figure 2.5). If a related warning appears while running, you may need to load AltAnalyze.py and associated source code directly (requires installation of R version specific rpy - see Linux and Source Code AltAnalyze instructions).

 

Domain/miR-BS Over-Representation Analysis

               A z-score is calculated to assess over-representation of specific protein features (e.g., domains) and miR-BS's found to overlap with probe sets that are alternatively regulated according to the AltAnalyze user analysis. This z-score is calculated by subtracting the expected number of genes in with a protein feature or miR-BS meeting the criterion (alternatively regulated with the user supplied thresholds) from the observed number of genes and dividing by the standard deviation of the observed number of genes. This z-score is a normal approximation to the hypergeometric distribution. This equation is expressed as:

n = All genes associated with a given element

r = Alternatively regulated genes associated with a given element

N = All genes examined

R = All alternatively regulated genes

 

               Once z-scores have been calculated for all protein features and miR-BS linked to alternatively regulated probe sets, a permutation analysis is performed to determine the likelihood of observing these z-scores by chance. This is done by randomly selecting the same number of regulated probe sets from all probe sets examined and recalculating z-scores for all terms 2000 times. The likelihood of a z-score occurring by chance is calculated as the number of times a permutation z-score is greater than or equal to the original z-score divided by 2000. A Benjamini-Hochberg correction is used to transform this p-value to adjusted for multiple hypothesis testing.

 

3.3 Alternative Splicing Prediction

To predict whether or not a single probe set (EA) or reciprocal probe set-pair (JA) associates with an alternative splicing or alternative promoter sequence, AltAnalyze uses two strategies; 1) Identify alternative exons/introns based on de novo isoform comparison and 2) Incorporating splicing predictions from UCSC's "known_alt.txt" file.

De Novo Splicing Prediction and Exon Annotation

1'>            In order to identify exons with alternative splicing or promoters, AltAnalyze compares all available gene transcripts from UCSC and Ensembl to look for shared and different exons. To achieve this, all mRNA transcripts from UCSC's species-specific "mRNA.txt" file that have genomic coordinates aligning to a single Ensembl gene and all Ensembl transcripts from each Ensembl build are extracted.Only UCSC transcripts that have a distinct exon composition from Ensembl transcripts are used in this analysis, excluding those that have a distinct genomic start or stop position for the first and last exon respectively (typically differing 5' and 3' UTR agreement), but identical exon-structure.

               When assessing alternative splicing, cases of intron retention are identified first. These regions consist of a single exon that spans an two adjacent exons at least one another transcript for that same gene. These retained introns are stored for later analysis, but eliminated as annotated exons. Remaining exons are clustered based on whether their genomic positions overlap (e.g., alternate 5' or 3' start sites). Each exon cluster is considered an exon block with one or more regions, where each block and region is assigned a numerical ID based on genomic rank (e.g., E1.1, E1.2, E2.1, E3.1). For each exon in a transcript, the exon is annotated as corresponding to an exon block and region number (Figure 3.1).All possible pair-wise transcript comparisons for each gene are then performed to identify exon pairs that show evidence of alternative exon-cassettes, alternative 3' or 5' splice sites or alternative-N or -C terminal exons (Figure 3.1). All transcript exon pairs are considered except for those adjacent to a retained intron. This analysis is performed by comparing the exon block ID and region IDs of an exon and it's neighboring exons to the exon blocks and regions in the compared transcript. Ultimately, a custom heuristic assigns the appropriate annotation based on these transcript comparisons.

 

Figure 3.1. Comparison of mRNA Exon Composition. To determine alternative splicing and alternative promoter regulation, all analyzed transcripts (Ensembl and UCSC) were compared based on exon genomic positions and subsequently annotated by a custom heuristic. Exon block and regions definitions are shown. The different types of alternative exon events are illustrated by different colored exons among five theoretical transcripts for the same gene.

 

Incorporating UCSC Splicing Predictions

In addition to all de novo splicing annotations, additional splicing annotations are imported from the UCSC genome database and linked to existing exon blocks and regions based on genomic coordinate overlap.This comparison is performed by the alignToKnownAlt.py module of AltAnalyze (called from EnsemlbImport.py). De novo and UCSC splicing annotations are stored along with probe set Ensembl gene alignment data in the file <species>_Ensembl_probesets.txt. These annotations are used by AltAnalyze and DomainGraph.

 

Filtering for Alternative Splicing

 

As mentioned in previous sections, AltAnalyze includes the option restrict alternative exon results to only those probe sets or reciprocal probe set-pairs predicted to indicate alternative splicing. AltAnalyze considers alternative splicing as any alternative exon annotation other than an alternative N-terminal exon or alternative promoter annotation derived from de novo or UCSC genome database annotations.

 

3.4 Protein/RNA Inference Analysis

Identifying Alternative Proteins Protein Domains

               Probe set sequences present on exon or junction arrays are used to identify which proteins align to or are missing from transcripts for that gene. To do this, all Ensembl and UCSC mRNA transcripts are extracted for a gene that corresponds to a given probe set. For each transcript, all exon genomic coordinates are stored. For exon arrays, transcripts with exons that contain the probe set genomic coordinates are considered transcript matches, while all others are considered non-matches. For junction arrays, probe sets with sequence that directly aligns to a transcript is considered a match. If two reciprocal junction probe sets match to distinct isoforms, these relationships are stored rather than the matching and non-matching isoforms, however, if both isoforms do not match to distinct transcripts, then the one with a matching and non-matching set of transcripts is stored for further exon comparisons.

               If a set of matching and non-matching isoforms are identified for a probe set or reciprocal junction pair, all possible match and non-match pair-wise combinations are identified to find those pair-wise comparisons with the smallest difference in exon composition. This is accomplished by determining the number of different and common exons each transcript pair contains (based on genomic start and stop of the exon). When comparing the different transcript pairs, the most optimal pair is selected by first considering the combined number of distinct exons in both transcripts and second the number of common exons. Thus, if one transcript pair has 4 exons in common and 2 exons not in common, while a second pair has 5 exons in common and 3 exon not in common, the prior will be selected as the optimal since it contains less overall differences in exon composition (even though it has less common exons than the other pair). A theoretical example is illustrated in Figure 3.2.

Figure 3.2. Comparison of probe set matching and non-matching mRNAs. A theoretical gene is shown with 8 distinct exons and 5 distinct mRNA transcripts with different exon combinations. Four exons are alternatively expressed in different transcripts (blue exons) while five are common or constitutive to all transcripts (black exons). All possible pair-wise transcript combinations are shown (arrows) between mRNAs that contain and probe set aligning exon and those that do not. Ultimately, a single pair is selected that has the most common exons and the least uncommon exons. The same strategy is used for exon and junction arrays, however, for junction arrays, matching transcripts are identified by direct sequence alignment to the mRNAs prior to comparing exon composition as described above.

 

               Once a single optimal isoform pair has been identified, protein sequence is obtained for each by identifying protein IDs that correspond to the mRNA (Ensembl or NCBI) and if not available, a predicted protein sequence is derived based on in silico translation. Although such a protein sequence may not be valid, given that translation of the protein may not occur, these sequences provide AltAnalyze the basis for identifying conservative changes predictions for a change in protein size, sequence and domain composition. Domain/protein features are obtained directly from UniProt's sequence annotation features or from Ensembl's InterPro sequence annotations (alignment e-value <1) (see section 5 data extraction protocols). To compare domain sequence composition differences, the protein sequence that corresponds to the amino-acid start and stop positions of a domain for each transcript is searched for in each of the compared isoforms. If a domain is present in one but not another isoform, that domain is stored as differentially present. To identify differences in protein sequence (e.g., alternative-N-terminus, C-terminus, truncation coding sequence and protein length), the two protein sequences are directly compared for shared sequence in the first and last five residues and comparison of the entire sequences. If the N-terminal sequence is common to both isoforms but there is a reduction in more than 50% of the sequence length, the comparison is annotated as truncated. All of these annotations are stored for each probe set or junction pair for import into AltAnalyze, for each new database build.

               The above strategy allows AltAnalyze to identify predicted protein domains that are found in one isoform but not the other (aligning to a probe set and not aligning). In addition to these "inferred" domain predictions, that include protein domains that do not necessarily overlap with the regulated probe set, AltAnalyze includes a distinct set of annotations that only corresponds to probe set aligning domain predictions. In the AltAnalyze alternative exon results file (suffix "exon-inclusion-results.txt"), the column named "ens_overlapping_domains" reports Ensembl InterPro IDs that directly overlap with the regulated probe set (domain name proceeded by "direct"). Alignment of probe sets to InterPro IDs is achieved by comparing probe set genomic coordinates to InterPro genomic coordinates, where the probe set also directly aligns to the domain containing Ensembl protein. In addition to these direct InterPro aligning probe sets, probe sets that overlap with an InterPro's genomic coordinates, but that do not directly align to that protein (that the InterPro aligns to) are also reported as "indirect" aligning. These probe sets typically occur in the gene introns. While these associations are typically meaningful, false "indirect" associations are possible. To reduce the occurrences of these false positives, any probe set that aligns to the UTR of an Ensembl gene or that occurs in the first or last exon of an mRNA transcript are excluded from the "indirect" analysis. These heuristics were chosen after looking at specific examples that the authors considered to be potential false positives.

 

Identifying microRNA Binding Sites associated with Alternative Exons

               MicroRNA binding site (miR-BS) sequence is obtained and compared from four different microRNA databases (see section 6.5), and compared to identify miR-BSs in common to or distinct to different databases. Probe set sequences or exon sequences (aligning to two reciprocal junctions) are obtained from Affymetrix (see section 6.7). Each miR-BS sequence is searched for within probe set or exon sequence to identify a match. These relationships are stored with each new database build and are used by AltAnalyze and DomainGraph.

Section 4 - Using External Programs with AltAnalyze


While AltAnalyze is a largely a stand-alone program, some statistical analyses can be included that depend on external applications. These require prior installation of these tools using operating system installers and properly interfacing them with AltAnalyze.

 

4.1 Configuring R

               Although installation of R is not required for any of the AltAnalyze analyses, for users who wish to use more advanced statistics, it will be necessary. Currently, the only statistic that requires installation of R is the linear regression method rlm. rlm is a regression statistical method apart of the R package mss. This method is preferred by some over the alternative linear regression method provided by default in AltAnalyze. Both methods produce very similar statistics, with only a few probe sets differing between threshold parameters out of hundreds of results. However, since the rlm method was used in the published linear regression analyses, other users wishing to replicate these results, may wish to use this algorithm.

               To run the option "linearegress-rlm" from the "Alternative Exon Analysis Parameters" window, you will need to install a compatible version R (only version 2.1 has been extensively tested). Along with R, the user will need to install the R statistical package mss. R is interpreted by Python using the Python program Rpy, which is packaged with the compiled versions of AltAnalyze, but not the source code (http://rpy.sourceforge.net/download.html). With some compiled installations of AltAnalyze, you will receive a warning when running "linearegress-rlm". This error occurs due to an Rpy compiling error that is specific to the OS executable version of AltAnalyze, but should work fine when Rpy is installed for the associated version of R by the user (each version of Rpy corresponds to several versions of R).

               Whether dealing with a compiled or source version of AltAnalyze, if Python reports that it cannot find the current version of R, the user may need to update the computers Environment Variables setting Path (Windows only). This is accessed through by opening Control Panels>System Properties>Advanced>Environment Variables and selecting the Variable "Path" and entering the path location of R (for example, ";C:\Program Files\R\rw2010;" - No spaces before or after ;). Contact AltAnalyze support if problems persist.


Section 5 - Software Infrastructure

5.1 Overview

               AltAnalyze consists of more than 30 modules and over 10,000 lines of code. The core modules for AltAnalyze consist of the programs ExpressionBuilder and AltAnalyze, which can be used in tandem or separately through the AltAnalyze GUI. The user will never actually encounter these module names when running AltAnalyze, but these distinct core modules are used for different analysis functions (Figure 5.1).


Figure 5.1. AltAnalyze Analysis Pipeline. The AltAnalyze workflow is depicted. The transparent green box highlights functions performed by the ExpressionBuilder module of AltAnalyze whereas the transparent red box highlights the AltAnalyze module.(A) User microarray data (probe set expression values and DABG p-values) are imported into AltAnalyze via the ExpressionBuilder module, which separates data for different biological array groups into user designated pair-wise comparisons (e.g., cancer vs. normal). For each pair-wise comparison, probe set expression values and DABG p-values are exported to separate files, and then analyzed by the module FilterDABG to exclude probe sets with poor detection parameters. The resulting files are inputs for alternative exon analysis. In parallel, a gene expression summary file is produced with Ensembl gene level expression values (based on constitutive probe set expression) for each gene and array along with summary statistics (average expression, fold, and t-test p-value for all pair-wise comparisons) and annotations. (B) Using the ExpressionBuilder pair-wise comparison files, AltAnalyze re-calculates constitutive expression values, evaluates changes in probe set expression relative to constitutive (statistics module), and links probe sets with "significant" changes to aligning alternative protein sequence and predicted changes in protein and miR-BS architecture (ExonAnalyze and FeatureAlignment modules). The result is a series of probe set and gene summary files along with over-representation statistics for the regulation of protein and miR-BS features. Optionally, probe set and constitutive expression values can be exported to the bundled application Affymetrix Power Tools to calculate additional alternative exon statistics to be included in the AltAnalyze analysis.

 

               The ExpressionBuilder component builds constitutive gene expression summary files as well as filters the probe set expression data prior to alternative-exon analysis. The AltAnalyze module performs all of the alternative-exon analysis and MiDAS p-value calculations.

5.2 ExpressionBuilder Module

The ExpressionBuilder program is principally designed to perform the following tasks:

1)     Import user expression data from tab-delimited files.

2)     (Splicing Arrays Only) Exclude probe sets where no samples have a DABG p<user-threshold (only applicable when DABG p-values exist).

3)     Organize your data according to biological groups and comparisons (specified by the user from custom text files).

4)     (Exon Arrays Only) Calculate gene transcription levels for all Ensembl genes.

5)     Export calculated gene expression values along with folds, t-test p-values and gene annotations for all genes and all user indicated comparisons.

6)     (Splicing Arrays Only) Export all gene linked probe set expression and DABG data for all pair-wise comparisons (splicing array analysis is restricted to two conditions), for further filtering (next step).

7)     (Splicing Arrays Only) Filter the resulting probe set data using expression probabilities specific for the two pair-wise comparisons and user-defined thresholds (Figure 2.4).

               The above tasks are performed in order by the ExpressionBuilder module. Detection probabilities are assessed at two steps (2 and 7). In step 2, import of DABG p-values are for the purpose of calculating a transcription intensity value only for those constitutive probe sets (present in all or most transcripts) that show detection above background, since some probe sets will have weaker expression/hybridization profiles as others. If no probe sets have a DABG p-value less than the default or user supplied threshold (for at least one sample in your dataset), all selected probe sets will be used to calculate expression (constitutive aligning only if default is selected).

               In step 7, the probe set DABG p-values are examined to include or exclude probe sets for alternative splicing analysis. This step is important in minimizing false positive splicing calls. False positive splicing calls can occur when a probe set is expressed below detectable limits and results in a transcription-corrected expression value that artificially appears to be alternatively regulated. ExpressionBuilder output an expression file and DABG p-value file containing all probe set expression values that are linked to unique gene identifiers (e.g., aligns to a Ensmbl gene ID) for all user-defined pair-wise comparisons. The FilterDABG module uses these two files to identify constitutive probe sets (considered common to all transcripts) with a mean DABG p<user-defined threshold and mean expression > user-defined threshold for both biological groups. If a probe set is non-constitutive, then for at least one of the two biological groups, the same criterion must be true (mean DABG p-value and expression). These filters ensure that: 1) the gene is "expressed" in both conditions and 2) the probe set is "expressed" in at least one condition. The probe sets and expression values passing these user defined filters are exported to a new file that is ready to use for splicing analyses. This file can also be directly selected in future AltAnalyze runs as input for analysis ("AltAnalyze filtered file" - Figure 2.2).

               Runtime of ExpressionBuilder is dependent on the number of conditions and array type being analyzed (10 minutes plus for Affymetrix Exon 1.0 ST arrays). If multiple comparisons are present in a single expression file, input files for AltAnalyze will all be generated at once and thus runtimes will take longer.

Note 1: While this pipeline is mainly for use with exon or junction arrays, it is also compatible with a standard 3' Affymetrix microarray dataset to calculate folds, t-test p-values and assign annotations to this data.This is useful when you have many comparisons in your dataset and you don't wish to manually calculate these values.

 

Note 2: You do not need to run ExpressionBuilder if you have an alternative way of building AltAnalyze input files. To do so, your file headers for each array must have the name "group:sample_name", where your group names are different for each group and the denominator group is listed first and the numerator is listed second. Below the header line should only be probe set IDs and log2 expression values.

 

5.3 AltAnalyze Module

The AltAnalyze module is the primary software used for all alternative exon analyses. This software imports the filtered expression data and performs all downstream statistical and functional analyses.This program will analyze any number of input comparison files that are in the "AltExpression" results directory for that array type. The main analysis steps in this program are:

1)     Import exon or junction annotations, to determine which probe sets to analyze, which are predicted constitutive probe sets and which correspond to known AS or APS events.

2)     Import the user expression data for the pair-wise comparison.

3)     Store probe set level data for all probe sets corresponding to either a constitutive exon or for splicing event, while storing the group membership for each value.

4)     Calculate a constitutive expression value for each gene and each sample (used for the splicing score later on). OPTIONAL: If the user selected a cut-off for constitutive fold changes allowed to look at alternative exon regulation, then remove genes from the analysis that have a gene-expression difference between the two groups > cut-off (up or down).

5)     (Exon array only) OPTIONAL: exports input for the Affymetrix Power Tools (APT) program to calculate a MiDAS p-value for each probe set. If using this option.

6)     Calculate a splicing score and t-test p-value from the probe set and constitutive expression values. This calculation requires that splicing ratios are calculate for each sample (exon/constitutive expression) and then compared between groups. For exon arrays, the splicing index method (SI) is calculated for each probe set. For junction arrays, ASPIRE, Linear Regression are used with the pre-determined reciprocal junctions or alternatively are calculated for individual probe sets using the SI method.

7)     (Junction array only) OPTIONAL: performs a permutation analysis of the sample ASPIRE input values or Linear Regression values to calculate a likelihood p-value for all possible sample combinations.

8)     Retain only probe sets meeting the scoring thresholds for these statistics (splicing score, splicing t-test p, permutation p, MiDAS p - see Section 3.2).

9)     Import probe set-protein, probe set-domain and probe set-miRNA associations pre-built local from flat files (see Section 6).

10) For the remaining probe sets, import all protein domain and miR-BS to all pre-built probe set associations (see Section 3.3 for details). Import all probe set-domain and -miR-BS associations for all genes to calculate an over-representation z-score for all domains and miR-BS's along with a non-adjusted and adjusted p-value. Export the resulting statistics and annotations to tab-delimited files in the "AltResutls/AlternativeOutput" folder in the user-defined output directory.

11) (Junction array only) Import splicing and exon annotations for regulated exons corresponding to each set of reciprocal probe sets (e.g. for E1-E3 compared to E1-E2, E2 is the regulated exon). These annotation files are the same as those for exon arrays, except that the probe set is replaced by the exon predicted to be regulated by the reciprocal probe set-pair (see Section 6.2).

12) For protein domain and miR-BS annotations, reformat the direction/inclusion status of the annotation.For example, if a kinase domain is only found in a protein that aligns to a robe set, but was down-regulated, then the annotation is listed as (-) kinase domain, but if up-regulated is listed as (+) kinase domain.

13) Export the results from this analysis to the "AltResutls/AlternativeOutput" folder in the user-defined output directory.

14) Summarize the probe set or reciprocal junction data at the level of genes and export these results (along with Gene Ontology/Pathway annotations).

15) Export overall statistics from this run (e.g. number of genes regulated, splicing events).

16) (Junction array only) Combine and export the exported probe set and gene files for each comparison analyzed, to compare and contrast differences.

 

Result File Types

When finished AltAnalyze will have generated five files.

1)     name-scoringmethod-exon-inclusion-GENE-results.txt

2)     name-scoringmethod-exon-inclusion-results.txt

3)     name-scoringmethod-ft-domain-zscores.txt

4)     name-scoringmethod-miRNA-zscores.txt

5)     name-scoringmethod-DomainGraph.txt

 

               Here, "name" indicates the comparison file name from ExpressionBuilder, composed of the species + array_type + comparison_name (e.g. Hs_Exon_H9-CS-d40_vs_H9-ESC-d0), "scoringmethod" is the type of algorithm used (e.g. SI) and the suffix indicates the type of file.

               The annotation files used by AltAnalyze are pre-built using other modules with this application (see Section 6). Although the user should not need to re-build these files on their own, advanced users may wish to modify these tables manually or with programs provided (see Section 6.7).

               For protein-level functional annotations (e.g., domain changes), this software assumes that if an exon is up-regulated in a certain condition, that the protein domain is also up-regulated and indicates it as such. For example, for exon array data, if a probe set is up-regulated (relative to gene constitutive expression) in an experimental group and this domain is found in the protein aligning to this probe set, in the results file this will be annotated as (+) domain. If the probe set were down-regulated (and aligns as indicated), this would be annotated as (-) domain.

 


Section 6 - Building AltAnalyze Annotation Files

6.1 Splicing Annotations and Protein Associations

A number of annotation files are built prior to running AltAnalyze that are necessary for:

1)     Organizing exons and introns from discrete transcripts into consistently ordered sequence blocks (UCSCImport.py and EnsemblImport.py).

2)     Identifying which exons and introns align to alternative annotations (alignToKnownAlt.py and EnsemblImport.py).

3)     Identifying probe sets with likely constitutive annotations (ExonArrayAffyRules.py).

4)     Identifying which probe sets align to which exons and introns (ExonArrayEnsemblRules.py).

5)     Extracting out protein sequences with functional annotations (ExtractUniProtFunctAnnot.py and EnsemblSQL.py).

6)     Identifying probe sets that overlap with microRNA binding sites (MatchMiRTargetPredictions.py and ExonSeqModule.py).

7)     Matching probe set genomic coordinates to cDNA exon coordinates and identify the optimal matching and non-matching mRNA/protein for each probe set (IdentifyAltIsoforms.py and ExonAnalyze_module.py)

 

These annotation files are necessary for all exon and junction array analyses. Junction array analyses further require:

8)     Matching reciprocal junction probe sets to annotated exons or introns (JunctionArray.py, EnsemblImport.py and JunctionArrayEnsemblRules.py), creating a file analogous to (4) above.

9)     Matching reciprocal junction probe set sequence to microRNA binding sites (JunctionSeqSearch.py), creating a file analogous to (6) above.

10) Matching probe set sequence to cDNA sequences (mRNASeqAlign.py), prior to identification of optimal matching and non-matching mRNA/proteins (IdentifyAltIsoforms.py).

 

               With the creation/update of these files, the user is ready perform alternative exon analyses for the selected species and array type. Since many of these analyses utilize genomic coordinate alignment as opposed to direct sequence comparison, it is import to ensure that all files were derived from the same genomic assembly.

 

Note: Although all necessary files are available with the AltAnalyze program at installation and some files can be updated automatically from the AltAnalyze server, users can use these programs to adjust the content of these files, use the output for alternative analyses, or create custom databases for currently unsupported species.

 

6.2 Building Ensembl-Probe Set Associations


Exon Arrays

Affymetrix Exon 1.0 ST arrays are provided with probe set sequence, transcript cluster, probe set genomic location, and mRNA count annotations.Each of these annotations are used by AltAnalyze to provide detailed sub-gene associations. Although transcript clusters represent putative genes, the AltAnalyze pipeline derives new gene associations to Ensembl genes, so that each probe set aligns to a single gene from a single gene database. This annotation schema further allows AltAnalyze to determine which probe sets align to defined exons regions (with external exon annotations), introns, and untranslated regions (UTR).

               To begin this process, Ensembl exons (each with a unique ID) and their genomic location and transcript associations are downloaded for the most recent genomic assembly using the EnsemblSQL.py module, which parses various files on the Ensembl FTP SQL database server to assemble the required fields. This file is saved to the directory "AltDatabase/ensembl/*species*/" with the filename "*species*_Ensembl_transcript-annotations.txt". Since Ensembl transcript associations are typically conservative, transcript associations are further augmented with exon-transcript structure data from the UCSC genome database, from the file "all_mrna.txt" (Downloads/*species*/Annotation database/all_mrna.txt.gz). This file encodes genomic coordinates for exons in each transcript similar to Ensembl.Transcript genomic coordinates and genomic strand data from UCSC is matched to Ensembl gene coordinates to identify genes that specifically overlap with Ensembl genes with the Python program UCSCImport.py. Unique transcripts, with distinct exon structures from Ensembl, are exported to the folder "AltDatabase/ucsc/*species* "*species*_UCSC_transcript_structure_filtered_mrna.txt", with the same structure as the Ensembl_transcript-annotations file.

               Once both transcript-structure files have been saved to the appropriate directory, ExonArrayAffyRules.py calls the program EnsemblImport.py to perform the following steps:

1)     Imports these two files, stores exon-transcript associations identify exon regions to exclude from further annotations. These are exons that signify intron retention (overlapping with two adjacent spliced exons) and thus are excluded as valid exon IDs. These regions are also flagged as intron-retention regions for later probe set annotations.

2)     Assembles exons from all transcripts for a gene into discrete exon clusters.If an exon cluster contains multiple exons with distinct boundaries, the exon cluster is divided into regions that represent putative alternative splice sites (Figure 3.1). These splice sites explicitly annotated downstream. Each exon cluster is ordered and number from the first to the last exon cluster (e.g., E1, E2, E3, E4, E5), composed of one or more regions. These exon cluster/region coordinates and annotations are stored in memory for downstream probe set alignment in the module ExonArrayAffyRules.py.

3)     Identifies alternative splicing events (cassette-exon inclusion, alternative 3' or 5' splice sites, alternative N-terminal and C-terminal exons, and combinations there in) for all Ensembl and UCSC transcripts by comparing exon cluster and region numbers for all pairs of exons in each transcript (Figure 3.2). Alternative exons/exon-regions and corresponding exon-junctions are stored in memory for later probe set annotation and exported to summary files for creation of databases for the Cytoscape exon structure viewer, SubgeneViewer (currently in development).

 

Upon completion, ExonArrayAffyRules.py:

1)     Imports Affymetrix Exon 1.0 ST probe sets genomic locations, transcript cluster, and mRNA counts from the Affymetrix probe set.csv annotation file (e.g., HuEx-1_0-st-v2.na23.hg18.probeset.csv). Although transcript clusters will be disregarded at the end of the analysis, these are used initially to group probe sets.

2)     Transcript cluster genomic locations are matched to Ensembl genes genomic locations (gene start and stop) to identify single transcript clusters that align to only one Ensembl gene for the respective genomic strand. For transcript clusters aligning to more than one Ensembl gene, coordinates for each individual probe set are matched to aligning Ensembl genes, to identify unique matches. If multiple transcript clusters align to a single Ensembl gene, only probe sets with an Affymetrix annotated annotation corresponding to that Ensembl gene, from the probeset.csv file are stored as proper relationships. This ensures that if other genes, not annotated by Ensembl exist in the same genomic interval, that they will not be inaccurately combined with a nearby Ensembl gene. If multiple associations or other inconsistencies are found, match probe set coordinates directly to the exon cluster locations derived in EnsemblImport.py.

3)     Once unique probe set-to-Ensembl gene associations have been defined, constitutive probe sets are identified using the Affymetrix mRNA counts provided in the program ExonArrayEnsemblRules.py.The mRNA counts are distributed based on the types of mRNAs they align to (full-length, Ensembl, and EST), where the probe sets with the largest number of high quality mRNA associations are chosen as constitutive. Probe sets for a given gene are ranked based on the: A) number of Ensembl transcripts, B) full-length and C) ESTs associated, in descending order, where multiple associations are required for each annotation type. If all probe sets have the same number of Ensembl and full-length transcript associations, then the number of EST aligning are compared. If no difference in these mRNA assignments exists, no constitutive probe sets are annotated.

4)     Each probe sets is then aligned to exon clusters, regions, retained introns, and splicing annotations for that gene. In addition to splicing annotations from EnsemblImport.py, splicing annotations from the UCSC genome annotation file "knownAlt.txt" (found in the same server directory at UCSC as "all_mRNA.txt") using the program alignToKnownAlt.py, are aligned to these probe sets. If a probe set does not align to an Ensemblmport.py defined exon or intron and is upstream of the first exon or downstream of the last exon, the probe set is assigned a UTR annotation (e.g., U1.1). All aligning probe sets are annotated based on the exon cluster number and the relative position of that probe set in the exon cluster, based on relative 5' genomic start (e.g., E2.1). This can mean that probe set E2.1 actually aligns to the second exon cluster in that gene in any of the exon regions (not necessarily the first exon region), if it is the most 5' aligning.

5)     These probe set annotations are exported to the directory "AltDatabase/*species*/exon" with the filename "*species* _Ensembl_probe sets.txt". Exon block and region annotations for each probe set are designated in the AltAnalyze result file.


Junction Arrays

For the exon-junction array AltMouseA, the same process is applied to the highlighted exon(s) from all pre-determined reciprocal probe sets, exported by the program ExonAnnotate_module.py.A highlighted exon is an exon that is considered to be regulated as the result of two alternative junctions.For example, if examining the exon-junctions E1-E2 and E1-E3, E2 would be the highlighted exon. Alternatively, for the mutually-exclusive splicing event E2-E4 and E1-E3, E2 and E3 would be considered to be the highlighted exons. To obtain the genomic locations of these exons, sequences for each are obtained from a static build of the mouse AltMerge program (March 2002) (ExonAnalyze_module.py) and searched for in FASTA formatted sequence obtained from BioMart for all Ensembl genes with an additional 2 kb upstream and downstream sequence (JunctionArray.py and EnsemblImport.py).This allows for the export of an exon-coordinate file analogous to the exon probeset.csv file.The main difference in this file is that AltMouseA gene to Ensembl ID associations are obtained by comparing gene symbol names and external GenBank accession numbers in common, as opposed to coordinate comparisons, and constitutive exon annotations are directly lifted from the "Mm_Ensembl_transcript-annotations.txt" file, obtained from BioMart.Unlike the exon array, these constitutive exon annotations are not used to determine which probe sets are most likely constitutive, since specific probe sets have been designed for this array to probe predicted constitutive features, each aligning to multiple exons. The resulting highlighted exon file is named "Mm_Ensembl_AltMouse_probe sets.txt" and is saved to "AltDatabase/Mm/AltMouse", with the same structure as its exon array analogue.

 

6.3 Extracting UniProt Protein Domain Annotations Overview

               The UniProt protein database is a highly curated protein database that provides annotations for whole proteins as well as protein segments (protein features or domains). These protein feature annotations correspond to specific amino acid (AA) sequences that are annotated using a common vocabulary, including a class (feature key) and detailed description field. An example is the TCF7L1 protein (http://www.uniprot.org/uniprot/Q9HCS4), which has five annotated feature regions, ranging in size from 7 to 210 AA. One of these regions has the feature key annotation "DNA binding" and the description "HMG box". To utilize these annotations in AltAnalyze, these functional tags are extracted along with full protein sequence, and external annotations for each protein (e.g., Ensembl gene) from the "uniprot_sprot_*taxonomy*.dat" file using the ExtractUniProtFunctAnnot.py program. This program produces two files ("uniprot_feature_file.txt" and "uniprot_trembl_sequence.txt") that are saved to the appropriate "AltDatabase/uniprot" species directory. FTP file locations for the UniProt database file can be found in the file "Config/Default-file.csv" for each supported species. To improve Ensembl-UniProt annotations, these relationships are also downloaded from BioMart and stored in the folder "AltDatabase/uniprot/*species*" as "*species*_Ensembl-UniProt.txt", which are gathered at ExtractUniProtFunctAnnot.py runtime to include in the UniProt sequence annotation file. These files are saved to "AltDatabase/uniprot/*species*" as "uniprot_feature_file.txt" and "uniprot_sequence.txt". Runtime is approximately 5 minutes (not including downloads).

 

6.4 Extracting Ensembl Protein Domain Annotations Overview

               In addition to protein domains/features extracted from UniProt, protein features associated with specific Ensembl transcripts are extracted from the Ensembl database. One advantage of these annotations over UniProt, is that alternative exon changes that alter the sequence of a feature but not its inclusion will be reported as a gain and loss of the same feature, as opposed to just one with UniProt. This is because protein feature annotations in UniProt only typically exist for one isoform of a gene and thus, alteration of this feature in any way will result in this feature being called regulated. Although an Ensembl annotated feature with a reported gain and loss can be considered not changed at all, functional differences can exist do to a minor feature sequence change that would not be predicted if the gain and loss of the feature were not reported.

               Three separate annotation files are built to provide feature sequences and descriptions, "Ensembl_Protein", "Protein", and "ProteinFeatures" files ("AltDatabase/ensembl/*species*"). The "ProteinFeatures" contains relative AA positions for protein features for all Ensembl protein IDs, genomic start and end locations along with InterPro annotations and IDs. Only those InterPro domains/features with an alignment e-score < 1 are stored for alignment to regulated exons. The "Protein" file contains AA sequences for each Ensembl protein. The "Ensembl_Protein " provides Ensembl gene, transcript and protein ID associations. Data for these files are downloaded and extracted the previously mentioned EnsemblSQL.py module. The feature annotation source in these files is InterPro, which provides a description similar to UniProt. As an example, see: http://ensembl.genomics.org.cn/Homo_sapiens/protview?db=core;peptide=ENSP00000282111, which has similar feature descriptions to UniProt for the same gene, TCF7L1.

               

6.5 Extracting microRNA Binding Annotations Overview

               To examine the potential gain or loss of microRNA bindings sites as the direct result of exon-inclusion or exclusion, AltAnalyze requires putative microRNA sequences from multiple prediction algorithms. These binding site annotations are extracted from the following flat files:

  • TargetScan conserved predicted targets (http://www.targetscan.org/cgi-bin/targetscan/data_download.cgi?db=vert_42). Gene symbol and putative microRNA associations are extracted (no sequence). The primary gene ID, gene-symbol, is linked to Ensembl based on BioMart downloaded gene-symbol to Ensembl gene annotations (from several Ensembl builds - outdated symbols are often used).
  • Miranda human centric with multi-species alignment information was obtained from target predictions organized by Ensembl gene ID (http://cbio.mskcc.org/research/sander/data/miRNA2003/mammalian/index.html). A larger set of associations was also pulled from species-specific files (http://www.microrna.org/microrna/getDownloads.do), where gene symbol was related to Ensembl gene. Both files provided target microRNA sequence.
  • Sanger center (miRBase) sequence was provided as a custom (requested) dump of their version 5 target predictions (http://microrna.sanger.ac.uk/targets/v5/), containing Ensembl gene IDs, microRNA names, and putative target sequences, specific for either mouse or human. Currently, rat has not been obtained.
  • PicTar conserved predicted targets were provided as supplementary data (Supplementary Table 3) at http://www.nature.com/ng/journal/v37/n5/suppinfo/ng1536_S1.html, with conservation in human, chimp, mouse, rat, and dog for a set of 168 microRNAs. For mouse, human gene symbols were searched for in the BioMart derived "Mm_ Ensembl_annotation.txt" table after converting these IDs to a mouse compatible format (e.g., TCF7L1 to Tcf7l1). The same was used for aligning to PicTar. The same strategy is used for rat.

 

               Ensembl gene to microRNA name and sequence are stored for all prediction algorithm flat files and directly compared to find genes with one or more lines of microRNA binding site evidence using the program MatchMiRTargetPredictions.py. The flat file produced from this program ("combined_gene-target-sequences.txt") was used by the program ExonSeqSearch.py to search for these putative microRNA binding site sequences among all probe sets from the "*species*_Ensembl_probe set.txt" file built by ExonArrayEnsemblRules.py and probe set sequence from the Affymetrix 1.0 ST probe set fasta sequence file (Affymetrix) or the reciprocal junction highlighted exon sequence file (see section 6.2). Two resulting files, one with any binding site predictions and another required to have evidence from at least two algorithms, are saved to "AltDatabase/*species*/*array_type*/" as "*species*_probe set_microRNAs_any.txt" and "*species*_probe set_microRNAs_multiple.txt", respectively.

 

6.6 Inferring Protein-Probe Set Associations Overview

               To obtain associations between specific probe sets and proteins, the programs IdentifyAltIsoforms.py (exon and junction arrays) and ExonSeqModule.py (junction arrays only) were written. The program IdentifyAltIsoforms.py grabs all gene mRNA transcripts and associated exon genomic coordinates from Ensembl and UCSC and compares these to probe set coordinates (or critical exon for junction arrays) to find pairs of transcripts (one containing the probe set or critical exon and the other not) that have the least number of differing exons (see Section 3.4). These transcript pairs are thus most likely to be similar with exception to the region containing the probe set or critical exon. Once these best matches are identified, corresponding protein sequences for each mRNA are downloaded from Ensembl using EnsemblSQL.py or are downloaded using NCBI webservices via BioPython's Entrez function.

               If protein sequences are unavailable for an mRNA accession, the mRNA sequence for that identifier is downloaded (using the above mentioned services) and translated using the custom IdentifyAltIsoforms.py function "BuildInSilicoTranslations", which uses functions from the BioPython module to translate an mRNA based on all possible start and stop sites to identify the longest putative translation that also shares either the first or last 5 AA of its sequence with the N-terminus or C-terminus (respectfully) of a UniProt protein.

               While this same protocol is used for junction arrays, prior to this analysis reciprocal probe sets are mapped to all possible aligning and non-aligning transcripts through direct probe set sequence comparison via ExonSeqModule.py.This module takes the consensus probe set sequence for all reciprocal probe set pairs (defined in the file
"AltMouse_junction_comparisons.txt") and searches for a match among mRNA transcript FASTA formatted sequences from Ensembl and UCSC mRNAs that correspond to that gene. Only a 100% sequence match is allowed, (matches may not occur do to polymorphisms between sequence sources and genomic assemblies). These associations are then used by
IdentifyAltIsoforms.py as described above.

At this point, only two mRNAs are matched to each probe set or junctions, matching and non-matching mRNAs. Next, differences in protein feature composition between these two proteins, alternative N or C-terminal sequences, coding sequence and protein length are assessed using ExonAnalyze_module.py and exported to text files (associations and protein sequences) for import when AltAnalyze is run. These two files have the suffix "exoncomp.txt". Two analogous files with the suffix "seqcomp.txt", are derived by identifying the best matching and non-matching mRNAs based on comparison of protein sequence as compared exon composition (e.g. least differences in protein domain composition), but are currently used only for internal analyses (contact the authors for the seqcomp files to replace exoncomp).

 

6.7 Required Files for Manual Update

Many external databases are required for the above build strategies. Advanced users may wish to update databases using the update.py module in the folder "Source_code" (currently requires moving all source files to the main directory and running Python through a terminal program). From there on, you will be presented with several options. To find or change the download location of any automated downloads, see the file "Config/Default-file.csv" and "Config/EnsemblSQL.txt". In the current "Default-file.csv" file, the web location of all Ensembl files to download is pointed towards build 49 of Ensembl.To update the data for a new build of Ensembl, these URL will need to be replaced with the address of the target build, identified by examining the URLs in the parent directories. These changes are all that should be required when downloading a new version of Ensembl, since AltAnalyze will download and parse these files using the update.py.

               When updating the Affymetrix, Ensembl, UniProt and UCSC genome database associations, the user must ensure that all of these databases are built off the same genomic build (e.g., NCBI36 aka HG18). Currently, two files must be downloaded manually from the Affymetrix website (probe set sequence and annotation files for exon arrays).Additional functionality and user-interface based control for obtaining AltAnalyze updates will be included in subsequent builds of AltAnalyze. For the AltMouseA array, full gene sequences in fasta format with 2kb upstream and downstream must be downloaded from BioMart in order to update genomic coordinate alignments for any new NCBI genomic build.

 


Section 7 - Evaluation of AltAnalyze Predictions

To assess AltAnalyze exon array analysis performance relative to other published approaches, we analyzed published experimental confirmation results for a dataset of splicing factor knockdown (mouse polypyrimidine tract binding protein (PTB) short-hairpin RNA (shRNA)). From two independent analyses [11,12], alternative splicing (AS) for 109 probe sets was assessed in the mouse PTB shRNA dataset by RT-PCR (Supplemental Tables 1,2 and 4 from the referenced study) [12]. Among these, 25 were false positives, one was a true negative, one undetermined and 81 were true positives.

 

Summary of Published MADS Results

In the analysis by Xing et al., alternative exons discovered by multiple splicing array platforms and RT-PCR were examined using a new algorithm named microarray analysis of differential splicing or MADS. MADS implements a modification of the splicing index method on gene expression values obtained using multiple scripts from GeneBase and filtering of probes predicted to hybridize to multiple genomic targets. Using this algorithm, the authors were able to verify AS detected by RT-PCR of mouse Affymetrix Exon 1.0 array data as well as predict and validate 27 novel splicing events by RT-PCR. Using the microarray CEL files posted by Xing and colleagues, we ran AltAnalyze using default options (same as used in the corresponding primary report), which includes quantile normalization via RMA-sketch using AltAnalyze's interface to Affymetrix Power Tools.

 

AltAnalyze Results

Of the 109 probe sets linked to splicing events characterized by RT-PCR, 78 were analyzed by AltAnalyze. Those probe sets not analyzed by AltAnalyze were either not apart of the AltAnalyze "core" probe sets or were excluded due to high detection p-value or low expression thresholds. A break-down of the number of RT-PCR true positive, false positive, undetermined and false negative (out of the 81 documented true positives) is shown for various AltAnalyze filters (Table 7.1).

Of the 78 analyzed probe sets, 26 were called by AltAnalyze to be alternatively regulated (using default parameters), out of 194 probe sets called by AltAnalyze as alternatively regulated. All 26 probe sets were annotated as true positives according to the published RT-PCR data. Although 17 probe sets were RT-PCR false positives among the 78 probe sets with RT-PCR data and analyzed by AltAnalyze, only one false positive was observed with any of the AltAnalyze filters alone (splicing-index p<0.05 alone without additional default options). The MADS algorithm was able to validate AS for 27 novel splice events corresponding to 41 probe sets. Of these 41 probe sets, 33 were analyzed by AltAnalyze and 23 were considered alternatively regulated.

 

Conclusions

This analysis suggests that AltAnalyze analysis using default parameters produces conservative results with high specificity (100% true positives in this analysis) with reasonable sensitivity(~42% that of MADS). For smaller datasets, such as the PTB knock-down comparison, the decreased sensitivity results will have a significant impact on the number of true splicing events detected, however, for larger datasets with thousands of regulated probe sets, AltAnalyze is likely to produce a reduction in the number of false positives and reduction in overall noise. It is important to note however, for the MADS analysis only a p-value threshold was used and for AltAnalyze, both a MiDAS and splicing-index p-value in addition to splicing-index fold change thresholds were used. For these analyses, we have not filtered probe sets based on association with annotated splicing events (see Materials and Methods for description), which should further decrease false positives. If probe sets with annotated splicing events are filtered out, 20 versus 26 true positives will remain.

Although a false positive rate of up-to 50% has been reported with the conventional splicing-index implementation (e.g., using the Affymetrix ExACT software) [12], AltAnalyze's analysis differs in several ways. First, in this analysis RMA-sketch was used as the method for quantile normalization. After obtaining expression values for probe sets (no low level filtering currently implemented), probe sets for each of the two biological groups are filtered based on two main parameters: DABG p-value and mean expression filters. Any probe set with a DABG p-value > 0.05 in both biological groups or mean expression < 70 are excluded.

Additional AltAnalyze specific parameters relate to how probe set to gene associations are obtained, which probe sets are selected for analysis and how probe sets are selected for calculation of gene expression. Unlike ExACT, probe set to gene association are via genomic coordinate alignment to Ensmebl/UCSC mRNA transcripts for unique Ensembl genes rather than to Affymetrix transcript clusters. This process ensures that a probe set only aligns to one Ensembl gene. Probe sets can align to an exon, intron or UTR of a gene. Any probe set aligning to an analyzed mRNA is used for analysis in the AltAnalyze "core" set along with any Affymetrix annotated "core" probe set. To determine gene expression from the exon-level a two-step method is employed. First, probe sets that are most over-represented among mRNAs or mRNAs and ESTs (associations from Affymetrix probe set annotation file) are selected as constitutive. Next if constitutive probe set is "expressed" in both biological groups (using the DABG and mean expression filters listed above), the probe set is retained for constitutive expression calculation. If more than one "expressed" constitutive probe set is present, the mean expression of all constitutive probe sets for each array is calculated.As a final step, probe sets with an associated gene expression difference between the two array groups greater than 3 are not reported. These analysis steps result in a unique splicing-index, splicing-index t-test p-value and MiDAS p-value calculation from other analysis methods.

Since this validation set provides a limited test case for analysis of type I (false positive) and type II (false negative) errors, the AltAnalyze algorithms will continue to be assessed as additional validation data is made available. Future implementations will likely include "low-level" analyses that reduce the occurrence of type I errors (e.g., elimination of expression data for specific probes that introduce additional noise). However, this data supports the concept that AltAnalyze produces conservative results with a level of confidence.

 

 

 

 

 

AltAnalyze RMA Analysis

 

All

MADS p<0.05

MADS p<0.01

All

SF>2

SP <0.05

SP|MP <0.05

TP

81

62

55

60

30

36

32

26

FP

25

6

2

17

0

1

0

0

UD

1

1

1

1

0

1

1

0

FN

0

19

26

21

51

45

49

55

 

Table 7.1 - Analysis of PTB shRNA verified splicing events with AltAnalyze. The number of probe sets matched to different RT-PCR absence or presence calls for AS from an analysis of mouse PTB shRNA knockdown of a neuroblastoma cell line compared to empty vector shRNA knockdown using the Affymetrix Mouse 1.0 Exon array [12]. Results are shown for all probe sets (All), MADS p<0.05 or p<0.01 (after removing cross-hybridizing probes), AltAnalyze "core" probe sets (All) analyzed by RT-PCR, splicing-index fold change >2, splicing-index t-test p (SP) <0.05, MiDAS p (MP) <0.05 or combination of all three options (SF>2 & SP & MP <0.05 - default option) along with gene expression filtering (GEF) of < 3 fold, for true positive (TP), false positive (FP), un-determined (UD), and false negative (FN) RT-PCR results. False positives consider probe sets are relative to all original true positives identified, independent of whether they were considered after filtering in AltAnalyze.

 

 


Section 8 - Analysis of AltAnalyze Results DomainGraph

 

Once alternative probe sets have been identified from an AltAnalyze exon array analysis, you can easily load this data in the Cytoscape plugin DomainGraph to:

1)     Visualize interaction networks between proteins and specific protein domains.

2)     Assess which probe sets overlap with a set of loaded genes and which specific protein domains at a high-level (protein/domain network/pathway) and low-level (domain/exon/probe set view).

3)     Immediately see which probe sets overlap with known/predicted alternative splicing events, alternative promoter selection or microRNA binding sites.

4)     View alternative probe set data in the context of WikiPathways and Reactome gene networks.

Figure 8.1. Visualization of AltAnalyze regulated probe sets along exons and protein domains. In the top panel, a loaded Cytoscape DomainGraph network is shown for the gene FYN, with relevant protein domain interactions shown between two alternative isoforms of each gene. Rounded boxes represent gene nodes and diamonds, protein domains and other functional elements. Greenish yellow nodes represent those containing AltAnalyze regulated probe sets, whereas green do not overlap with an AltAnalyze regulated probe set. The gene FYN has been selected in the main network that creates a domain architecture and exon structure view for the select FYN isoform in the Cytoscape "Data Panel". Domains (top), exons (middle) and probe sets (bottom) are shown that correspond to the FYN isoforms ENSP00000229470 and ENSP00000229471, with AltAnalyze down-regulated probe sets in green. Probe sets with a solid black border are associated with an alternative splicing (alternative cassette exon) or alternative promoter annotation. In this example, the probe set with an alternatively splicing annotation overlaps with exon 8 and the Protein Kinase domain of the protein. Probe sets with a blue bar beneath them overlap with predicted microRNA binding sites. Details about each domain, exon, probe set and microRNA binding site, including AltAnalyze statistics and functional annotations are accessible by mousing-over the respective feature and by left-clicking the object to link-out to resources on the web.

 

 

Installing DomainGraph

Please note, the below instructions pertain to DomainGraph 2.0 and will be updated with the next major release, including use of the new DomainGraph output file format produced by AltAnalyze 1.1.

1)     Prior to installing DomainGraph, go to http://www.cytoscape.org and download the latest version of Cytoscape.

2)     DomainGraph, like most other Cytoscape plugins, can be downloaded within Cytoscape, after installation by choosing the Plugins>Manage Plugins>Network Inference menu. DomainGraph will be listed under the "Available for Install" plugins. If you believe this plugin is outdated or you wish to access additional information on DomainGraph, go to http://domaingraph.bioinf.mpi-inf.mpg.de/.

3)     After installation, go to the menu Plugins>DomainGraph>ManageGaph database>Import data for selected species into database.

4)     Next, you will be asked to register your installation of DomainGraph (DomainGraph is currently free ONLY for non-commercial use). After filling out the form, choose to accept the user agreement and select OK. You will receive a registration key by email.

5)     Enter the registration key into the bottom of the registration form you filled out. If closed, you can access this again by performing step 3.

6)     Repeat step 3 and download the DomainGraph database for your species. AltAnalyze currently supports mouse and human analyses. Database download can take several minutes.

7)     Once finished, you are ready to import a gene network and AltAnalyze results.

 

Obtaining AltAnalyze Input for DomainGraph

DomainGraph analyses are compatible with data from Affymetrix Exon 1.0 arrays. To get AltAnalyze results, you must be analyzing data from this array platform. Download AltAnalyze from the AltAnalyze website. Of the several result files created by AltAnalyze for every pair-wise comparison (experimental group compared to a control group of arrays), the dataset file in the AltResults/DomainGraph folder of the user-defined output directory.

 

Importing and Running DomainGraph with AltAnalyze

Detailed and updated instructions for each new DomainGraph version are provided online at:

http://groups.google.com/group/alt_predictions/web/visualizing-altanalyze-results-in-domaingraph.

 


References

1. Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, et al. (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2: 2366-2382.

2. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, et al. (2007) Ensembl 2007. Nucleic Acids Res 35: D610-617.

3. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, et al. (2008) The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 36: D773-779.

4. Salomonis N, Hanspers K, Zambon AC, Vranizan K, Lawlor SC, et al. (2007) GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 8: 217.

5. van Iersel MP, Kelder T, Pico AR, Hanspers K, Coort S, et al. (2008) Presenting and exploring biological pathways with PathVisio. BMC Bioinformatics 9: 399.

6. Srinivasan K, Shiue L, Hayes JD, Centers R, Fitzwater S, et al. (2005) Detection and measurement of alternative splicing using splicing-sensitive microarrays. Methods 37: 345-359.

7. Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, et al. (2006) Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 7: 325.

8. Ule J, Ule A, Spencer J, Williams A, Hu JS, et al. (2005) Nova regulates brain-specific splicing to shape the synapse. Nat Genet 37: 844-852.

9. Wheeler R (2002) A method of consolidating and combining EST and mRNA alignments to a genome to enumerate supported splice variants Algorithms in Bioinformatics: Second International Workshop,                Springer Berlin / Heidelberg Volume 2452/2002: 201-209.

10. Sugnet CW, Srinivasan K, Clark TA, O'Brien G, Cline MS, et al. (2006) Unusual intron conservation near tissue-regulated exons found by splicing microarrays. PLoS Comput Biol 2: e4.

11. Boutz PL, Stoilov P, Li Q, Lin CH, Chawla G, et al. (2007) A post-transcriptional regulatory switch in polypyrimidine tract-binding proteins reprograms alternative splicing in developing neurons. Genes Dev 21: 1636-1652.

12. Xing Y, Stoilov P, Kapur K, Han A, Jiang H, et al. (2008) MADS: a new and improved method for analysis of differential alternative splicing by exon-tiling microarrays. Rna 14: 1470-1479.

 

 

 

 

Untitled Document <body bgcolor="#FFFFFF"> </body> Untitled Document

Google Groups Page
AltAnalyze Gene Expression Tutorial
AltAnalyze Alternative Splicing Tutorial
Section 1 - Introduction
1.1 Program Description
1.2 Implementation
1.3 Requirements
1.4 Pre-Processing, External Files and Applications
1.5 Help with AltAnalyze
Section 2 - Running AltAnalyze for Splicing Arrays
2.1 Where to Save Input Expression Files?
2.2 Beginning an AltAnalyze Run
2.3 AltAnalyze Analysis Options
2.4 Overview of Analysis Results
Section 3 - Algorithms
3.1 Default Methods
3.2 Algorithm Descriptions
3.3 Alternative Splicing Prediction
3.4 Protein/RNA Inference Analysis
Section 4 - Using External Programs with AltAnalyze
4.1 Configuring R
Section 5 - Software Infrastructure
5.1 Overview
5.2 ExpressionBuilder Module
Section 6 - Building AltAnalyze Annotation Files
6.1 Splicing Annotations and Protein Associations
6.2 Building Ensembl-Probe Set Associations
6.3 Extracting UniProt Protein Domain Annotations Overview
6.4 Extracting Ensembl Protein Domain Annotations Overview
6.5 Extracting microRNA Binding Annotations Overview
6.6 Inferring Protein-Probe Set Associations Overview
6.7 Required Files for Manual Update
Section 7 - Evaluation of AltAnalyze Predictions
Section 8 - Analysis of AltAnalyze Results DomainGraph