Preprocessing Affymetrix Microarray Data at the Probe Level

Open Script

This example shows how to use MATLAB® and Bioinformatics Toolbox™ for preprocessing Affymetrix® oligonucleotide microarray probe-level data with two preprocessing techniques, Robust Multi-array Average (RMA) and GC Robust Multi-array Average (GCRMA).

Introduction

With Affymetrix oligonucleotide microarray platforms, gene expression is measured using probe sets consisting of 11 to 20 perfect match (PM) probes (25 nucleotides in length) complementary to target mRNA sequences. Each probe set also has the same number of mismatch (MM) probes, in which the 13th nucleotide has been changed to its complement. The PM probes are designed for gene specific hybridization. The control MM probe measurements are thought to comprise most of the background non-specific binding, such as cross-hybridization. A PM probe and its corresponding MM probe are referred to as a probe pair.

The measured probe intensities and locations from a hybridized microarray are stored in a CEL file. For each Affymetrix microarray platform, the information relating probe pairs to probe set IDs, and to locations on the array is stored in a CDF library file. The probe sequence information is stored in a sequence file (FASTA or tab-separated format).

In general, preprocessing Affymetrix probe-level expression data consists of three steps: background adjustment, normalization, and summarization at the probe set level as a measure of the expression level of corresponding mRNA. Many methods exist for the statistical procedures of these three steps. Two popular techniques, RMA (Irizarry et al., 2003) and GCRMA (Wu et al., 2004), are used in this example.

Note: This example shows the RMA and GCRMA preprocessing procedures to compute expression values from input CEL files in step-by-step detail, using several functions. You can also complete the same RMA or GCRMA techniques in one function call by using the Bioinformatics Toolbox affyrma or affygcrma functions, respectively.

A publicly available dataset containing Affymetrix microarray measurements of 42 tumor tissues of the embryonal central nervous system (CNS, Pomeroy et al., 2002) is used for this example. You will import and access the probe level data of multiple arrays, and then perform expression level measurements with RMA and GCRMA preprocessing methods.

Importing Data

The CNS experiment was conducted using the Affymetrix HuGeneFL GeneChip® array, and the data were stored in CEL files. Information related to each probe is contained in the Affymetrix Hu6800 CDF library file.

The example uses the Hu6800 CDF library file, which you can download from here. Extract the Hu6800.CDF file into a directory, such as C:\Examples\affypreprocessdemo\libfiles. Note: You will have to register in order to access the library files, but you do not have to run the setup.exe file in the archive.

The CNS dataset (CEL files) is available here. To complete this example, download the CEL files of the CNS dataset into a directory, such as C:\Examples\affypreprocessdemo\data. Unzip the CEL file archives. Note: This dataset contains more CEL files than are needed for this example.

CNS_DataA_Sample_CEL.txt, a file provided with Bioinformatics Toolbox, contains a list of the 42 CEL filenames used for this example, and the samples (10 medulloblastomas, 10 rhabdoid, 10 malignant glioma, 8 supratentorial PNETS, and 4 normal human cerebella) to which they belong. Load this data into two MATLAB variables.

fid = fopen('CNS_DataA_Sample_CEL.txt','r');
ftext = textscan(fid,'%q%q');
fclose(fid);
samples = ftext{1};
cels = ftext{2};

Set the variables celPath and libPath to the paths of the CEL files and library directories.

celPath = 'C:\Examples\affypreprocessdemo\data';
libPath = 'C:\Examples\affypreprocessdemo\libfiles';

Rename the cel files so that each file name starts with the MG number that follows the underscore "_" in the original file name. For instance, GSM1688666_MG1999060202AA.CEL is renamed to MG1999060202AA.CEL. You do not need to run this code if the file names are already in the required format.

A = dir(fullfile(celPath,'*.cel'));
fileNames = string({A.name});
for iFile = 1:numel(A)
    newName = fullfile(celPath,extractAfter(fileNames(iFile),"_"));
    movefile(fullfile(celPath,fileNames(iFile)),newName);
end

The function celintensityread can read multiple CEL files and access a CDF library file. It returns a MATLAB structure containing the probe information and probe intensities. The matrices of PM and MM intensities from multiple CEL files are stored in the PMIntensities and MMIntensities fields. In each probe intensity matrix, the column indices correspond to the order in which the CEL files were read, and each row corresponds to a probe. Create a MATLAB structure of PM and MM probe intensities by loading data from the CEL files from the directory where the CEL files are stored, and pass in the path to where you stored the CDF library file. (Note: celintensityread will report the progress to the MATLAB command window. You can turn the progress report off by setting the input parameter VERBOSE to false.)

probeData = celintensityread(cels, 'Hu6800.CDF',...
                 'celpath', celPath, 'cdfpath', libPath, 'pmonly', false)

Reading CDF file: Hu6800.CDF
Reading file 1 of 42: MG2000040501AA
Reading file 2 of 42: MG2000040502AA
Reading file 3 of 42: MG2000040504AA
Reading file 4 of 42: MG2000040505AA
Reading file 5 of 42: MG2000040508AA
Reading file 6 of 42: MG2000040509AA
Reading file 7 of 42: MG2000040510AA
Reading file 8 of 42: MG2000040511AA
Reading file 9 of 42: MG2000040512AA
Reading file 10 of 42: MG2000040513AA
Reading file 11 of 42: MG2000051201AA
Reading file 12 of 42: MG2000051202AA
Reading file 13 of 42: MG2000051204AA
Reading file 14 of 42: MG2000051205AA
Reading file 15 of 42: MG2000051209AA
Reading file 16 of 42: MG2000071102AA
Reading file 17 of 42: MG2000051207AA
Reading file 18 of 42: MG2000051208AA
Reading file 19 of 42: MG2000051211AA
Reading file 20 of 42: MG2000051213AA
Reading file 21 of 42: MG2000061902AA
Reading file 22 of 42: MG2000061903AA
Reading file 23 of 42: MG2000061904AA
Reading file 24 of 42: MG2000061905AA
Reading file 25 of 42: MG2000061906AA
Reading file 26 of 42: MG2000070709AA
Reading file 27 of 42: MG2000070710AA
Reading file 28 of 42: MG2000070711AA
Reading file 29 of 42: MG2000070712AA
Reading file 30 of 42: MG2000070713AA
Reading file 31 of 42: MG1999112206AA
Reading file 32 of 42: MG2000033109AA
Reading file 33 of 42: MG2000033106AA
Reading file 34 of 42: MG2000033107AA
Reading file 35 of 42: MG1999112202AA
Reading file 36 of 42: MG1999112204AA
Reading file 37 of 42: MG2000011801AA
Reading file 38 of 42: MG2000031503AA
Reading file 39 of 42: MG2000032015AA
Reading file 40 of 42: MG2000030308AA
Reading file 41 of 42: MG2000011803AA
Reading file 42 of 42: MG2000011807AA

probeData = 

  struct with fields:

          CDFName: 'Hu6800.CDF'
         CELNames: {1×42 cell}
         NumChips: 42
     NumProbeSets: 7129
        NumProbes: 140983
      ProbeSetIDs: {7129×1 cell}
     ProbeIndices: [140983×1 uint8]
     GroupNumbers: [140983×1 uint8]
    PMIntensities: [140983×42 single]
    MMIntensities: [140983×42 single]

Determine the number of CEL files loaded.

nSamples = probeData.NumChips

nSamples =

    42

Determine the number of probe sets on a HuGeneFL array.

nProbeSets = probeData.NumProbeSets

nProbeSets =

        7129

Determine the number of probes on a HuGeneFL array.

nProbes = probeData.NumProbes

nProbes =

      140983

To perform GCRMA preprocessing, the probe sequence information of the HuGeneFL array is also required. The Affymetrix support site provides probe sequence information for most of the available arrays, either as FASTA formatted or tab-delimited files. This example assumes you have the HuGeneFL_probe_tab file in the library files directory. Use the function affyprobeseqread to parse the sequence file and return the probe sequences in an nProbes x 25 matrix of integers that represents the PM probe sequence bases, with rows corresponding to the probes on the chip and columns corresponding to the base positions of the 25-mer.

S = affyprobeseqread('HuGeneFL_probe_tab', 'Hu6800.CDF',...
                'seqpath', libPath, 'cdfpath', libPath, 'seqonly', true)

S = 

  struct with fields:

    SequenceMatrix: [140983×25 uint8]

Preprocessing Probe-Level Expression Data

The RMA procedure uses only PM probe intensities for background adjustment (Irizarry et al., 2003), while GCRMA adjusts background using probe sequence information and MM control probe intensities to estimate non-specific binding (Wu et al., 2004). Both RMA and GCRMA are preceded by quantile normalization (Bolstad et al., 2003) and median polish summarization (Irizarry et al., 2003) of PM intensities.

Using the RMA Procedure

The RMA background adjustment method corrects PM probe intensities chip by chip. The PM probe intensities are modeled as the sum of a normal noise component and an exponential signal component. Use rmabackadj to background adjust the PM intensities in the CNS data. You can inspect the intensity distribution histogram and the estimated background adjustment of a specific chip by setting the input parameter SHOWPLOT to the column index of the chip.

pms_bg = rmabackadj(probeData.PMIntensities, 'showplot', 1);

Several nonlinear normalization methods have been successfully applied to Affymetrix microarray data. The RMA procedure normalizes the probe-level data with a quantile normalization method. Use quantilenorm to normalize the background adjusted PM intensities in the CNS data. Note: If you are interested in a rank-invariant set normalization method, use the affyinvarsetnorm function instead.

pms_bgnorm = quantilenorm(pms_bg);

A median polish procedure is applied to the PM intensities in summarization. To calculate the expression values, use rmasummary to summarize probe intensities of each probe set across multiple chips. The expression values are the probe set intensity summaries on a log-2 scale.

cns_rma_exp = rmasummary(probeData.ProbeIndices, pms_bgnorm);

Using the GCRMA Procedure

The GCRMA procedure adjusts for optical noise and non-specific binding (NSB) taking into account the effect of the stronger bonding of G/C pairs (Naef et al., 2003, Wu et al., 2004). GCRMA uses probe sequence information to estimate probe affinities for computing non-specific binding. The probe affinity is modeled as a sum of the position-dependent base effects. Usually, the probe affinities are estimated from the MM intensities of an NSB experiment. If NSB data is not available, the probe affinities can still be estimated from sequence information and MM probe intensities normalized by the probe set median intensity (Naef et al., 2003).

For the CNS dataset, use the data from the microarray hybridized with the normal cerebella sample (Brain_Ncer_1) to compute the probe affinities for the HuGeneFL array. Use affyprobeaffinities to estimate the probe affinities of an Affymetrix microarray. Use the SHOWPLOT input parameter to inspect a plot showing the effects of base A, C, G, and T at the 25 positions.

figure
idx = find(strcmpi('Brain_Ncer_1', samples));
[pmAlpha, mmAlpha] = affyprobeaffinities(S.SequenceMatrix,...
                       probeData.MMIntensities(:, idx), 'showplot', true);

Note: There are 496 probes on a HuGeneFL array that do not have sequence information; the affinities for these probes were NaN.

With the probe affinities available, the amount of NSB can be estimated by fitting a LOWESS curve through MM probe intensities vs. MM probe affinities. The function gcrmabackadj performs optical and NSB corrections. The input parameter SHOWPLOT shows a plot of the optical noise adjusted MM intensities against its affinities, and the smooth fit of a specified chip. You can compute the background intensities with one of two estimation methods, Maximum Likelihood Estimate (MLE) and Empirical-Bayes (EB), which computes the posterior mean of specific binding given prior observed intensities. Here you will background adjust four arrays using both estimation methods. (Note: gcrmabackadj will report the progress to the MATLAB command window. You can turn the progress report off by setting the input parameter VERBOSE to false.)

Background adjust the first four chips using GCRMA-MLE method, and inspect the plot of intensity vs. affinity for data from the third array.

pms_MLE_bg = gcrmabackadj(probeData.PMIntensities(:,1:4),...
                            probeData.MMIntensities(:, 1:4),...
                            pmAlpha, mmAlpha, 'showplot', 3);

% Adjust YLIM for better view
ylim([4 16]);

Adjusting background for chip # 1 of 4 using MLE method.
Adjusting background for chip # 2 of 4 using MLE method.
Adjusting background for chip # 3 of 4 using MLE method.
Adjusting background for chip # 4 of 4 using MLE method.

Background adjust the first four chips using the GCRMA-EB method. Processing with this method is more computationally intensive and will take longer.

pms_EB_bg = gcrmabackadj(probeData.PMIntensities(:,1:4),...
                            probeData.MMIntensities(:, 1:4),...
                            pmAlpha, mmAlpha, 'method', 'EB');

Adjusting background for chip # 1 of 4 using EB method.
Adjusting background for chip # 2 of 4 using EB method.
Adjusting background for chip # 3 of 4 using EB method.
Adjusting background for chip # 4 of 4 using EB method.

You can continue the preprocessing with the quatilenorm and rmasummary functions, or use the gcrma function to do everything. The gcrma function performs background adjustment and returns expression measures of background adjusted PM probe intensities using the same normalization and summarization methods as RMA. You can also pass in the sequence matrix instead of affinities. The function will automatically compute the affinities in this case. (Note: gcrma will report the progress to the MATLAB command window. You can turn the progress report off by setting the input parameter VERBOSE to false.)

cns_mle_exp = gcrma(probeData.PMIntensities, probeData.MMIntensities,...
                    probeData.ProbeIndices, pmAlpha, mmAlpha);

Adjusting background for chip # 1 of 42 using MLE method.
Adjusting background for chip # 2 of 42 using MLE method.
Adjusting background for chip # 3 of 42 using MLE method.
Adjusting background for chip # 4 of 42 using MLE method.
Adjusting background for chip # 5 of 42 using MLE method.
Adjusting background for chip # 6 of 42 using MLE method.
Adjusting background for chip # 7 of 42 using MLE method.
Adjusting background for chip # 8 of 42 using MLE method.
Adjusting background for chip # 9 of 42 using MLE method.
Adjusting background for chip # 10 of 42 using MLE method.
Adjusting background for chip # 11 of 42 using MLE method.
Adjusting background for chip # 12 of 42 using MLE method.
Adjusting background for chip # 13 of 42 using MLE method.
Adjusting background for chip # 14 of 42 using MLE method.
Adjusting background for chip # 15 of 42 using MLE method.
Adjusting background for chip # 16 of 42 using MLE method.
Adjusting background for chip # 17 of 42 using MLE method.
Adjusting background for chip # 18 of 42 using MLE method.
Adjusting background for chip # 19 of 42 using MLE method.
Adjusting background for chip # 20 of 42 using MLE method.
Adjusting background for chip # 21 of 42 using MLE method.
Adjusting background for chip # 22 of 42 using MLE method.
Adjusting background for chip # 23 of 42 using MLE method.
Adjusting background for chip # 24 of 42 using MLE method.
Adjusting background for chip # 25 of 42 using MLE method.
Adjusting background for chip # 26 of 42 using MLE method.
Adjusting background for chip # 27 of 42 using MLE method.
Adjusting background for chip # 28 of 42 using MLE method.
Adjusting background for chip # 29 of 42 using MLE method.
Adjusting background for chip # 30 of 42 using MLE method.
Adjusting background for chip # 31 of 42 using MLE method.
Adjusting background for chip # 32 of 42 using MLE method.
Adjusting background for chip # 33 of 42 using MLE method.
Adjusting background for chip # 34 of 42 using MLE method.
Adjusting background for chip # 35 of 42 using MLE method.
Adjusting background for chip # 36 of 42 using MLE method.
Adjusting background for chip # 37 of 42 using MLE method.
Adjusting background for chip # 38 of 42 using MLE method.
Adjusting background for chip # 39 of 42 using MLE method.
Adjusting background for chip # 40 of 42 using MLE method.
Adjusting background for chip # 41 of 42 using MLE method.
Adjusting background for chip # 42 of 42 using MLE method.
Normalizing.
Calculating expression.

Inspecting the Background Adjustment Results

Use boxplots to inspect the PM intensity distributions of the first four chips with three background adjustment procedures.

figure
subplot(4,1,1)
maboxplot(log2(probeData.PMIntensities(:, 1:4)), samples(1:4),...
          'title','Raw Intensities', 'orientation', 'horizontal')
subplot(4,1,2)
maboxplot(log2(pms_bg(:,1:4)), samples(1:4),...
          'title','After RMA Background Adjustment','orient','horizontal')
subplot(4,1,3)
maboxplot(log2(pms_MLE_bg), samples(1:4),...
          'title','After GCRMA-MLE Background Adjustment','orient','horizontal')
subplot(4,1,4)
maboxplot(log2(pms_EB_bg), samples(1:4),...
          'title','After GCRMA-EB Background Adjustment','orient','horizontal')

Use boxplots to inspect the background corrected and normalized PM intensity distributions of the first four chips with three background adjustment procedures.

pms_MLE_bgnorm = quantilenorm(pms_MLE_bg);
pms_EB_bgnorm  = quantilenorm(pms_EB_bg);

figure
subplot(3,1,1)
maboxplot(log2(pms_bgnorm(:, 1:4)), samples(1:4),...
          'title','Normalized RMA Background Adjusted Intensity',...
          'orientation', 'horizontal')
subplot(3,1,2)
maboxplot(log2(pms_MLE_bgnorm), samples(1:4),...
          'title','Normalized GCRMA-MLE Background Adjusted Intensity',...
          'orientation', 'horizontal')
subplot(3,1,3)
maboxplot(log2(pms_EB_bgnorm), samples(1:4),...
          'title','Normalized GCRMA-EB Background Adjusted Intensity',...
          'orientation', 'horizontal')

Final Remarks

You can perform importing of data from CEL files and all three preprocessing steps of the RMA and GCRMA techniques shown in this example by using the affyrma and affygcrma functions respectively.

Affymetrix and GeneChip are registered trademarks of Affymetrix, Inc.

References

[1] Pomeroy, S.L., et al., "Prediction of central nervous system embryonal tumour outcome based on gene expression", Nature, 415(6870):436-42, 2002.

[2] Irizarry, R.A., et al., "Exploration, normalization, and summaries of high density oligonucleotide array probe level data", Biostatistics, 4(2):249-64, 2003.

[3] Wu, Z., et al., "A model based background adjustment for oligonucleotide expression arrays", Journal of the American Statistical Association, 99(468):909-17, 2004.

[4] Bolstad, B.M., et al., "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias", Bioinformatics, 19(2):185-93, 2003.

[5] Naef, F., and Magnasco, M.O. "Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays", Physical Review, E, Statistical, Nonlinear and Soft Matter Physics, 68(1Pt1):011906, 2003.