Documentation Center

  • Trial Software
  • Product Updates

RegressionTree class

Superclasses: CompactRegressionTree

Regression tree

Description

A decision tree with binary splits for regression. An object of class RegressionTree can predict responses for new data with the predict method. The object contains the data used for training, so can compute resubstitution predictions.

Construction

tree = fitrtree(x,y) returns a regression tree based on the input variables (also known as predictors, features, or attributes) x and output (response) y. tree is a binary tree where each branching node is split based on the values of a column of x.

tree = fitrtree(x,y,Name,Value) fits a tree with additional options specified by one or more Name,Value pair arguments.

Input Arguments

expand all

x

A matrix of predictor values. Each column of x represents one variable, and each row represents one observation.

fitrtree considers NaN values in x as missing values. fitrtree does not use observations with all missing values for x the fit. fitrtree uses observations with some missing values for x to find splits on variables for which these observations have valid values.

y

A numeric column vector with the same number of rows as x. Each entry in y is the response to the data in the corresponding row of x.

fitrtree considers NaN values in y to be missing values. fitrtree does not use observations with missing values for y in the fit.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside single quotes (' '). You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'CategoricalPredictors' — Categorical predictors listnumeric or logical vector | cell array of strings | character matrix | 'all'

Categorical predictors list, specified as the comma-separated pair consisting of 'CategoricalPredictors' and one of the following.

  • A numeric vector with indices from 1 to p, where p is the number of columns of x.

  • A logical vector of length p, where a true entry means that the corresponding column of x is a categorical variable.

  • A cell array of strings, where each element in the array is the name of a predictor variable. The names must match entries in the PredictorNames property.

  • A character matrix, where each row of the matrix is a name of a predictor variable. Pad the names with extra blanks so each row of the character matrix has the same length.

  • 'all', meaning all predictors are categorical.

Data Types: single | double | logical | char | struct | cell

'CrossVal' — Cross-validation flag'off' (default) | 'on'

Cross-validation flag, specified as the comma-separated pair consisting of 'CrossVal' and either 'on' or 'off'.

If 'on', fitrtree grows a cross-validated decision tree with 10 folds. You can override this cross-validation setting using one of the 'KFold', 'Holdout', 'Leaveout', or 'CVPartition' name-value pair arguments. Note that you can only use one of these four options ('KFold', 'Holdout', 'Leaveout', or 'CVPartition') at a time when creating a cross-validated tree.

Alternatively, cross-validate tree later using the crossval method.

Example: 'CrossVal','on'

'CVPartition' — Partition for cross-validation treecvpartition object

Partition for cross-validated tree, specified as the comma-separated pair consisting of 'CVPartition' and an object created using cvpartition.

Note that if you use 'CVPartition', you cannot use any of the 'KFold', 'Holdout', or 'Leaveout' name-value pair arguments.

'Holdout' — Fraction of data for holdout validation0 (default) | scalar value in the range [0,1]

Fraction of data used for holdout validation, specified as the comma-separated pair consisting of 'Holdout' and a scalar value in the range [0,1]. Holdout validation tests the specified fraction of the data, and uses the rest of the data for training.

Note that if you use 'Holdout', you cannot use any of the 'CVPartition', 'KFold', or 'Leaveout' name-value pair arguments.

Example: 'Holdout',0.1

Data Types: single | double

'KFold' — Number of folds10 (default) | positive integer value

Number of folds to use in a cross-validated tree, specified as the comma-separated pair consisting of 'KFold' and a positive integer value.

Note that if you use 'KFold', you cannot use any of the 'CVPartition', 'Holdout', or 'Leaveout' name-value pair arguments.

Example: 'KFold',8

Data Types: single | double

'Leaveout' — Leave-one-out cross-validation flag'off' (default) | 'on'

Leave-one-out cross-validation flag, specified as the comma-separated pair consisting of 'Leaveout' and either 'on' or 'off. Use leave-one-out cross validation by setting to 'on'.

Note that if you use 'Leaveout', you cannot use any of the 'CVPartition', 'Holdout', or 'KFold' name-value pair arguments.

Example: 'Leaveout','on'

'MergeLeaves' — Leaf merge flag'on' (default) | 'off'

Leaf merge flag, specified as the comma-separated pair consisting of 'MergeLeaves' and either 'on' or 'off'. When 'on', fitrtree merges leaves that originate from the same parent node, and that give a sum of risk values greater or equal to the risk associated with the parent node. When 'off', fitrtree does not merge leaves.

Example: 'MergeLeaves','off'

'MinLeaf' — Minimum number of leaf node observations1 (default) | positive integer value

Minimum number of leaf node observations, specified as the comma-separated pair consisting of 'MinLeaf' and a positive integer value. Each leaf has at least MinLeaf observations per tree leaf. If you supply both MinParent and MinLeaf, fitrtree uses the setting that gives larger leaves: MinParent=max(MinParent,2*MinLeaf).

Example: 'MinLeaf',3

Data Types: single | double

'MinParent' — Minimum number of branch node observations10 (default) | positive integer value

Minimum number of branch node observations, specified as the comma-separated pair consisting of 'MinParent' and a positive integer value. Each branch node in the tree has at least MinParent observations. If you supply both MinParent and MinLeaf, fitrtree uses the setting that gives larger leaves: MinParent=max(MinParent,2*MinLeaf).

Example: 'MinParent',8

Data Types: single | double

'NVarToSample' — Number of predictors for split'all' (default) | positive integer value

Number of predictors to select at random for each split, specified as the comma-separated pair consisting of 'NVarToSample' and a positive integer value. You can also specify 'all' to use all available predictors.

Example: 'NVarToSample',3

Data Types: single | double

'PredictorNames' — Predictor variable names{'x1','x2',...} (default) | cell array of strings

Predictor variable names, specified as the comma-separated pair consisting of 'PredictorNames' and a cell array of strings containing the names for the predictor variables, in the order in which they appear in x.

Data Types: cell

'Prune' — Pruning flag'on' (default) | 'off'

Pruning flag, specified as the comma-separated pair consisting of 'Prune' and either 'on' or 'off'. When 'on', fitrtree computes the full tree and the optimal sequence of pruned subtrees. When 'off' fitrtree computes the full tree without pruning.

Example: 'Prune','off'

'PruneCriterion' — Pruning criterion'error' (default)

Pruning criterion, specified as the comma-separated pair consisting of 'PruneCriterion' and 'error'.

Example: 'PruneCriterion','error'

'QEToler' — Quadratic error tolerance1e-6 (default) | positive scalar value

Quadratic error tolerance per node, specified as the comma-separated pair consisting of 'QEToler' and a positive scalar value. Splitting nodes stops when quadratic error per node drops below QEToler*QED, where QED is the quadratic error for the entire data computed before the decision tree is grown.

Example: 'QEToler',1e-4

'ResponseName' — Response variable name'Y' (default) | string

Response variable name, specified as the comma-separated pair consisting of 'ResponseName' and a string containing the name of the response variable in y.

Example: 'ResponseName','Response'

Data Types: char

'ResponseTransform' — Response transform function'none' (default) | function handle

Response transform function for transforming the raw response values, specified as the comma-separated pair consisting of 'ResponseTransform' and either a function handle or 'none'. The function handle should accept a matrix of response values and return a matrix of the same size. The default string 'none' means @(x)x, or no transformation.

Add or change a ResponseTransform function using dot notation:

tree.ResponseTransform = @function

Data Types: function_handle

'SplitCriterion' — Split criterion'MSE' (default)

Split criterion, specified as the comma-separated pair consisting of 'SplitCriterion' and 'MSE', meaning mean squared error.

Example: 'SplitCriterion','MSE'

'Surrogate' — Surrogate decision splits flag'off' | 'on' | 'all' | positive integer value

Surrogate decision splits flag, specified as the comma-separated pair consisting of 'Surrogate' and 'on', 'off', 'all', or a positive integer value.

  • When 'on', fitrtree finds at most 10 surrogate splits at each branch node.

  • When set to a positive integer value, fitrtree finds at most the specified number of surrogate splits at each branch node.

  • When set to 'all', fitrtree finds all surrogate splits at each branch node. The 'all' setting can use much time and memory.

Use surrogate splits to improve the accuracy of predictions for data with missing values. The setting also enables you to compute measures of predictive association between predictors.

Example: 'Surrogate','on'

Data Types: single | double

'Weights' — Observation weightsones(size(X,1),1) (default) | vector of scalar values

Observation weights, specified as the comma-separated pair consisting of 'Weights' and a vector of scalar values. The length of Weights is the number of rows in x.

Data Types: single | double

Properties

CategoricalPredictors

List of categorical predictors, a numeric vector with indices from 1 to p, where p is the number of columns of X.

CatSplit

An n-by-2 cell array, where n is the number of categorical splits in tree. Each row in CatSplit gives left and right values for a categorical split. For each branch node with categorical split j based on a categorical predictor variable z, the left child is chosen if z is in CatSplit(j,1) and the right child is chosen if z is in CatSplit(j,2). The splits are in the same order as nodes of the tree. Nodes for these splits can be found by running cuttype and selecting 'categorical' cuts from top to bottom.

Children

An n-by-2 array containing the numbers of the child nodes for each node in tree, where n is the number of nodes. Leaf nodes have child node 0.

CutCategories

An n-by-2 cell array of the categories used at branches in tree, where n is the number of nodes. For each branch node i based on a categorical predictor variable x, the left child is chosen if x is among the categories listed in CutCategories{i,1}, and the right child is chosen if x is among those listed in CutCategories{i,2}. Both columns of CutCategories are empty for branch nodes based on continuous predictors and for leaf nodes.

CutPoint contains the cut points for 'continuous' cuts, and CutCategories contains the set of categories.

CutPoint

An n-element vector of the values used as cut points in tree, where n is the number of nodes. For each branch node i based on a continuous predictor variable x, the left child is chosen if x<CutPoint(i) and the right child is chosen if x>=CutPoint(i). CutPoint is NaN for branch nodes based on categorical predictors and for leaf nodes.

CutType

An n-element cell array indicating the type of cut at each node in tree, where n is the number of nodes. For each node i, CutType{i} is:

  • 'continuous' — If the cut is defined in the form x < v for a variable x and cut point v.

  • 'categorical' — If the cut is defined by whether a variable x takes a value in a set of categories.

  • '' — If i is a leaf node.

CutPoint contains the cut points for 'continuous' cuts, and CutCategories contains the set of categories.

CutVar

An n-element cell array of the names of the variables used for branching in each node in tree, where n is the number of nodes. These variables are sometimes known as cut variables. For leaf nodes, CutVar contains an empty string.

CutPoint contains the cut points for 'continuous' cuts, and CutCategories contains the set of categories.

IsBranch

An n-element logical vector ib that is true for each branch node and false for each leaf node of tree.

ModelParameters

Object holding parameters of tree.

NumObservations

Number of observations in the training data, a numeric scalar. NumObservations can be less than the number of rows of input data X when there are missing values in X or response Y.

NodeErr

An n-element vector e of the errors of the nodes in tree, where n is the number of nodes. e(i) is the misclassification probability for node i.

NodeMean

An n-element numeric array with mean values in each node of tree, where n is the number of nodes in the tree. Every element in NodeMean is the average of the true Y values over all observations in the node.

NodeProb

An n-element vector p of the probabilities of the nodes in tree, where n is the number of nodes. The probability of a node is computed as the proportion of observations from the original data that satisfy the conditions for the node. This proportion is adjusted for any prior probabilities assigned to each class.

NodeRisk

An n-element vector of the risk of the nodes in the tree, where n is the number of nodes. The risk for each node is the node error weighted by the node probability.

NodeSize

An n-element vector sizes of the sizes of the nodes in tree, where n is the number of nodes. The size of a node is defined as the number of observations from the data used to create the tree that satisfy the conditions for the node.

NumNodes

The number of nodes n in tree.

Parent

An n-element vector p containing the number of the parent node for each node in tree, where n is the number of nodes. The parent of the root node is 0.

PredictorNames

A cell array of names for the predictor variables, in the order in which they appear in X.

PruneAlpha

Numeric vector with one element per pruning level. If the pruning level ranges from 0 to M, then PruneAlpha has M + 1 elements sorted in ascending order. PruneAlpha(1) is for pruning level 0 (no pruning), PruneAlpha(2) is for pruning level 1, and so on.

PruneList

An n-element numeric vector with the pruning levels in each node of tree, where n is the number of nodes. The pruning levels range from 0 (no pruning) to M, where M is the distance between the deepest leaf and the root node.

ResponseName

Name of the response variable Y, a string.

ResponseTransform

Function handle for transforming the raw response values (mean squared error). The function handle should accept a matrix of response values and return a matrix of the same size. The default string 'none' means @(x)x, or no transformation.

Add or change a ResponseTransform function using dot notation:

tree.ResponseTransform = @function

SurrCutCategories

An n-element cell array of the categories used for surrogate splits in tree, where n is the number of nodes in tree. For each node k, SurrCutCategories{k} is a cell array. The length of SurrCutCategories{k} is equal to the number of surrogate predictors found at this node. Every element of SurrCutCategories{k} is either an empty string for a continuous surrogate predictor, or is a two-element cell array with categories for a categorical surrogate predictor. The first element of this two-element cell array lists categories assigned to the left child by this surrogate split, and the second element of this two-element cell array lists categories assigned to the right child by this surrogate split. The order of the surrogate split variables at each node is matched to the order of variables in SurrCutVar. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, SurrCutCategories contains an empty cell.

SurrCutFlip

An n-element cell array of the numeric cut assignments used for surrogate splits in tree, where n is the number of nodes in tree. For each node k, SurrCutFlip{k} is a numeric vector. The length of SurrCutFlip{k} is equal to the number of surrogate predictors found at this node. Every element of SurrCutFlip{k} is either zero for a categorical surrogate predictor, or a numeric cut assignment for a continuous surrogate predictor. The numeric cut assignment can be either –1 or +1. For every surrogate split with a numeric cut C based on a continuous predictor variable Z, the left child is chosen if Z < C and the cut assignment for this surrogate split is +1, or if Z ≥ C and the cut assignment for this surrogate split is –1. Similarly, the right child is chosen if Z ≥ C and the cut assignment for this surrogate split is +1, or if Z < C and the cut assignment for this surrogate split is –1. The order of the surrogate split variables at each node is matched to the order of variables in SurrCutVar. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, SurrCutFlip contains an empty array.

SurrCutPoint

An n-element cell array of the numeric values used for surrogate splits in tree, where n is the number of nodes in tree. For each node k, SurrCutPoint{k} is a numeric vector. The length of SurrCutPoint{k} is equal to the number of surrogate predictors found at this node. Every element of SurrCutPoint{k} is either NaN for a categorical surrogate predictor, or a numeric cut for a continuous surrogate predictor. For every surrogate split with a numeric cut C based on a continuous predictor variable Z, the left child is chosen if Z < C and SurrCutFlip for this surrogate split is –1. Similarly, the right child is chosen if Z ≥ C and SurrCutFlip for this surrogate split is +1, or if Z < C and SurrCutFlip for this surrogate split is –1. The order of the surrogate split variables at each node is matched to the order of variables returned by SurrCutVar. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, SurrCutPoint contains an empty cell.

SurrCutType

An n-element cell array indicating types of surrogate splits at each node in tree, where n is the number of nodes in tree. For each node k, SurrCutType{k} is a cell array with the types of the surrogate split variables at this node. The variables are sorted by the predictive measure of association with the optimal predictor in the descending order, and only variables with the positive predictive measure are included. The order of the surrogate split variables at each node is matched to the order of variables in SurrCutVar. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, SurrCutType contains an empty cell. A surrogate split type can be either 'continuous' if the cut is defined in the form Z < V for a variable Z and cut point V or 'categorical' if the cut is defined by whether Z takes a value in a set of categories.

SurrCutVar

An n-element cell array of the names of the variables used for surrogate splits in each node in tree, where n is the number of nodes in tree. Every element of SurrCutVar is a cell array with the names of the surrogate split variables at this node. The variables are sorted by the predictive measure of association with the optimal predictor in the descending order, and only variables with the positive predictive measure are included. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, SurrCutVar contains an empty cell.

SurrVarAssoc

An n-element cell array of the predictive measures of association for surrogate splits in tree, where n is the number of nodes in tree. For each node k, SurrVarAssoc{k} is a numeric vector. The length of SurrVarAssoc{k} is equal to the number of surrogate predictors found at this node. Every element of SurrVarAssoc{k} gives the predictive measure of association between the optimal split and this surrogate split. The order of the surrogate split variables at each node is the order of variables in SurrCutVar. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes, SurrVarAssoc contains an empty cell.

W

The scaled weights, a vector with length n, the number of rows in X.

X

A matrix of predictor values. Each column of X represents one variable, and each row represents one observation.

Y

A numeric column vector with the same number of rows as X. Each entry in Y is the response to the data in the corresponding row of X.

Methods

compactCompact regression tree
crossvalCross-validated decision tree
cvlossRegression error by cross validation
pruneProduce sequence of subtrees by pruning
resubLossRegression error by resubstitution
resubPredictPredict resubstitution response of tree

Inherited Methods

lossRegression error
meanSurrVarAssocMean predictive measure of association for surrogate splits in decision tree
predictPredict response of regression tree
predictorImportanceEstimates of predictor importance
viewView tree

Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.

Examples

expand all

Construct a Regression Tree

Load the sample data.

load carsmall;

Construct a regression tree using the sample data.

tree = fitrtree([Weight, Cylinders],MPG,...
                'categoricalpredictors',2,'MinParent',20,...
                'PredictorNames',{'W','C'})
tree = 

  RegressionTree
           PredictorNames: {'W'  'C'}
             ResponseName: 'Y'
        ResponseTransform: 'none'
    CategoricalPredictors: 2
            NumObservations: 94


  Properties, Methods

Predict the mileage of 4,000-pound cars with 4, 6, and 8 cylinders.

mileage4K = predict(tree,[4000 4; 4000 6; 4000 8])
mileage4K =
   19.2778
   19.2778
   14.3889

See Also

| | |

Was this topic helpful?